neutron should forbid configuring agent_down_time that is known to crash due to CPython epoll limitation

Bug #2028724 reported by Lewis Denny
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Low
Lewis Denny
oslo.service
New
Undecided
Unassigned

Bug Description

This bug is created to improve neutron to not allow configuring agent_down_time to values that are known to misbehave because of limitations of CPython C-types interface that doesn't seem to support any values larger than (2^32 / 2 - 1) [in miliseconds] for green thread waiting.

We can either truncate or error on invalid value (the former is probably preferable).

Also, we may want to consider patching oslo.service (?) to apply similar truncation for values passed through loopingcall module. If the library is patched to do the truncation, then neutron enforcement won't be needed.

To reproduce, set agent_down_time to a number larger than (2^32 / 2 - 1)/1000 and check the neutron server log for an error like:
```
05:28:58.327 39 ERROR oslo_service.threadgroup [req-39043291-6236-4d9b-a1e5-45b6cfc7eb2d - - - - -] Error waiting on thread.: OverflowError: timeout is too large
```

This bug is applicable to all current versions of Neutron and can be reproduced on master devstack

Lewis Denny (lewisdenny)
Changed in neutron:
assignee: nobody → Lewis Denny (lewisdenny)
Revision history for this message
Bence Romsics (bence-romsics) wrote :

Hi,

Thanks for the report! I'm not really sure if we need to protect ourselves against an admin maintaining configuration, but at the same time I don't see a problem with the validation either. By the way let me link your patch here:

https://review.opendev.org/c/openstack/neutron/+/889373

Changed in neutron:
status: New → Triaged
status: Triaged → In Progress
importance: Undecided → Low
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

@Bence, admin may not be aware that a particular config option gets passed down to epoll that has this (undocumented) limit. But we are aware, and so oslo.service (and neutron) could help admins by failing early - by enforcing that insane intervals raise exceptions.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

The admins in the sceanario of this bug were setting `agent_down_time` to a really high value because they wanted to effectively disable the periodic checks / updates of the agent status in neutron. While it may be argued that they shouldn't have done it for other reasons (there's a good reason these checks should happen periodically and not be disabled), still that's what they did, and nothing stopped them. This bug is to make sure they are stopped at some limit that we know is broken.

Perhaps neutron may also want to tighten the option down even more to make sure that the periodic is executed more often, but that's a different discussion. This bug is for theoretical limit defined by libc common implementation.

Revision history for this message
Bence Romsics (bence-romsics) wrote :

@Ihar: Thanks for the explanation!

tags: added: low-hanging-fruit
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/889373
Committed: https://opendev.org/openstack/neutron/commit/6fef1e65250dbda057206e1c2ee64f59b21d490f
Submitter: "Zuul (22348)"
Branch: master

commit 6fef1e65250dbda057206e1c2ee64f59b21d490f
Author: Lewis Denny <email address hidden>
Date: Mon Jul 31 16:38:22 2023 +1000

    Add max limit to agent_down_time

    The agent_down_time ends up being passed to an eventlet green-thread;
    under the hood, this uses a CPython C-types interface with a limitation
    of (2^32 / 2 - 1) INT_MAX (as defined in C) where int is usually 32 bits

    I have set the max value to (2^32 / 2 - 1)/1000 as agent_down_time
    configured in seconds, this ends up being 2147483.

    This patch is required as passing a larger number
    causes this error: OverflowError: timeout is too large

    If a user currently has a value larger than (2^32 / 2 - 1)/1000 set,
    Neutron Server will fail to start and will print out a very helpful
    error message.

    Closes-Bug: #2028724
    Change-Id: Ib5b943344cddbd468c00768461ba1ee00a2b4c58

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 23.0.0.0b3

This issue was fixed in the openstack/neutron 23.0.0.0b3 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.