Hanging services when connectivity to RabbitMQ lost

Bug #1961402 reported by Andrew Bonney
28
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
New
Undecided
Unassigned
oslo.messaging
Fix Released
Critical
Unassigned

Bug Description

# Versions
- oslo.messaging 12.9.1
- rabbitmq 3.9.8
- ubuntu 20.04

Hi,
We are observing issues with services recovering if they encounter connectivity issues to the RabbitMQ cluster. We have seen this across Nova, Neutron and Cinder services in particular, across all of our deployments. When this occurs, the following greenlet related traceback is always seen in the service logs, following a number of reconnection related messages (example for Nova compute):

Feb 18 08:42:33 compute102 nova-compute[1402787]: 2022-02-18 08:42:33.514 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.99:5671 is unreachable: . Trying again in 1 seconds.: socket.timeout
Feb 18 08:42:34 compute102 nova-compute[1402787]: 2022-02-18 08:42:34.517 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.99:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:35 compute102 nova-compute[1402787]: 2022-02-18 08:42:35.050 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.99:5671 is unreachable: . Trying again in 1 seconds.: socket.timeout
Feb 18 08:42:35 compute102 nova-compute[1402787]: 2022-02-18 08:42:35.520 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.98:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:36 compute102 nova-compute[1402787]: 2022-02-18 08:42:36.052 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.99:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:36 compute102 nova-compute[1402787]: 2022-02-18 08:42:36.521 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.97:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 2 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:37 compute102 nova-compute[1402787]: 2022-02-18 08:42:37.053 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.98:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:38 compute102 nova-compute[1402787]: 2022-02-18 08:42:38.055 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.97:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 2 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:38 compute102 nova-compute[1402787]: 2022-02-18 08:42:38.524 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.99:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:39 compute102 nova-compute[1402787]: 2022-02-18 08:42:39.526 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.98:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:40 compute102 nova-compute[1402787]: 2022-02-18 08:42:40.058 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.99:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:40 compute102 nova-compute[1402787]: 2022-02-18 08:42:40.527 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.97:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 4 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:41 compute102 nova-compute[1402787]: 2022-02-18 08:42:41.060 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.98:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:42 compute102 nova-compute[1402787]: 2022-02-18 08:42:42.062 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.97:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 4 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:44 compute102 nova-compute[1402787]: 2022-02-18 08:42:44.532 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.99:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:45 compute102 nova-compute[1402787]: 2022-02-18 08:42:45.534 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.98:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:46 compute102 nova-compute[1402787]: 2022-02-18 08:42:46.067 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.99:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:46 compute102 nova-compute[1402787]: 2022-02-18 08:42:46.536 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.97:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 6 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:47 compute102 nova-compute[1402787]: 2022-02-18 08:42:47.068 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.98:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:48 compute102 nova-compute[1402787]: 2022-02-18 08:42:48.070 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.97:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 6 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:52 compute102 nova-compute[1402787]: 2022-02-18 08:42:52.543 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.99:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:53 compute102 nova-compute[1402787]: 2022-02-18 08:42:53.545 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.98:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:54 compute102 nova-compute[1402787]: 2022-02-18 08:42:54.077 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.99:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:54 compute102 nova-compute[1402787]: 2022-02-18 08:42:54.546 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.97:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 8 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:55 compute102 nova-compute[1402787]: 2022-02-18 08:42:55.079 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.98:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:56 compute102 nova-compute[1402787]: 2022-02-18 08:42:56.080 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.97:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 8 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:58 compute102 nova-compute[1402787]: 2022-02-18 08:42:58.700 1402787 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 110] Connection timed out
Feb 18 08:42:58 compute102 nova-compute[1402787]: 2022-02-18 08:42:58.701 1402787 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 110] Connection timed out
Feb 18 08:42:58 compute102 nova-compute[1402787]: 2022-02-18 08:42:58.702 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 101] ENETUNREACH (retrying in 0 seconds): OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:58 compute102 nova-compute[1402787]: Traceback (most recent call last):
Feb 18 08:42:58 compute102 nova-compute[1402787]: File "/openstack/venvs/nova-24.0.0.0rc1/lib/python3.8/site-packages/eventlet/hubs/hub.py", line 476, in fire_timers
Feb 18 08:42:58 compute102 nova-compute[1402787]: timer()
Feb 18 08:42:58 compute102 nova-compute[1402787]: File "/openstack/venvs/nova-24.0.0.0rc1/lib/python3.8/site-packages/eventlet/hubs/timer.py", line 59, in __call__
Feb 18 08:42:58 compute102 nova-compute[1402787]: cb(*args, **kw)
Feb 18 08:42:58 compute102 nova-compute[1402787]: File "/openstack/venvs/nova-24.0.0.0rc1/lib/python3.8/site-packages/eventlet/semaphore.py", line 152, in _do_acquire
Feb 18 08:42:58 compute102 nova-compute[1402787]: waiter.switch()
Feb 18 08:42:58 compute102 nova-compute[1402787]: greenlet.error: cannot switch to a different thread

Typically if the RabbitMQ cluster is taken down this will impact ~5% of the services in the deployment, all of which will need to be restarted in order to recover. Similar recovery issues have been seen if the host's network interface is taken down and brought back up (as used to generate the above traceback).

As far as we can tell this started to occur at a similar time to https://bugs.launchpad.net/oslo.messaging/+bug/1949964, so around the time of the Wallaby OpenStack release, and coinciding with a switch from TLSv1.0/v1.1 to TLSv1.2 in our RabbitMQ connections, plus a switch to using a full PKI infrastructure with certificate validation, rather than ignoring certificate errors.

Any suggestions for diagnosing this further would be appreciated.

Thanks

Revision history for this message
Andrew Bonney (andrewbonney) wrote :

After some further testing I suspect this and the unresolved portion of https://bugs.launchpad.net/oslo.messaging/+bug/1949964 have the same root cause as https://bugs.launchpad.net/oslo.messaging/+bug/1934937. I've switched the 'heartbeat_in_pthread' parameter to False on a sample host and will report back next week if this resolves the issues.

Revision history for this message
Andrew Bonney (andrewbonney) wrote :

I have now confirmed that this issue can be resolved by setting 'heartbeat_in_pthread' to False.

Changed in oslo.messaging:
status: New → In Progress
Changed in oslo.messaging:
importance: Undecided → Critical
Revision history for this message
Slawek Kaplonski (slaweq) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (master)

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/852251
Committed: https://opendev.org/openstack/oslo.messaging/commit/e44f286ebca0fbde5eae2f7eb9a21ba55ba2a549
Submitter: "Zuul (22348)"
Branch: master

commit e44f286ebca0fbde5eae2f7eb9a21ba55ba2a549
Author: Slawek Kaplonski <email address hidden>
Date: Fri Aug 5 12:40:40 2022 +0200

    Change default value of "heartbeat_in_pthread" to False

    As was reported in the related bug some time ago, setting that
    option to True for nova-compute can break it as it's non-wsgi service.
    We also noticed same problems with randomly stucked non-wsgi services
    like e.g. neutron agents and probably the same issue can happen with
    any other non-wsgi service.

    To avoid that this patch changes default value of that config option
    to be False.
    Together with [1] it effectively reverts change done in [2] some time
    ago.

    [1] https://review.opendev.org/c/openstack/oslo.messaging/+/800621
    [2] https://review.opendev.org/c/openstack/oslo.messaging/+/747395

    Related-Bug: #1934937
    Closes-Bug: #1961402

    Change-Id: I85f5b9d1b5d15ad61a9fcd6e25925b7eeb8bf6e7

Changed in oslo.messaging:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 14.0.0

This issue was fixed in the openstack/oslo.messaging 14.0.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/yoga)

Fix proposed to branch: stable/yoga
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/854904

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to oslo.messaging (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/876318

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to oslo.messaging (master)

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/876318
Committed: https://opendev.org/openstack/oslo.messaging/commit/fd2381c723fe805b17aca1f80bfff4738fbe9628
Submitter: "Zuul (22348)"
Branch: master

commit fd2381c723fe805b17aca1f80bfff4738fbe9628
Author: Arnaud Morin <email address hidden>
Date: Fri Mar 3 11:16:56 2023 +0100

    Disable greenthreads for RabbitDriver "listen" connections

    When enabling heartbeat_in_pthread, we were restoring the "threading"
    python library from eventlet to original one in RabbitDriver but we
    forgot to do the same in AMQPDriverBase (RabbitDriver is subclass of
    AMQPDriverBase).

    We also need to use the original "queue" so that queues are not going to
    use greenthreads as well.

    Related-bug: #1961402
    Related-bug: #1934937
    Closes-bug: #2009138

    Signed-off-by: Arnaud Morin <email address hidden>
    Change-Id: I34ea0d1381e934297df2f793e0d2594ef8254f00

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to oslo.messaging (stable/2023.1)

Related fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/880187

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to oslo.messaging (stable/zed)

Related fix proposed to branch: stable/zed
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/880188

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to oslo.messaging (stable/yoga)

Related fix proposed to branch: stable/yoga
Review: https://review.opendev.org/c/openstack/oslo.messaging/+/880189

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to oslo.messaging (stable/2023.1)

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/880187
Committed: https://opendev.org/openstack/oslo.messaging/commit/3645839d162a989da46ba1c0fcbe7f32b48b9fb1
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit 3645839d162a989da46ba1c0fcbe7f32b48b9fb1
Author: Arnaud Morin <email address hidden>
Date: Fri Mar 3 11:16:56 2023 +0100

    Disable greenthreads for RabbitDriver "listen" connections

    When enabling heartbeat_in_pthread, we were restoring the "threading"
    python library from eventlet to original one in RabbitDriver but we
    forgot to do the same in AMQPDriverBase (RabbitDriver is subclass of
    AMQPDriverBase).

    We also need to use the original "queue" so that queues are not going to
    use greenthreads as well.

    Related-bug: #1961402
    Related-bug: #1934937
    Closes-bug: #2009138

    Signed-off-by: Arnaud Morin <email address hidden>
    Change-Id: I34ea0d1381e934297df2f793e0d2594ef8254f00
    (cherry picked from commit fd2381c723fe805b17aca1f80bfff4738fbe9628)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to oslo.messaging (stable/zed)

Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/880188
Committed: https://opendev.org/openstack/oslo.messaging/commit/4d15b7c4fe0c14e285484d23c15fe5531e952679
Submitter: "Zuul (22348)"
Branch: stable/zed

commit 4d15b7c4fe0c14e285484d23c15fe5531e952679
Author: Arnaud Morin <email address hidden>
Date: Fri Mar 3 11:16:56 2023 +0100

    Disable greenthreads for RabbitDriver "listen" connections

    When enabling heartbeat_in_pthread, we were restoring the "threading"
    python library from eventlet to original one in RabbitDriver but we
    forgot to do the same in AMQPDriverBase (RabbitDriver is subclass of
    AMQPDriverBase).

    We also need to use the original "queue" so that queues are not going to
    use greenthreads as well.

    Related-bug: #1961402
    Related-bug: #1934937
    Closes-bug: #2009138

    Signed-off-by: Arnaud Morin <email address hidden>
    Change-Id: I34ea0d1381e934297df2f793e0d2594ef8254f00
    (cherry picked from commit 3645839d162a989da46ba1c0fcbe7f32b48b9fb1)

tags: added: in-stable-zed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.