RabbitMQ fails to synchronize exchanges under high load (Note for ubuntu: stein, rocky, queens(bionic) changes only fix compatibility with fully patched releases)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ubuntu Cloud Archive |
Invalid
|
Undecided
|
Unassigned | ||
Mitaka |
Triaged
|
Medium
|
Seyeong Kim | ||
Queens |
Fix Released
|
Medium
|
Seyeong Kim | ||
Rocky |
Fix Released
|
Medium
|
Chris MacNaughton | ||
Stein |
Fix Released
|
Medium
|
Unassigned | ||
Train |
Fix Released
|
Undecided
|
Unassigned | ||
oslo.messaging |
Fix Released
|
Undecided
|
Oleg Bondarev | ||
python-oslo.messaging (Ubuntu) |
Fix Released
|
Medium
|
Unassigned | ||
Xenial |
Invalid
|
Medium
|
Seyeong Kim | ||
Bionic |
Fix Released
|
Medium
|
Seyeong Kim |
Bug Description
[Impact]
If there are many exchanges and queues, after failing over, rabbitmq-server shows us error that exchanges are cannot be found.
Affected
Bionic (Queens)
Not affected
Focal
[Test Case]
1. deploy simple rabbitmq cluster
- https:/
2. juju ssh neutron-gateway/0
- for i in {1..1000}; do systemd restart neutron-
3. it would be better if we can add more exchanges, queues, bindings
- rabbitmq-plugins enable rabbitmq_management
- rabbitmqctl add_user test password
- rabbitmqctl set_user_tags test administrator
- rabbitmqctl set_permissions -p openstack test ".*" ".*" ".*"
- https:/
- for i in {1..2000}; do ./create.sh test_$i; done
4. restart rabbitmq-server service or shutdown machine and turn on several times.
5. you can see the exchange not found error
[1] create.sh (pasting here because pastebins don't last forever)
#!/bin/bash
rabbitmqadmin declare exchange -V openstack name=$1 type=direct -u test -p password
rabbitmqadmin declare queue -V openstack name=$1 durable=false -u test -p password 'arguments=
rabbitmqadmin -V openstack declare binding source=$1 destination_
[Where problems could occur]
1. every service which uses oslo.messaging need to be restarted.
2. Message transferring could be an issue
[Others]
Possible Workaround
1. for exchange not found issue,
- create exchange, queue, binding for problematic name in log
- then restart rabbitmq-server one by one
2. for queue crashed and failed to restart
- delete specific queue in log
// original description
Input:
- OpenStack Pike cluster with ~500 nodes
- DVR enabled in neutron
- Lots of messages
Scenario: failover of one rabbit node in a cluster
Issue: after failed rabbit node gets back online some rpc communications appear broken
Logs from rabbit:
=ERROR REPORT==== 10-Aug-
Channel error on connection <0.14839.1> (10.200.0.24:55834 -> 10.200.0.31:5672, vhost: '/openstack', user: 'openstack'), channel 1:
operation basic.publish caused a channel exception not_found: no exchange 'reply_
Investigation:
After rabbit node gets back online it gets many new connections immediately and fails to synchronize exchanges for some reason (number of exchanges in that cluster was ~1600), on that node it stays low and not increasing.
Workaround: let the recovered node synchronize all exchanges - forbid new connections with iptables rules for some time after failed node gets online (30 sec)
Proposal: do not create new exchanges (use default) for all direct messages - this also fixes the issue.
Is there a good reason for creating new exchanges for direct messages?
Changed in oslo.messaging: | |
assignee: | nobody → Oleg Bondarev (obondarev) |
Changed in python-oslo.messaging (Ubuntu): | |
assignee: | nobody → Seyeong Kim (seyeongkim) |
tags: | added: sts |
description: | updated |
description: | updated |
Changed in python-oslo.messaging (Ubuntu): | |
importance: | Undecided → Medium |
Changed in python-oslo.messaging (Ubuntu Bionic): | |
importance: | Undecided → Medium |
Changed in python-oslo.messaging (Ubuntu Xenial): | |
importance: | Undecided → Medium |
Changed in python-oslo.messaging (Ubuntu Xenial): | |
status: | New → In Progress |
assignee: | nobody → Seyeong Kim (seyeongkim) |
Changed in python-oslo.messaging (Ubuntu Bionic): | |
status: | New → In Progress |
assignee: | nobody → Seyeong Kim (seyeongkim) |
Changed in python-oslo.messaging (Ubuntu): | |
assignee: | Seyeong Kim (seyeongkim) → nobody |
tags: | added: verification-queens-needed |
tags: |
added: verification-done-bionic removed: verification-needed-bionic |
tags: |
added: verification-queens-done removed: verification-queens-needed |
tags: | removed: verification-needed |
Changed in python-oslo.messaging (Ubuntu Bionic): | |
status: | Fix Committed → Fix Released |
tags: |
added: verification-queens-failed removed: verification-queens-done |
description: | updated |
Changed in python-oslo.messaging (Ubuntu Bionic): | |
status: | New → Triaged |
description: | updated |
Fix proposed to branch: master /review. openstack. org/596661
Review: https:/