[RabbitMQ] nova-compute stuck for a while (AMQP)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Fuel for OpenStack |
Invalid
|
High
|
Bogdan Dobrelya | ||
4.1.x |
Fix Committed
|
High
|
Bogdan Dobrelya |
Bug Description
Symptoms:
1)
* Random nova-compute from time to time marked as "XXX" for a while.
* Compute service itself works properly. In logs there are a status updates send reports to conductor are being recorded, but actually nothing is sent.
* "netstat" shows that all connections to/from rabbit "ESTABLISHED"
* rabbitmqctl shows that "compute.node-x" queue synced to all slaves.
2)
* computes' queues grow after some time have passed since the last compute service restarting.
Axe style solution:
/etc/init.
Summary:
1)Fuel should provide TCP KA (keepalives) for rabitmq sessions in HA mode.
These TCP KA should be visible at the app layer as well as at the network stack layer.
related Oslo.messaging issue: https:/
related fuel-dev ML: https:/
2) Instances at compute nodes should be consistant with their state in nova db in order to prevent computes' queues uncontrolled grow - there was a reaping logic update was done in the Icehouse should be synced as well (running_
related zendesk issues, #1663, #1743
Perhaps, this issue should be fixed in 5.0 but backporting should be considered as a critical for 3.2.1, 4.1, 4.1.1 releases (due to the increasing number of related tickets in zendesk).
Changed in fuel: | |
assignee: | nobody → Fuel Hardening Team (fuel-hardening) |
Changed in fuel: | |
milestone: | 5.0 → 4.1.1 |
Changed in fuel: | |
milestone: | 4.1.1 → 5.0 |
status: | In Progress → Invalid |
The support for Rabbit heartbeat was reverted: https:/ /review. openstack. org/#/c/ 36606/. With kombu you have to call heartbeat_check() once per second. Without a thread calling that function your connections will all die after heartbeat seconds.
The kombu reconnect changes here: https:/ /review. openstack. org/#/c/ 76686/ along with the CCN changes are already in our packages. The config changes to rabbit here: https:/ /bugs.launchpad .net/oslo. messaging/ +bug/856764/ comments/ 19 sound helpful though and are worth testing.