Lost communication between nova-conductor and nova-compute because of lost exchanges
Affects | Status | Importance | Assigned to | Milestone | ||
---|---|---|---|---|---|---|
Mirantis OpenStack | Status tracked in 10.0.x | |||||
10.0.x |
Fix Committed
|
High
|
Kirill Bespalov | |||
7.0.x |
Fix Released
|
High
|
Kirill Bespalov | |||
8.0.x |
Fix Released
|
High
|
Kirill Bespalov | |||
9.x |
Fix Released
|
High
|
Kirill Bespalov |
Bug Description
At 200-scale Nova reports that all nova-compute services are down. The issue appeared after some load, but at the moment of observation the system was idle for more than a day (so it obviously had time to recover).
According to nova-compute logs it tries to ping conductor, but fails to get the reply:
2016-05-06 15:06:36.561 23305 ERROR oslo_service.
2016-05-06 15:06:36.561 23305 ERROR oslo_service.
The original message is present in the traffic and is received by the conductor:
2016-05-06 15:05:36.562 16992 DEBUG oslo_messaging.
Message body:
{"oslo.message": "{\"_context_
The reply queue is present:
root@node-159:~# rabbitmqctl list_consumers | grep reply_ffdd1b7c9
reply_ffdd1b7c9
root@node-159:~# rabbitmqctl list_channels name pid connection | grep <email address hidden>
192.168.0.65:53749 -> 192.168.0.88:5673 (1) <email address hidden> <email address hidden>
Nova-compute service is connected to RabbitMQ and connection is alive:
root@node-63:~# lsof -i 4 | grep 53749
nova-comp 23305 nova 4u IPv4 2422830 0t0 TCP node-63.
However there's no evidence that the reply was even send to the compute service (it is not present in traffic capture).
-------------
Restart of one of nova-conductor services helps, but over the time number of down compute services increases and approximately in an hour the situation becomes stable (all dead). The conductor logs are different -- according to the service it at least tries to send the reply during the minute:
2016-05-06 15:18:05.828 36910 DEBUG oslo_messaging.
2016-05-06 15:18:06.637 36910 DEBUG oslo_messaging.
2016-05-06 15:18:07.208 36910 DEBUG oslo_messaging.
2016-05-06 15:19:07.169 36910 WARNING oslo_messaging.
2016-05-06 15:19:07.519 36910 INFO oslo_messaging.
In the stable state there are no any logs related to replying.
VERSION:
[root@fuel ~]# fuel2 fuel-version
api: '1'
auth_required: true
feature_groups: []
openstack_version: mitaka-9.0
release: '9.0'
DEPLOYMENT:
200 HW nodes, Neutron DVR