nova-compute doesn't reconnect properly after control plane outage

Bug #1284431 reported by Robert Collins
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Undecided
Unassigned
oslo.messaging
Incomplete
Undecided
Unassigned
tripleo
Fix Released
High
Unassigned

Bug Description

We had to reboot the control node for ci-overcloud, after that, and ensuring it was online again properly...

+------------------+-------------------------------------------------+----------+---------+-------+--------------------------
| Binary | Host | Zone | Status | State | Updated_at
+------------------+-------------------------------------------------+----------+---------+-------+--------------------------
| nova-conductor | ci-overcloud-notcompute0-gxezgcvv4v2q | internal | enabled | down | 2014-02-24T19:42:26.00000
| nova-cert | ci-overcloud-notcompute0-gxezgcvv4v2q | internal | enabled | down | 2014-02-24T19:42:18.00000
| nova-scheduler | ci-overcloud-notcompute0-gxezgcvv4v2q | internal | enabled | down | 2014-02-24T19:42:26.00000
| nova-consoleauth | ci-overcloud-notcompute0-gxezgcvv4v2q | internal | enabled | down | 2014-02-24T19:42:18.00000
| nova-compute | ci-overcloud-novacompute4-5aywwwqlmtv3 | nova | enabled | down | 2014-02-25T02:07:37.00000
| nova-compute | ci-overcloud-novacompute7-mosbehy6ikhz | nova | enabled | down | 2014-02-25T02:07:44.00000
| nova-compute | ci-overcloud-novacompute0-vidddfuaauhw | nova | enabled | down | 2014-02-25T02:07:36.00000
| nova-compute | ci-overcloud-novacompute6-6fnuizd4n4gv | nova | enabled | down | 2014-02-25T02:07:36.00000
| nova-compute | ci-overcloud-novacompute1-4q2dbhdklrkq | nova | enabled | down | 2014-02-25T02:07:43.00000
| nova-compute | ci-overcloud-novacompute5-y27zvc4o5fps | nova | enabled | down | 2014-02-25T02:07:36.00000
| nova-compute | ci-overcloud-novacompute3-sxibwe5v5gpw | nova | enabled | down | 2014-02-25T02:08:40.00000
| nova-compute | ci-overcloud-novacompute8-4qu2kxq4e6pb | nova | enabled | down | 2014-02-25T02:08:41.00000
| nova-compute | ci-overcloud-novacompute2-tvsutghnaofq | nova | enabled | down | 2014-02-25T02:07:36.00000
| nova-compute | ci-overcloud-novacompute9-qt7sqeqcexjh | nova | enabled | down | 2014-02-25T02:08:45.00000
| nova-scheduler | ci-overcloud-notcompute0-gxezgcvv4v2q.novalocal | internal | enabled | up | 2014-02-25T03:24:53.00000
| nova-conductor | ci-overcloud-notcompute0-gxezgcvv4v2q.novalocal | internal | enabled | up | 2014-02-25T03:24:59.00000
| nova-consoleauth | ci-overcloud-notcompute0-gxezgcvv4v2q.novalocal | internal | enabled | up | 2014-02-25T03:24:53.00000
| nova-cert | ci-overcloud-notcompute0-gxezgcvv4v2q.novalocal | internal | enabled | up | 2014-02-25T03:24:51.00000
+------------------+-------------------------------------------------+----------+---------+-------+--------------------------

Tags: compute
Tracy Jones (tjones-i)
tags: added: compute
Revision history for this message
John Garbutt (johngarbutt) wrote :

could we have some logs from nova-compute, what did it fail on when it came up?

Changed in nova:
status: New → Incomplete
Revision history for this message
Giulio Fidente (gfidente) wrote :

periodic task seems to fail reporting the status:

nova.openstack.common.periodic_task Traceback (most recent call last):
nova.openstack.common.periodic_task File "/opt/stack/venvs/nova/lib/python2.7/site-packages/nova/openstack/common/periodic_task.py", line 198, in run_periodic_tasks
nova.openstack.common.periodic_task task(self, context)
nova.openstack.common.periodic_task File "/opt/stack/venvs/nova/lib/python2.7/site-packages/nova/compute/manager.py", line 4997, in _heal_instance_info_cache
nova.openstack.common.periodic_task context, self.host, expected_attrs=[], use_slave=True)
nova.openstack.common.periodic_task File "/opt/stack/venvs/nova/lib/python2.7/site-packages/nova/objects/base.py", line 151, in wrapper
nova.openstack.common.periodic_task args, kwargs)
nova.openstack.common.periodic_task File "/opt/stack/venvs/nova/lib/python2.7/site-packages/nova/conductor/rpcapi.py", line 344, in object_class_action
nova.openstack.common.periodic_task objver=objver, args=args, kwargs=kwargs)
nova.openstack.common.periodic_task File "/opt/stack/venvs/nova/lib/python2.7/site-packages/oslo/messaging/rpc/client.py", line 150, in call
nova.openstack.common.periodic_task wait_for_reply=True, timeout=timeout)
nova.openstack.common.periodic_task File "/opt/stack/venvs/nova/lib/python2.7/site-packages/oslo/messaging/transport.py", line 90, in _send
nova.openstack.common.periodic_task timeout=timeout)
nova.openstack.common.periodic_task File "/opt/stack/venvs/nova/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 412, in send
nova.openstack.common.periodic_task return self._send(target, ctxt, message, wait_for_reply, timeout)
nova.openstack.common.periodic_task File "/opt/stack/venvs/nova/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 403, in _send
nova.openstack.common.periodic_task result = self._waiter.wait(msg_id, timeout)
nova.openstack.common.periodic_task File "/opt/stack/venvs/nova/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 280, in wait
nova.openstack.common.periodic_task reply, ending, trylock = self._poll_queue(msg_id, timeout)
nova.openstack.common.periodic_task File "/opt/stack/venvs/nova/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 220, in _poll_queue
nova.openstack.common.periodic_task message = self.waiters.get(msg_id, timeout)
nova.openstack.common.periodic_task File "/opt/stack/venvs/nova/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 126, in get
nova.openstack.common.periodic_task 'to message ID %s' % msg_id)
nova.openstack.common.periodic_task MessagingTimeout: Timed out waiting for a reply to message ID fd733e68c2da4e4b8175a0df442cc897

Revision history for this message
Giulio Fidente (gfidente) wrote :
Download full text (6.5 KiB)

this starts to happen despite the oslo.messaging apparently being reconnected:

oslo.messaging._drivers.impl_rabbit [-] Failed to publish message to topic 'conductor': Socket closed
oslo.messaging._drivers.impl_rabbit Traceback (most recent call last):
oslo.messaging._drivers.impl_rabbit File "/opt/stack/venvs/nova/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 622, in ensure
oslo.messaging._drivers.impl_rabbit return method(*args, **kwargs)
oslo.messaging._drivers.impl_rabbit File "/opt/stack/venvs/nova/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 718, in _publish
oslo.messaging._drivers.impl_rabbit publisher = cls(self.conf, self.channel, topic, **kwargs)
oslo.messaging._drivers.impl_rabbit File "/opt/stack/venvs/nova/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 379, in __init__
oslo.messaging._drivers.impl_rabbit **options)
oslo.messaging._drivers.impl_rabbit File "/opt/stack/venvs/nova/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 326, in __init__
oslo.messaging._drivers.impl_rabbit self.reconnect(channel)
oslo.messaging._drivers.impl_rabbit File "/opt/stack/venvs/nova/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 334, in reconnect
oslo.messaging._drivers.impl_rabbit routing_key=self.routing_key)
oslo.messaging._drivers.impl_rabbit File "/opt/stack/venvs/nova/lib/python2.7/site-packages/kombu/messaging.py", line 85, in __init__
oslo.messaging._drivers.impl_rabbit self.revive(self._channel)
oslo.messaging._drivers.impl_rabbit File "/opt/stack/venvs/nova/lib/python2.7/site-packages/kombu/messaging.py", line 218, in revive
oslo.messaging._drivers.impl_rabbit self.declare()
oslo.messaging._drivers.impl_rabbit File "/opt/stack/venvs/nova/lib/python2.7/site-packages/kombu/messaging.py", line 105, in declare
oslo.messaging._drivers.impl_rabbit self.exchange.declare()
oslo.messaging._drivers.impl_rabbit File "/opt/stack/venvs/nova/lib/python2.7/site-packages/kombu/entity.py", line 166, in declare
oslo.messaging._drivers.impl_rabbit nowait=nowait, passive=passive,
oslo.messaging._drivers.impl_rabbit File "/opt/stack/venvs/nova/lib/python2.7/site-packages/amqp/channel.py", line 620, in exchange_declare
oslo.messaging._drivers.impl_rabbit (40, 11), # Channel.exchange_declare_ok
oslo.messaging._drivers.impl_rabbit File "/opt/stack/venvs/nova/lib/python2.7/site-packages/amqp/abstract_channel.py", line 67, in wait
oslo.messaging._drivers.impl_rabbit self.channel_id, allowed_methods)
oslo.messaging._drivers.impl_rabbit File "/opt/stack/venvs/nova/lib/python2.7/site-packages/amqp/connection.py", line 237, in _wait_method
oslo.messaging._drivers.impl_rabbit self.method_reader.read_method()
oslo.messaging._drivers.impl_rabbit File "/opt/stack/venvs/nova/lib/python2.7/site-packages/amqp/method_framing.py", line 189, in read_method
oslo.messaging._drivers.impl_rabbit raise m
oslo.messaging._drivers.impl_rabbit IOError: Socket closed

oslo.messaging._drivers.impl_rabbit [-] Reconnecting to AMQP server on 192.0.2.29:5672
oslo.messaging._drivers...

Read more...

Revision history for this message
Giulio Fidente (gfidente) wrote :

nova conductor was silently failing; it printed a large amount of identical messages:

oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on 192.0.2.29:5672

and only then the hypervisors were seen up again

Revision history for this message
Giulio Fidente (gfidente) wrote :

^^ in the previous comment I missed to say that I _restarted_ nova-conductor for that to happen

Revision history for this message
Sean Dague (sdague) wrote :

My understanding from the Ops meetups is this is all about rabbit and oslo.messaging

Changed in nova:
status: Incomplete → Invalid
Revision history for this message
Mehdi Abaakouk (sileht) wrote :

This should be fixed since oslo.messaging 1.5.0

Mehdi Abaakouk (sileht)
Changed in oslo.messaging:
status: New → Incomplete
Ben Nemec (bnemec)
Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.