I think this may be related to the way the compute service calls to conductor during the startup process? We see this very early in the compute log: 2018-04-02 15:41:49.130 6 WARNING nova.conductor.api [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] Timed out waiting for nova-conductor. Is it running? Or did this service start before nova-conductor? Reattempting establishment of nova-conductor connection...: MessagingTimeo ut: Timed out waiting for a reply to message ID f89d35ae47c8478d91946b0607384653 2018-04-02 15:41:59.139 6 WARNING nova.conductor.api [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] Timed out waiting for nova-conductor. Is it running? Or did this service start before nova-conductor? Reattempting establishment of nova-conductor connection...: MessagingTimeo ut: Timed out waiting for a reply to message ID b1604c3ca79c400ea6a499ca73baf256 2018-04-02 15:42:09.143 6 WARNING nova.conductor.api [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] Timed out waiting for nova-conductor. Is it running? Or did this service start before nova-conductor? Reattempting establishment of nova-conductor connection...: MessagingTimeo ut: Timed out waiting for a reply to message ID 701e75d59a94448ca7c913f84ac5ab52 2018-04-02 15:42:19.155 6 WARNING nova.conductor.api [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] Timed out waiting for nova-conductor. Is it running? Or did this service start before nova-conductor? Reattempting establishment of nova-conductor connection...: MessagingTimeo ut: Timed out waiting for a reply to message ID 0235454609f94457b5c2b4e220b1e4bf 2018-04-02 15:42:29.161 6 WARNING nova.conductor.api [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] Timed out waiting for nova-conductor. Is it running? Or did this service start before nova-conductor? Reattempting establishment of nova-conductor connection...: MessagingTimeo ut: Timed out waiting for a reply to message ID 3ccdb5654ba847d9b1c00c5bf32ab319 2018-04-02 15:42:39.165 6 WARNING nova.conductor.api [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] Timed out waiting for nova-conductor. Is it running? Or did this service start before nova-conductor? Reattempting establishment of nova-conductor connection...: MessagingTimeo ut: Timed out waiting for a reply to message ID b823ce13ba9a4268bfa08457fdeaa9e0 2018-04-02 15:42:49.176 6 WARNING nova.conductor.api [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] Timed out waiting for nova-conductor. Is it running? Or did this service start before nova-conductor? Reattempting establishment of nova-conductor connection...: MessagingTimeo ut: Timed out waiting for a reply to message ID c42f32732878498e8db0eddf54b263f7 2018-04-02 15:42:49.214 6 INFO nova.conductor.api [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] nova-conductor connection established successfully And in the rabbit log, there are two connections established from compute: =INFO REPORT==== 2-Apr-2018::15:41:39 === accepting AMQP connection <0.6144.0> (192.168.24.11:45302 -> 192.168.24.10:5672) =INFO REPORT==== 2-Apr-2018::15:41:39 === Connection <0.6144.0> (192.168.24.11:45302 -> 192.168.24.10:5672) has a client-provided name: nova-compute:6:362e8ca8-736e-4aaf-b0db-c4d02389bdb0 =INFO REPORT==== 2-Apr-2018::15:41:39 === accepting AMQP connection <0.6162.0> (192.168.24.11:45306 -> 192.168.24.10:5672) =INFO REPORT==== 2-Apr-2018::15:41:39 === Connection <0.6162.0> (192.168.24.11:45306 -> 192.168.24.10:5672) has a client-provided name: nova-compute:6:a53af820-6970-468d-b5c2-d41124850165 The first connection should be the connection pool allocating for the outgoing call request, and the second one should be spawned for the reply-waiters connection which receives all rpc replies for this worker. A few minutes later, two more connections are made from compute: =INFO REPORT==== 2-Apr-2018::15:44:33 === accepting AMQP connection <0.9108.0> (192.168.24.11:45482 -> 192.168.24.10:5672) =INFO REPORT==== 2-Apr-2018::15:44:34 === accepting AMQP connection <0.9111.0> (192.168.24.11:45484 -> 192.168.24.10:5672) The latter one (<0.9111.0>) provides a client name that is *the same name* as the previous reply-waiter thread (connection <0.6162.0>), meaning the connection object is being reused on the client side (the uuid is generated and stored as an instance variable): =INFO REPORT==== 2-Apr-2018::15:44:34 === Connection <0.9111.0> (192.168.24.11:45484 -> 192.168.24.10:5672) has a client-provided name: nova-compute:6:a53af820-6970-468d-b5c2-d41124850165 Immediately after, the original connection is closed abruptly, which makes sense if the same connection object is being reused: =WARNING REPORT==== 2-Apr-2018::15:44:34 === closing AMQP connection <0.6162.0> (192.168.24.11:45306 -> 192.168.24.10:5672 - nova-compute:6:a53af820-6970-468d-b5c2-d41124850165): client unexpectedly closed TCP connection It is at this time that the compute service logs these errors: 2018-04-02 15:44:33.183 6 ERROR oslo.messaging._drivers.impl_rabbit [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] [a53af820-6970-468d-b5c2-d41124850165] AMQP server on overcloud-controller-0.internalapi.localdomain:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds. Client port: 45482: error: [Errno 110] Connection timed out 2018-04-02 15:44:34.207 6 INFO oslo.messaging._drivers.impl_rabbit [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] [a53af820-6970-468d-b5c2-d41124850165] Reconnected to AMQP server on overcloud-controller-0.internalapi.localdomain:5672 via [amqp] client with port 45484. After 10 seconds, rabbitmq gives up on the other new connection (<0.9108.0>) because it never performed an AMQP handshake. Note this corresponds to the first error in the compute above on port 45482: =ERROR REPORT==== 2-Apr-2018::15:44:43 === closing AMQP connection <0.9108.0> (192.168.24.11:45482 -> 192.168.24.10:5672): {handshake_timeout,frame_header} Some time later this pattern repeats, with one significant difference -- the compute log starts logging that the client connection has no port at all: 2018-04-02 15:45:34.218 6 ERROR oslo.messaging._drivers.impl_rabbit [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] [a53af820-6970-468d-b5c2-d41124850165] AMQP server on overcloud-controller-0.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds. Client port: None: timeout: timed out It gets a new port: 2018-04-02 15:45:35.240 6 INFO oslo.messaging._drivers.impl_rabbit [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] [a53af820-6970-468d-b5c2-d41124850165] Reconnected to AMQP server on overcloud-controller-0.internalapi.localdomain:5672 via [amqp] client with port 45490. And then later loses it again somehow: 2018-04-02 15:46:35.251 6 ERROR oslo.messaging._drivers.impl_rabbit [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] [a53af820-6970-468d-b5c2-d41124850165] AMQP server on overcloud-controller-0.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds. Client port: None: timeout: timed out This repeats forever. Now, the reason I said initially that this might be related to startup is because I see this stuck thread in the greenthread output: ------ Green Thread ------ /usr/lib/python2.7/site-packages/eventlet/greenthread.py:214 in main `result = function(*args, **kwargs)` /usr/lib/python2.7/site-packages/oslo_service/service.py:721 in run_service `service.start()` /usr/lib/python2.7/site-packages/nova/service.py:174 in start `self.manager.pre_start_hook()` /usr/lib/python2.7/site-packages/nova/compute/manager.py:1183 in pre_start_hook `startup=True)` /usr/lib/python2.7/site-packages/nova/compute/manager.py:6758 in update_available_resource `self.update_available_resource_for_node(context, nodename)` /usr/lib/python2.7/site-packages/nova/compute/manager.py:6720 in update_available_resource_for_node `rt.update_available_resource(context, nodename)` /usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py:657 in update_available_resource `self._update_available_resource(context, resources)` /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:271 in inner `return f(*args, **kwargs)` /usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py:681 in _update_available_resource `self._init_compute_node(context, resources)` /usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py:550 in _init_compute_node `self._setup_pci_tracker(context, cn, resources)` /usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py:572 in _setup_pci_tracker `self.pci_tracker = pci_manager.PciDevTracker(context, node_id=n_id)` /usr/lib/python2.7/site-packages/nova/pci/manager.py:72 in __init__ `context, node_id)` /usr/lib/python2.7/site-packages/oslo_versionedobjects/base.py:177 in wrapper `args, kwargs)` /usr/lib/python2.7/site-packages/nova/conductor/rpcapi.py:240 in object_class_action_versions `args=args, kwargs=kwargs)` /usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py:169 in call `retry=self.retry)` /usr/lib/python2.7/site-packages/oslo_messaging/transport.py:123 in _send `timeout=timeout, retry=retry)` /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:566 in send `retry=retry)` /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:552 in _send `msg=msg, timeout=timeout, retry=retry)` /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py:1278 in topic_send `retry=retry)` /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py:1161 in _ensure_publishing `self.ensure(method, retry=retry, error_callback=_error_callback)` /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py:807 in ensure `ret, channel = autoretry_method()` /usr/lib/python2.7/site-packages/kombu/connection.py:494 in _ensured `return fun(*args, **kwargs)` /usr/lib/python2.7/site-packages/kombu/connection.py:570 in __call__ `return fun(*args, channel=channels[0], **kwargs), channels[0]` /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py:796 in execute_method `method()` /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py:1193 in _publish `compression=self.kombu_compression)` /usr/lib/python2.7/site-packages/kombu/messaging.py:181 in publish `exchange_name, declare,` /usr/lib/python2.7/site-packages/kombu/messaging.py:203 in _publish `mandatory=mandatory, immediate=immediate,` /usr/lib/python2.7/site-packages/amqp/channel.py:1759 in basic_publish_confirm `self.wait(spec.Basic.Ack)` /usr/lib/python2.7/site-packages/amqp/abstract_channel.py:93 in wait `self.connection.drain_events(timeout=timeout)` /usr/lib/python2.7/site-packages/amqp/connection.py:464 in drain_events `return self.blocking_read(timeout)` /usr/lib/python2.7/site-packages/amqp/connection.py:468 in blocking_read `frame = self.transport.read_frame()` /usr/lib/python2.7/site-packages/amqp/transport.py:237 in read_frame `frame_header = read(7, True)` /usr/lib/python2.7/site-packages/amqp/transport.py:377 in _read `s = recv(n - len(rbuf))` /usr/lib/python2.7/site-packages/eventlet/greenio/base.py:354 in recv `return self._recv_loop(self.fd.recv, b'', bufsize, flags)` /usr/lib/python2.7/site-packages/eventlet/greenio/base.py:348 in _recv_loop `self._read_trampoline()` /usr/lib/python2.7/site-packages/eventlet/greenio/base.py:319 in _read_trampoline `timeout_exc=socket.timeout("timed out"))` /usr/lib/python2.7/site-packages/eventlet/greenio/base.py:203 in _trampoline `mark_as_closed=self._mark_as_closed)` /usr/lib/python2.7/site-packages/eventlet/hubs/__init__.py:162 in trampoline `return hub.switch()` /usr/lib/python2.7/site-packages/eventlet/hubs/hub.py:294 in switch `return self.greenlet.switch()` Which is under the pre_start_hook. That thread is still stuck hours after the service started. I'm not sure if this is causing the rpc weirdness, or if the rpc weirdness is causing this, or maybe the two combined is interacting poorly. Next I'm going to do some tests to see what happens when I start the compute service before and after the conductor service to see if it makes any difference. It shouldn't... but best to verify.