tripleo

ovb 1ctrl 1comp fs022 pike: neutron/nova AMQP MessagingTimeout

Bug #1760189 reported by Rafael Folco on 2018-03-30

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Released	Critical	Gabriele Cerami	tripleo rocky-1

Bug Description

fs022 pike reporting AMQP timeouts for nova-compute/neutron ovs agent.

errors on controller side:
https://logs.rdoproject.org/openstack-periodic-24hr/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset022-pike/d9e1506/overcloud-controller-0/var/log/extra/errors.txt.gz#_2018-03-27_06_45_18_768

2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource Traceback (most recent call last):
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 849, in _action_recorder
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource yield
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 957, in _do_action
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource yield self.action_handler_task(action, args=handler_args)
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 351, in wrapper
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource step = next(subtask)
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 908, in action_handler_task
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource done = check(handler_data)
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/openstack/nova/server.py", line 869, in check_create_complete
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource check = self.client_plugin()._check_active(server_id)
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/clients/os/nova.py", line 238, in _check_active
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource 'code': fault.get('code', _('Unknown'))
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource ResourceInError: Went to status ERROR due to "Message: No valid host was found. , Code: 500"

errors on compute side:

https://logs.rdoproject.org/openstack-periodic-24hr/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset022-pike/d9e1506/overcloud-novacompute-0/var/log/extra/errors.txt.gz#_2018-03-27_06_42_50_630

2018-03-27 06:41:13.072 ERROR /var/log/extra/docker/containers/neutron_ovs_agent/log/neutron/neutron-openvswitch-agent.log: 23260 ERROR neutron.common.rpc [req-392cd54f-005c-4d4f-a350-af1cde36bf4b - - - - -] Timeout in RPC method tunnel_sync. Waiting for 53 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID e495637cb37148db9a4ca34b61db11de
2018-03-27 06:42:50.630 ERROR /var/log/extra/docker/containers/nova_libvirt/log/nova/nova-compute.log: 7 ERROR oslo.messaging._drivers.impl_rabbit [req-1d44cbaa-5d4e-4f88-928e-c80cb50711fe - - - - -] [ad70f6b4-1e81-49df-baee-1b35303fd4cf] AMQP server on overcloud-controller-0.internalapi.localdomain:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds. Client port: 43950: error: [Errno 110] Connection timed out

Tags:

Revision history for this message

John Eckersberg (jeckersb) wrote on 2018-04-02:

This looks like a problem on the client side, somewhere in oslo.messaging/kombu/py-amqp.

You can see the client src ports being used from the compute log:

$ curl -s https://logs.rdoproject.org/openstack-periodic-24hr/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset022-pike/d9e1506/overcloud-novacompute-0/var/log/extra/errors.txt.gz#_2018-03-27_06_42_50_630 | egrep -o 'port: [0-9]+' | sort | uniq
port: 43950
port: 43960
port: 44038
port: 44048
port: 44060

And then you can see in the rabbit log that those same client connections are not handshaking properly for some reason:

$ curl -s https://logs.rdoproject.org/openstack-periodic-24hr/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset022-pike/d9e1506<email address hidden> | grep -B2 handshake
=ERROR REPORT==== 27-Mar-2018::06:43:00 ===
closing AMQP connection <0.7859.0> (192.168.24.14:43950 -> 192.168.24.17:5672):
{handshake_timeout,frame_header}
--
=ERROR REPORT==== 27-Mar-2018::06:44:45 ===
closing AMQP connection <0.8794.0> (192.168.24.14:43960 -> 192.168.24.17:5672):
{handshake_timeout,frame_header}
--
=ERROR REPORT==== 27-Mar-2018::06:47:30 ===
closing AMQP connection <0.10299.0> (192.168.24.14:44038 -> 192.168.24.17:5672):
{handshake_timeout,frame_header}
--
=ERROR REPORT==== 27-Mar-2018::06:49:14 ===
closing AMQP connection <0.11187.0> (192.168.24.14:44048 -> 192.168.24.17:5672):
{handshake_timeout,frame_header}
--
=ERROR REPORT==== 27-Mar-2018::06:50:58 ===
closing AMQP connection <0.12067.0> (192.168.24.14:44060 -> 192.168.24.17:5672):
{handshake_timeout,frame_header}
--
=ERROR REPORT==== 27-Mar-2018::06:52:43 ===
closing AMQP connection <0.12963.0> (192.168.24.14:44130 -> 192.168.24.17:5672):
{handshake_timeout,frame_header}

So for some reason the client knows something is wrong, and reconnects with a new TCP connection with a new source port, but doesn't perform an AMQP handshake after the connection is established.

This looks like a problem on the client side, somewhere in oslo.messaging/kombu/py-amqp.

You can see the client src ports being used from the compute log:

And then you can see in the rabbit log that those same client connections are not handshaking properly for some reason:

$ curl -s https://logs.rdoproject.org/openstack-periodic-24hr/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset022-pike/d9e1506/overcloud-controller-0/var/log/containers/rabbitmq/rabbit@overcloud-controller-0.log.txt.gz | grep -B2 handshake
=ERROR REPORT==== 27-Mar-2018::06:43:00 ===
closing AMQP connection <0.7859.0> (192.168.24.14:43950 -> 192.168.24.17:5672):
{handshake_timeout,frame_header}
--
=ERROR REPORT==== 27-Mar-2018::06:44:45 ===
closing AMQP connection <0.8794.0> (192.168.24.14:43960 -> 192.168.24.17:5672):
{handshake_timeout,frame_header}
--
=ERROR REPORT==== 27-Mar-2018::06:47:30 ===
closing AMQP connection <0.10299.0> (192.168.24.14:44038 -> 192.168.24.17:5672):
{handshake_timeout,frame_header}
--
=ERROR REPORT==== 27-Mar-2018::06:49:14 ===
closing AMQP connection <0.11187.0> (192.168.24.14:44048 -> 192.168.24.17:5672):
{handshake_timeout,frame_header}
--
=ERROR REPORT==== 27-Mar-2018::06:50:58 ===
closing AMQP connection <0.12067.0> (192.168.24.14:44060 -> 192.168.24.17:5672):
{handshake_timeout,frame_header}
--
=ERROR REPORT==== 27-Mar-2018::06:52:43 ===
closing AMQP connection <0.12963.0> (192.168.24.14:44130 -> 192.168.24.17:5672):
{handshake_timeout,frame_header}

So for some reason the client knows something is wrong, and reconnects with a new TCP connection with a new source port, but doesn't perform an AMQP handshake after the connection is established.

Revision history for this message

John Eckersberg (jeckersb) wrote on 2018-04-03:

Download full text (12.1 KiB)

I think this may be related to the way the compute service calls to conductor during the startup process? We see this very early in the compute log:

2018-04-02 15:41:49.130 6 WARNING nova.conductor.api [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] Timed out waiting for nova-conductor. Is it running? Or did this service start before nova-conductor? Reattempting establishment of nova-conductor connection...: MessagingTimeo
ut: Timed out waiting for a reply to message ID f89d35ae47c8478d91946b0607384653
2018-04-02 15:41:59.139 6 WARNING nova.conductor.api [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] Timed out waiting for nova-conductor. Is it running? Or did this service start before nova-conductor? Reattempting establishment of nova-conductor connection...: MessagingTimeo
ut: Timed out waiting for a reply to message ID b1604c3ca79c400ea6a499ca73baf256
2018-04-02 15:42:09.143 6 WARNING nova.conductor.api [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] Timed out waiting for nova-conductor. Is it running? Or did this service start before nova-conductor? Reattempting establishment of nova-conductor connection...: MessagingTimeo
ut: Timed out waiting for a reply to message ID 701e75d59a94448ca7c913f84ac5ab52
2018-04-02 15:42:19.155 6 WARNING nova.conductor.api [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] Timed out waiting for nova-conductor. Is it running? Or did this service start before nova-conductor? Reattempting establishment of nova-conductor connection...: MessagingTimeo
ut: Timed out waiting for a reply to message ID 0235454609f94457b5c2b4e220b1e4bf
2018-04-02 15:42:29.161 6 WARNING nova.conductor.api [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] Timed out waiting for nova-conductor. Is it running? Or did this service start before nova-conductor? Reattempting establishment of nova-conductor connection...: MessagingTimeo
ut: Timed out waiting for a reply to message ID 3ccdb5654ba847d9b1c00c5bf32ab319
2018-04-02 15:42:39.165 6 WARNING nova.conductor.api [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] Timed out waiting for nova-conductor. Is it running? Or did this service start before nova-conductor? Reattempting establishment of nova-conductor connection...: MessagingTimeo
ut: Timed out waiting for a reply to message ID b823ce13ba9a4268bfa08457fdeaa9e0
2018-04-02 15:42:49.176 6 WARNING nova.conductor.api [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] Timed out waiting for nova-conductor. Is it running? Or did this service start before nova-conductor? Reattempting establishment of nova-conductor connection...: MessagingTimeo
ut: Timed out waiting for a reply to message ID c42f32732878498e8db0eddf54b263f7
2018-04-02 15:42:49.214 6 INFO nova.conductor.api [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] nova-conductor connection established successfully

And in the rabbit log, there are two connections established from compute:

=INFO REPORT==== 2-Apr-2018::15:41:39 ===
accepting AMQP connection <0.6144.0> (192.168.24.11:45302 -> 192.168.24.10:5672)

=INFO REPORT==== 2-Apr-2018::15:41:39 ===
Connection <0.6144.0> (192.168.24.11:45302 -> 192.168.24.10:5672) has a client-provided na...

I think this may be related to the way the compute service calls to conductor during the startup process?  We see this very early in the compute log:

2018-04-02 15:41:49.130 6 WARNING nova.conductor.api [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] Timed out waiting for nova-conductor.  Is it running? Or did this service start before nova-conductor?  Reattempting establishment of nova-conductor connection...: MessagingTimeo
ut: Timed out waiting for a reply to message ID f89d35ae47c8478d91946b0607384653
2018-04-02 15:41:59.139 6 WARNING nova.conductor.api [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] Timed out waiting for nova-conductor.  Is it running? Or did this service start before nova-conductor?  Reattempting establishment of nova-conductor connection...: MessagingTimeo
ut: Timed out waiting for a reply to message ID b1604c3ca79c400ea6a499ca73baf256
2018-04-02 15:42:09.143 6 WARNING nova.conductor.api [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] Timed out waiting for nova-conductor.  Is it running? Or did this service start before nova-conductor?  Reattempting establishment of nova-conductor connection...: MessagingTimeo
ut: Timed out waiting for a reply to message ID 701e75d59a94448ca7c913f84ac5ab52
2018-04-02 15:42:19.155 6 WARNING nova.conductor.api [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] Timed out waiting for nova-conductor.  Is it running? Or did this service start before nova-conductor?  Reattempting establishment of nova-conductor connection...: MessagingTimeo
ut: Timed out waiting for a reply to message ID 0235454609f94457b5c2b4e220b1e4bf
2018-04-02 15:42:29.161 6 WARNING nova.conductor.api [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] Timed out waiting for nova-conductor.  Is it running? Or did this service start before nova-conductor?  Reattempting establishment of nova-conductor connection...: MessagingTimeo
ut: Timed out waiting for a reply to message ID 3ccdb5654ba847d9b1c00c5bf32ab319
2018-04-02 15:42:39.165 6 WARNING nova.conductor.api [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] Timed out waiting for nova-conductor.  Is it running? Or did this service start before nova-conductor?  Reattempting establishment of nova-conductor connection...: MessagingTimeo
ut: Timed out waiting for a reply to message ID b823ce13ba9a4268bfa08457fdeaa9e0
2018-04-02 15:42:49.176 6 WARNING nova.conductor.api [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] Timed out waiting for nova-conductor.  Is it running? Or did this service start before nova-conductor?  Reattempting establishment of nova-conductor connection...: MessagingTimeo
ut: Timed out waiting for a reply to message ID c42f32732878498e8db0eddf54b263f7
2018-04-02 15:42:49.214 6 INFO nova.conductor.api [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] nova-conductor connection established successfully

And in the rabbit log, there are two connections established from compute:

=INFO REPORT==== 2-Apr-2018::15:41:39 ===
accepting AMQP connection <0.6144.0> (192.168.24.11:45302 -> 192.168.24.10:5672)

=INFO REPORT==== 2-Apr-2018::15:41:39 ===
Connection <0.6144.0> (192.168.24.11:45302 -> 192.168.24.10:5672) has a client-provided name: nova-compute:6:362e8ca8-736e-4aaf-b0db-c4d02389bdb0

=INFO REPORT==== 2-Apr-2018::15:41:39 ===
accepting AMQP connection <0.6162.0> (192.168.24.11:45306 -> 192.168.24.10:5672)

=INFO REPORT==== 2-Apr-2018::15:41:39 ===
Connection <0.6162.0> (192.168.24.11:45306 -> 192.168.24.10:5672) has a client-provided name: nova-compute:6:a53af820-6970-468d-b5c2-d41124850165

The first connection should be the connection pool allocating for the outgoing call request, and the second one should be spawned for the reply-waiters connection which receives all rpc replies for this worker.

A few minutes later, two more connections are made from compute:

=INFO REPORT==== 2-Apr-2018::15:44:33 ===
accepting AMQP connection <0.9108.0> (192.168.24.11:45482 -> 192.168.24.10:5672)

=INFO REPORT==== 2-Apr-2018::15:44:34 ===
accepting AMQP connection <0.9111.0> (192.168.24.11:45484 -> 192.168.24.10:5672)

The latter one (<0.9111.0>) provides a client name that is *the same name* as the previous reply-waiter thread (connection <0.6162.0>), meaning the connection object is being reused on the client side (the uuid is generated and stored as an instance variable):

=INFO REPORT==== 2-Apr-2018::15:44:34 ===
Connection <0.9111.0> (192.168.24.11:45484 -> 192.168.24.10:5672) has a client-provided name: nova-compute:6:a53af820-6970-468d-b5c2-d41124850165

Immediately after, the original connection is closed abruptly, which makes sense if the same connection object is being reused:

=WARNING REPORT==== 2-Apr-2018::15:44:34 ===
closing AMQP connection <0.6162.0> (192.168.24.11:45306 -> 192.168.24.10:5672 - nova-compute:6:a53af820-6970-468d-b5c2-d41124850165):
client unexpectedly closed TCP connection

It is at this time that the compute service logs these errors:

2018-04-02 15:44:33.183 6 ERROR oslo.messaging._drivers.impl_rabbit [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] [a53af820-6970-468d-b5c2-d41124850165] AMQP server on overcloud-controller-0.internalapi.localdomain:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds. Client port: 45482: error: [Errno 110] Connection timed out
2018-04-02 15:44:34.207 6 INFO oslo.messaging._drivers.impl_rabbit [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] [a53af820-6970-468d-b5c2-d41124850165] Reconnected to AMQP server on overcloud-controller-0.internalapi.localdomain:5672 via [amqp] client with port 45484.

After 10 seconds, rabbitmq gives up on the other new connection (<0.9108.0>) because it never performed an AMQP handshake.  Note this corresponds to the first error in the compute above on port 45482:

=ERROR REPORT==== 2-Apr-2018::15:44:43 ===
closing AMQP connection <0.9108.0> (192.168.24.11:45482 -> 192.168.24.10:5672):
{handshake_timeout,frame_header}

Some time later this pattern repeats, with one significant difference -- the compute log starts logging that the client connection has no port at all:

2018-04-02 15:45:34.218 6 ERROR oslo.messaging._drivers.impl_rabbit [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] [a53af820-6970-468d-b5c2-d41124850165] AMQP server on overcloud-controller-0.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds. Client port: None: timeout: timed out

It gets a new port:

2018-04-02 15:45:35.240 6 INFO oslo.messaging._drivers.impl_rabbit [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] [a53af820-6970-468d-b5c2-d41124850165] Reconnected to AMQP server on overcloud-controller-0.internalapi.localdomain:5672 via [amqp] client with port 45490.

And then later loses it again somehow:

2018-04-02 15:46:35.251 6 ERROR oslo.messaging._drivers.impl_rabbit [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] [a53af820-6970-468d-b5c2-d41124850165] AMQP server on overcloud-controller-0.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds. Client port: None: timeout: timed out

This repeats forever.

Now, the reason I said initially that this might be related to startup is because I see this stuck thread in the greenthread output:

------                        Green Thread                        ------
/usr/lib/python2.7/site-packages/eventlet/greenthread.py:214 in main
    `result = function(*args, **kwargs)`
/usr/lib/python2.7/site-packages/oslo_service/service.py:721 in run_service
    `service.start()`
/usr/lib/python2.7/site-packages/nova/service.py:174 in start
    `self.manager.pre_start_hook()`
/usr/lib/python2.7/site-packages/nova/compute/manager.py:1183 in pre_start_hook
    `startup=True)`
/usr/lib/python2.7/site-packages/nova/compute/manager.py:6758 in update_available_resource
    `self.update_available_resource_for_node(context, nodename)`
/usr/lib/python2.7/site-packages/nova/compute/manager.py:6720 in update_available_resource_for_node
    `rt.update_available_resource(context, nodename)`
/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py:657 in update_available_resource
    `self._update_available_resource(context, resources)`
/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:271 in inner
    `return f(*args, **kwargs)`
/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py:681 in _update_available_resource
    `self._init_compute_node(context, resources)`
/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py:550 in _init_compute_node
    `self._setup_pci_tracker(context, cn, resources)`
/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py:572 in _setup_pci_tracker
    `self.pci_tracker = pci_manager.PciDevTracker(context, node_id=n_id)`
/usr/lib/python2.7/site-packages/nova/pci/manager.py:72 in __init__
    `context, node_id)`
/usr/lib/python2.7/site-packages/oslo_versionedobjects/base.py:177 in wrapper
    `args, kwargs)`
/usr/lib/python2.7/site-packages/nova/conductor/rpcapi.py:240 in object_class_action_versions
    `args=args, kwargs=kwargs)`
/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py:169 in call
    `retry=self.retry)`
/usr/lib/python2.7/site-packages/oslo_messaging/transport.py:123 in _send
    `timeout=timeout, retry=retry)`
/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:566 in send
    `retry=retry)`
/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:552 in _send
    `msg=msg, timeout=timeout, retry=retry)`
/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py:1278 in topic_send
    `retry=retry)`
/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py:1161 in _ensure_publishing
    `self.ensure(method, retry=retry, error_callback=_error_callback)`
/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py:807 in ensure
    `ret, channel = autoretry_method()`
/usr/lib/python2.7/site-packages/kombu/connection.py:494 in _ensured
    `return fun(*args, **kwargs)`
/usr/lib/python2.7/site-packages/kombu/connection.py:570 in __call__
    `return fun(*args, channel=channels[0], **kwargs), channels[0]`
/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py:796 in execute_method
    `method()`
/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py:1193 in _publish
    `compression=self.kombu_compression)`
/usr/lib/python2.7/site-packages/kombu/messaging.py:181 in publish
    `exchange_name, declare,`
/usr/lib/python2.7/site-packages/kombu/messaging.py:203 in _publish
    `mandatory=mandatory, immediate=immediate,`
/usr/lib/python2.7/site-packages/amqp/channel.py:1759 in basic_publish_confirm
    `self.wait(spec.Basic.Ack)`
/usr/lib/python2.7/site-packages/amqp/abstract_channel.py:93 in wait
    `self.connection.drain_events(timeout=timeout)`
/usr/lib/python2.7/site-packages/amqp/connection.py:464 in drain_events
    `return self.blocking_read(timeout)`
/usr/lib/python2.7/site-packages/amqp/connection.py:468 in blocking_read
    `frame = self.transport.read_frame()`
/usr/lib/python2.7/site-packages/amqp/transport.py:237 in read_frame
    `frame_header = read(7, True)`
/usr/lib/python2.7/site-packages/amqp/transport.py:377 in _read
    `s = recv(n - len(rbuf))`
/usr/lib/python2.7/site-packages/eventlet/greenio/base.py:354 in recv
    `return self._recv_loop(self.fd.recv, b'', bufsize, flags)`
/usr/lib/python2.7/site-packages/eventlet/greenio/base.py:348 in _recv_loop
    `self._read_trampoline()`
/usr/lib/python2.7/site-packages/eventlet/greenio/base.py:319 in _read_trampoline
    `timeout_exc=socket.timeout("timed out"))`
/usr/lib/python2.7/site-packages/eventlet/greenio/base.py:203 in _trampoline
    `mark_as_closed=self._mark_as_closed)`
/usr/lib/python2.7/site-packages/eventlet/hubs/__init__.py:162 in trampoline
    `return hub.switch()`
/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py:294 in switch
    `return self.greenlet.switch()`

Which is under the pre_start_hook.  That thread is still stuck hours after the service started.  I'm not sure if this is causing the rpc weirdness, or if the rpc weirdness is causing this, or maybe the two combined is interacting poorly.

Next I'm going to do some tests to see what happens when I start the compute service before and after the conductor service to see if it makes any difference.  It shouldn't... but best to verify.

Revision history for this message

John Eckersberg (jeckersb) wrote on 2018-04-03:

Same behavior if I restart the nova_compute container while conductor is already running. The AMQP connections are flapping in same way, and the pre_start_hook thread is stuck.

Revision history for this message

John Eckersberg (jeckersb) wrote on 2018-04-05:

tcp-weirdness.png Edit (196.7 KiB, image/png)

Something really weird is going on low-level in the network... adding attachment of wireshark output. This is taken from the compute side.

The connection to RabbitMQ is established fine and transmitting data, and then suddenly a particular payload is never ACKd so it gets retransmitted repeatedly. The odd thing is that the connection is clearly still alive, because the keepalive traffic is flowing correctly in both directions.

Eventually the AMQP session dies because this behavior blocks the heartbeat traffic. It reconnects and the same thing repeats forever.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-04-05: Related fix proposed to tripleo-quickstart-extras (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/559107

Rafael Folco (rafaelfolco) on 2018-04-06

tags:

removed: fs022 ovb pike

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-04-11: Related fix merged to tripleo-quickstart-extras (master)

Reviewed: https://review.openstack.org/559107
Committed: https://git.openstack.org/cgit/openstack/tripleo-quickstart-extras/commit/?id=e1e91e55cc2f53273d2c996becb532e4ffc12ec0
Submitter: Zuul
Branch: master

commit e1e91e55cc2f53273d2c996becb532e4ffc12ec0
Author: Alex Schultz <email address hidden>
Date: Thu Apr 5 15:16:44 2018 +0000

Revert "Remove adjust-interface-mtus script"

This reverts commit c1d0eb1c8748a52a92788ecaaa25e49fa0818cfd.

Since this was fixed in queens/master, we don't need to run it in queens
and master so we've added a release clause.

Change-Id: I8b5b6ed983b1560f9f834abd8a54ae53b4db3465
Related-Bug: #1760189

Quique Llorente (quiquell) on 2018-04-12

Changed in tripleo:
assignee:	nobody → Gabriele Cerami (gcerami)

Revision history for this message

wes hayutin (weshayutin) wrote on 2018-04-12:

https://logs.rdoproject.org/openstack-periodic-24hr/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset022-pike/5a339c4/console.txt.gz

fixed

Changed in tripleo:
status:	Triaged → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

tcp-weirdness.png Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.