ovb 1ctrl 1comp fs022 pike: neutron/nova AMQP MessagingTimeout

Bug #1760189 reported by Rafael Folco
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Gabriele Cerami

Bug Description

fs022 pike reporting AMQP timeouts for nova-compute/neutron ovs agent.

errors on controller side:
https://logs.rdoproject.org/openstack-periodic-24hr/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset022-pike/d9e1506/overcloud-controller-0/var/log/extra/errors.txt.gz#_2018-03-27_06_45_18_768

2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource Traceback (most recent call last):
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 849, in _action_recorder
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource yield
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 957, in _do_action
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource yield self.action_handler_task(action, args=handler_args)
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 351, in wrapper
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource step = next(subtask)
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 908, in action_handler_task
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource done = check(handler_data)
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/openstack/nova/server.py", line 869, in check_create_complete
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource check = self.client_plugin()._check_active(server_id)
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/clients/os/nova.py", line 238, in _check_active
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource 'code': fault.get('code', _('Unknown'))
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource ResourceInError: Went to status ERROR due to "Message: No valid host was found. , Code: 500"

errors on compute side:

https://logs.rdoproject.org/openstack-periodic-24hr/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset022-pike/d9e1506/overcloud-novacompute-0/var/log/extra/errors.txt.gz#_2018-03-27_06_42_50_630

2018-03-27 06:41:13.072 ERROR /var/log/extra/docker/containers/neutron_ovs_agent/log/neutron/neutron-openvswitch-agent.log: 23260 ERROR neutron.common.rpc [req-392cd54f-005c-4d4f-a350-af1cde36bf4b - - - - -] Timeout in RPC method tunnel_sync. Waiting for 53 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID e495637cb37148db9a4ca34b61db11de
2018-03-27 06:42:50.630 ERROR /var/log/extra/docker/containers/nova_libvirt/log/nova/nova-compute.log: 7 ERROR oslo.messaging._drivers.impl_rabbit [req-1d44cbaa-5d4e-4f88-928e-c80cb50711fe - - - - -] [ad70f6b4-1e81-49df-baee-1b35303fd4cf] AMQP server on overcloud-controller-0.internalapi.localdomain:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds. Client port: 43950: error: [Errno 110] Connection timed out

Revision history for this message
John Eckersberg (jeckersb) wrote :

This looks like a problem on the client side, somewhere in oslo.messaging/kombu/py-amqp.

You can see the client src ports being used from the compute log:

$ curl -s https://logs.rdoproject.org/openstack-periodic-24hr/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset022-pike/d9e1506/overcloud-novacompute-0/var/log/extra/errors.txt.gz#_2018-03-27_06_42_50_630 | egrep -o 'port: [0-9]+' | sort | uniq
port: 43950
port: 43960
port: 44038
port: 44048
port: 44060

And then you can see in the rabbit log that those same client connections are not handshaking properly for some reason:

$ curl -s https://logs.rdoproject.org/openstack-periodic-24hr/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset022-pike/d9e1506<email address hidden> | grep -B2 handshake
=ERROR REPORT==== 27-Mar-2018::06:43:00 ===
closing AMQP connection <0.7859.0> (192.168.24.14:43950 -> 192.168.24.17:5672):
{handshake_timeout,frame_header}
--
=ERROR REPORT==== 27-Mar-2018::06:44:45 ===
closing AMQP connection <0.8794.0> (192.168.24.14:43960 -> 192.168.24.17:5672):
{handshake_timeout,frame_header}
--
=ERROR REPORT==== 27-Mar-2018::06:47:30 ===
closing AMQP connection <0.10299.0> (192.168.24.14:44038 -> 192.168.24.17:5672):
{handshake_timeout,frame_header}
--
=ERROR REPORT==== 27-Mar-2018::06:49:14 ===
closing AMQP connection <0.11187.0> (192.168.24.14:44048 -> 192.168.24.17:5672):
{handshake_timeout,frame_header}
--
=ERROR REPORT==== 27-Mar-2018::06:50:58 ===
closing AMQP connection <0.12067.0> (192.168.24.14:44060 -> 192.168.24.17:5672):
{handshake_timeout,frame_header}
--
=ERROR REPORT==== 27-Mar-2018::06:52:43 ===
closing AMQP connection <0.12963.0> (192.168.24.14:44130 -> 192.168.24.17:5672):
{handshake_timeout,frame_header}

So for some reason the client knows something is wrong, and reconnects with a new TCP connection with a new source port, but doesn't perform an AMQP handshake after the connection is established.

Revision history for this message
John Eckersberg (jeckersb) wrote :
Download full text (12.1 KiB)

I think this may be related to the way the compute service calls to conductor during the startup process? We see this very early in the compute log:

2018-04-02 15:41:49.130 6 WARNING nova.conductor.api [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] Timed out waiting for nova-conductor. Is it running? Or did this service start before nova-conductor? Reattempting establishment of nova-conductor connection...: MessagingTimeo
ut: Timed out waiting for a reply to message ID f89d35ae47c8478d91946b0607384653
2018-04-02 15:41:59.139 6 WARNING nova.conductor.api [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] Timed out waiting for nova-conductor. Is it running? Or did this service start before nova-conductor? Reattempting establishment of nova-conductor connection...: MessagingTimeo
ut: Timed out waiting for a reply to message ID b1604c3ca79c400ea6a499ca73baf256
2018-04-02 15:42:09.143 6 WARNING nova.conductor.api [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] Timed out waiting for nova-conductor. Is it running? Or did this service start before nova-conductor? Reattempting establishment of nova-conductor connection...: MessagingTimeo
ut: Timed out waiting for a reply to message ID 701e75d59a94448ca7c913f84ac5ab52
2018-04-02 15:42:19.155 6 WARNING nova.conductor.api [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] Timed out waiting for nova-conductor. Is it running? Or did this service start before nova-conductor? Reattempting establishment of nova-conductor connection...: MessagingTimeo
ut: Timed out waiting for a reply to message ID 0235454609f94457b5c2b4e220b1e4bf
2018-04-02 15:42:29.161 6 WARNING nova.conductor.api [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] Timed out waiting for nova-conductor. Is it running? Or did this service start before nova-conductor? Reattempting establishment of nova-conductor connection...: MessagingTimeo
ut: Timed out waiting for a reply to message ID 3ccdb5654ba847d9b1c00c5bf32ab319
2018-04-02 15:42:39.165 6 WARNING nova.conductor.api [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] Timed out waiting for nova-conductor. Is it running? Or did this service start before nova-conductor? Reattempting establishment of nova-conductor connection...: MessagingTimeo
ut: Timed out waiting for a reply to message ID b823ce13ba9a4268bfa08457fdeaa9e0
2018-04-02 15:42:49.176 6 WARNING nova.conductor.api [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] Timed out waiting for nova-conductor. Is it running? Or did this service start before nova-conductor? Reattempting establishment of nova-conductor connection...: MessagingTimeo
ut: Timed out waiting for a reply to message ID c42f32732878498e8db0eddf54b263f7
2018-04-02 15:42:49.214 6 INFO nova.conductor.api [req-22254fdc-ee62-41b5-9347-3b788ac40d8c - - - - -] nova-conductor connection established successfully

And in the rabbit log, there are two connections established from compute:

=INFO REPORT==== 2-Apr-2018::15:41:39 ===
accepting AMQP connection <0.6144.0> (192.168.24.11:45302 -> 192.168.24.10:5672)

=INFO REPORT==== 2-Apr-2018::15:41:39 ===
Connection <0.6144.0> (192.168.24.11:45302 -> 192.168.24.10:5672) has a client-provided na...

Revision history for this message
John Eckersberg (jeckersb) wrote :

Same behavior if I restart the nova_compute container while conductor is already running. The AMQP connections are flapping in same way, and the pre_start_hook thread is stuck.

Revision history for this message
John Eckersberg (jeckersb) wrote :

Something really weird is going on low-level in the network... adding attachment of wireshark output. This is taken from the compute side.

The connection to RabbitMQ is established fine and transmitting data, and then suddenly a particular payload is never ACKd so it gets retransmitted repeatedly. The odd thing is that the connection is clearly still alive, because the keepalive traffic is flowing correctly in both directions.

Eventually the AMQP session dies because this behavior blocks the heartbeat traffic. It reconnects and the same thing repeats forever.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-quickstart-extras (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/559107

tags: removed: fs022 ovb pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-quickstart-extras (master)

Reviewed: https://review.openstack.org/559107
Committed: https://git.openstack.org/cgit/openstack/tripleo-quickstart-extras/commit/?id=e1e91e55cc2f53273d2c996becb532e4ffc12ec0
Submitter: Zuul
Branch: master

commit e1e91e55cc2f53273d2c996becb532e4ffc12ec0
Author: Alex Schultz <email address hidden>
Date: Thu Apr 5 15:16:44 2018 +0000

    Revert "Remove adjust-interface-mtus script"

    This reverts commit c1d0eb1c8748a52a92788ecaaa25e49fa0818cfd.

    Since this was fixed in queens/master, we don't need to run it in queens
    and master so we've added a release clause.

    Change-Id: I8b5b6ed983b1560f9f834abd8a54ae53b4db3465
    Related-Bug: #1760189

Changed in tripleo:
assignee: nobody → Gabriele Cerami (gcerami)
Revision history for this message
wes hayutin (weshayutin) wrote :
Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.