mistral engine sporadically stops processing

Bug #1835097 reported by Steffn
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mistral
Incomplete
High
Unassigned

Bug Description

Hi!

I am running mistral in a container based setup, with the multi node configuration. After some time the Mistral Engine stops processing messages and everything comes to a halt (no workflow, no task is processed anymore). You can see the errors in the API log below (for a cron and an API triggered run). If I redeploy the Mistral Engine everything is resumed where it was left off and works, till the error reoccurs.
I observed the sporadic error after weeks, as well as only after 30mins after a redeploy. Atm I wouldn't know where to look next, already did online research but didnt't find any hint, thus filing this bug report.

Mistral 8.0.0
Python 3.7.2
PostgresDB
RabbitMQ 3.7.12

Happy to provide any further information!
Steffn

2019-07-01 08:10:58.216 28 ERROR mistral.rpc.base [req-c140887e-e4a6-414f-9231-fed2ba6df47b - - - - -] Caught a messaging remote error. See details of the original exception.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 2505ae837bc54e118b6c9b13b021da74
2019-07-01 08:10:58.216 28 ERROR mistral.rpc.base Traceback (most recent call last):
2019-07-01 08:10:58.216 28 ERROR mistral.rpc.base File "/usr/local/lib/python3.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 393, in get
2019-07-01 08:10:58.216 28 ERROR mistral.rpc.base return self._queues[msg_id].get(block=True, timeout=timeout)
2019-07-01 08:10:58.216 28 ERROR mistral.rpc.base File "/usr/local/lib/python3.7/dist-packages/eventlet/queue.py", line 322, in get
2019-07-01 08:10:58.216 28 ERROR mistral.rpc.base return waiter.wait()
2019-07-01 08:10:58.216 28 ERROR mistral.rpc.base File "/usr/local/lib/python3.7/dist-packages/eventlet/queue.py", line 141, in wait
2019-07-01 08:10:58.216 28 ERROR mistral.rpc.base return get_hub().switch()
2019-07-01 08:10:58.216 28 ERROR mistral.rpc.base File "/usr/local/lib/python3.7/dist-packages/eventlet/hubs/hub.py", line 297, in switch
2019-07-01 08:10:58.216 28 ERROR mistral.rpc.base return self.greenlet.switch()
2019-07-01 08:10:58.216 28 ERROR mistral.rpc.base _queue.Empty
2019-07-01 08:10:58.216 28 ERROR mistral.rpc.base
2019-07-01 08:10:58.216 28 ERROR mistral.rpc.base During handling of the above exception, another exception occurred:
2019-07-01 08:10:58.216 28 ERROR mistral.rpc.base
2019-07-01 08:10:58.216 28 ERROR mistral.rpc.base Traceback (most recent call last):
2019-07-01 08:10:58.216 28 ERROR mistral.rpc.base File "/opt/stack/mistral/mistral/rpc/base.py", line 99, in decorator
2019-07-01 08:10:58.216 28 ERROR mistral.rpc.base return method(*args, **kwargs)
2019-07-01 08:10:58.216 28 ERROR mistral.rpc.base File "/opt/stack/mistral/mistral/rpc/clients.py", line 143, in start_workflow
2019-07-01 08:10:58.216 28 ERROR mistral.rpc.base params=params
2019-07-01 08:10:58.216 28 ERROR mistral.rpc.base File "/opt/stack/mistral/mistral/rpc/oslo/oslo_client.py", line 38, in sync_call
2019-07-01 08:10:58.216 28 ERROR mistral.rpc.base **kwargs
2019-07-01 08:10:58.216 28 ERROR mistral.rpc.base File "/usr/local/lib/python3.7/dist-packages/oslo_messaging/rpc/client.py", line 178, in call
2019-07-01 08:10:58.216 28 ERROR mistral.rpc.base retry=self.retry)
2019-07-01 08:10:58.216 28 ERROR mistral.rpc.base File "/usr/local/lib/python3.7/dist-packages/oslo_messaging/transport.py", line 128, in _send
2019-07-01 08:10:58.216 28 ERROR mistral.rpc.base retry=retry)
2019-07-01 08:10:58.216 28 ERROR mistral.rpc.base File "/usr/local/lib/python3.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 645, in send
2019-07-01 08:10:58.216 28 ERROR mistral.rpc.base call_monitor_timeout, retry=retry)
2019-07-01 08:10:58.216 28 ERROR mistral.rpc.base File "/usr/local/lib/python3.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 634, in _send
2019-07-01 08:10:58.216 28 ERROR mistral.rpc.base call_monitor_timeout)
2019-07-01 08:10:58.216 28 ERROR mistral.rpc.base File "/usr/local/lib/python3.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 520, in wait
2019-07-01 08:10:58.216 28 ERROR mistral.rpc.base message = self.waiters.get(msg_id, timeout=timeout)
2019-07-01 08:10:58.216 28 ERROR mistral.rpc.base File "/usr/local/lib/python3.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 397, in get
2019-07-01 08:10:58.216 28 ERROR mistral.rpc.base 'to message ID %s' % msg_id)
2019-07-01 08:10:58.216 28 ERROR mistral.rpc.base oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 2505ae837bc54e118b6c9b13b021da74
2019-07-01 08:10:58.216 28 ERROR mistral.rpc.base
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic [req-c140887e-e4a6-414f-9231-fed2ba6df47b - - - - -] Failed to process cron trigger CronTrigger {'created_at': '2019-06-26 10:07:53', 'updated_at': '2019-07-01 07:59:58', 'scope': 'private', 'project_id': '<default-project>', 'id': '7f5ceac3-e536-4faf-9990-8dc0207e1c5a', 'name': 'trigger@foo', 'pattern': '*/10 * * * *', 'first_execution_time': None, 'next_execution_time': '2019-07-01 08:10:00', 'workflow_name': 'worfklow-1', 'remaining_executions': None, 'workflow_id': '030c3299-fe18-472a-b212-66c22d852955', 'workflow_params': {'env': {'...'}}"}}, 'workflow_params_hash': 'dc102023c194847d728fc982418ab7bc3ab532644a153b4e7556165125394453', 'workflow_input': {}, 'workflow_input_hash': '4f53cda18c2baa0c0354bb5f9a3ecbe5ed12ab4d8e11ba873c2f11161202b945', 'trust_id': None}: mistral.exceptions.MistralException: MessagingTimeout: Timed out waiting for a reply to message ID 2505ae837bc54e118b6c9b13b021da74
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic Traceback (most recent call last):
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic File "/usr/local/lib/python3.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 393, in get
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic return self._queues[msg_id].get(block=True, timeout=timeout)
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic File "/usr/local/lib/python3.7/dist-packages/eventlet/queue.py", line 322, in get
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic return waiter.wait()
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic File "/usr/local/lib/python3.7/dist-packages/eventlet/queue.py", line 141, in wait
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic return get_hub().switch()
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic File "/usr/local/lib/python3.7/dist-packages/eventlet/hubs/hub.py", line 297, in switch
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic return self.greenlet.switch()
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic _queue.Empty
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic During handling of the above exception, another exception occurred:
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic Traceback (most recent call last):
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic File "/opt/stack/mistral/mistral/rpc/base.py", line 99, in decorator
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic return method(*args, **kwargs)
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic File "/opt/stack/mistral/mistral/rpc/clients.py", line 143, in start_workflow
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic params=params
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic File "/opt/stack/mistral/mistral/rpc/oslo/oslo_client.py", line 38, in sync_call
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic **kwargs
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic File "/usr/local/lib/python3.7/dist-packages/oslo_messaging/rpc/client.py", line 178, in call
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic retry=self.retry)
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic File "/usr/local/lib/python3.7/dist-packages/oslo_messaging/transport.py", line 128, in _send
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic retry=retry)
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic File "/usr/local/lib/python3.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 645, in send
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic call_monitor_timeout, retry=retry)
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic File "/usr/local/lib/python3.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 634, in _send
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic call_monitor_timeout)
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic File "/usr/local/lib/python3.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 520, in wait
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic message = self.waiters.get(msg_id, timeout=timeout)
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic File "/usr/local/lib/python3.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 397, in get
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic 'to message ID %s' % msg_id)
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 2505ae837bc54e118b6c9b13b021da74
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic During handling of the above exception, another exception occurred:
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic Traceback (most recent call last):
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic File "/opt/stack/mistral/mistral/services/periodic.py", line 85, in process_cron_triggers_v2
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic **trigger.workflow_params
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic File "/opt/stack/mistral/mistral/rpc/base.py", line 115, in decorator
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic _wrap_exception_and_reraise(e)
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic File "/opt/stack/mistral/mistral/rpc/base.py", line 84, in _wrap_exception_and_reraise
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic raise exc.MistralException(message)
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic mistral.exceptions.MistralException: MessagingTimeout: Timed out waiting for a reply to message ID 2505ae837bc54e118b6c9b13b021da74
2019-07-01 08:10:58.218 28 ERROR mistral.services.periodic

Tags: engine
Steffn (steffn99)
description: updated
Steffn (steffn99)
summary: - mistral engine stops processing
+ mistral engine sporadically stops processing
Revision history for this message
Steffn (steffn99) wrote :

In the meantime I run into the issue again and was able to check if the rabbitmq connection is still there. I was looking for that inspired by the healthcheck script https://github.com/openstack/tripleo-common/blob/master/healthcheck/mistral-engine
So in the error situation the rabbitmq connection is gone. As a workaround I applied the same healthcheck in my environment and redeploy the mistral-engine container if it fails.

Still would be interesting to figure out the root cause and avoid the problem overall.

Revision history for this message
Renat Akhmerov (rakhmerov) wrote :

Hi,

If it started recently then most likely it's related with the recent deprecation of some config properties in oslo.messaging. Try to add "default_reply_retry=-1" under the group "oslo_messaging_amqp". The bottom line is that if you don't specify this property then the connection retry count by default is 0, i.e. it never tries to reconnect. So we need to make it infinite explicitly.

Let me know if it helps.

Changed in mistral:
milestone: none → train-1
Revision history for this message
Steffn (steffn99) wrote :

Hi,

Thanks for the reply, only saw it now! Will try the suggested configuration and report back!

Revision history for this message
Renat Akhmerov (rakhmerov) wrote :

Yep, thanks. Once you have any new info, please let us know.

Changed in mistral:
milestone: train-1 → train-rc1
importance: Undecided → High
Changed in mistral:
status: New → Incomplete
milestone: train-rc1 → none
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.