OpenStack Heat

heat engine failure results in loss of RPC requests

Bug #1580977 reported by Anant Patil on 2016-05-12

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Heat	In Progress	Undecided	Anant Patil	OpenStack Heat no-milestone-taged-bugs

Bug Description

RPC requests are queued up locally in the heat engine and failure of
heat engine leads to loss of those requests. The requests (in the form
of AMQP messages) should be available in messaging server if the heat
engine is not able to process them.

The messages, carrying RPC calls/casts requests, are drained from the
messaging server (RabbitMQ) and submitted in the thread pool executor
(GreenThreadPoolExecutor from futurist library). Before submitting the
message to the executor, the message is acknowledged, which means the
message is deleted from the messaging server.The thread pool executor
queues the messages locally when there are no eventlets available to
process the message. This is bad, because the messages are queued up
locally and if the process goes down, these messages are lost, it is
very difficult to recover as they are not available in the messaging
server. The mail thread
http://lists.openstack.org/pipermail/openstack-dev/2015-July/068742.html
gives more context and I cried and wept when I read it.

In convergence, the heat engine casts the requests to process the
resources and we don't want the heat engine failures to result in loss
of those resource requests, as there is no easier way to recover them.

The issue is fixed by https://review.openstack.org/#/c/297988 . I
installed and tested with version 5.0.0, which is the latest version of
oslo.messaging and has the fix. In the new version, the messages are
acknowledged only after the message gets an eventlet. It is not ideal in
the sense that it doesn't give the service/client the freedom to
acknowledge when it wants to, but better than the older versions. So, if
the engine process cannot get an eventlet/thread to process the message,
it is not acknowledged and it remains in the messaging server.

I tested with two engine processes with executor thread pool size set to
2. This means at most only 4 resources should be processed at a time and
remaining should be available in the messaging server. I created a stack
of 8 test resources each with 20 secs of waiting time, and saw that 4
messages were available in the messaging server while other 4 were being
processed. I restarted the engine processes and the remaining messages
were again taken up of processing.

I am glad that the issue is fixed in the new version and we should move
to it before enabling convergence by default.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-12: Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/315488

Changed in heat:
assignee:	nobody → Anant Patil (ananta)
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-06-01: Change abandoned on heat (master)

Change abandoned by Anant Patil (<email address hidden>) on branch: master
Review: https://review.openstack.org/315488
Reason: Submitted patch in global requirements.

Rico Lin (rico-lin) on 2018-05-07

Changed in heat:
milestone:	none → no-priority-tag-bugs

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.