Comment 2 for bug 1423116

Revision history for this message
Bogdan Dobrelya (bogdando) wrote : Re: primary controller has been marked as offline by fuel.

The debug session for node-19 (Wed Feb 18 11:22:53 UTC 2015 - Wed Feb 18 13:05:43 UTC 2015) have shown the following:
1) beam process is able to process the AMQP connections (~900 existed within each 10 seconds frame):
- There number of sessions is floating
- There are AMQP messages flow in conversations

2) There are multiple {handshake_timeout,frame_header} and {handshake_timeout,frame_header}
 in rabbitmq log started at 17-Feb-2015::14:49:04, Example: http://pastebin.com/tvCMLikr

3) rabbitmqctl status / list* is not responsive at all. The results of strace -s 2048 -T -tt -f -ff -o status.strace rabbitmqctl status are in attachment. And rabbitmqctl "eval" "rabbit_misc:which_applications()." reports the node is down: http://paste.openstack.org/show/176860/

4) pacemaker stopped to process monitor actions for rabbitmq resource at 2015-02-18T08:42:15.104256+00:00, here is the last two rounds http://paste.openstack.org/show/176843/ and there is none after that moment. But note, that pacemaker does not mark the resource as failed or not running, this is odd.

5) there is a "hanged" ocf action monitor which started at Feb 18 8:42 and still existed after 4h and all the time the debug session was in progress http://paste.openstack.org/show/176859/

Postmortem: It looks like the RA action monitor (eval" "rabbit_misc:which_applications()) invoked at 8:42 hanged and brought down the entire application.