Rabbit OCF scripts cannot identify partial split brain

Bug #1584504 reported by Andrey Epifanov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Status tracked in 10.0.x
10.0.x
Invalid
High
MOS Oslo
6.0.x
Invalid
High
MOS Maintenance
6.1.x
Invalid
High
MOS Maintenance
7.0.x
Invalid
High
Anton Chevychalov
8.0.x
Invalid
High
Anton Chevychalov
9.x
Invalid
High
MOS Oslo

Bug Description

MOS 6.1

After RabbitMQ failure on all 3 controllers (a lot of stack traces)
it was repaired RMQ by restarting RMQ resource in Pacemaker.

The RMQ cluster status after that became the following:
node-1# # rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-1' ...
[{nodes,[{disc,['rabbit@node-1','rabbit@node-2','rabbit@node-3']}]},
 {running_nodes,['rabbit@node-1','rabbit@node-2','rabbit@node-3']},
 {cluster_name,<<"rabbit@node-1">>},
 {partitions,[]}]
...done.

node-2# # rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-2' ...
[{nodes,[{disc,['rabbit@node-1','rabbit@node-2','rabbit@node-3']}]},
 {running_nodes,['rabbit@node-1','rabbit@node-2']},
 {cluster_name,<<"rabbit@node-1">>},
 {partitions,[]}]
...done.

node-3# # rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-3' ...
[{nodes,[{disc,['rabbit@node-1','rabbit@node-2','rabbit@node-3']}]},
 {running_nodes,['rabbit@node-1','rabbit@node-3']},
 {cluster_name,<<"rabbit@node-1">>},
 {partitions,[]}]
...done.

Looks like OCF scripts cannot identify this kind of split brain.

Changed in mos:
status: New → Confirmed
tags: added: customer-found support
summary: - Rabbit ocf cannot identify patial split brain
+ Rabbit OCF scripts cannot identify patial split brain
tags: added: ct1
Revision history for this message
Bogdan Dobrelya (bogdando) wrote : Re: Rabbit OCF scripts cannot identify patial split brain

Please backport, it was fixed in the master

Changed in mos:
milestone: none → 6.1-updates
importance: Undecided → High
status: Confirmed → Triaged
assignee: nobody → MOS Maintenance (mos-maintenance)
Revision history for this message
Dina Belova (dbelova) wrote :

Due to conversation with oslo team marking as high (was critical) for 6.1.

summary: - Rabbit OCF scripts cannot identify patial split brain
+ Rabbit OCF scripts cannot identify partial split brain
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

This is a very rare issue, hence moving to 10.0 milestone

tags: added: move-to-10.0
Dina Belova (dbelova)
tags: added: move-to-mu
tags: added: 10.0-reviewed
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

There is nothing to fix for 9.x and 10.0 as it had been fixed. Moving to incomplete as it requires an additional confirmation if the bug is applicable to those releases.

tags: added: move-to-9.2
Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Invalid for 9.x and 10.0

Revision history for this message
Roman Rufanov (rrufanov) wrote :

This is experienced by MOS 7 customer, please fix in MOS 7.

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

I have set 9.2 to fix committed to indicate that actually the problem is fixed in 9.2, and the fix is not available in 9.1 or 9.0.

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

For clarity - bug is fixed in 9.2 after commit https://review.openstack.org/#/c/378600/

tags: added: on-verification
Revision history for this message
Alexey Lebedeff (alebedev-a) wrote :

The following command should be run on a non-master node:

rabbitmqctl eval "sys:replace_state(rabbit_node_monitor, fun(OldState) -> setelement(3, OldState, ['test@localhost']) end)."

After some time this node should restart and pacemaker logs on master should contain "thinks that it is partitoned with" message from https://review.openstack.org/#/c/378600/2/files/fuel-ha-utils/ocf/rabbitmq

Revision history for this message
TatyanaGladysheva (tgladysheva) wrote :

Verified on 9.2 snapshot #636 using steps to verify from previous comment.

Actual results: http://paste.openstack.org/show/592457/

tags: removed: on-verification
Revision history for this message
Anton Chevychalov (achevychalov) wrote :

Commit mentioned above is incorrectly pined to that issue. That fix will not work in case when we have split mind in a way described above (when we have empty partitions and different running_nodes).

It is considered by our developers that issue somewhere inside RabbitMQ code and it should not be fixed in OCF scripts. Perhaps that issue has been fixed in modern versions of MOS by bumping of RabbitMQ version.

But we need more information about reproduce steps. So I mark that bug as incomplete until new information will be received.

Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Closing as Invalid as the bug stays in Incomplete for more than a month. Please reopen if the issue is reproduced again.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.