Previous split brain logic worked as follows: each slave checked
that it is connected to master. If check fails, slave restarts. The
ultimate flaw in that logic is that there is little guarantee that
master is alive at the moment. Moreover, if master dies, it is very
probable that during the next monitor check slaves will detect its
death and restart, causing complete RabbitMQ cluster downtime.
With the new approach master node checks that slaves are connected to
it and orders them to restart if they are not. The check is performed
after master node health check, meaning that at least that node
survives. Also, orders expire in one minute and freshly started node
ignores orders to restart for three minutes to give cluster time to
stabilize.
Also corrected the problem, when node starts and is already clustered.
In that case OCF script forgot to start the RabbitMQ app, causing
subsequent restart. Now we ensure that RabbitMQ app is running.
The two introduced attributes rabbit-start-phase-1-time and
rabbit-ordered-to-restart are made private. In order to allow master
to set node's order to restart, both ocf_update_private_attr and
ocf_get_private_attr signatures are expanded to allow passing
node name.
Finally, a bug is fixed in ocf_get_private_attr. Unlike crm_attribute,
attrd_updater returns empty string instead of "(null)", when an
attribute is not defined on needed node, but is defined on some other
node. Correspondingly changed code to expect empty string, not a
"(null)".
Reviewed: https:/ /review. openstack. org/324647 /git.openstack. org/cgit/ openstack/ fuel-library/ commit/ ?id=67e9b3d74f2 2a433da8def35a7 c8bfb40f78ae89
Committed: https:/
Submitter: Jenkins
Branch: stable/mitaka
commit 67e9b3d74f22a43 3da8def35a7c8bf b40f78ae89
Author: Dmitry Mescheryakov <email address hidden>
Date: Wed May 25 10:48:50 2016 +0300
Enhance split-brain detection logic
Previous split brain logic worked as follows: each slave checked
that it is connected to master. If check fails, slave restarts. The
ultimate flaw in that logic is that there is little guarantee that
master is alive at the moment. Moreover, if master dies, it is very
probable that during the next monitor check slaves will detect its
death and restart, causing complete RabbitMQ cluster downtime.
With the new approach master node checks that slaves are connected to
it and orders them to restart if they are not. The check is performed
after master node health check, meaning that at least that node
survives. Also, orders expire in one minute and freshly started node
ignores orders to restart for three minutes to give cluster time to
stabilize.
Also corrected the problem, when node starts and is already clustered.
In that case OCF script forgot to start the RabbitMQ app, causing
subsequent restart. Now we ensure that RabbitMQ app is running.
The two introduced attributes rabbit- start-phase- 1-time and ordered- to-restart are made private. In order to allow master private_ attr and get_private_ attr signatures are expanded to allow passing
rabbit-
to set node's order to restart, both ocf_update_
ocf_
node name.
Finally, a bug is fixed in ocf_get_ private_ attr. Unlike crm_attribute,
attrd_updater returns empty string instead of "(null)", when an
attribute is not defined on needed node, but is defined on some other
node. Correspondingly changed code to expect empty string, not a
"(null)".
Closes-Bug: #1561894
Closes-Bug: #1559136
Change-Id: Ib72794361dac54 817975163593ea7 e07f7e8b4e1