hook failed: "leader-settings-changed" (double bootstrap-pxc due to a leadership change before cluster bootstrap completion)

Bug #1732257 reported by Ashley Lai
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Percona Cluster Charm
Confirmed
Critical
Unassigned

Bug Description

16.04 HWE-edge kernel
Openstack version: Pike

Using percona-cluster r275, we deployed 3 units. 2 units (mysql/0 and mysql/2) failed in leader-settings-changed hook.

The juju log shows the leader UUID and unit UUID don't match.

2017-11-14 19:27:19 DEBUG leader-settings-changed Traceback (most recent call last):
2017-11-14 19:27:19 DEBUG leader-settings-changed File "/var/lib/juju/agents/unit-mysql-0/charm/hooks/leader-settings-changed", line 780, in <module>
2017-11-14 19:27:19 DEBUG leader-settings-changed main()
2017-11-14 19:27:19 DEBUG leader-settings-changed File "/var/lib/juju/agents/unit-mysql-0/charm/hooks/leader-settings-changed", line 770, in main
2017-11-14 19:27:19 DEBUG leader-settings-changed hooks.execute(sys.argv)
2017-11-14 19:27:19 DEBUG leader-settings-changed File "/var/lib/juju/agents/unit-mysql-0/charm/hooks/charmhelpers/core/hookenv.py", line 768, in execute
2017-11-14 19:27:19 DEBUG leader-settings-changed self._hooks[hook_name]()
2017-11-14 19:27:19 DEBUG leader-settings-changed File "/var/lib/juju/agents/unit-mysql-0/charm/hooks/leader-settings-changed", line 737, in leader_settings_changed
2017-11-14 19:27:19 DEBUG leader-settings-changed update_bootstrap_uuid()
2017-11-14 19:27:19 DEBUG leader-settings-changed File "/var/lib/juju/agents/unit-mysql-0/charm/hooks/percona_utils.py", line 512, in update_bootstrap_uuid
2017-11-14 19:27:19 DEBUG leader-settings-changed cluster_state_uuid)
2017-11-14 19:27:19 DEBUG leader-settings-changed percona_utils.InconsistentUUIDError: Leader UUID ('0dde54c4-c961-11e7-9775-2f72db67cbda') != Unit UUID ('b3e1b7e8-c95f-11e7-91fa-56c8581eb616')
2017-11-14 19:27:19 ERROR juju.worker.uniter.operation runhook.go:107 hook "leader-settings-changed" failed: exit status 1

I've attached the juju crashdump log.

Revision history for this message
Ashley Lai (alai) wrote :
tags: added: cpe-onsite
Ashley Lai (alai)
tags: added: cdo-qa
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :
Download full text (6.5 KiB)

I can confirm that alitvinov and I have seen exactly that behavior a few weeks ago on stable/17.08.

Leader UUID ('0dde54c4-c961-11e7-9775-2f72db67cbda') != Unit UUID ('b3e1b7e8-c95f-11e7-91fa-56c8581eb616')

In our case all leader-get invocations returned the same result (so the Juju leader bucket is consistent across units).

http://paste.ubuntu.com/25962876/

As the code suggests, this is case 1 and it seems that there were two concurrent bootstrap attempts.

    elif lead_cluster_state_uuid != cluster_state_uuid:
        # this may mean 2 things:
        # 1) the units have diverged, which it's bad and we do stop.
        # 2) cluster_state_uuid could not be retrieved because it
        # hasn't been bootstrapped, mysqld is stopped, etc.

The question is how is that even possible with a single leader?

render_config_restart_on_changed -> bootstrap_pxc

https://github.com/openstack/charm-percona-cluster/blob/stable/17.08/hooks/percona_hooks.py#L323-L327
    if is_leader():
        log("Leader unit - bootstrap required=%s" % (not bootstrapped),
            DEBUG)
        render_config_restart_on_changed(clustered, hosts,
                                         bootstrap=not bootstrapped)

https://github.com/openstack/charm-percona-cluster/blob/stable/17.08/hooks/percona_hooks.py#L221-L224
            bootstrap_pxc()
            # NOTE(dosaboy): this will not actually do anything if no cluster
            # relation id exists yet.
            notify_bootstrapped()

If both units have passed is_leader gate than they were both leaders at the same time.

===

Each time a new leader is elected there will be a resignation operation by the operation resolver in uniter:

https://github.com/juju/juju/blob/juju-2.2.6/worker/uniter/leadership/resolver.go#L56-L60

Normally, a current leader will keep its lease (lease renewal).

https://github.com/juju/juju/blob/juju-2.2.6/worker/leadership/tracker.go#L249

===

Looking at the logs:

unit-mysql-0 - never elected as a leader
unit-mysql-1 - elected as a once leader at 2017-11-14 17:27:07, never resigned
unit-mysql-2 - first leader ever for this app, elected as a leader once at 2017-11-14 17:17:00, resigned at 2017-11-14 17:21:33

unit-mysql-2.log

# there is no leader-elected hook implemented in percona-cluster so this is fine

post-install queued leader-elected event
2017-11-14 17:16:33 INFO juju.worker.uniter resolver.go:104 found queued "leader-elected" hook

2017-11-14 17:16:33 DEBUG juju.worker.uniter agent.go:17 [AGENT-STATUS] executing: running leader-elected hook

2017-11-14 17:16:35 INFO juju-log Unknown hook leader-elected - skipping.

2017-11-14 17:17:00 INFO juju.worker.uniter.operation runhook.go:113 ran "leader-elected" hook
2017-11-14 17:17:00 DEBUG juju.worker.uniter.operation executor.go:100 committing operation "run leader-elected hook"

...

2017-11-14 17:17:00 INFO juju.worker.uniter.operation runhook.go:113 ran "leader-elected" hook
2017-11-14 17:17:00 DEBUG juju.worker.uniter.operation executor.go:100 committing operation "run leader-elected hook"

...

2017-11-14 17:17:28 DEBUG juju-log Leader unit - bootstrap required=True

2017-11-14 17:17:32 DEBUG config-changed Unknown oper...

Read more...

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

What I think happens is that there is a percona-level contention on who's cluster UUID is the right one as we technically ran bootstrap-pxc two times due to a leadership flip to a different unit.

I don't think concurrent bootstrap-pxc is generally a good idea (https://www.percona.com/doc/percona-xtradb-cluster/LATEST/bootstrap.html this doesn't discuss consequential bootstrap-pxc executions on different nodes - I think there is a high chance that the cluster will go crazy like our cluster did).

1.
mysql/2 (the first leader) has set a cluster_state_uuid to b3e1b7e8-c95f-11e7-91fa-56c8581eb616

2017-11-14 17:17:43 DEBUG juju-log Bootstrap PXC Succeeded

2.

mysql/1 - the new leader doesn't error out so it must have generated 0dde54c4-c961-11e7-9775-2f72db67cbda and set is to leader settings.

2017-11-14 17:27:24 DEBUG juju-log Bootstrap PXC Succeeded

3.

mysql/0 - a minion "for life"

Used the correct UUID of the new leader (mysql/1) (0dde54c4-c961-11e7-9775-2f72db67cbda) at first and even started notifying peer units, so:

somehow got past

    if wsrep_ready.lower() in ['on', 'ready']:
        cluster_state_uuid = get_wsrep_value('wsrep_cluster_state_uuid')

and

    elif lead_cluster_state_uuid != cluster_state_uuid:

        log('bootstrap uuid differs: %s != %s' % (lead_cluster_state_uuid,
                                                  cluster_state_uuid),

and got to

    for rid in relation_ids('cluster'):
        notify_bootstrapped(cluster_rid=rid,
                            cluster_uuid=cluster_state_uuid)

2017-11-14 17:31:07 DEBUG juju-log cluster:28: Notifying peers that percona is bootstrapped (uuid=0dde54c4-c961-11e7-9775-2f72db67cbda)

Notice the time stamp: 17:31:07 - this happened way after mysql/2 set up wsrep_cluster_state to b3e1b7e8-c95f-11e7-91fa-56c8581eb616

4. mysql/0 (minion) started saying that wsrep_cluster_uuid == b3e1b7e8-c95f-11e7-91fa-56c8581eb616

2017-11-14 17:31:57 ERROR juju-log bootstrap uuid differs: 0dde54c4-c961-11e7-9775-2f72db67cbda != b3e1b7e8-c95f-11e7-91fa-56c8581eb616

summary: - hook failed: "leader-settings-changed"
+ hook failed: "leader-settings-changed" (double bootstrap-pxc due to a
+ leadership change before cluster bootstrap completion)
Changed in charm-percona-cluster:
status: New → Confirmed
David Ames (thedac)
Changed in charm-percona-cluster:
importance: Undecided → Critical
milestone: none → 17.11
Ashley Lai (alai)
tags: added: cdo-qa-blocker foundation-engine
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.