Juju Charms Collection
percona-cluster package

MySQL units in error state due to 'leader-elected' hook failing

Bug #1784584 reported by Lorenzo Cavassa on 2018-07-31

This bug affects 2 people

	Status	Importance	Assigned to
Canonical Juju	Invalid	Undecided	Unassigned
OpenStack Percona Cluster Charm	Fix Released	Undecided	Felipe Reyes
percona-cluster (Juju Charms Collection)	Invalid	Undecided	Unassigned

Bug Description

Openstack cloud already deployed. It was already properly working.

Queens + 18.05 charms - Xenial

Suddenly 2 on of 3 MySQL units (the leader is the other one) started to show Juju error messages:

"
2018-07-31 08:15:18 DEBUG worker.uniter.jujuc server.go:178 running hook tool "leader-set"
2018-07-31 08:15:18 DEBUG juju.worker.leadership tracker.go:269 mysql/1 is not mysql leader
2018-07-31 08:15:18 DEBUG leader-elected ERROR cannot write leadership settings: cannot write settings: not the leader
2018-07-31 08:15:18 DEBUG leader-elected Traceback (most recent call last):
2018-07-31 08:15:18 DEBUG leader-elected File "/var/lib/juju/agents/unit-mysql-1/charm/hooks/leader-elected", line 848, in <module>
2018-07-31 08:15:18 DEBUG leader-elected main()
2018-07-31 08:15:18 DEBUG leader-elected File "/var/lib/juju/agents/unit-mysql-1/charm/hooks/leader-elected", line 838, in main
2018-07-31 08:15:18 DEBUG leader-elected hooks.execute(sys.argv)
2018-07-31 08:15:18 DEBUG leader-elected File "/var/lib/juju/agents/unit-mysql-1/charm/hooks/charmhelpers/core/hookenv.py", line 823, in execute
2018-07-31 08:15:18 DEBUG leader-elected self._hooks[hook_name]()
2018-07-31 08:15:18 DEBUG leader-elected File "/var/lib/juju/agents/unit-mysql-1/charm/hooks/leader-elected", line 810, in leader_elected
2018-07-31 08:15:18 DEBUG leader-elected leader_set(**{'leader-ip': get_relation_ip('cluster')})
2018-07-31 08:15:18 DEBUG leader-elected File "/var/lib/juju/agents/unit-mysql-1/charm/hooks/charmhelpers/core/hookenv.py", line 952, in inner_translate_exc2
2018-07-31 08:15:18 DEBUG leader-elected return f(*args, **kwargs)
2018-07-31 08:15:18 DEBUG leader-elected File "/var/lib/juju/agents/unit-mysql-1/charm/hooks/charmhelpers/core/hookenv.py", line 1005, in leader_set
2018-07-31 08:15:18 DEBUG leader-elected subprocess.check_call(cmd)
2018-07-31 08:15:18 DEBUG leader-elected File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
2018-07-31 08:15:18 DEBUG leader-elected raise CalledProcessError(retcode, cmd)
2018-07-31 08:15:18 DEBUG leader-elected subprocess.CalledProcessError: Command '['leader-set', 'leader-ip=10.218.5.28']' returned non-zero exit status 1
2018-07-31 08:15:18 ERROR juju.worker.uniter.operation runhook.go:113 hook "leader-elected" failed: exit status 1
2018-07-31 08:15:18 DEBUG juju.worker.uniter.operation executor.go:84 lock released
2018-07-31 08:15:18 INFO juju.worker.uniter resolver.go:100 awaiting error resolution for "leader-elected" hook
2018-07-31 08:15:18 DEBUG juju.worker.uniter agent.go:17 [AGENT-STATUS] error: hook failed: "leader-elected"
"

Tags:

Revision history for this message

James Page (james-page) wrote on 2018-07-31:

Note that the hook being called is 'leader-elected' which is actually a single line of python - the leader-set call.

Yes we can wrap that in an is_leader check, but it looks like something went funky with the leadership managed via juju across these units.

Revision history for this message

Frode Nordahl (fnordahl) wrote on 2018-07-31:

https://files.support.canonical.com/v3/files/uploads/juju-crashdump-dbdca43c-3f1c-4395-b5a8-8cb0df2e92e5.tar.xz

Revision history for this message

Hua Zhang (zhhuabj) wrote on 2018-08-01:

Only the leader can call leader-set, unit-mysql-1 should not the leader so we saw the error 'Command '['leader-set', 'leader-ip=10.218.5.28']' returned non-zero exit status 1', but if unit-mysql-1 was not the leader why the hook leader-elected was called ?

The following log shows seems unit "hacluster-mysql/2" (this log comes from another env, so I mean the unit "hacluster-mysql/2" should be above unit-mysql-1) first shut down then started again.

2018-07-30 16:31:14 ERROR juju.worker.uniter agent.go:31 updating agent status: cannot get unit "hacluster-mysql/2": no reachable servers
2018-07-30 16:31:14 INFO juju.worker.uniter uniter.go:399 unit "hacluster-mysql/2" shutting down: could not acquire lock: cancelled acquiring mutex
...
2018-07-30 16:31:17 DEBUG juju.worker.leadership tracker.go:130 hacluster-mysql/2 making initial claim for hacluster-mysql leadership
...
2018-07-30 16:31:18 ERROR juju.worker.dependency engine.go:548 "migration-minion" manifold worker returned unexpected error: setting up watcher: connection is shut down
2018-07-30 16:31:18 DEBUG juju.worker.dependency engine.go:550 stack trace:
github.com/juju/juju/rpc/client.go:12: connection is shut down
...
2018-07-30 16:31:31 DEBUG juju.worker.apiaddressupdater apiaddressupdater.go:88 updating API hostPorts to [[10.82.112.60:17070 127.0.0.1:17070 [::1]:17070] [10.82.112.62:17070 127.0.0.1:17070 [::1]:17070] ...
2018-07-30 16:31:31 DEBUG juju.worker.dependency engine.go:483 "leadership-tracker" manifold worker started
2018-07-30 16:31:31 DEBUG juju.worker.leadership tracker.go:130 hacluster-mysql/2 making initial claim for hacluster-mysql leadership
...
2018-07-30 16:31:31 DEBUG juju.worker.leadership tracker.go:275 hacluster-mysql/2 is not hacluster-mysql leader
2018-07-30 16:32:35 DEBUG juju.worker.uniter uniter.go:641 acquire lock "machine-lock" for uniter hook execution
2018-07-30 16:32:35 DEBUG juju.worker.uniter uniter.go:646 lock "machine-lock" acquired

So I have a guess, at the beginning, unit-mysql-0 is the leader, unit-mysql-1 was dead so that it was removed from the election queue, but then unit-mysql-0 started again so that it was added into the election queue again. Maybe this kind of race make unit-mysql-1 believe it was the leader too at that time, so that the problem happened. Of course, I am not familiar with juju code, so it's a just guess. Do you think there is this possibility ? thanks.

The following log shows seems unit "hacluster-mysql/2" (this log comes from another env, so I mean the unit "hacluster-mysql/2" should be above unit-mysql-1) first shut down then started again.

Revision history for this message

Felipe Reyes (freyes) wrote on 2018-08-01:

First some background on how the leadership works. The unitter (juju agent) has
a worker called leadership-tracker which is responsible of keeping track of the
leadership and keep the lease updated, if the unitter decides to run the
leader-elected hook there is *no* guarantee that the unit is the leader during
the execution of the hook. So a charm *must* verify if it's the leader before
doing something that it's meant to be done only by the leader, for example
(pseudo-code):

Wrong approach:
```
@hooks('leader-elected')
def leader_elected():
do_bootstrap()
```

Expected approach:
```
@hooks('leader-elected')
def leader_elected():
if is_leader():
do_bootstrap()
```

So in the case of the percona-cluster charm, the leader-elected charm is simply
calling leader_set() without wrapping it with a "if is_leader()"
https://github.com/openstack/charm-percona-cluster/blob/master/hooks/percona_hooks.py#L807

Revision history for this message

Felipe Reyes (freyes) wrote on 2018-08-01:

https://review.openstack.org/587899

tags:	added: sts
Changed in percona-cluster (Juju Charms Collection):
assignee:	nobody → Felipe Reyes (freyes)
status:	New → In Progress

Revision history for this message

Felipe Reyes (freyes) wrote on 2018-08-13:

https://review.openstack.org/#/c/588247 <- backported to 18.05

Changed in percona-cluster (Juju Charms Collection):
assignee:	Felipe Reyes (freyes) → nobody
status:	In Progress → Invalid
Changed in charm-percona-cluster:
status:	New → Fix Released
assignee:	nobody → Felipe Reyes (freyes)

Revision history for this message

Chris MacNaughton (chris.macnaughton) wrote on 2018-09-12:

Marking this in valid for Juju as Felipe is right, the hook is queued when the unit is leader but may take some time to execute

Changed in juju:
status:	New → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Juju Charms Collectionpercona-cluster package

MySQL units in error state due to 'leader-elected' hook failing

Bug Description

Other bug subscribers

Remote bug watches

Juju Charms Collection
percona-cluster package