MySQL units in error state due to 'leader-elected' hook failing

Bug #1784584 reported by Lorenzo Cavassa
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Invalid
Undecided
Unassigned
OpenStack Percona Cluster Charm
Fix Released
Undecided
Felipe Reyes
percona-cluster (Juju Charms Collection)
Invalid
Undecided
Unassigned

Bug Description

Openstack cloud already deployed. It was already properly working.

Queens + 18.05 charms - Xenial

Suddenly 2 on of 3 MySQL units (the leader is the other one) started to show Juju error messages:

"
2018-07-31 08:15:18 DEBUG worker.uniter.jujuc server.go:178 running hook tool "leader-set"
2018-07-31 08:15:18 DEBUG juju.worker.leadership tracker.go:269 mysql/1 is not mysql leader
2018-07-31 08:15:18 DEBUG leader-elected ERROR cannot write leadership settings: cannot write settings: not the leader
2018-07-31 08:15:18 DEBUG leader-elected Traceback (most recent call last):
2018-07-31 08:15:18 DEBUG leader-elected File "/var/lib/juju/agents/unit-mysql-1/charm/hooks/leader-elected", line 848, in <module>
2018-07-31 08:15:18 DEBUG leader-elected main()
2018-07-31 08:15:18 DEBUG leader-elected File "/var/lib/juju/agents/unit-mysql-1/charm/hooks/leader-elected", line 838, in main
2018-07-31 08:15:18 DEBUG leader-elected hooks.execute(sys.argv)
2018-07-31 08:15:18 DEBUG leader-elected File "/var/lib/juju/agents/unit-mysql-1/charm/hooks/charmhelpers/core/hookenv.py", line 823, in execute
2018-07-31 08:15:18 DEBUG leader-elected self._hooks[hook_name]()
2018-07-31 08:15:18 DEBUG leader-elected File "/var/lib/juju/agents/unit-mysql-1/charm/hooks/leader-elected", line 810, in leader_elected
2018-07-31 08:15:18 DEBUG leader-elected leader_set(**{'leader-ip': get_relation_ip('cluster')})
2018-07-31 08:15:18 DEBUG leader-elected File "/var/lib/juju/agents/unit-mysql-1/charm/hooks/charmhelpers/core/hookenv.py", line 952, in inner_translate_exc2
2018-07-31 08:15:18 DEBUG leader-elected return f(*args, **kwargs)
2018-07-31 08:15:18 DEBUG leader-elected File "/var/lib/juju/agents/unit-mysql-1/charm/hooks/charmhelpers/core/hookenv.py", line 1005, in leader_set
2018-07-31 08:15:18 DEBUG leader-elected subprocess.check_call(cmd)
2018-07-31 08:15:18 DEBUG leader-elected File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
2018-07-31 08:15:18 DEBUG leader-elected raise CalledProcessError(retcode, cmd)
2018-07-31 08:15:18 DEBUG leader-elected subprocess.CalledProcessError: Command '['leader-set', 'leader-ip=10.218.5.28']' returned non-zero exit status 1
2018-07-31 08:15:18 ERROR juju.worker.uniter.operation runhook.go:113 hook "leader-elected" failed: exit status 1
2018-07-31 08:15:18 DEBUG juju.worker.uniter.operation executor.go:84 lock released
2018-07-31 08:15:18 INFO juju.worker.uniter resolver.go:100 awaiting error resolution for "leader-elected" hook
2018-07-31 08:15:18 DEBUG juju.worker.uniter agent.go:17 [AGENT-STATUS] error: hook failed: "leader-elected"
"

Tags: cpe-onsite sts
Revision history for this message
James Page (james-page) wrote :

Note that the hook being called is 'leader-elected' which is actually a single line of python - the leader-set call.

Yes we can wrap that in an is_leader check, but it looks like something went funky with the leadership managed via juju across these units.

Revision history for this message
Frode Nordahl (fnordahl) wrote :
Revision history for this message
Hua Zhang (zhhuabj) wrote :

Only the leader can call leader-set, unit-mysql-1 should not the leader so we saw the error 'Command '['leader-set', 'leader-ip=10.218.5.28']' returned non-zero exit status 1', but if unit-mysql-1 was not the leader why the hook leader-elected was called ?

The following log shows seems unit "hacluster-mysql/2" (this log comes from another env, so I mean the unit "hacluster-mysql/2" should be above unit-mysql-1) first shut down then started again.

2018-07-30 16:31:14 ERROR juju.worker.uniter agent.go:31 updating agent status: cannot get unit "hacluster-mysql/2": no reachable servers
2018-07-30 16:31:14 INFO juju.worker.uniter uniter.go:399 unit "hacluster-mysql/2" shutting down: could not acquire lock: cancelled acquiring mutex
...
2018-07-30 16:31:17 DEBUG juju.worker.leadership tracker.go:130 hacluster-mysql/2 making initial claim for hacluster-mysql leadership
...
2018-07-30 16:31:18 ERROR juju.worker.dependency engine.go:548 "migration-minion" manifold worker returned unexpected error: setting up watcher: connection is shut down
2018-07-30 16:31:18 DEBUG juju.worker.dependency engine.go:550 stack trace:
github.com/juju/juju/rpc/client.go:12: connection is shut down
...
2018-07-30 16:31:31 DEBUG juju.worker.apiaddressupdater apiaddressupdater.go:88 updating API hostPorts to [[10.82.112.60:17070 127.0.0.1:17070 [::1]:17070] [10.82.112.62:17070 127.0.0.1:17070 [::1]:17070] ...
2018-07-30 16:31:31 DEBUG juju.worker.dependency engine.go:483 "leadership-tracker" manifold worker started
2018-07-30 16:31:31 DEBUG juju.worker.leadership tracker.go:130 hacluster-mysql/2 making initial claim for hacluster-mysql leadership
...
2018-07-30 16:31:31 DEBUG juju.worker.leadership tracker.go:275 hacluster-mysql/2 is not hacluster-mysql leader
2018-07-30 16:32:35 DEBUG juju.worker.uniter uniter.go:641 acquire lock "machine-lock" for uniter hook execution
2018-07-30 16:32:35 DEBUG juju.worker.uniter uniter.go:646 lock "machine-lock" acquired

So I have a guess, at the beginning, unit-mysql-0 is the leader, unit-mysql-1 was dead so that it was removed from the election queue, but then unit-mysql-0 started again so that it was added into the election queue again. Maybe this kind of race make unit-mysql-1 believe it was the leader too at that time, so that the problem happened. Of course, I am not familiar with juju code, so it's a just guess. Do you think there is this possibility ? thanks.

Revision history for this message
Felipe Reyes (freyes) wrote :

First some background on how the leadership works. The unitter (juju agent) has
a worker called leadership-tracker which is responsible of keeping track of the
leadership and keep the lease updated, if the unitter decides to run the
leader-elected hook there is *no* guarantee that the unit is the leader during
the execution of the hook. So a charm *must* verify if it's the leader before
doing something that it's meant to be done only by the leader, for example
(pseudo-code):

Wrong approach:
```
@hooks('leader-elected')
def leader_elected():
    do_bootstrap()
```

Expected approach:
```
@hooks('leader-elected')
def leader_elected():
    if is_leader():
        do_bootstrap()
```

So in the case of the percona-cluster charm, the leader-elected charm is simply
calling leader_set() without wrapping it with a "if is_leader()"
https://github.com/openstack/charm-percona-cluster/blob/master/hooks/percona_hooks.py#L807

Revision history for this message
Felipe Reyes (freyes) wrote :
tags: added: sts
Changed in percona-cluster (Juju Charms Collection):
assignee: nobody → Felipe Reyes (freyes)
status: New → In Progress
Revision history for this message
Felipe Reyes (freyes) wrote :
Changed in percona-cluster (Juju Charms Collection):
assignee: Felipe Reyes (freyes) → nobody
status: In Progress → Invalid
Changed in charm-percona-cluster:
status: New → Fix Released
assignee: nobody → Felipe Reyes (freyes)
Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

Marking this in valid for Juju as Felipe is right, the hook is queued when the unit is leader but may take some time to execute

Changed in juju:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.