Elected leader in K8s should not be error or terminating

Bug #1960475 reported by Haw Loeung
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
High
Unassigned

Bug Description

Hi,

In a K8s model, the elected leader can sometimes be either a unit that's in error state (with the pod no longer around) or terminating, see ~Juju 2022-02-10). Juju should not allow non-"active" units to be elected as the leader:

|
| Model Controller Cloud/Region Version SLA Timestamp
| prod-charmhub-mattermost prodstack-is-2 k8s-is-external/default 2.9.18 unsupported 03:13:27Z
|
| SAAS Status Store URL
| postgresql active prodstack-is-2 admin/prod-charmhub-mattermost-db.postgresql
|
| App Version Status Scale Charm Store Channel Rev OS Address Message
| mattermost mattermost:5.39.0-canonical active 8/5 mattermost local stable 4 kubernetes 10.85.0.39
|
| Unit Workload Agent Address Ports Message
| mattermost/35 error idle 10.86.56.182 8065/TCP hook failed: "db-relation-changed"
| mattermost/36* terminated failed 10.86.77.171 8065/TCP unit stopped by the cloud
| mattermost/37 terminated failed 10.86.56.205 8065/TCP unit stopped by the cloud
| mattermost/39 error idle 10.86.77.232 8065/TCP hook failed: "db-relation-created"
| mattermost/40 error idle 10.86.56.34 8065/TCP hook failed: "leader-settings-changed"
| mattermost/49 active idle 10.86.56.84 8065/TCP
| mattermost/50 active idle 10.86.56.85 8065/TCP
| mattermost/51 active idle 10.86.77.10 8065/TCP
| mattermost/52 active idle 10.86.77.11 8065/TCP
| mattermost/53 active idle 10.86.56.86 8065/TCP

| Model Controller Cloud/Region Version SLA Timestamp
| prod-charmhub-mattermost prodstack-is-2 k8s-is-external/default 2.9.18 unsupported 03:13:02Z
|
| SAAS Status Store URL
| postgresql active prodstack-is-2 admin/prod-charmhub-mattermost-db.postgresql
|
| App Version Status Scale Charm Store Channel Rev OS Address Message
| mattermost mattermost:5.39.0-canonical active 8/5 mattermost local stable 4 kubernetes 10.85.0.39
|
| Unit Workload Agent Address Ports Message
| mattermost/35* error idle 10.86.56.182 8065/TCP hook failed: "db-relation-changed"
| mattermost/36 terminated failed 10.86.77.171 8065/TCP unit stopped by the cloud
| mattermost/37 terminated failed 10.86.56.205 8065/TCP unit stopped by the cloud
| mattermost/39 error idle 10.86.77.232 8065/TCP hook failed: "db-relation-created"
| mattermost/40 error idle 10.86.56.34 8065/TCP hook failed: "leader-settings-changed"
| mattermost/49 active idle 10.86.56.84 8065/TCP
| mattermost/50 active idle 10.86.56.85 8065/TCP
| mattermost/51 active idle 10.86.77.10 8065/TCP
| mattermost/52 active idle 10.86.77.11 8065/TCP
| mattermost/53 active idle 10.86.56.86 8065/TCP

Had to use juju_revoke_lease multiple times to eventually get an "active" unit to be elected and for things to be functional.

Ian Booth (wallyworld)
Changed in juju:
milestone: none → 2.9.26
status: New → Triaged
importance: Undecided → High
Changed in juju:
milestone: 2.9.26 → 2.9.27
Changed in juju:
milestone: 2.9.27 → 2.9.28
Changed in juju:
milestone: 2.9.28 → 2.9.29
Changed in juju:
milestone: 2.9.29 → 2.9.30
Changed in juju:
status: Triaged → In Progress
assignee: nobody → Yang Kelvin Liu (kelvin.liu)
Revision history for this message
Yang Kelvin Liu (kelvin.liu) wrote :

This usually happens when the application has so many units and the operator is just too busy and the uniters running on the operator could not handle those events and state changes quick enough (or even could be stuck temporarily). Those new units should be able to be stabilized and dead units should be removed later in some time. Because Juju doesn't define the resource limit/request for the operator pod. Ideally, the pod will consume resources as much as it can from the k8s worker node. So we don't know what's the exact max number of units that an operator can manage.

Changed in juju:
status: In Progress → Invalid
Revision history for this message
Haw Loeung (hloeung) wrote :

I don't believe that to be true. Those additional units, IIRC, was spun up to try tickle leader elections and get things to "work". Isn't there a way for the current leader to detect it's status and if it's either "error" or "terminated" to give up it's leadership?

Also, even on calling juju_revoke_lease, a pod should not elect iteslf as leader if it's status is either "error" or "terminated". I think this is a bug and better guarded against.

Changed in juju:
status: Invalid → New
Revision history for this message
Yang Kelvin Liu (kelvin.liu) wrote :

Today I was only able to reproduce it by scaling Mattermost to 60. Then all Uniters were stuck for a while. During that period, leadership was not able to transferred to new units because operator was too busy.
Do you have a way to reproduce the case you mentioned above?

Revision history for this message
Haw Loeung (hloeung) wrote :

Unfortunately, I do not have a good way to reproduce the case as originally reported above, sorry. We don't have many services in K8s yet.

Perhaps this could reproduce the issue of "terminating", though, untested:

* delete a pod using kubectl

* then try a new deployment triggering spawning of new pods.

* then if the pod with "terminating" is still around, try "juju_revoke_lease" to see if that gets elected as the leader.

For pods in "error", might need creative way to make pods go int error state, ether intentional via charm or docker image bugs for testing.

Revision history for this message
Yang Kelvin Liu (kelvin.liu) wrote :

I tried the below steps, and all the pods get terminated and leadership was moved to the new pods correctly.

```
juju deploy postgresql-k8s
juju deploy mattermost-k8s
juju relate mattermost-k8s postgresql-k8s:db
juju scale-application mattermost-k8s 6

mkubectl -nt1 delete pods -l app.kubernetes.io/name=mattermost-k8s
mkubectl -nt1 delete pods -l app.kubernetes.io/name=mattermost-k8s
mkubectl -nt1 delete pods -l app.kubernetes.io/name=mattermost-k8s

```

John A Meinel (jameinel)
Changed in juju:
milestone: 2.9.30 → 2.9-next
status: New → Incomplete
status: Incomplete → Triaged
Changed in juju:
assignee: Yang Kelvin Liu (kelvin.liu) → nobody
Revision history for this message
Haw Loeung (hloeung) wrote :

This is still happening, with Juju 2.9.32:

| Unit Workload Agent Address Ports Message
| mattermost/79* error idle 10.86.56.232 8065/TCP hook failed: "db-relation-broken"
| mattermost/80 active idle 10.86.77.99 8065/TCP
| ...

See https://pastebin.canonical.com/p/yJJQNbq8PF/

Maybe a way to reproduce it is per the above causing some relation breakage? Mind you, this environment is using cross-model relations.

Revision history for this message
Haw Loeung (hloeung) wrote (last edit ):

Add waiting too - https://pastebin.canonical.com/p/tXzqd4vF2S/:

| mattermost/88* waiting executing 10.86.77.102 8065/TCP Waiting for database relation

Revision history for this message
Ian Booth (wallyworld) wrote :

We'd need logs to see why the hook failed etc.
What's causing the pods to churn? Are there config changes being applied?

Ideally the charm would be migrated to a sidecar charm where a statefulset is used and the unit numbers are stable.

Ian Booth (wallyworld)
Changed in juju:
milestone: 2.9-next → none
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.