stale relation data for sdn-ip affects kubelet clusterDNS

Bug #2022151 reported by Kevin W Monroe
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Kubernetes Control Plane Charm
Triaged
High
Unassigned

Bug Description

We observed a problem where kubernetes-control-plane (k-c-p) leadership changes could lead to stale DNS info being available over the kube-control relation, leading to invalid config in kubernetes-worker units.

Consider the scenario where k-c-p/0 is the leader and discovers the cluster dns service ip to be x.y.z.119. It transmits this data over the kube-control relation, which is eventually consumed by the kubernetes-worker units and written to /root/cdk/kubelet/config.yaml as:

...
clusterDNS:
- x.y.z.119
...

This data originates from the send_cluster_dns_detail handler:

https://github.com/charmed-kubernetes/charm-kubernetes-control-plane/blob/main/reactive/kubernetes_control_plane.py#L1399

Which is gated by the `cdk-addons.configured` flag. This flag is only set on current leaders and the leader's data is always valid. However, the consuming side of the relation gets data from all control plane units as a combined view. This view prefers low relids and low unit names as the source of truth for key values:

https://github.com/juju-solutions/charms.reactive/blob/master/charms/reactive/endpoints.py#L783-L784

If leadership changes to k-c-p/1 and the dns service IP changes, k8s worker units will see both the previous and current IPs on the relation and prefer the old leader's value for sdn-ip (k-c-p/0 unit name < k-c-p/1). This will lead to a mis-configuration of the k8s worker kubelet service.

There are a few ways to fix this:
- adjust charms.reactive to detect when a leader is sending data over a relation and prefer that
- clear relation data keys from k-c-p units on leadership change
- fire send_cluster_dns_detail for all k-c-p units regardless of leadership

Option 3 feels the safest to implement to ensure dns info is consistent for all k-c-p units regardless of leadership.

Changed in charm-kubernetes-master:
status: New → Triaged
importance: Undecided → High
Changed in charm-kubernetes-master:
milestone: none → 1.28
Revision history for this message
Adam Dyess (addyess) wrote :

It seems likely that this bug will be fixed with the re-write in the ops framework.

Changed in charm-kubernetes-master:
milestone: 1.28 → 1.28+ck1
Adam Dyess (addyess)
Changed in charm-kubernetes-master:
milestone: 1.28+ck1 → 1.29
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.