sysctl max inotify watch number is not updated on the host when k8s master is deployed in an LXD container

Bug #1967154 reported by Camille Rodriguez
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Incomplete
Undecided
Harry Pidcock
Kubernetes Control Plane Charm
Triaged
Medium
Unassigned
Kubernetes Worker Charm
Triaged
Medium
Unassigned

Bug Description

Kubernetes-master has a sysctl option which allows configuration of several parameters, including fs.inotify.max_user_watches. This one is critical in kubernetes for the host to be able to launch new containers without hitting a limit. The recommended value for production systems is 1048576 according to https://linuxcontainers.org/lxd/docs/master/production-setup/.

The default template for deploying kubernetes is to deploy k8s-master in lxd containers. When doing so, the value is not updated on the host, which leads to the host not being able to spin up new containers. The fact that the config option exists on k8s-master leads the user to believe this value is set on the host, when in reality it is not.

Here follows an example of what is seen on a system where the limit has not been increased. The new LXD containers do not get an IP address.

ubuntu@k8s-control-03:~$ sudo lxc list
+---------------------+---------+-------------------------------+------+-----------+-----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
+---------------------+---------+-------------------------------+------+-----------+-----------+
| juju-2addd2-0-lxd-0 | RUNNING | 192.168.20.33 (eth0) | | CONTAINER | 0 |
+---------------------+---------+-------------------------------+------+-----------+-----------+
| juju-2addd2-0-lxd-1 | RUNNING | 192.168.20.124 (eth1) | | CONTAINER | 0 |
| | | 138.26.125.136 (eth0) | | | |
+---------------------+---------+-------------------------------+------+-----------+-----------+
| juju-2addd2-0-lxd-2 | RUNNING | 192.168.20.66 (eth1) | | CONTAINER | 0 |
| | | 192.168.20.209 (eth1) | | | |
| | | 138.26.125.244 (eth0) | | | |
| | | 138.26.125.135 (eth0) | | | |
+---------------------+---------+-------------------------------+------+-----------+-----------+
| juju-2addd2-0-lxd-3 | RUNNING | 192.168.20.187 (eth1) | | CONTAINER | 0 |
| | | 138.26.125.137 (eth0) | | | |
| | | 10.128.196.192 (vxlan.calico) | | | |
+---------------------+---------+-------------------------------+------+-----------+-----------+
| juju-2addd2-0-lxd-4 | RUNNING | 192.168.20.150 (eth0) | | CONTAINER | 0 |
+---------------------+---------+-------------------------------+------+-----------+-----------+
| juju-2addd2-0-lxd-5 | RUNNING | | | CONTAINER | 0 |
+---------------------+---------+-------------------------------+------+-----------+-----------+
| juju-2addd2-0-lxd-6 | RUNNING | | | CONTAINER | 0 |
+---------------------+---------+-------------------------------+------+-----------+-----------+

$ tail /var/log/syslog
Mar 29 15:40:41 k8s-control-03 systemd[1]: user-1000.slice: Failed to add control inotify watch descriptor for control group /user.slice/user-1000.slice: No space left on device
Mar 29 15:40:41 k8s-control-03 systemd[1]: Created slice User Slice of UID 1000.
Mar 29 15:40:41 k8s-control-03 systemd[1]: user-runtime-dir@1000.service: Failed to add control inotify watch descriptor for control group /user.slice/user-1000.slice/user-runtime-dir@1000.service: No space left on device
Mar 29 15:40:41 k8s-control-03 systemd[1]: Starting User Runtime Directory /run/user/1000...
Mar 29 15:40:41 k8s-control-03 systemd[1]: Finished User Runtime Directory /run/user/1000.
Mar 29 15:40:41 k8s-control-03 systemd[1]: user@1000.service: Failed to add control inotify watch descriptor for control group /user.slice/user-1000.slice/user@1000.service: No space left on device
Mar 29 15:40:41 k8s-control-03 systemd[1]: Starting User Manager for UID 1000...
Mar 29 15:40:41 k8s-control-03 systemd[382967]: Reached target Paths.
Mar 29 15:40:41 k8s-control-03 systemd[382967]: Reached target Timers.
Mar 29 15:40:41 k8s-control-03 systemd[382967]: Starting D-Bus User Message Bus Socket.
Mar 29 15:40:41 k8s-control-03 systemd[382967]: Listening on GnuPG network certificate management daemon.
Mar 29 15:40:41 k8s-control-03 systemd[382967]: Listening on GnuPG cryptographic agent and passphrase cache (access for web browsers).
Mar 29 15:40:41 k8s-control-03 systemd[382967]: Listening on GnuPG cryptographic agent and passphrase cache (restricted).
Mar 29 15:40:41 k8s-control-03 systemd[382967]: Listening on GnuPG cryptographic agent (ssh-agent emulation).
Mar 29 15:40:41 k8s-control-03 systemd[382967]: Listening on GnuPG cryptographic agent and passphrase cache.
Mar 29 15:40:41 k8s-control-03 systemd[382967]: Listening on debconf communication socket.
Mar 29 15:40:41 k8s-control-03 systemd[382967]: Listening on REST API socket for snapd user session agent.
Mar 29 15:40:41 k8s-control-03 systemd[382967]: Listening on D-Bus User Message Bus Socket.
Mar 29 15:40:41 k8s-control-03 systemd[382967]: Reached target Sockets.
Mar 29 15:40:41 k8s-control-03 systemd[382967]: Reached target Basic System.
Mar 29 15:40:41 k8s-control-03 systemd[382967]: Reached target Main User Target.
Mar 29 15:40:41 k8s-control-03 systemd[382967]: Startup finished in 88ms.
Mar 29 15:40:41 k8s-control-03 systemd[1]: Started User Manager for UID 1000.
Mar 29 15:40:41 k8s-control-03 systemd[1]: session-795.scope: Failed to add control inotify watch descriptor for control group /user.slice/user-1000.slice/session-795.scope: No space left on device
Mar 29 15:40:41 k8s-control-03 systemd[1]: Started Session 795 of user ubuntu.
Mar 29 15:41:34 k8s-control-03 systemd[1]: motd-news.service: Failed to add control inotify watch descriptor for control group /system.slice/motd-news.service: No space left on device
Mar 29 15:41:34 k8s-control-03 systemd[1]: Starting Message of the Day...
Mar 29 15:41:35 k8s-control-03 50-motd-news[384858]: * Super-optimized for small spaces - read how we shrank the memory
Mar 29 15:41:35 k8s-control-03 50-motd-news[384858]: footprint of MicroK8s to make it the smallest full K8s around.
Mar 29 15:41:35 k8s-control-03 50-motd-news[384858]: https://ubuntu.com/blog/microk8s-memory-optimisation
Mar 29 15:41:35 k8s-control-03 systemd[1]: motd-news.service: Succeeded.
Mar 29 15:41:35 k8s-control-03 systemd[1]: Finished Message of the Day.
Mar 29 15:41:38 k8s-control-03 systemd[1]: snap.lxd.lxc.c803c336-fa40-4565-a960-f58d94735603.scope: Failed to add control inotify watch descriptor for control group /system.slice/snap.lxd.lxc.c803c336-fa40-4565-a960-f58d94735603.scope: No space left on device
Mar 29 15:41:38 k8s-control-03 systemd[1]: Started snap.lxd.lxc.c803c336-fa40-4565-a960-f58d94735603.scope.
Mar 29 15:41:38 k8s-control-03 systemd[382967]: run-snapd-ns-lxd.mnt.mount: Succeeded.
Mar 29 15:41:38 k8s-control-03 systemd[1]: run-snapd-ns-lxd.mnt.mount: Succeeded.
Mar 29 15:41:38 k8s-control-03 systemd[382967]: tmp-snap.rootfs_MckOjj.mount: Succeeded.
Mar 29 15:41:38 k8s-control-03 systemd[1]: tmp-snap.rootfs_MckOjj.mount: Succeeded.

Revision history for this message
George Kraft (cynerva) wrote :

The kubernetes-master charm, when deployed to LXD containers, does not have write access to kernel parameters. I recommend we remove the charm's sysctl config option to prevent confusion on this front. The same applies to kubernetes-worker.

Added Juju as an affected project. If the LXD project is recommending fs.inotify.max_user_watches=1048576 for production environments, perhaps Juju should set it?

Changed in charm-kubernetes-master:
importance: Undecided → Medium
Changed in charm-kubernetes-worker:
importance: Undecided → Medium
Changed in charm-kubernetes-master:
status: New → Triaged
Changed in charm-kubernetes-worker:
status: New → Triaged
Revision history for this message
Juan M. Tirado (tiradojm) wrote :

Any comments @wallyworld?

Changed in juju:
assignee: nobody → Harry Pidcock (hpidcock)
Revision history for this message
John A Meinel (jameinel) wrote :

Fundamentally, this isn't something that Juju or the charm can do from within a container. If a charm is being put into a container, it is unable to set kernel parameters. Either you need to use a VM, or provision it on the host machine.

The other option would be to have a different charm that sets the kernel parameters, and have someone deploy that to the host machine.

There is some possibility with the lxd-profile.yaml feature of charms (it supports charms that need to have a custom kernel module loaded, etc), but I don't think it supports arbitrary kernel parameters.

Revision history for this message
John A Meinel (jameinel) wrote :

I'm tempted to put this into Won't Fix, but for now I want to understand what you think you might be able to accomplish, and what the priority is for it. It's plausible that we could expand lxd-profile.yaml if we really need additional kernel tuning for a good number of use cases.
I don't particularly like the workaround of "deploy this to the host machine to make deploying this to a container" work. It falls into the "the system doesn't just work, you have to understand the workarounds to make it work" that I really don't like.

Changed in juju:
status: New → Incomplete
Revision history for this message
George Kraft (cynerva) wrote :

Thanks for the feedback. Assuming we were to proceed with no changes in Juju, here's what I think we would need to do:

1. For Juju controllers deployed to localhost, we'll need to update our documentation to include setting required kernel parameters prior to deployment.

2. For Juju controllers with MAAS/vSphere/etc clouds, we'll need to update any bundles that deploy units to e.g. `machine: lxd:0`, to also include an Ubuntu charm deployed to `machine: 0` with a sysconfig subordinate[1] that sets the kernel parameters properly. We will also need some hefty documentation around this to ensure that people building their own bundles know it is required.

The kernel parameters that we're discussing in this issue aren't specific to Kubernetes, but rather seem to be those that the LXD documentation recommends for production use[2]. In my mind, it would make sense for Juju, when initializing LXD on a host VM, to also set the kernel parameters that the LXD project recommends. This is what I had in mind when adding Juju to this issue.

The Kubernetes part of this issue is that the kubernetes-control-plane and kubernetes-worker charms provide a `sysctl` config option that simply does not work in LXD containers, and should be removed to avoid confusion. It's worth noting that Kubernetes does come with its own kernel parameter requirements[3], so we will probably need to tweak our documentation and bundles to handle those anyway.

If there's anything that Juju can do to either:
1. Set recommended kernel parameters for LXD when initializing LXD on host machines, or
2. Allow units in LXD to "bubble up" kernel parameter needs to the host machine

It would certainly reduce the weight of our documentation and the need for sysconfig as a workaround, and would be much appreciated.

[1]: https://charmhub.io/sysconfig
[2]: https://linuxcontainers.org/lxd/docs/master/reference/server_settings/
[3]: https://github.com/charmed-kubernetes/layer-kubernetes-node-base/blob/38fdcfce8fc89f397c1d8212065e12cdfae6b251/config.yaml#L4

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.