Race to mount seed device
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
livecd-rootfs (Ubuntu) |
Fix Released
|
High
|
Unassigned | ||
Bionic |
Fix Released
|
Critical
|
Unassigned |
Bug Description
[Impact]
We've just come across a situation where cloud-init races systemd to mount a partition, with the result that either cloud-init fails to load its seed, or a relatively important partition fails to mount.
This can basically lead to situations where the device will not boot.
[Test Case]
Boot the image built using the bionic-proposed livecd-rootfs multiple times and make sure it is booting up correctly.
TODO: find a better test case?
[Regression Potential]
The change seems relatively safe and doesn't seem like there are any obvious regressions it could cause, but one should be on a lookout for any cloud-init weirdness or issues on boot-time.
[Original Description]
We've just come across a situation where cloud-init races systemd to mount a partition, with the result that either cloud-init fails to load its seed, or a relatively important partition fails to mount. I'm not entirely sure this is a bug - it could well be argued this is mis-configuration on our part, and there's a trivial workaround, but I'll leave the cloud-init devs to make that call.
The configuration is as follows:
1. On recent Ubuntu classic images for the Raspberry Pi, cloud-init's seed is located on the boot partition (/dev/mmcblk0p1, labelled "system-boot"). This is desirable as it's a straight-forward FAT partition, and thus easily accessible and editable on any OS (as opposed to the original seed location on the root ext4 partition).
2. Cloud-init is configured to look for its seed with "fs_label: system-boot"
3. Systemd is configured (via /etc/fstab) to mount /dev/disk/
The result on boot-up is one of three situations:
1. cloud-init mounts /dev/mmcblk0p1 on a temp path. While attempting to read its seed, systemd attempts to mount /dev/mmcblk0p1 on /boot/firmware but fails because the device is already mounted elsewhere. cloud-init succeeds in reading its seed, umounts /dev/mmcblk0p1 and the system boots successfully, but /boot/firmware isn't mounted (leading to issues further down the line with other things like flash-kernel that expect it to be mounted).
2. cloud-init checks for /dev/mmcblk0p1 in /proc/mounts, and doesn't see it there. It goes to mount /dev/mmcblk0p1 on a temp path, but before it can do so, systemd mounts it on /boot/firmware. cloud-init's mount fails (device already mounted), so it fails to read its seed and uses a default config instead (which isn't entirely desirable for us as the default involves a long wait if ethernet is not connected; our default seed overrides that).
3. Everything works fine. This occurs when either a) systemd mounts /boot/firmware first, then cloud-init sees this mount in /proc/mounts and uses it or b) cloud-init mounts /dev/mmcblk0p1 on a temp path, reads its seed and unmounts it, *then* systemd mounts /boot/firmware.
Here's the relevant snippet of traceback from when the second result occurs:
...
2020-01-14 20:48:01,337 - util.py[DEBUG]: Read 386 bytes from /etc/os-release
2020-01-14 20:48:01,338 - util.py[DEBUG]: Running command ['blkid', '-odevice', '/dev/sr0'] with allowed return codes [0, 2] (shell=False, capture=True)
2020-01-14 20:48:01,348 - util.py[DEBUG]: Running command ['blkid', '-odevice', '/dev/sr1'] with allowed return codes [0, 2] (shell=False, capture=True)
2020-01-14 20:48:01,357 - util.py[DEBUG]: Running command ['blkid', '-tTYPE=vfat', '-odevice'] with allowed return codes [0, 2] (shell=False, capture=True)
2020-01-14 20:48:01,415 - util.py[DEBUG]: Running command ['blkid', '-tTYPE=iso9660', '-odevice'] with allowed return codes [0, 2] (shell=False, capture=True)
2020-01-14 20:48:01,432 - util.py[DEBUG]: Running command ['blkid', '-tLABEL=
2020-01-14 20:48:01,448 - util.py[DEBUG]: Running command ['blkid', '-tLABEL=
2020-01-14 20:48:01,466 - DataSourceNoClo
2020-01-14 20:48:01,467 - util.py[DEBUG]: Reading from /proc/mounts (quiet=False)
2020-01-14 20:48:01,467 - util.py[DEBUG]: Read 2062 bytes from /proc/mounts
2020-01-14 20:48:01,468 - util.py[DEBUG]: Fetched {'sysfs': {'fstype': 'sysfs', 'mountpoint': '/sys', 'opts': 'rw,nosuid,
2020-01-14 20:48:01,468 - util.py[DEBUG]: Running command ['mount', '-o', 'ro', '-t', 'auto', '/dev/mmcblk0p1', '/run/cloud-
2020-01-14 20:48:01,529 - util.py[DEBUG]: Failed mount of '/dev/mmcblk0p1' as 'auto': Unexpected error while running command.
Command: ['mount', '-o', 'ro', '-t', 'auto', '/dev/mmcblk0p1', '/run/cloud-
Exit code: 32
Reason: -
Stdout:
Stderr: mount: /run/cloud-
2020-01-14 20:48:01,530 - util.py[WARNING]: Failed to mount /dev/mmcblk0p1 when looking for data
2020-01-14 20:48:01,533 - util.py[DEBUG]: Failed to mount /dev/mmcblk0p1 when looking for data
Traceback (most recent call last):
File "/usr/lib/
pp2d_kwargs)
File "/usr/lib/
(device, tmpd, failure_reason))
cloudinit.
Command: ['mount', '-o', 'ro', '-t', 'auto', '/dev/mmcblk0p1', '/run/cloud-
Exit code: 32
Reason: -
Stdout:
Stderr: mount: /run/cloud-
2020-01-14 20:48:01,549 - __init__.py[DEBUG]: Datasource DataSourceNoCloud [seed=None]
2020-01-14 20:48:01,549 - handlers.py[DEBUG]: finish: init-local/
2020-01-14 20:48:01,549 - main.py[DEBUG]: No local datasource found
...
The relevant code is mount_cb() in util.py which is calling mounts() to read /proc/mounts then attempting to mount the device when it's not found (leading to a classic race between the check and the action).
It might be argued that it should simply attempt the mount, and check /proc/mounts in the event of failure, but I'm wary that mounting FS' is a potentially costly exercise (can the seed be on a remote mount?), and there's probably edge cases here I'm unaware of (is it possible, perhaps dangerous, to double-mount certain devices?).
Perhaps it could be adjusted to check /proc/mounts, attempt the mount, then *re-check* /proc/mounts in the case of failure?
Nevertheless, there's a trivial work-around in our case: just add an override for "RequiresMountF
tags: | added: id-5db0502bcb3f2112497e1b92 |
Changed in cloud-init (Ubuntu): | |
status: | New → Triaged |
tags: |
added: verification-done verification-done-bionic removed: verification-needed verification-needed-bionic |
no longer affects: | cloud-init (Ubuntu Bionic) |
no longer affects: | cloud-init (Ubuntu) |
So the mentioned workaround seems fair, let's proceed with it if possible. What I find mysterious is that we did not see it in eoan or focal so far, where since this is an obvious race condition, I'd expect it to pop up here and there.