Mantic minimized/minimal cloud images do not receive IP address during provisioning; systemd regression with wait-online

Bug #2036968 reported by Philip Roche
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
cloud-images
New
Undecided
Unassigned
linux (Ubuntu)
Fix Released
Undecided
Unassigned
systemd (Ubuntu)
Won't Fix
Medium
Nick Rosbrook

Bug Description

Following a recent change from linux-kvm kernel to linux-generic kernel in the mantic minimized images, there is a reproducable bug where a guest VM does not have an IP address assigned as part of cloud-init provisioning.

This is easiest to reproduce when emulating arm64 on amd64 host. The bug is a race condition, so there could exist fast enough virtualisation on fast enough hardware where this bug is not present but in all my testing I have been able to reproduce.

The latest mantic minimized images from http://cloud-images.ubuntu.com/minimal/daily/mantic/ have force initrdless boot and no initrd to fallback to.

This but is not present in the non minimized/base images @ http://cloud-images.ubuntu.com/mantic/ as these boot with initrd with the required drivers present for virtio-net.

Reproducer

```
wget -O "launch-qcow2-image-qemu-arm64.sh" https://people.canonical.com/~philroche/20230921-cloud-images-mantic-fail-to-provision/launch-qcow2-image-qemu-arm64.sh

chmod +x ./launch-qcow2-image-qemu-arm64.sh
wget https://people.canonical.com/~philroche/20230921-cloud-images-mantic-fail-to-provision/livecd.ubuntu-cpc.img
./launch-qcow2-image-qemu-arm64.sh --password passw0rd --image ./livecd.ubuntu-cpc.img
```

You will then be able to log in with user `ubuntu` and password `passw0rd`.

You can run `ip a` and see that there is a network interface present (separate to `lo`) but no IP address has been assigned.

```
ubuntu@cloudimg:~$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: enp0s1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff

```

This is because when cloud-init is trying to configure network interfaces it doesn't find any so it doesn't configure any. But by the time boot is complete the network interface is present but cloud-init provisioning has already completed.

You can verify this by running `sudo cloud-init clean && sudo cloud-init init`

You can then see a successfully configured network interface

```
ubuntu@cloudimg:~$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: enp0s1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.15/24 metric 100 brd 10.0.2.255 scope global dynamic enp0s1
       valid_lft 86391sec preferred_lft 86391sec
    inet6 fec0::5054:ff:fe12:3456/64 scope site dynamic mngtmpaddr noprefixroute
       valid_lft 86393sec preferred_lft 14393sec
    inet6 fe80::5054:ff:fe12:3456/64 scope link
       valid_lft forever preferred_lft forever

```

The bug is also reproducible with amd64 guest on adm64 host on older/slower hardware.

The suggested fixes while debugging this issue are:

* to include `virtio-net` as a built-in in the mantic generic kernel
* understand what needs to change in cloud-init so that it can react to late additions of network interfaces

I will file a separate bug against cloud-init to address the race condition on emulated guest/older hardware.

Changed in linux (Ubuntu):
milestone: none → ubuntu-23.10
description: updated
description: updated
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 2036968

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Philip Roche (philroche)
description: updated
Revision history for this message
Philip Roche (philroche) wrote : Re: Mantic minimized/minimal cloud images do not receive IP address during provisioning
Revision history for this message
Dimitri John Ledkov (xnox) wrote :
Changed in linux (Ubuntu):
status: Incomplete → Triaged
Changed in linux (Ubuntu):
status: Triaged → Fix Committed
Revision history for this message
Dimitri John Ledkov (xnox) wrote :

analysis at https://github.com/canonical/cloud-init/issues/4451
identified systemd regression that is affecting mantic
the kernel workaround is here for virtio-net, but not for all other NIC types, i.e. e1000
and will likely affect other users too, not just cloud-init use case.

Request to consider fixing https://github.com/canonical/cloud-init/issues/4451#issuecomment-1733881643 in systemd in Mantic for GA, or 0-day sru.

tags: added: rls-mm-incoming
summary: Mantic minimized/minimal cloud images do not receive IP address during
- provisioning
+ provisioning; systemd regression with wait-online
Nick Rosbrook (enr0n)
tags: added: foundations-todo
removed: rls-mm-incoming
tags: added: rls-mm-incoming
Changed in systemd (Ubuntu):
status: New → Triaged
importance: Undecided → High
assignee: nobody → Nick Rosbrook (enr0n)
Revision history for this message
Philip Roche (philroche) wrote (last edit ):

@xnox I have successfully verified that -proposed arm64 kernel `6.5.0-7-generic` results in successful network configuration when tested using qemu on an amd64 host. See https://people.canonical.com/~philroche/20231003-mantic-minimal-proposed-kernel/arm64/ for cloud-init logs, some debug output and test image.

Revision history for this message
Philip Roche (philroche) wrote :

I have also successfully verified that -proposed amd64 kernel `6.5.0-7-generic` results in successful network configuration when tested using qemu on an amd64 host with older hardware (ThinkPad T460 with 6th gen intel i5 which is the same hardware which we were able to reproduce the issue on previously). See https://people.canonical.com/~philroche/20231003-mantic-minimal-proposed-kernel/amd64/ for cloud-init logs, some debug output and test image.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 6.5.0-7.7

---------------
linux (6.5.0-7.7) mantic; urgency=medium

  * mantic/linux: 6.5.0-7.7 -proposed tracker (LP: #2037611)

  * kexec enable to load/kdump zstd compressed zimg (LP: #2037398)
    - [Packaging] Revert arm64 image format to Image.gz

  * Mantic minimized/minimal cloud images do not receive IP address during
    provisioning (LP: #2036968)
    - [Config] Enable virtio-net as built-in to avoid race

  * Miscellaneous Ubuntu changes
    - SAUCE: Add mdev_set_iommu_device() kABI
    - [Config] update gcc version in annotations

 -- Andrea Righi <email address hidden> Thu, 28 Sep 2023 10:19:24 +0200

Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
tags: removed: rls-mm-incoming
Changed in systemd (Ubuntu):
importance: High → Medium
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-oem-6.5/6.5.0-1005.5 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-oem-6.5' to 'verification-done-jammy-linux-oem-6.5'. If the problem still exists, change the tag 'verification-needed-jammy-linux-oem-6.5' to 'verification-failed-jammy-linux-oem-6.5'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-oem-6.5-v2 verification-needed-jammy-linux-oem-6.5
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-azure-6.5/6.5.0-1007.7~22.04.1 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-azure-6.5' to 'verification-done-jammy-linux-azure-6.5'. If the problem still exists, change the tag 'verification-needed-jammy-linux-azure-6.5' to 'verification-failed-jammy-linux-azure-6.5'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-azure-6.5-v2 verification-needed-jammy-linux-azure-6.5
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-aws-6.5/6.5.0-1008.8~22.04.1 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-aws-6.5' to 'verification-done-jammy-linux-aws-6.5'. If the problem still exists, change the tag 'verification-needed-jammy-linux-aws-6.5' to 'verification-failed-jammy-linux-aws-6.5'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-aws-6.5-v2 verification-needed-jammy-linux-aws-6.5
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-nvidia-6.8/6.8.0-1006.6~22.04.2 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-nvidia-6.8' to 'verification-done-jammy-linux-nvidia-6.8'. If the problem still exists, change the tag 'verification-needed-jammy-linux-nvidia-6.8' to 'verification-failed-jammy-linux-nvidia-6.8'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-nvidia-6.8-v2 verification-needed-jammy-linux-nvidia-6.8
Revision history for this message
Nick Rosbrook (enr0n) wrote :

There was a lot of work in netplan related to this area in Noble (see e.g. bug 2060311). This bug has a lot of valuable information, but the new behavior as changed enough that I don't think we should keep this open.

And, mantic is EOL soon enough that doing an SRU for mantic specifically is not practical.

Changed in systemd (Ubuntu):
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.