Azure: issues with accelerated networking on Hirsute

Bug #1919177 reported by Gauthier Jolly
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
systemd (Ubuntu)
Fix Released
Undecided
Unassigned
Hirsute
New
Undecided
Unassigned

Bug Description

[General]

On Azure, when provisioning a Hirsute VM with Accelerated Networking enabled, sometimes part of the cloud-init configuration is not applied.

Especially, in those cases, the public SSH key is not setup properly.

[how to reproduce]

Start a VM with AN enabled:

```
az vm create --name "$VM_NAME --resource-group "$GROUP" --location "UK South" --image 'Canonical:0001-com-ubuntu-server-hirsute-daily:21_04-daily-gen2:latest' --size Standard_F8s_v2 --admin-username ubuntu --ssh-key-value "$SSH_KEY" --accelerated-networking
```

After a moment, try to SSH: if you succeed, delete and recreate a new VM.

[troubleshooting]

To be able to connect into the VM, run:

az vm run-command invoke -g "$GROUP" -n "$VM_NAME" --command-id RunShellScript --scripts "sudo -u ubuntu ssh-import-id $LP_USERNAME"
```

In "/run/cloud-init/instance-data.json", I can see:
```
     "publicKeys": [
      {
       "keyData": "<my-pub-key>",
       "path": "/home/ubuntu/.ssh/authorized_keys"
      }
     ],
```

as expected.

[workaround]

As mentioned, Azure allows the user to run command into the VM without SSH connection. To do so, one can use the Azure CLI:

az vm run-command invoke -g "$GROUP" -n "$VM_NAME" --command-id RunShellScript --scripts "sudo -u ubuntu ssh-import-id $LP_USERNAME"

This example uses "ssh-import-id" but it's also possible to just echo a given public key into /home/ubuntu/.ssh/authorized_keys

NOTE: this will only solves the SSH issue, I do not know if this bug affects other things. If so the user would have to apply those things manually.

Tags: fr-1324
Revision history for this message
Gauthier Jolly (gjolly) wrote :
Revision history for this message
Paride Legovini (paride) wrote :

Hi Gauthier,

There are a few lines in the attached logs tarball (thanks attaching it) that make us suspect that networking is more generally broken when the issue you describe happens, e.g.:

azure.py[ERROR]: Failed to read /var/lib/dhcp/dhclient.eth0.leases: [Errno 2] No such file or directory: '/var/lib/dhcp/dhclient.eth0.leases'

Without networking it is to be expected that cloud-init fails to complete the instance configuration, and there's little cloud-init can do. A full networking setup failure could be caused by a kernel issue with the accelerated networking feature. Could you please dig a bit deeper into a failing instance and see if the networking is actually working and that everything is fine with the kernel? Thanks!

Changed in cloud-init:
status: New → Incomplete
Revision history for this message
Gauthier Jolly (gjolly) wrote :

Hi Paride,

Thanks for checking the logs. I can confirm that the network seems correctly configured when I access the VM.
As far as I can see with journalctl, the mellanox's network interface is not fully configured when the cloud-init "init" steps startup.

Good instance (no issue):

Mar 22 14:25:40 hirsute-acc systemd-networkd[666]: eth0: DHCPv4 address 10.0.0.4/24 via 10.0.0.1
Mar 22 14:25:41 hirsute-acc cloud-init[678]: Cloud-init v. 21.1-19-gbad84ad4-0ubuntu1 running 'init' at Mon, 22 Mar 2021 14:25:41 +0000. Up 3970.29 seconds.

Bad instance (with issue):

Mar 22 16:26:44 hirsute-acc-2 cloud-init[662]: Cloud-init v. 21.1-19-gbad84ad4-0ubuntu1 running 'init' at Mon, 22 Mar 2021 16:26:44 +0000. Up 14.23 seconds.
Mar 22 16:26:48 hirsute-acc-2 systemd-networkd[653]: eth0: DHCPv4 address 10.0.0.4/24 via 10.0.0.1

It's weird because in both cases, I can this line before cloud-init starts up:

Mar 22 16:26:44 hirsute-acc-2 systemd[1]: Finished Wait for Network to be Configured.

which is expected since cloud-init unit is set to run after "systemd-networkd-wait-online.service".

So I am trying to understand why "systemd-networkd-wait-online" reports that the interfaces are configured when apparently they are not.

Revision history for this message
Gauthier Jolly (gjolly) wrote :

Hi,

I think the complete suite of events is well shown here:

Apr 12 08:25:00 hirsute-acc systemd-networkd[652]: enP28243s1np0: Interface name change detected, enP28243s1np0 has been renamed to eth1.
Apr 12 08:25:00 hirsute-acc systemd-networkd[652]: eth1: Interface name change detected, eth1 has been renamed to enP28243s1np0.
Apr 12 08:25:00 hirsute-acc systemd-networkd[652]: enP28243s1np0: Link UP
Apr 12 08:25:00 hirsute-acc systemd-networkd[652]: eth0: Link UP
Apr 12 08:25:00 hirsute-acc systemd-networkd[652]: eth0: Gained carrier
Apr 12 08:25:00 hirsute-acc systemd-networkd[652]: enP28243s1np0: Gained carrier
Apr 12 08:25:00 hirsute-acc systemd-networkd[652]: enP28243s1np0: Link DOWN
Apr 12 08:25:00 hirsute-acc systemd-networkd[652]: enP28243s1np0: Lost carrier
Apr 12 08:25:00 hirsute-acc systemd-networkd[652]: eth0: Link DOWN
Apr 12 08:25:00 hirsute-acc systemd-networkd[652]: eth0: Lost carrier
Apr 12 08:25:00 hirsute-acc systemd[1]: Finished Wait for Network to be Configured.
Apr 12 08:25:00 hirsute-acc systemd[1]: Starting Initial cloud-init job (metadata service crawler)...
Apr 12 08:25:00 hirsute-acc systemd-networkd[652]: enP28243s1np0: Link UP
Apr 12 08:25:00 hirsute-acc systemd-networkd[652]: eth0: Link UP
Apr 12 08:25:00 hirsute-acc systemd-networkd[652]: eth0: Gained carrier
Apr 12 08:25:00 hirsute-acc cloud-init[662]: 2021-04-12 08:25:00,954 - azure.py[ERROR]: Failed to read /var/lib/dhcp/dhclient.eth0.leases: [Errno 2] No such file or directory: '/var/lib/dhcp/d
hclient.eth0.leases'

The link goes down, then systemd stops waiting for the network configuration, then cloud-init starts, shortly after the link goes UP again but cloud-init fails (shows an error related to dhcp lease).

Robert C Jennings (rcj)
Changed in cloud-init (Ubuntu):
milestone: none → ubuntu-21.04
tags: added: rls-hh-incoming
Changed in linux-azure (Ubuntu):
milestone: none → ubuntu-21.04
Revision history for this message
Tim Gardner (timg-tpi) wrote :

gjolly - can you attach the whole /var/log/syslog soon after boot ? I see the same pattern of link UP/DOWN/UP on a non-accelerated Hirsute instance, but systemd correctly waits for networking to be configured in this case.

Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :

I'm also taking a look in this one, I couldn't reproduce it (tried 11 times, with the same az-cli command-line provided by gjolly.
Found 2 interesting things after Gauthier provide me access to one of his failing instances:

(a) Regarding cloud-init, I see the following in the logs (comparing a GOOD and BAD instance):

GOOD:
2021-04-13 19:19:09,341 - DataSourceAzure.py[DEBUG]: Retrieving public SSH keys
2021-04-13 19:19:09,341 - azure.py[DEBUG]: Unable to get keys from IMDS, falling back to OVF
2021-04-13 19:19:09,341 - DataSourceAzure.py[DEBUG]: Retrieved keys from OVF
2021-04-13 19:19:09,342 - handlers.py[DEBUG]: finish: azure-ds/get_public_ssh_keys: SUCCESS: get_public_ssh_keys
2021-04-13 19:19:09,342 - util.py[DEBUG]: Changing the ownership of /home/ubuntu/.ssh to 1000:1000
2021-04-13 19:19:09,343 - util.py[DEBUG]: Reading from /etc/ssh/sshd_config (quiet=False)
2021-04-13 19:19:09,343 - util.py[DEBUG]: Read 3287 bytes from /etc/ssh/sshd_config
2021-04-13 19:19:09,343 - util.py[DEBUG]: Writing to /home/ubuntu/.ssh/authorized_keys - wb: [600] 381 bytes
2021-04-13 19:19:09,343 - util.py[DEBUG]: Changing the ownership of /home/ubuntu/.ssh/authorized_keys to 1000:1000
2021-04-13 19:19:09,344 - util.py[DEBUG]: Changing the ownership of /root/.ssh to 0:0
2021-04-13 19:19:09,344 - util.py[DEBUG]: Reading from /etc/ssh/sshd_config (quiet=False)
2021-04-13 19:19:09,344 - util.py[DEBUG]: Read 3287 bytes from /etc/ssh/sshd_config
2021-04-13 19:19:09,344 - util.py[DEBUG]: Writing to /root/.ssh/authorized_keys - wb: [600] 545 bytes

BAD:
2021-04-12 08:25:07,412 - DataSourceAzure.py[DEBUG]: Retrieving public SSH keys
2021-04-12 08:25:07,412 - azure.py[DEBUG]: Unable to get keys from IMDS, falling back to OVF
2021-04-12 08:25:07,412 - azure.py[DEBUG]: No keys available from OVF
2021-04-12 08:25:07,412 - handlers.py[DEBUG]: finish: azure-ds/get_public_ssh_keys: SUCCESS: get_public_ssh_keys
2021-04-12 08:25:07,413 - util.py[DEBUG]: Changing the ownership of /home/ubuntu/.ssh to 1000:1000
2021-04-12 08:25:07,413 - util.py[DEBUG]: Reading from /etc/ssh/sshd_config (quiet=False)
2021-04-12 08:25:07,413 - util.py[DEBUG]: Read 3287 bytes from /etc/ssh/sshd_config
2021-04-12 08:25:07,414 - util.py[DEBUG]: Writing to /home/ubuntu/.ssh/authorized_keys - wb: [600] 0 bytes
2021-04-12 08:25:07,414 - util.py[DEBUG]: Changing the ownership of /home/ubuntu/.ssh/authorized_keys to 1000:1000
2021-04-12 08:25:07,414 - util.py[DEBUG]: Changing the ownership of /root/.ssh to 0:0
2021-04-12 08:25:07,414 - util.py[DEBUG]: Reading from /etc/ssh/sshd_config (quiet=False)
2021-04-12 08:25:07,415 - util.py[DEBUG]: Read 3287 bytes from /etc/ssh/sshd_config
2021-04-12 08:25:07,415 - util.py[DEBUG]: Writing to /root/.ssh/authorized_keys - wb: [600] 0 bytes

So, the main difference here is:

2021-04-13 19:19:09,341 - DataSourceAzure.py[DEBUG]: Retrieved keys from OVF
vs
2021-04-12 08:25:07,412 - azure.py[DEBUG]: No keys available from OVF

Why one method executes from DataSourceAzure.py whereas the other from azure.py?
I'm far from expert in cloud-init, so I'll defer that questions to cloud-init folks.

Will continue in next comment.

Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :
Download full text (7.3 KiB)

(b) Regarding kernel, I found some oddities too [ output of dmesg | grep -v "audit\|apparmor" ]:

GOOD: <suppressed non-related intermixed entries, like ext4's>

[ 1627.732924] hv_netvsc 00224840-7fbf-0022-4840-7fbf00224840 eth0: VF slot 1 added
[ 1627.733637] hv_pci fb9ea909-d0dd-41b6-a1c2-98b1233e987d: PCI VMBus probing: Using version 0x10002
[ 1627.742469] hv_pci fb9ea909-d0dd-41b6-a1c2-98b1233e987d: PCI host bridge to bus d0dd:00
[ 1627.742472] pci_bus d0dd:00: root bus resource [mem 0xfe0000000-0xfe00fffff window]
[ 1627.743208] pci d0dd:00:02.0: [15b3:1016] type 00 class 0x020000
[ 1627.747302] pci d0dd:00:02.0: reg 0x10: [mem 0xfe0000000-0xfe00fffff 64bit pref]
[ 1627.825728] pci d0dd:00:02.0: enabling Extended Tags
[ 1627.830106] pci d0dd:00:02.0: 0.000 Gb/s available PCIe bandwidth, limited by Unknown x0 link at d0dd:00:02.0 (capable of 63.008 Gb/s with 8.0 GT/s PCIe x8 link)
[ 1627.834774] pci d0dd:00:02.0: BAR 0: assigned [mem 0xfe0000000-0xfe00fffff 64bit pref]
[ 1627.926227] mlx5_core d0dd:00:02.0: firmware version: 14.25.8362
[ 1627.936526] mlx5_core d0dd:00:02.0: handle_hca_cap:526:(pid 619): log_max_qp value in current profile is 18, changing it to HCA capability limit (12)
[ 1628.131452] mlx5_core d0dd:00:02.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
[ 1628.256107] hv_netvsc 00224840-7fbf-0022-4840-7fbf00224840 eth0: VF registering: eth1
[ 1628.256147] mlx5_core d0dd:00:02.0 eth1: joined to eth0
[ 1628.257059] mlx5_core d0dd:00:02.0 eth1: Disabling LRO, not supported in legacy RQ
[ 1628.266117] mlx5_core d0dd:00:02.0 eth1: Disabling LRO, not supported in legacy RQ
[ 1628.266803] mlx5_core d0dd:00:02.0 enP53469s1np0: renamed from eth1
[ 1628.305588] mlx5_core d0dd:00:02.0 enP53469s1np0: Disabling LRO, not supported in legacy RQ
[ 1628.444056] mlx5_core d0dd:00:02.0 enP53469s1np0: Link up
[ 1628.445592] hv_netvsc 00224840-7fbf-0022-4840-7fbf00224840 eth0: Data path switched to VF: enP53469s1np0

BAD:
[ 5.211059] hv_netvsc 000d3ad6-1871-000d-3ad6-1871000d3ad6 eth0: VF slot 1 added
[ 5.211618] hv_pci e94e34d7-6e53-4ab5-95d1-4328102a7c87: PCI VMBus probing: Using version 0x10002
[ 5.220736] hv_pci e94e34d7-6e53-4ab5-95d1-4328102a7c87: PCI host bridge to bus 6e53:00
[ 5.220739] pci_bus 6e53:00: root bus resource [mem 0xfe0000000-0xfe00fffff window]
[ 5.221465] pci 6e53:00:02.0: [15b3:1016] type 00 class 0x020000
[ 5.239844] pci 6e53:00:02.0: reg 0x10: [mem 0xfe0000000-0xfe00fffff 64bit pref]
[ 5.317841] pci 6e53:00:02.0: enabling Extended Tags
[ 5.322168] pci 6e53:00:02.0: 0.000 Gb/s available PCIe bandwidth, limited by Unknown x0 link at 6e53:00:02.0 (capable of 63.008 Gb/s with 8.0 GT/s PCIe x8 link)
[ 5.325766] pci 6e53:00:02.0: BAR 0: assigned [mem 0xfe0000000-0xfe00fffff 64bit pref]
[ 5.412415] mlx5_core 6e53:00:02.0: firmware version: 14.25.8102
[ 5.424153] mlx5_core 6e53:00:02.0: handle_hca_cap:526:(pid 7): log_max_qp value in current profile is 18, changing it to HCA capability limit (12)
[ 5.613614] mlx5_core 6e53:00:02.0: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
[ 5.727686] hv_netvsc 000d3ad6-1871-000d-3ad6-1871000d3ad6 eth0: VF registering: eth1
[ 5.7...

Read more...

Revision history for this message
Gauthier Jolly (gjolly) wrote :

Hi,

Thanks for looking into that. Indeed, while trying to reproduce the issue this morning, I found it more challenging than I originally thought. I want to add a few points here on how I reproduced the issue:

 1. Usually, I do not use the Azure CLI directly. I use a custom CLI of my own that uses the Azure SDK. This custom CLI always creates the NIC (with AN) before creating the VM. The VM is created with the existing NIC. I don't know how Azure CLI manages "--accelerated-networking" flag under the hood, maybe it's doing something different that makes it harder to reproduce the issue.
 2. (for costs reasons) I always create a new resource group when I create a new VM. Once again, I don't know if it has any impact on the reproducibility.
 3. This morning I managed to reproduce the issue using only the Azure CLI and after a few (unsuccessful) tries:

➜ ~ az group create --resource-group hirsute-acc-manual-1 --location 'UK South'
{
  "id": "/subscriptions/5059ce5a-a72d-4085-acb7-33b421daa1ee/resourceGroups/hirsute-acc-manual-1",
  "location": "uksouth",
  "managedBy": null,
  "name": "hirsute-acc-manual-1",
  "properties": {
    "provisioningState": "Succeeded"
  },
  "tags": null,
  "type": "Microsoft.Resources/resourceGroups"
}
➜ ~ az vm create --name hirsute-acc-manual --resource-group hirsute-acc-manual-1 --location "UK South" --image 'Canonical:0001-com-ubuntu-server-hirsute-daily:21_04-daily-gen2:latest' --size Standard_F8s_v2 --admin-username ubuntu --ssh-key-value "$(cat ~/.ssh/canonical.pub)" --accelerated-networking
{- Finished ..
  "fqdns": "",
  "id": "/subscriptions/5059ce5a-a72d-4085-acb7-33b421daa1ee/resourceGroups/hirsute-acc-manual-1/providers/Microsoft.Compute/virtualMachines/hirsute-acc-manual",
  "location": "uksouth",
  "macAddress": "00-22-48-40-82-32",
  "powerState": "VM running",
  "privateIpAddress": "10.0.0.4",
  "publicIpAddress": "51.104.198.218",
  "resourceGroup": "hirsute-acc-manual-1",
  "zones": ""
}
➜ ~ ssh -i ~/.ssh/canonical ubuntu@51.104.198.218
The authenticity of host '51.104.198.218 (51.104.198.218)' can't be established.
ECDSA key fingerprint is SHA256:wIQAUjmIeFvdBeqT5a2RHJEpDtjCnrJ+FggR8pzW7OM.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added '51.104.198.218' (ECDSA) to the list of known hosts.
ubuntu@51.104.198.218: Permission denied (publickey).

 4. I will post the full syslog file here but I also want to point that I THINK this issue only appears with mlx5 devices/drivers. When I was checking the VM created with no issue, mlx4 modules were loaded. On the previous VM, I can see:

ubuntu@hirsute-acc-manual:~$ lsmod | grep mlx
mlx5_ib 331776 0
ib_uverbs 139264 1 mlx5_ib
ib_core 348160 2 ib_uverbs,mlx5_ib
mlx5_core 1081344 1 mlx5_ib
tls 90112 1 mlx5_core
mlxfw 36864 1 mlx5_core

  Once again, I don't know if that really matters.

Revision history for this message
Paride Legovini (paride) wrote :

Hi Guilherme,

Answering your question in comment #6:

> Why one method executes from DataSourceAzure.py
> whereas the other from azure.py?

The reason is not that interesting and won't really help here. Some log messages generated in DataSourceAzure.py are logged using a helper function defined in azure.py, while some others are logged directly from DataSourceAzure.py. Honestly I'm not completely sure on why we have this difference, it could be in part a legacy thing. In any case the interesting logic is all in DataSourceAzure.py.

Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :

Thanks Paride. I still think we should precisely understand that difference in the logs, since in the BAD case we always see the "azure.py" messages, not the other one. This could be related or at least a clue on the root cause.

Regarding the kernel side, I've build a 5.11 kernel with debug patch [0] - I'm attaching the patch here, very simple, just a parameter-delay in the carrier notification. Unfortunately gjolly tried it in a custom image and it didn't reproduce. My theory is that just delaying the notification is not enough, due to the complex SR-IOV multi-interface nature in Hyper-V, maybe there is network connectivity even before the carrier is fully set UP, so the debug patch could be extended maybe to block packet transmission in mlx5 for N seconds.

I have a feeling that Groovy should reproduce this, as discussed with gjolly - in our first reproducer, we had a Hirsute image with Groovy 5.8 kernel and also we have cloud-init versions really alike in Groovy/Hirsute. So, if reproduces in Groovy it shouldn't be a release blocker, definitely.

Thanks!

[0] https://launchpad.net/~gpiccoli/+archive/ubuntu/test1919177/

Revision history for this message
Paride Legovini (paride) wrote :

Hi again Guilherme,

I don't think there's any useful information hiding in the fact that cloud-init logs some messages as DataSourceAzure.py[DEBUG] and others as azure.py[DEBUG]. Look at:

https://github.com/canonical/cloud-init/blob/45db197cfc7e3488baae7dc1053c45da070248f6/cloudinit/sources/DataSourceAzure.py#L691

See that some messages are logged with LOG.debug(), while others with report_diagnostic_event(). This latter function is defined in azure.py, hence the difference. It could have been LOG.debug() all the time, then we would always have DataSourceAzure.py[DEBUG]. I'm not sure of what was the rationale for the two ways of logging, maybe report_diagnostic_event() was used for "more useful" debugging messages? Anyway I doubt it will help debugging this issue.

Revision history for this message
Gauthier Jolly (gjolly) wrote :

Hi there,

For whether or not the issue affects Groovy. I've been creating Groovy VMs on a loop to confirm if I could reproduce it there and (after ~50 tries) I still haven't seen it happen. I think we can safely assume that this issue does not affect Groovy.

Also I was able to reproduce the issue with gen1 hyper-v (before I was only using gen2).

To continue the system log investigation, I can add the following:

GOOD
Apr 19 08:53:43 hirsute-man-2 systemd[1]: Finished Wait for Network to be Configured.
Apr 19 08:53:43 hirsute-man-2 systemd-networkd[660]: eth0: Gained carrier
Apr 19 08:53:43 hirsute-man-2 systemd-networkd[660]: eth0: DHCPv4 address 10.0.0.4/24 via 10.0.0.1
Apr 19 08:53:44 hirsute-man-2 cloud-init[667]: Cloud-init v. 21.1-19-gbad84ad4-0ubuntu2 running 'init' at Mon, 19 Apr 2021 08:53:44 +0000. Up 84.81 seconds.
Apr 19 08:53:48 hirsute-man-2 systemd-networkd[660]: enP6454s1: Link UP
Apr 19 08:53:48 hirsute-man-2 systemd-networkd[660]: enP6454s1: Gained carrier

BAD
Apr 19 08:33:48 groovy-acc-UWP5F systemd[1]: Finished Wait for Network to be Configured.
Apr 19 08:33:48 groovy-acc-UWP5F systemd-networkd[616]: eth0: Gained carrier
Apr 19 08:33:48 groovy-acc-UWP5F cloud-init[626]: Cloud-init v. 21.1-19-gbad84ad4-0ubuntu2 running 'init' at Mon, 19 Apr 2021 08:33:48 +0000. Up 7.34 seconds.
Apr 19 08:33:48 groovy-acc-UWP5F systemd-networkd[616]: enP30932s1np0: Gained carrier
Apr 19 08:33:51 groovy-acc-UWP5F systemd-networkd[616]: eth0: DHCPv4 address 10.0.0.4/24 via 10.0.0.1

On GOOD, DHCP is configured before Cloud-init runs, on BAD it's done after. This is "obvious" considering the error raised by cloud-init but I just wanted to underline it.

Revision history for this message
Gauthier Jolly (gjolly) wrote :

I don't have any particular knowledge of how systemd waits for the network to be ready but reading systemd-networkd-wait-online.service(8) and networkctl(1) I wonder if cloud-init shouldn't wait for networkd-wait-online to report that the link is "routable" instead of "degradated" (default minimum).

"/usr/lib/systemd/systemd-networkd-wait-online -o routable"

Revision history for this message
Paride Legovini (paride) wrote :

Hi, I don't think that's a viable option as cloud-init may very well be the component that allows the network interface to be configured with a routable address, so we can't wait such an address to be available before configuring the machine, at least not in general.

Revision history for this message
Gauthier Jolly (gjolly) wrote :

Hi,

Ok thanks. I am now trying to find at which time the issue was first introduced. So far I tested an image from 20201222 and confirm that I cannot reproduce the issue with this image.

Then by modifying the image (always before first boot) I was able to find that:
 - upgrading cloud-init from 20.4-0ubuntu1 to 21.1-19-gbad84ad4-0ubuntu2 I still cannot reproduce the issue
 - upgrading systemd from 246.6-5ubuntu1 to 247.3-1ubuntu4 I couldn't reproduce the issue
 - upgrading BOTH cloud-init and systemd at the same time I was able to reproduce the issue

I am now trying to test newer images to understand what is the first image that introduces the issue.

Revision history for this message
Gauthier Jolly (gjolly) wrote :

The issue probably* started between these two serials: 20210215 and 20210304.

*I say "probably" because while I'm sure I can reproduce the issue with 20210304, the fact I didn't manage to reproduce it with 20210215 (in ~15 tries) doesn't **absolutely prove** that the issue wasn't there.

Gauthier Jolly (gjolly)
description: updated
description: updated
Revision history for this message
Gauthier Jolly (gjolly) wrote :

I isolated the issue between those two serial: 20210219 (doesn't have the issue) and 20210220 (reproduces the issue).

The main difference I can see between those two serial is systemd version:

20210219 -> 247.1-4ubuntu1
20210220 -> 247.3-1ubuntu2

Cloud-init version is the same (20.4.1-79-g71564dce-0ubuntu1).

I will try to get a better package diff between those two images to understand if anything else significant changed.

Revision history for this message
Gauthier Jolly (gjolly) wrote :

I attach here the full diff of packages between the two serials.

Paride Legovini (paride)
Changed in cloud-init (Ubuntu):
status: New → Incomplete
Changed in cloud-init (Ubuntu):
milestone: ubuntu-21.04 → hirsute-updates
Changed in linux-azure (Ubuntu):
milestone: ubuntu-21.04 → hirsute-updates
tags: added: fr-1324
tags: removed: rls-hh-incoming
Revision history for this message
Gauthier Jolly (gjolly) wrote :

Hi there,

Is there any update on this issue? I would like to understand who owns the investigation/debugging process? Please tell us if you need any help from the CPC team.

Revision history for this message
Balint Reczey (rbalint) wrote :

@gjolly There were networking-related changes in systemd which could have caused this but I also see that the kernel packages also changed between the Azure images:
-ii linux-image-azure 5.8.0.1017.19+21.04.14
...
+ii linux-image-5.8.0-1022-azure 5.8.0-1022.24+21.04.2

Could you please prepare an image that has the older kernel and newer systemd to rule out kernel changes causing the issue?

Also do you still observe the issue in Impish? It has systemd 248 now.

Changed in systemd (Ubuntu):
status: New → Incomplete
Revision history for this message
Marcelo Cerri (mhcerri) wrote :

Hi, Gauthier and Gauthier.

I'm marking linux-azure as invalid for now. If something changes please let us know. Thank you.

Changed in linux-azure (Ubuntu):
status: New → Invalid
Revision history for this message
Gauthier Jolly (gjolly) wrote :

@rbalint I was able to reproduce the issue with an image that was running the following:

ii cloud-init 20.4.1-79-g71564dce-0ubuntu1 all initialization and customization tool for cloud instances
ii linux-image-azure 5.8.0.1017.19+21.04.14 amd64 Linux kernel image for Azure systems.
ii systemd 247.3-3ubuntu3 amd64 system and service manager

For reference, I took 20210219 and only upgraded systemd.

I will also try to reproduce with Impish to confirm whether or not systemd 248 solves the issue.

Revision history for this message
Gauthier Jolly (gjolly) wrote :

@rbalint, I just tested Impish and I cannot reproduce the issue. It doesn't prove it's not there but it's a very good sign!

I can see that our automated testing failed because of this issue on the 2021-05-10 but not after. If I'm not wrong, systemd 248 was pushed on the 14th, this would make a nice correlation.

Balint Reczey (rbalint)
Changed in systemd (Ubuntu):
status: Incomplete → Fix Released
Changed in linux-azure (Ubuntu Hirsute):
status: New → Invalid
Changed in cloud-init (Ubuntu Hirsute):
status: New → Incomplete
Revision history for this message
Balint Reczey (rbalint) wrote :

@ddstreed Could you please look into this SRU candidate?

@gjolly What is the importance/urgency of landing this fix?

Revision history for this message
Gauthier Jolly (gjolly) wrote :

It's been there for almost 3 months, I think it can wait a few more days. It's affecting Azure's users who use Hirsute instances with accelerated networking enabled, I don't know how many users that is.

Revision history for this message
Gauthier Jolly (gjolly) wrote :

@rbalint @ddstreed any update on this issue?

Revision history for this message
Gauthier Jolly (gjolly) wrote :

I tried to reproduce the issue with the latest hirsute image pushed to Azure and it appears that I cannot.

While I can still reproduce the issue with 20210511.1, I can't with 20210622.1.

Mathew Hodson (mhodson)
no longer affects: linux-azure (Ubuntu)
no longer affects: linux-azure (Ubuntu Hirsute)
no longer affects: cloud-init (Ubuntu Hirsute)
no longer affects: cloud-init (Ubuntu)
affects: cloud-init → ubuntu-translations
no longer affects: ubuntu-translations
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.