Netplan/Systemd/Cloud-init/Dbus Race

Bug #1997124 reported by Brett Holman
28
This bug affects 5 people
Affects Status Importance Assigned to Milestone
cloud-init
Expired
High
James Falcon
netplan
Triaged
Wishlist
Lukas Märdian
systemd (Ubuntu)
Triaged
Wishlist
Unassigned

Bug Description

Cloud-init is seeing intermittent failures while running `netplan apply`, which appears to be caused by a missing resource at the time of call.

The symptom in cloud-init logs looks like:

Running ['netplan', 'apply'] resulted in stderr output: Failed to connect system bus: No such file or directory

I think that this error[1] is likely caused by cloud-init running netplan apply too early in boot process (before dbus is active).

Today I stumbled upon this error which was hit in MAAS[2]. We have also hit it intermittently during tests (we didn't have a reproducer).

Realizing that this may not be a cloud-init error, but possibly a dependency bug between dbus/systemd we decided to file this bug for broader visibility to other projects.

I will follow up this initial report with some comments from our discussion earlier.

[1] https://github.com/canonical/netplan/blob/main/src/dbus.c#L801
[2] https://discourse.maas.io/t/latest-ubuntu-20-04-image-causing-netplan-error/5970

Revision history for this message
Brett Holman (holmanb) wrote :

Some details from a conversation with Chad, James, and Vorlon.

`netplan apply` is executed in cloud-init.service, which runs
`Before=network-online.target` but `After=systemd-networkd-wait-online.service`.

There is may be a dependency bug between dbus and systemd-networkd because systemd-networkd is a dbus service so when it's "up" it should be accessible over dbus.

Should cloud-init or systemd-networkd-wait-online.service require being ordered after dbus.service?

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

They should be ordered after dbus.socket, which should be enough to activate those services. However, they themselves will enqueu themselves into Systemd startup sequence
.

Separately we really ought to port networkd from dbus communication to varlink such that it can be used safely on critical boot path. The rest of the Systemd critical components are already using varlink.

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

Note this bug was opened against upstream projects. Systemd does not use launchpad for bug tracking, did you mean to mark Ubuntu(Systemd) as affected?

Revision history for this message
Brett Holman (holmanb) wrote :

> Separately we really ought to port networkd from dbus communication to varlink such that it can be used safely on critical boot path. The rest of the Systemd critical components are already using varlink.

+1

> did you mean to mark Ubuntu(Systemd) as affected?

Yes, I'll update that thanks.

no longer affects: systemd
James Falcon (falcojr)
Changed in cloud-init:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Chad Smith (chad.smith) wrote :

Confirmed that we need dbus.socket in cloud-init.service as the dependency chain doesn't explicitly define that ordering dependency. We'll need this for netplan apply to work without a race

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in systemd (Ubuntu):
status: New → Confirmed
Revision history for this message
Rod Smith (rodsmith) wrote :

We've run into this in the Server Certification lab on mouser, a NEC Express5800/R128h-1M server. In our testing, it's affected 50% (2 of 4) Ubuntu 20.04 deployments, but not 18.04 or 22.04 deployments (0 of 3 for each of those). These sample sizes are low, so this may be a coincidence; but I'm mentioning it here in case it's not a coincidence.

James Falcon (falcojr)
Changed in cloud-init:
assignee: nobody → James Falcon (falcojr)
status: Triaged → In Progress
Revision history for this message
Chad Smith (chad.smith) wrote :

The dbus race that is happening here is due to `networkctl reconfigure`[1] being run by netplan apply, failing to talk to dbus, and restarting systemd_networkd[2] at that point in time when systemd_network may actually be coming up and is in an indeterminate state.

[1] https://github.com/canonical/netplan/blob/main/netplan/cli/utils.py#L116
[2] https://github.com/canonical/netplan/blob/main/netplan/cli/commands/apply.py#L277

I'm guessing the restart here from netplan apply is what's triggering the occasional failure case where not all network config is applied (like IP addresses) in systemd-networkd. It doesn't happen all the time but it's racy as systemd-networkd is mid startup and we're restarting it again via netplan apply.

After discussion with waldi (Bastian Blank) in Debian land about the systemd dependency chain, it seems my suggestion about about adding dbus.socket to cloud-init.service will actually introduce an ordering cycle because dbus.socket is
  After=sysinit.target, yet cloud-init.service is Before=sysinit.target.

So, trying to shoehorn cloud-init into the dependency chain After=dbus.socket is impossible for systemd to schedule.

Maybe, we'd want one of the following instead:
 1. `netplan apply` provide an option to avoid falling back to `networkctl reconfigure` and exit non-zero so cloud-init can do something better, or retry where necessary
 2. `netplan apply` can defer or block/retry until dbus.socket/service is ready allowing this only to affect cases where netplan apply is called
 3. cloud-init to defer calling netplan apply on systemd-networkd environments until later boot stage (cloud-config.service) which comes after sysinit.target (and therefore can expect dbus.socket to be started at that point in boot.

I'll add netplan here to see if there are thoughts or counter suggestions here.

Lukas Märdian (slyon)
Changed in netplan:
assignee: nobody → Lukas Märdian (slyon)
Revision history for this message
Lukas Märdian (slyon) wrote :

I think the "Failed to connect system bus: No such file or directory" stderr output rather comes from networkctl [1] than from "netplan-dbus" (Netplan's output would be "... connect TO system bus..."). netplan-dbus is not involved at all AFAICS, as cloud-init is calling into the "netplan apply" CLI and not calling its "io.netplan.Netplan Apply()" DBus method; which would fail due to missing DBus communication, too.

So the root-cause IMO is networkctl trying to talk to systemd-networkd via DBus, which is not yet ready. Porting this communication to using varlink instead of dbus could solve this (but is probably a big task). Are we sure that systemd-networkd.service is already up-and-running at this stage and dbus.service/.socket being the bottleneck? We're sorting `After=systemd-networkd-wait-online.service`, so I assume: Yes.

Netplan's "apply" CLI could probably implement a "systemctl is-active ..." check for dbus.service/.socket and/or systemd-networkd.service/NetworkManager.service (depending on which backend is about to be (re-)configured. But generally "netplan apply" is designed to be a userspace tool and only Netplan's generator is designed to be executed during early boot. So if it's possible to postpone the execution of "netplan apply" until after systemd's initial boot transaction finished (i.e. into cloud-config.service) this would IMO be the cleaner solution and could avoid similar, future issues related to early boot.

[1] https://github.com/systemd/systemd/blob/main/src/network/networkctl.c#L2992

Changed in netplan:
status: New → Triaged
importance: Undecided → Wishlist
Revision history for this message
James Falcon (falcojr) wrote :
Changed in cloud-init:
status: In Progress → Expired
Nick Rosbrook (enr0n)
Changed in systemd (Ubuntu):
status: Confirmed → Triaged
importance: Undecided → Wishlist
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.