"System is booting up. Unprivileged users are not permitted to log in yet." causes wait subcommand to fail

Bug #2039441 reported by Thibf
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
uvtool
Fix Committed
Undecided
Unassigned
cloud-init (Ubuntu)
Fix Committed
Undecided
Unassigned
Focal
Fix Committed
Undecided
Unassigned
Jammy
Fix Committed
Undecided
Unassigned
Lunar
Fix Committed
Undecided
Unassigned
Mantic
Fix Committed
Undecided
Unassigned
uvtool (Ubuntu)
Fix Released
Undecided
Unassigned
Focal
Triaged
Undecided
Unassigned
Jammy
Triaged
Undecided
Unassigned
Lunar
Triaged
Undecided
Unassigned
Mantic
Triaged
Undecided
Unassigned

Bug Description

After doing `uvt-kvm wait`, we can expect to be able to ssh into the VMs.
That's not always the case as the ssh port can be up before PAM is setup:
`System is booting up. Unprivileged users are not permitted to log in yet. Please come back later. For technical details, see pam_nologin(8).`
This means that subsequent programs can't rely on `uvt-kvm wait` to know if the system is up, which defeats the purpose of this function and drives the complexity up in highly automated environment.

Personally, I see two ways to fix the wait to handle this case:
- Change the behavior of the created VM to avoid this edge case.
- Makes `uvt-kvm wait` smarter by actually establishing a communication to check if we really can login.

The last option seems less intrusive but will make the library more complex.
I'm not convinced that would be a reasonable default or would be better as an option to `uvt-kvm wait`.

Related branches

Revision history for this message
Robie Basak (racb) wrote :

Thank you for the report. I have seen the same thing happening in CI over the past few days. It seems to be some change in behaviour, since that CI was working before. Do you see the same? Any idea what changed, or is it just the change in an outcome of a race condition?

The entire point of `uvt-kvm wait` is to wrap the messy business so the caller doesn't have to worry about it, so I think we do need to adjust it to handle this scenario correctly. Perhaps the easiest way would be a string match on the output in case of failure, and keep retrying if there is a match (with a timeout in case the condition persists).

Changed in uvtool:
status: New → Triaged
summary: - System is booting up prevent ssh at startup
+ "System is booting up. Unprivileged users are not permitted to log in
+ yet." causes wait subcommand to fail
Revision history for this message
Thibf (thibf) wrote :

No idea what changed, I would guess it's result of a race condition which is more likely to trigger due to unrelated changed. I don't have the issue every time, but much more.

I think we can reuse the existing connection timeout and handle this string matching error as if we failed to establish communication.

Revision history for this message
Thibf (thibf) wrote :

I just noticed there is an uvtool/wait.py and uvtool/libvirt/kvm.py:780L:ssh.
I confused both of them. Ignore my previous comment part on the timeout. I don't see any timeout here.

Revision history for this message
Robie Basak (racb) wrote :
Download full text (3.2 KiB)

I tried uvt-kvm wait three times against the following image and it consistently works:

http://cloud-images.ubuntu.com/daily/server/focal/20231003/focal-server-cloudimg-amd64.img

I tried uvt-kvm wait three times against the following image and it consistently fails:

http://cloud-images.ubuntu.com/releases/focal/release-20231011/ubuntu-20.04-server-cloudimg-amd64.img

Differences in package versiosn between these images are as follows:

$ diff -u0 good bad
--- good 2023-10-16 16:13:54.245903813 +0000
+++ bad 2023-10-16 16:14:04.801959310 +0000
@@ -30 +30 @@
-cloud-init 23.2.2-0ubuntu0~20.04.1
+cloud-init 23.3.1-0ubuntu1~20.04.1
@@ -43 +43 @@
-curl 7.68.0-1ubuntu2.19
+curl 7.68.0-1ubuntu2.20
@@ -101,2 +101,2 @@
-grub-efi-amd64-bin 2.06-2ubuntu14.2
-grub-efi-amd64-signed 1.187.4~20.04.1+2.06-2ubuntu14.2
+grub-efi-amd64-bin 2.06-2ubuntu14.4
+grub-efi-amd64-signed 1.187.6~20.04.1+2.06-2ubuntu14.4
@@ -175,2 +175,2 @@
-libcurl3-gnutls:amd64 7.68.0-1ubuntu2.19
-libcurl4:amd64 7.68.0-1ubuntu2.19
+libcurl3-gnutls:amd64 7.68.0-1ubuntu2.20
+libcurl4:amd64 7.68.0-1ubuntu2.20
@@ -380,8 +380,8 @@
-linux-headers-5.4.0-163 5.4.0-163.180
-linux-headers-5.4.0-163-generic 5.4.0-163.180
-linux-headers-generic 5.4.0.163.160
-linux-headers-virtual 5.4.0.163.160
-linux-image-5.4.0-163-generic 5.4.0-163.180
-linux-image-virtual 5.4.0.163.160
-linux-modules-5.4.0-163-generic 5.4.0-163.180
-linux-virtual 5.4.0.163.160
+linux-headers-5.4.0-164 5.4.0-164.181
+linux-headers-5.4.0-164-generic 5.4.0-164.181
+linux-headers-generic 5.4.0.164.161
+linux-headers-virtual 5.4.0.164.161
+linux-image-5.4.0-164-generic 5.4.0-164.181
+linux-image-virtual 5.4.0.164.161
+linux-modules-5.4.0-164-generic 5.4.0-164.181
+linux-virtual 5.4.0.164.161
@@ -579,4 +579,4 @@
-vim 2:8.1.2269-1ubuntu5.17
-vim-common 2:8.1.2269-1ubuntu5.17
-vim-runtime 2:8.1.2269-1ubuntu5.17
-vim-tiny 2:8.1.2269-1ubuntu5.17
+vim 2:8.1.2269-1ubuntu5.18
+vim-common 2:8.1.2269-1ubuntu5.18
+vim-runtime 2:8.1.2269-1ubuntu5.18
+vim-tiny 2:8.1.2269-1ubuntu5.18
@@ -589 +589 @@
-xxd 2:8.1.2269-1ubuntu5.17
+xxd 2:8.1.2269-1ubuntu5.18

cloud-init seems like the most likely change to be affecting this issue.

So then I tried downgrading cloud-init to 23.2.2-0ubuntu0~20.04.1 in 20231011 using mount-image-callback. I tried uvt-kvm wait against this three times, and it consistently works.

To try to eliminate me accidentally fixing the issue due to my method itself, I tried reinstalling cloud-init 23.3.1-0ubuntu1~20.04.1 using mount-image-callback. I tried uvt-kvm wait against this three times, and it consistently fails.

Conclusion: something changed in cloud-init 23.3.1-0ubuntu1~20.04.1 to consistently cause early ssh to fail, whereas it did not before. This could be that the ssh listening socket is open before pam_nologin is deactivated whereas this didn't happen before, or because the window when it is active has grown. The former suggests a behavioural change; the latter suggests a "time to ssh" boot speed regression. So it seems appropriate to add a cloud-init regression task to this bug.

Neverthel...

Read more...

tags: added: regression-update
Revision history for this message
Robie Basak (racb) wrote :

> So it seems appropriate to add a cloud-init regression task to this bug.

To be clear, I'm not claiming that this definitely is a problem in cloud-init; just that appears it may be and warrants further investigation.

Revision history for this message
Thibf (thibf) wrote :

Looking a bit into cloud-init between 23.2.2 and 23.3.1 and found this:

```
commit b3c9b6a79c85ebc8c87908383c34b0314c2205b6
Refs: 23.2-53-gb3c9b6a7
Author: Brett Holman <email address hidden>
AuthorDate: Mon Jul 10 08:36:31 2023 -0600
Commit: GitHub <email address hidden>
CommitDate: Mon Jul 10 08:36:31 2023 -0600

    systemd: Block login until config stage completes (#2111)

    This is required to prevent login prompt prior to user provisioning.
    Performance testing reveals no regression.

    LP: #2013403
---
```

Which seems like a good culprit.

Revision history for this message
Chad Smith (chad.smith) wrote :

Yes the above comment/link is the source of this condition we now are affected on in uvt-kvm wait. Sorry, we are chatting on this internally.

I'm not certain this is truly a bug that cloud-init should fix/revert in this case as the solution was closing a different gap which incorrectly allowed logins prompts prior to cloud-init's default user being created and passwords being setup.

The change we landed in cloud-init was to shift the `Before=systemd-user-sessions.service` from ordering defined in cloud-init.service (earlier boot stage). To cloud-config.service (where ssh_import_id) is performed. Ultimately, this pushed back systemd-user-sessions.service which is the service that cleans up /run/nologin (which contains that unprivileged login rjection message). So now uvt-kvm wait is hitting a case where ssh port 22 is being brought up by ssh.service, beforce /run/nologin is removed by systemd-user-sessions.service and we get a different failure path/behavior.

Revision history for this message
Brett Holman (holmanb) wrote :

This condition was possible prior to the change in cloud-init. Cloud-init's ordering change just revealed this unhandled condition in uvtool.

From systemd-user-sessions.service(8):

> After basic system initialization is complete, it removes /run/nologin, thus permitting logins.

From uvt-kvm(1):

> Wait for a VM to become ready....

Given the purpose and scope of uvtool's `uvt-kvm wait` and systemd-user-sessions.service, this looks like a condition that uvtool should really be able to handle; when logins are not permitted the VM is not "ready".

Cloud-init made the change in b3c9b6a79c to fix a real bug caused by implicit systemd ordering assumptions proved invalid (observed on odd architectures and virtualized instruction sets).

Ultimately, this isn't cloud-init's responsibility, and uvtool should likewise be fixed to handle this condition.

That said, due to LP: #2039505, this change will temporarily be reverted until snapd is no longer an ordering dependency, since #2039505 is a serious regression.

Advise this delay be used to fix uvtool such that when cloud-init re-implements this change uvtool works seamlessly.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in cloud-init (Ubuntu):
status: New → Confirmed
Revision history for this message
Robie Basak (racb) wrote :

Fixed in uvtool commit 35dc66 in the main branch. This needs Ubuntu uploads and SRUs.

Changed in uvtool:
status: Triaged → Fix Committed
Changed in uvtool (Ubuntu):
status: New → Triaged
Changed in uvtool (Ubuntu Focal):
status: New → Triaged
Changed in uvtool (Ubuntu Jammy):
status: New → Triaged
Changed in uvtool (Ubuntu Lunar):
status: New → Triaged
Changed in uvtool (Ubuntu Mantic):
status: New → Triaged
Robie Basak (racb)
Changed in uvtool (Ubuntu):
status: Triaged → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package uvtool - 0~git183-0ubuntu1

---------------
uvtool (0~git183-0ubuntu1) noble; urgency=medium

  [ Lena Voytek ]
  * Swap simplestreams test comments to allowlist for inclusive naming

  [ Andrei Gherzan ]
  * Fix "uvt-kvm list" behaviour on non-login shells (LP: #2002530):
    - Clarify comment for the list command
    - Force virsh for the qemu system daemon
    - Refactor libvirt connection usage

  [ Thibault Ferrante ]
  * Continue waiting while pam_nologin is active in the VM (LP: #2039441)

 -- Robie Basak <email address hidden> Tue, 31 Oct 2023 09:11:32 +0000

Changed in uvtool (Ubuntu):
status: Fix Committed → Fix Released
Revision history for this message
Alberto Contreras (aciba) wrote :

The cloud-init commits provoking this new behavior / issue were reverted as part of https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/2039505/comments/1.

Changed in cloud-init (Ubuntu):
status: Confirmed → Fix Committed
Changed in cloud-init (Ubuntu Focal):
status: New → Fix Committed
Changed in cloud-init (Ubuntu Jammy):
status: New → Fix Committed
Changed in cloud-init (Ubuntu Lunar):
status: New → Fix Committed
Changed in cloud-init (Ubuntu Mantic):
status: New → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.