rpool fails to import on first boot after automated install (failed PRE_SHUTDOWN?)

Bug #2073772 reported by Richard Hesketh
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
subiquity (Ubuntu)
Fix Committed
Medium
Olivier Gayot

Bug Description

After installing 24.04 with a zfs root, the system fails on first boot with an error that the rpool belongs to another system. This suggests that the installer is failing to export the pools after the install is complete. The zpool import must be manually forced for the boot to proceed, as in attached screenshot.

For greater context: I work with an infrastructure in which we network boot machines from the ubuntu server ISO with a preseed to automate the OS installation on our bare metal. The intended result is that we can flag a server to be reinstalled, reboot it, and then wait until it comes back up with a clean install, no user intervention required or desired. For the 22.04 images this works fine and the server will install and reboot cleanly. For the 24.04 installer, it doesn't, because of the above error with the zpool after reboot. The same preseed that produces a booting system from the 22.04(.4) ISO results in failure for the 24.04 version.

This seems to be the same ultimate failure mode as in https://bugs.launchpad.net/subiquity/+bug/2049761, albeit in that issue the reporting user used the desktop installer and it appears they may have rebooted the system before the installer was completely finished. That can't be the case for us as we're using the installer fully automated and it is choosing if and when it restarts.

I will attach the installer-journal.txt log from the host, though obviously that only extends as far as the moment the installer copies its logfiles into the target. As the installation is done via netbooting the ISO the installer environment itself is entirely ephemeral so the complete logs from installation media are not available.

Revision history for this message
Richard Hesketh (richh-bbc) wrote :
Revision history for this message
Richard Hesketh (richh-bbc) wrote :
Revision history for this message
Dan Bungert (dbungert) wrote :

Hi Richard, thanks for the report.

Yes, it's surely similar to the linked bug you mentioned. zpool export is supposed to happen.

I would appreciate more of the logs from the install - a tarball of /var/log/installer would help, as the journal doesn't have it all.

I wonder what's going on in your late commands - is it possible that there is a process with a file open which is ultimately blocking the zpool export? Just a guess. To that end though you might consider, for debugging purposes, temporarily removing the late-commands portion to see if you get different results.

Changed in subiquity:
status: New → Incomplete
Revision history for this message
Richard Hesketh (richh-bbc) wrote :

Hi Dan,

Thanks for looking at this. Your initial idea about a leftover process holding a file sounded promising so I've done a bit of testing. Unfortunately it's only left me more confused!

I have tried entirely removing the late command from the preseed in case anything done there was causing a problem. However it does not seem to make a difference - on reboot it still failed to import the zpools. I will attach the relevant logfiles from this attempt.

Aside from that - I have a workaround as I mentioned in the other report that uses late commands to manipulate the installer hostid and bounce the zpool:

    - cp /target/etc/hostid /etc/hostid
    - umount -l /target
    - zpool export rpool
    - zpool import rpool -f
    - mount -t zfs -o zfsutil rpool/ROOT/zfsroot /target

A lazy unmount is required here or the umount call fails, but having done so it is possible to export and reimport the pool without issues. I tried a variant of this that did not copy the hostid in case bouncing the pool was itself sufficient to clear up any issue which was preventing the installer's attempt to export, but it didn't work - the pool still failed to import at reboot.

And as mentioned before this problem doesn't happen at all in the 22.04 installer - we only see it (even if using the same preseed) when installing 24.04.

Revision history for this message
Richard Hesketh (richh-bbc) wrote :
Dan Bungert (dbungert)
Changed in subiquity:
status: Incomplete → Triaged
importance: Undecided → Medium
Dan Bungert (dbungert)
summary: - rpool fails to import on first boot after automated install
+ rpool fails to import on first boot after automated install (failed
+ PRE_SHUTDOWN?)
affects: subiquity → subiquity (Ubuntu)
tags: added: foundations-todo
Olivier Gayot (ogayot)
Changed in subiquity (Ubuntu):
assignee: nobody → Olivier Gayot (ogayot)
Revision history for this message
Olivier Gayot (ogayot) wrote (last edit ):

Hello Richard,

I've been trying to reproduce this bug today; so far without success. I used the following cloud-config:

```
#cloud-config
users:
    - default
    - name: root
      ssh_authorized_keys:
        - ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIMM/qhS3hS3+IjpJBYXZWCqPKPH9Zag8QYbS548iEjoZ olivier@camelair
autoinstall:
  version: 1
  storage:
    layout:
      name: zfs
  identity:
    username: ubuntu
    password: '$6$wdAcoXrU039hKYPd$508Qvbe7ObUnxoj15DRCkzC3qO7edjH0VV7BPNRDYK4QR8ofJaEEF2heacn0QgD.f8pO8SNp83XNdWG6tocBM1'
    realname: ''
    hostname: ubuntest
```

I used the `users` section to give me SSH access to the installer environment. (so I could run journalctl -f). Below is what the journal captured before the machine starting rebooting; which seems to show that the installer did the right thing for me:

```
Aug 20 15:26:50 ubuntu-server systemd[1]: Started run-u70.service - /usr/sbin/zpool export -a.
Aug 20 15:26:50 ubuntu-server zed[21057]: eid=131 class=pool_export pool='bpool' pool_state=EXPORTED

Broadcast message from root@ubuntu-server (Tue 2024-08-20 15:26:50 UTC):

The system will reboot now!

Aug 20 15:26:50 ubuntu-server zed[21059]: eid=132 class=config_sync pool='bpool' pool_state=UNINITIALIZED
Aug 20 15:26:50 ubuntu-server zed[21061]: eid=133 class=pool_export pool='rpool' pool_state=EXPORTED
Aug 20 15:26:50 ubuntu-server zed[21063]: eid=134 class=config_sync pool='rpool' pool_state=UNINITIALIZED
Aug 20 15:26:50 ubuntu-server systemd[1]: run-u70.service: Deactivated successfully.
Aug 20 15:26:50 ubuntu-server systemd-logind[1250]: The system will reboot now!
Aug 20 15:26:50 ubuntu-server subiquity_event.1510[1510]: subiquity/Shutdown/shutdown: mode=REBOOT
```

Are you comfortable sharing the contents of your cloud-config / autoinstall file so I can try locally?

Thank you,
Olivier

Revision history for this message
Olivier Gayot (ogayot) wrote :

I opened https://github.com/canonical/subiquity/pull/2064 to tentatively fix the issue but I can only hope that it would fix it ; since I haven't reproduced the failure.

Revision history for this message
Olivier Gayot (ogayot) wrote :

Hello Richard,

I merged the change mentioned above and created an updated installer snap based on our stable noble branch.

If you would like to try and see if it addresses your issue, please add the following chunk to your autoinstall file:

```
refresh-installer:
  update: true
  channel: edge/error-systemd-shutdown
```

Thank you!

Changed in subiquity (Ubuntu):
status: Triaged → Fix Committed
Revision history for this message
Richard Hesketh (richh-bbc) wrote :

Hi Olivier,

Thanks for looking into the issue. I am currently otherwise occupied but I should be able to test this next week and will let you know how it goes.

Revision history for this message
Richard Hesketh (richh-bbc) wrote :

Hi Olivier,

I have tested the new installer snap as suggested but unfortunately it doesn't seem to have made a difference to the behaviour.

I will attach a copy of an autoinstall config which produces the problem for us (mildy redacted for internal network details). The way we're doing the zfs configuration might be a bit funky and out-of-date as this part hasn't been updated for a while and will date back to releases where support for it may have been lacking so some workarounds were required. (Our setup relies on other automation post-install to add a second device to the pool, for instance.)

Thanks,
Rich

Revision history for this message
Olivier Gayot (ogayot) wrote :

Thanks Richard! Using your autoinstall file, I was able to reproduce the issue.

My previous assumption was that the call to `zpool export` was failing. But now it looks like it is not executed at all. No wonder the fix I came up with didn't work.

probably because the len() test below returns 0.

        if len(self.model._all(type="zpool")) > 0:
            await self.app.command_runner.run(["zpool", "export", "-a"])

https://github.com/canonical/subiquity/blob/f20135f3c705f3862595b0d08f14286e7ee48c1a/subiquity/server/controllers/filesystem.py#L1647

Revision history for this message
Olivier Gayot (ogayot) wrote :

I've confirmed it's the absence of explicit zpool in the configuration that prevented the `zpool export -a` call from running.

https://github.com/canonical/subiquity/pull/2077 should address the issue.

Changed in subiquity (Ubuntu):
status: Fix Committed → In Progress
Revision history for this message
Olivier Gayot (ogayot) wrote :

Hello Richard,

It looks like the new fix I wrote addresses the issue in my tests, but if you would like to try out yourself, you can use the following chunk:

```
refresh-installer:
  update: true
  channel: edge/zpool-export-always
```

Thanks,
Olivier

Olivier Gayot (ogayot)
Changed in subiquity (Ubuntu):
status: In Progress → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.