snapd.seeded.service blocks for a long time preventing cloud-init from completing

Bug #1834190 reported by Jeremy Collin
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
cloud-init
Invalid
Undecided
Unassigned
snapd
In Progress
High
Paweł Stołowski

Bug Description

We use qcow2 images to install baremetal servers, our datasource is to have a little partition as ConfigDrive with meta_data.json and vendor_data.json is necessary.
The vendor_data.json provides informations about default user password settings.

This actually works well for every linux distrib we use unless cloud-init is 19.1.

The modules steps are no longer working:

cat /run/cloud-init/status.json
{
 "v1": {
  "datasource": "DataSourceConfigDrive [net,ver=2][source=/dev/sda4]",
  "init": {
   "errors": [],
   "finished": 1561465042.7556307,
   "start": 1561465042.1868145
  },
  "init-local": {
   "errors": [],
   "finished": 1561465035.4210315,
   "start": 1561465034.724825
  },
  "modules-config": {
   "errors": [],
   "finished": null,
   "start": null
  },
  "modules-final": {
   "errors": [],
   "finished": null,
   "start": null
  },
  "modules-init": {
   "errors": [],
   "finished": null,
   "start": null
  },
  "stage": null
 }
}

and we can see in the /var/lib/cloud/instance/sem directory that the modules doesn't have the sem file.

Revision history for this message
Ryan Harper (raharper) wrote :

Thanks for filing a bug.

Could you run 'cloud-init collect-logs' and attach the tarball it creates?

Changed in cloud-init:
status: New → Incomplete
Revision history for this message
Jeremy Collin (jcollin) wrote :

It will not be on the same server, but the problem is the same.

Revision history for this message
Jeremy Collin (jcollin) wrote :
Revision history for this message
Ryan Harper (raharper) wrote :
Download full text (8.5 KiB)

Thanks for the logs. Could you be more specific about which files you mention are present in which path ?

/var/lib/cloud/instance is a symlink that points to the current instance-id
 under /var/lib/cloud/instances. And in your case the log shows this:

2019-06-27 07:20:18,617 - util.py[DEBUG]: Creating symbolic link from '/var/lib/cloud/instance' => '/var/lib/cloud/instances/93471835-2d79-4c91-bed6-4a733a60a185'

And then cloud-init logs all of the semephors it writes:

% grep instances cloud-init.log
2019-06-27 07:20:18,617 - util.py[DEBUG]: Creating symbolic link from '/var/lib/cloud/instance' => '/var/lib/cloud/instances/93471835-2d79-4c91-bed6-4a733a60a185'
2019-06-27 07:20:18,618 - util.py[DEBUG]: Reading from /var/lib/cloud/instances/93471835-2d79-4c91-bed6-4a733a60a185/datasource (quiet=False)
2019-06-27 07:20:18,618 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/93471835-2d79-4c91-bed6-4a733a60a185/datasource - wb: [644] 75 bytes
2019-06-27 07:20:24,975 - util.py[DEBUG]: Creating symbolic link from '/var/lib/cloud/instance' => '/var/lib/cloud/instances/93471835-2d79-4c91-bed6-4a733a60a185'
2019-06-27 07:20:24,975 - util.py[DEBUG]: Reading from /var/lib/cloud/instances/93471835-2d79-4c91-bed6-4a733a60a185/datasource (quiet=False)
2019-06-27 07:20:24,976 - util.py[DEBUG]: Read 75 bytes from /var/lib/cloud/instances/93471835-2d79-4c91-bed6-4a733a60a185/datasource
2019-06-27 07:20:24,976 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/93471835-2d79-4c91-bed6-4a733a60a185/datasource - wb: [644] 75 bytes
2019-06-27 07:20:25,004 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/93471835-2d79-4c91-bed6-4a733a60a185/user-data.txt - wb: [600] 0 bytes
2019-06-27 07:20:25,021 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/93471835-2d79-4c91-bed6-4a733a60a185/user-data.txt.i - wb: [600] 308 bytes
2019-06-27 07:20:25,022 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/93471835-2d79-4c91-bed6-4a733a60a185/vendor-data.txt - wb: [600] 113 bytes
2019-06-27 07:20:25,024 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/93471835-2d79-4c91-bed6-4a733a60a185/vendor-data.txt.i - wb: [600] 418 bytes
2019-06-27 07:20:25,025 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/93471835-2d79-4c91-bed6-4a733a60a185/sem/consume_data - wb: [644] 24 bytes
2019-06-27 07:20:25,025 - helpers.py[DEBUG]: Running consume_data using lock (<FileLock using file '/var/lib/cloud/instances/93471835-2d79-4c91-bed6-4a733a60a185/sem/consume_data'>)
2019-06-27 07:20:25,027 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/93471835-2d79-4c91-bed6-4a733a60a185/cloud-config.txt - wb: [600] 0 bytes
2019-06-27 07:20:25,030 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/93471835-2d79-4c91-bed6-4a733a60a185/vendor-cloud-config.txt - wb: [600] 159 bytes
2019-06-27 07:20:25,105 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/93471835-2d79-4c91-bed6-4a733a60a185/sem/config_seed_random - wb: [644] 24 bytes
2019-06-27 07:20:25,105 - helpers.py[DEBUG]: Running config-seed_random using lock (<FileLock using file '/var/lib/cloud/instances/93471835-2d79-4c91-bed6-4a733a60a185/sem/config_seed_random'>)
2019-06-27 07:20:25,106 - util.py[DEBUG]...

Read more...

Revision history for this message
Jeremy Collin (jcollin) wrote :

We are using the official images (https://cloud-images.ubuntu.com/disco/current/disco-server-cloudimg-amd64.img, and https://cloud-images.ubuntu.com/cosmic/current/cosmic-server-cloudimg-amd64.img)

For the process we use:
1/ we create a configDrive
2/ we rsync the image onto the remaining disk
3/ we reboot the server

At this point we have no other actions. This process worked well till cloud-init was upgraded to 19.1

About the files I mention, it's the semaphore files, which show that actions has been done.

In the vendor_data.json, we have the cc_set_passwords sets but it's not run. < this is the main problem.

If you need any other information, just ask.

Revision history for this message
Chad Smith (chad.smith) wrote :
Download full text (3.7 KiB)

Hi Jeremy,
thanks for the bug file and for making cloud-init better, taking your cloud-init.log and running cloud-init analyze show on it, it seems you environment isn't running either modules-config stage or modules-final stage. Just init-local and init-network stages which is why status.json has no timestamps for those boot stages. I'm wondering if something in the environment has disabled/avoided running those stages (like a broken systemd dependency chain)

Starting stage: init-local
|`->no cache found @00.00400s +00.00000s
|`->found local data from DataSourceConfigDrive @00.01400s +00.38300s
Finished stage: (init-local) 00.82400 seconds

Starting stage: init-network
|`->restored from cache with run check: DataSourceConfigDrive [net,ver=2][source=/dev/sda4] @06.75200s +00.00400s
|`->setting up datasource @06.80200s +00.00000s
|`->reading and applying user-data @06.82300s +00.00200s
|`->reading and applying vendor-data @06.82500s +00.00400s
|`->activating datasource @06.84900s +00.00100s
|`->config-migrator ran successfully @06.90200s +00.00000s
|`->config-seed_random ran successfully @06.90300s +00.00000s
|`->config-bootcmd ran successfully @06.90400s +00.00000s
|`->config-write-files ran successfully @06.90400s +00.00100s
|`->config-growpart ran successfully @06.90500s +00.01900s
|`->config-resizefs ran successfully @06.92400s +00.37100s
|`->config-disk_setup ran successfully @07.29500s +00.00100s
|`->config-mounts ran successfully @07.29600s +00.14700s
|`->config-set_hostname ran successfully @07.44300s +00.00100s
|`->config-update_hostname ran successfully @07.44500s +00.00000s
|`->config-update_etc_hosts ran successfully @07.44600s +00.00000s
|`->config-ca-certs ran successfully @07.44600s +00.00100s
|`->config-rsyslog ran successfully @07.44700s +00.00000s
|`->config-users-groups ran successfully @07.44800s +00.88100s
|`->config-ssh ran successfully @08.32900s +00.30600s
Finished stage: (init-network) 01.90700 seconds

Total Time: 2.73100 seconds

-- Boot Record 02 --
The total time elapsed since completing an event is printed after the "@" character.
The time the event takes is printed after the "+" character.

Starting stage: init-local
|`->cache invalid in datasource: DataSourceConfigDrive [net,ver=2][source=/dev/sda4] @00.02200s +00.05000s
|`->found local data from DataSourceConfigDrive @00.07400s +00.48400s
Finished stage: (init-local) 00.70800 seconds

Starting stage: init-network
|`->restored from cache with run check: DataSourceConfigDrive [net,ver=2][source=/dev/sda4] @06.43400s +00.00300s
|`->setting up datasource @06.48600s +00.00000s
|`->reading and applying user-data @06.51700s +00.01000s
|`->reading and applying vendor-data @06.52700s +00.00400s
|`->activating datasource @06.55100s +00.00100s
|`->config-migrator ran successfully @06.59200s +00.00100s
|`->config-seed_random previously ran @06.59300s +00.00000s
|`->config-bootcmd ran successfully @06.59300s +00.00100s
|`->config-write-files previously ran @06.59400s +00.00000s
|`->config-growpart ran successfully @06.59400s +00.01900s
|`->config-resizefs ran successfully @06.61400s +00.28000s
|`->config-disk_setup previously ran @06.89500s +00.00000s
|`->config-m...

Read more...

Revision history for this message
Ryan Harper (raharper) wrote :

@Jeremy

2019-06-27 07:20:18,586 - openstack.py[DEBUG]: Failed reading optional path /run/cloud-init/tmp/tmp9gnk8kxy/openstack/latest/network_data.json due to: [Errno 2] No such file or directory: '/run/cloud-init/tmp/tmp9gnk8kxy/openstack/latest/network_data.json'

That looks like what you're seeing.

Could you dump/attach a sanitized version of your ConfigDrive so we can reproduce?

Revision history for this message
Jeremy Collin (jcollin) wrote :

Oh, is the network_data.json mandatory now?

Revision history for this message
Jeremy Collin (jcollin) wrote :

I just tried to have empty network_data.json, and empty user_data in the configDrive, the result is still the same. I'll attach an ISO file of the config drive.

Revision history for this message
Jeremy Collin (jcollin) wrote :
Ryan Harper (raharper)
Changed in cloud-init:
status: Incomplete → Triaged
Revision history for this message
Ryan Harper (raharper) wrote :

I cannot recreate the failure with your config drive and the current disco or cosmic image, however it does allow you to login before cloud-init is complete. You can watch/wait for cloud-init to complete (cloud-init status --long --wait). And systemctl list-jobs (will show you any pending units that are not yet complete. To see the extra time it takes look at this:

% systemd-analyze critical-chain cloud-final.service
cloud-final.service +481ms
└─cloud-config.service @1min 10.653s +5.742s
  └─snapd.seeded.service @23.132s +47.511s
    └─snapd.service @54.339s +3.108s
      └─snap-lxd-10972.mount @58.097s +28ms
        └─dev-loop1.device @58.116s +47ms

47 seconds for the snapd.seeded.service to install the two snaps. And the cloud-final.service will wait until the snapd.seeded.sevice is complete as cloud-init defers package and snap installs until after snapd is up.

On your failing instance log, I see a snapd message about error installing the core snap. Could you run:

snap list
snap changes

And for any errors in the snap changes output, please run:

snap tasks <ID> to get an error message.

Changed in cloud-init:
status: Triaged → Incomplete
Revision history for this message
Jeremy Collin (jcollin) wrote :

---
root@pwet:~# cloud-init status --long --wait
status: error
time: Tue, 02 Jul 2019 07:35:35 +0000
detail:
'NoneType' object has no attribute 'version'

root@pwet:~# snap list
No snaps are installed yet. Try 'snap install hello-world'.

root@pwet:~# snap changes
ID Status Spawn Ready Summary
1 Error today at 07:35 UTC today at 07:35 UTC Initialize system state
2 Done today at 07:35 UTC today at 07:35 UTC Initialize device

root@pwet:~# systemctl list-jobs
JOB UNIT TYPE STATE
107 snapd.seeded.service start running
103 snapd.autoimport.service start waiting
  2 multi-user.target start waiting
105 systemd-update-utmp-runlevel.service start waiting
111 cloud-config.service start waiting
  1 graphical.target start waiting
109 cloud-init.target start waiting
110 cloud-final.service start waiting

8 jobs listed.

root@pwet:~# systemd-analyze critical-chain cloud-final.service
Bootup is not yet finished (org.freedesktop.systemd1.Manager.FinishTimestampMonotonic=0).
Please try again later.
Hint: Use 'systemctl list-jobs' to see active jobs
---

I found something (https://bugs.launchpad.net/ubuntu/+source/snapd/+bug/1779948) and it seems that's quite like the problem I have. snapd.seeded.service does not start properly, still have to find why.

I purged snapd, squashfs-tools, reinstalled them, and cloud-init finished its starting process (modules-config & modules-final)

I now have to figure out why.

Thanks

Revision history for this message
Jeremy Collin (jcollin) wrote :

I've masked snapd before the first boot and everything works fine now.
The problem is not in cloud-init, it's just snapd.seeded.service that is stalling and preventing cloud-init to finish its work.

Revision history for this message
Ryan Harper (raharper) wrote :

I've added snapd so they can look into what's going on with snapd.seeded.service.

Changed in cloud-init:
status: Incomplete → Invalid
summary: - Modules steps not taken into account since 19.1
+ snapd.seeded.service blocks for a long time preventing cloud-init from
+ completing
Revision history for this message
Jeremy Collin (jcollin) wrote :

Thanks

Revision history for this message
Zygmunt Krynicki (zyga) wrote :

This topic was discussed at the Paris engineering sprint and we have work scheduled to understand the slowest parts and optimise them with an additional pass that can be performed on a built image.

I'm assigning this bug to Pawel as he is currently looking into it.

Changed in snapd:
status: New → Confirmed
assignee: nobody → Paweł Stołowski (stolowski)
status: Confirmed → In Progress
importance: Undecided → High
Revision history for this message
Ian Johnson (anonymouse67) wrote :

What is left to do here? we now have snap-preseed which can speed up classic cloud image boot times, and so I'm inclined to close this as Fix Released.

Revision history for this message
Anisse Astier (anisse) wrote :

Bug #1846355 was marked as a duplicate of this one, but it seems the cloud image's initial boot is still quite long in groovy. I see that performance has greatly improved in groovy, but snapd still accounts for almost half of the time spent in the initial first boot.

snap.seeded has the top spot of systemd-analyze blame for this first boot using the same script as in bug #1846355:

2min 10.521s snapd.seeded.service
     45.134s snapd.apparmor.service
     38.917s cloud-init.service
     37.197s dev-vda1.device
     36.135s cloud-config.service
     32.090s cloud-init-local.service
     27.304s snapd.service
     26.047s systemd-udev-settle.service
     24.403s apparmor.service
     22.987s cloud-final.service
     21.231s accounts-daemon.service
     20.568s pollinate.service
     17.044s snap.lxd.activate.service
     16.678s networkd-dispatcher.service
     11.466s systemd-udev-trigger.service
     11.386s apport.service
      9.659s systemd-logind.service
      9.568s grub-common.service
      8.293s keyboard-setup.service
      5.803s systemd-journald.service
      4.975s grub-initrd-fallback.service
      4.908s rsyslog.service
      4.742s user@1000.service
      4.481s blk-availability.service
      4.280s atd.service
      4.033s systemd-resolved.service
      3.656s systemd-timesyncd.service

(note that it's twice as fast as before)

Revision history for this message
Ian Johnson (anonymouse67) wrote :

@anisse, do you know if that image was preseeded or not and/or what platform/cloud that was on? AIUI, the CPC team is producing preseeded images for groovy, but it may not be for all clouds/image sources. I will ask the CPC folks to see if the numbers you reported match up with what they expect to see.

Revision history for this message
Anisse Astier (anisse) wrote :

I do not know if the image was preseeded or not, I took it as-is from the cloud-images mirror. I used the same script I posted on the same machine, I just replaced eoan with groovy.

To provide more context, note that since this is emulated, so this is exaggerated; on arm64 hardware with kvm, the first boot time went from ~180s in eoan to ~75s in groovy. But this is still more than debian bullseye's ~52s on the same hardware.

Elias Sanyana (lephoto)
Changed in snapd:
status: In Progress → New
Colin Watson (cjwatson)
Changed in snapd:
status: New → In Progress
Revision history for this message
James Falcon (falcojr) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.