Commissioning fails when a node is rebooted in the middle - No such file or directory: '/var/run/lldpd.socket'

Bug #2008454 reported by Nobuto Murata
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Triaged
Medium
Unassigned

Bug Description

maas: 1:3.3.0-13159-g.1c22f7beb-0ubuntu1~22.04.1

When a node is rebooted in the middle of commissioning with any reasons (a corner case indeed), it seems that already completed commissioning scripts are skipped thus the process fails since lldpd installation is not going to happen in the next boot.

[maas-capture-lldpd]
Traceback (most recent call last):
  File "/tmp/user_data.sh.4Jn6tY/scripts/commissioning/maas-capture-lldpd", line 53, in <module>
    lldpd_capture("/var/run/lldpd.socket", 60)
  File "/tmp/user_data.sh.4Jn6tY/scripts/commissioning/maas-capture-lldpd", line 41, in lldpd_capture
    time_ref = getmtime(reference_file)
  File "/usr/lib/python3.8/genericpath.py", line 55, in getmtime
    return os.stat(filename).st_mtime
FileNotFoundError: [Errno 2] No such file or directory: '/var/run/lldpd.socket'

Interestingly enough, I only see the "Installing apt packages for 20-maas-01-install-lldpd" message from the 5 succeeded machines in rsyslog although all 7 machines should have an identical commissioning configuration.

$ sudo grep -i 'Installing apt packages for 20-maas-01-install-lldpd' -l /var/log/maas/rsyslog/*/*/messages
/var/log/maas/rsyslog/crack-chow/2023-02-22/messages
/var/log/maas/rsyslog/enough-camel/2023-02-22/messages
/var/log/maas/rsyslog/open-bee/2023-02-22/messages
/var/log/maas/rsyslog/quick-gopher/2023-02-22/messages
/var/log/maas/rsyslog/wanted-louse/2023-02-22/messages

Revision history for this message
Nobuto Murata (nobuto) wrote :
Revision history for this message
Nobuto Murata (nobuto) wrote :

Failed machine

Revision history for this message
Nobuto Murata (nobuto) wrote :

Succeeded machine

Nobuto Murata (nobuto)
summary: - Commissioning intermittently fails with No such file or directory:
- '/var/run/lldpd.socket'
+ Commissioning fails when a node is rebooted in the middle - No such file
+ or directory: '/var/run/lldpd.socket'
description: updated
Revision history for this message
Nobuto Murata (nobuto) wrote :

I think this is a good overview of what's going on.

20-maas-01-install-lldpd is passed in the initial boot as "changed status from 'Running' to 'Passed'" so it wasn't run in the second boot so there is no lldpd binary in the ephemeral OS.

$ grep -E 'Script|PXE' commissioning-events.log
Wed, 01 Mar. 2023 02:54:09 Script - maas-capture-lldpd failed
Wed, 01 Mar. 2023 02:54:08 Script result - maas-lshw changed status from 'Running' to 'Passed'
Wed, 01 Mar. 2023 02:54:08 Script result - maas-list-modaliases changed status from 'Running' to 'Passed'
...

Wed, 01 Mar. 2023 02:54:07 Script result - maas-lshw changed status from 'Pending' to 'Running'
Wed, 01 Mar. 2023 02:54:06 Script result - maas-capture-lldpd changed status from 'Pending' to 'Running'
Wed, 01 Mar. 2023 02:54:06 Script result - 50-maas-01-commissioning changed status from 'Running' to 'Passed'
Wed, 01 Mar. 2023 02:54:04 Script result - 50-maas-01-commissioning changed status from 'Pending' to 'Running'
...

Wed, 01 Mar. 2023 02:53:54 Script result - 20-maas-02-dhcp-unconfigured-ifaces changed status from 'Running' to 'Passed'
Wed, 01 Mar. 2023 02:52:55 Performing PXE boot
Wed, 01 Mar. 2023 02:52:55 PXE Request - commissioning
Wed, 01 Mar. 2023 02:52:26 Script result - 20-maas-02-dhcp-unconfigured-ifaces changed status from 'Pending' to 'Running'
Wed, 01 Mar. 2023 02:52:26 Script result - 20-maas-01-install-lldpd changed status from 'Running' to 'Passed'
Wed, 01 Mar. 2023 02:52:26 Script result - 20-maas-01-install-lldpd changed status from 'Installing dependencies' to 'Running'
Wed, 01 Mar. 2023 02:52:18 Script result - 20-maas-01-install-lldpd changed status from 'Pending' to 'Installing dependencies'
Wed, 01 Mar. 2023 02:51:11 Performing PXE boot
Wed, 01 Mar. 2023 02:51:11 PXE Request - commissioning

Revision history for this message
Nobuto Murata (nobuto) wrote :

> I'm not 100% sure but it looks like the behavior is changed between 3.2 and 3.3.

It wasn't the case for the record. The behavior was consistent between 3.2 and 3.3.

description: updated
Revision history for this message
Björn Tillenius (bjornt) wrote :

Yes, I've noticed this as well.

Changed in maas:
status: New → Triaged
importance: Undecided → Medium
milestone: none → 3.4.0
Alberto Donato (ack)
Changed in maas:
milestone: 3.4.0 → 3.4.x
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.