deadlock in maas_run_remote_scripts.py

Bug #1799862 reported by Dmitry Sutyagin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
High
Lee Trager

Bug Description

I am troubleshooting stuck "Testing" phase of a particular server. Commissioning passes, but testing never finishes (and never times out). The only test which is enabled for this node is "smartctl-validate". Upon debugging on the target node I have found that python process is stuck doing nothing. Tracing with gdb showed that the main process as well as child thread both are stuck on Lock.acquire() (in different places). There is one thread which sends heartbeats, but the other which is supposed to run smartctl is stuck on Lock.aquire(). I don't see a third thread for the second smartctl invocation (server has 2 disks /dev/sda and /dev/sdb). This state holds forever.

traces can be looked at here - http://paste.openstack.org/show/NW4pBwt8aqPvqbtN7R1x/

There are folders for smartctl-validate in the temp folder, for both drives, but they are empty, and smartctl is not running. File tree can be observed here: http://paste.openstack.org/show/LkE6Lsek48xTr40Hha7K/

Related branches

Changed in maas:
milestone: none → 2.5.0
Revision history for this message
Lee Trager (ltrager) wrote :

When a hardware storage test like smartctl-validate is run maas-run-remote-scripts uses lsblk to map the drive MAAS wants tested with what is on the system. To ensure this is only called once a Lock is used. I suspect lsblk is hanging in the thread for the other drive. As lsblk is run before the script actually starts running the timeout counter hasn't started yet.

Do you see lsblk running in your process tree?

If you run `lsblk --exclude 1,2,7 -d -P -o NAME,MODEL,SERIAL` does it hang?

What version of MAAS are you using?

What commissioning operating system are you using?

Can you post the output of dmesg after maas-run-remote-scripts has hang?

Changed in maas:
status: New → Incomplete
Revision history for this message
Dmitry Sutyagin (dsutyagin) wrote :

No, lsblk is not running:

ubuntu@node:~$ sudo pstree -alp
init,1
  ├─acpid,2137 -c /etc/acpi/events -s /var/run/acpid.socket
  ├─atd,2084
  ├─cloud-init,2260 /usr/bin/cloud-init modules --mode=final
  │ ├─sh,2265 -c tee -a /var/log/cloud-init-output.log
  │ │ └─tee,2266 -a /var/log/cloud-init-output.log
  │ └─user_data.sh,2267 /var/lib/cloud/instance/scripts/user_data.sh
  │ └─python3,3407 /tmp/user_data.sh.wMa95m/bin/maas-run-remote-scripts --config=/etc/cloud/cloud.cfg.d/91_kernel_cmdline_url.cfg /tmp/user_data.sh.wMa95m
  │ └─{python3},6546
  ├─cron,2085
  ├─dbus-daemon,1976 --system --fork
  ├─dhclient,3784 -nw -4 eth0
  ├─dhclient,3791 -nw -4 eth1
  ├─dhclient,3798 -nw -4 eth2
  ├─dhclient,3804 -nw -4 eth3
  ├─dhclient,3812 -nw -4 eth4
  ├─dhclient,3819 -nw -4 eth5
  ├─dhclient,3826 -nw -4 eth6
  ├─getty,2058 -8 38400 tty4
  ├─getty,2061 -8 38400 tty5
  ├─getty,2066 -8 38400 tty2
  ├─getty,2067 -8 38400 tty3
  ├─getty,2069 -8 38400 tty6
  ├─getty,2191 -L ttyS0 115200 vt102
  ├─getty,9627 -8 38400 tty1
  ├─irqbalance,2131
  ├─lldpd,3756
  │ └─lldpd,3759
  ├─rsyslogd,1857
  │ ├─{rsyslogd},1859
  │ ├─{rsyslogd},1860
  │ └─{rsyslogd},1861
  ├─sh,3838 -c for idx in $(seq 10); do dhclient -6 eth4 && break || sleep 10; done
  │ └─dhclient,3842 -6 eth4
  ├─sh,3841 -c for idx in $(seq 10); do dhclient -6 eth5 && break || sleep 10; done
  │ └─dhclient,3845 -6 eth5
  ├─sh,3844 -c for idx in $(seq 10); do dhclient -6 eth6 && break || sleep 10; done
  │ └─dhclient,3848 -6 eth6
  ├─sh,3847 -c for idx in $(seq 10); do dhclient -6 eth7 && break || sleep 10; done
  │ └─dhclient,3850 -6 eth7
  ├─sshd,2089 -D
  │ └─sshd,19953
  │ └─sshd,20037
  │ └─bash,20038
  │ └─sudo,20146 pstree -alp
  │ └─pstree,20147 -alp
  ├─systemd-logind,2022
  ├─systemd-udevd,1243 --daemon
  ├─upstart-file-br,1974 --daemon
  ├─upstart-socket-,1488 --daemon
  └─upstart-udev-br,1238 --daemon

Revision history for this message
Dmitry Sutyagin (dsutyagin) wrote :

MAAS packages are 2.3.0-6434-gd354690-0ubuntu1~16.04.1

Commissioning operating system:
ubuntu@node:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.5 LTS
Release: 14.04
Codename: trusty

dmesg - https://pastebin.com/q3MpAh4N (looks clean to me)

Changed in maas:
status: Incomplete → New
Revision history for this message
Dmitry Sutyagin (dsutyagin) wrote :

ubuntu@node:~$ time sudo lsblk --exclude 1,2,7 -d -P -o NAME,MODEL,SERIAL
lsblk: unknown column: SERIAL

real 0m0.011s
user 0m0.000s
sys 0m0.011s

I also see the same output ("lsblk: unknown column: SERIAL") in server's terminal.

Revision history for this message
Lee Trager (ltrager) wrote :

The issue is lsblk in Trusty does not support the SERIAL operation and maas-run-remote-scripts is not properly handling the failure.

Using Xenial or Bionic as your commissioning release should fix the problem. Is there any reason you need to use Trusty?

Changed in maas:
status: New → Confirmed
Revision history for this message
Dmitry Sutyagin (dsutyagin) wrote :

It's just the default we have in our system. For now I'm just skipping this test and I don't care about the issue, I might as well try a different bootstrap but so far no need.

The fact is that it's a deadlock, looks like this situation is unhandled and can happen on any release as long as lsblk fails for any other reason. So I wouldn't say the issue is with lsblk.

Lee Trager (ltrager)
Changed in maas:
assignee: nobody → Lee Trager (ltrager)
importance: Undecided → High
Revision history for this message
Lee Trager (ltrager) wrote :

I completely agree that the problem isn't with lsblk. The issue is we're using a feature in lsblk that isn't available in Trusty. The related branch handles the errors but will result in a test failure as MAAS doesn't know why lsblk failed.

As Trusty will be EOL in April MAAS 2.5 will no longer allow Trusty to be used as a commissioning OS.

Lee Trager (ltrager)
Changed in maas:
status: Confirmed → In Progress
milestone: 2.5.0 → 2.5.0rc1
Revision history for this message
Lee Trager (ltrager) wrote :

I have opened LP:1800233 to track deprecating Trusty as a commissioning series in MAAS 2.5. It will still be deployable.

Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
milestone: 2.5.0rc1 → 2.5.0beta4
milestone: 2.5.0beta4 → 2.5.0rc1
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.