I am troubleshooting stuck "Testing" phase of a particular server. Commissioning passes, but testing never finishes (and never times out). The only test which is enabled for this node is "smartctl-validate". Upon debugging on the target node I have found that python process is stuck doing nothing. Tracing with gdb showed that the main process as well as child thread both are stuck on Lock.acquire() (in different places). There is one thread which sends heartbeats, but the other which is supposed to run smartctl is stuck on Lock.aquire(). I don't see a third thread for the second smartctl invocation (server has 2 disks /dev/sda and /dev/sdb). This state holds forever.
traces can be looked at here - http://paste.openstack.org/show/NW4pBwt8aqPvqbtN7R1x/
There are folders for smartctl-validate in the temp folder, for both drives, but they are empty, and smartctl is not running. File tree can be observed here: http://paste.openstack.org/show/LkE6Lsek48xTr40Hha7K/
When a hardware storage test like smartctl-validate is run maas-run- remote- scripts uses lsblk to map the drive MAAS wants tested with what is on the system. To ensure this is only called once a Lock is used. I suspect lsblk is hanging in the thread for the other drive. As lsblk is run before the script actually starts running the timeout counter hasn't started yet.
Do you see lsblk running in your process tree?
If you run `lsblk --exclude 1,2,7 -d -P -o NAME,MODEL,SERIAL` does it hang?
What version of MAAS are you using?
What commissioning operating system are you using?
Can you post the output of dmesg after maas-run- remote- scripts has hang?