[2.0 beta 2] Nodes fail to remain powered after Trusty commission with "Allow SSH" selected

Bug #1570633 reported by Chris Gregan
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Won't Fix
Critical
Unassigned
cloud-init
Fix Released
Medium
Unassigned
cloud-init (Ubuntu)
Fix Released
Medium
Unassigned
Trusty
Confirmed
Undecided
Unassigned

Bug Description

Build Version/Date: MAAS 2.0 Beta2
Environment used for testing: Xenial

Summary:
When commissioning nodes with the "Allow SSH" option selected, at least 50% of nodes fail to remain powered and in "Ready" state

Steps to Reproduce:
Enlist 5+ nodes
Commission all nodes at once

Expected result:
All nodes Ready and powered

Actual result:
50-75% of nodes are Ready but powered off

Syslog shows the following errors
Apr 14 19:03:41 donphan sh[28839]: 2016-04-14 19:03:41+0000 [RemoteOriginReadSession (UDP)] Got error: <tftp.datagram.ERRORDatagram object at 0x7f6f9409c048>
Apr 14 19:05:37 donphan sh[28839]: 2016-04-14 19:05:37+0000 [RemoteOriginReadSession (UDP)] Got error: <tftp.datagram.ERRORDatagram object at 0x7f6f8e132978>
Apr 14 19:05:37 donphan sh[28839]: 2016-04-14 19:05:37+0000 [RemoteOriginReadSession (UDP)] Got error: <tftp.datagram.ERRORDatagram object at 0x7f6f8e1320f0>
Apr 14 19:07:12 donphan sh[28839]: 2016-04-14 19:07:12+0000 [RemoteOriginReadSession (UDP)] Got error: <tftp.datagram.ERRORDatagram object at 0x7f6f8e11fa20>
Apr 14 19:07:48 donphan sh[28839]: 2016-04-14 19:07:48+0000 [RemoteOriginReadSession (UDP)] Got error: <tftp.datagram.ERRORDatagram object at 0x7f6f9d779a58>
Apr 14 19:08:08 donphan sh[28839]: 2016-04-14 19:08:08+0000 [RemoteOriginReadSession (UDP)] Got error: <tftp.datagram.ERRORDatagram object at 0x7f6f8e137f28>
Apr 14 19:11:37 donphan sh[28839]: 2016-04-14 19:11:37+0000 [RemoteOriginReadSession (UDP)] Got error: <tftp.datagram.ERRORDatagram object at 0x7f6f8e14ff28>
Apr 14 19:11:44 donphan sh[28839]: 2016-04-14 19:11:44+0000 [RemoteOriginReadSession (UDP)] Got error: <tftp.datagram.ERRORDatagram object at 0x7f6f9d779898>
Apr 14 19:11:47 donphan sh[28839]: 2016-04-14 19:11:47+0000 [RemoteOriginReadSession (UDP)] Got error: <tftp.datagram.ERRORDatagram object at 0x7f6f8e125080>
Apr 14 19:12:24 donphan sh[28839]: 2016-04-14 19:12:24+0000 [RemoteOriginReadSession (UDP)] Got error: <tftp.datagram.ERRORDatagram object at 0x7f6f8e198a90>
Apr 14 19:12:24 donphan sh[28839]: 2016-04-14 19:12:24+0000 [RemoteOriginReadSession (UDP)] Got error: <tftp.datagram.ERRORDatagram object at 0x7f6f8e1321d0>
Apr 14 19:13:14 donphan sh[28839]: 2016-04-14 19:13:14+0000 [RemoteOriginReadSession (UDP)] Got error: <tftp.datagram.ERRORDatagram object at 0x7f6f8e14f2b0>
Apr 14 19:13:17 donphan sh[28575]: Failure: twisted.internet.error.ConnectionDone: Connection was closed cleanly.
Apr 14 19:13:18 donphan sh[28575]: Failure: twisted.internet.error.ConnectionDone: Connection was closed cleanly.
Apr 14 19:43:41 donphan sh[28839]: 2016-04-14 19:43:41+0000 [RemoteOriginReadSession (UDP)] Got error: <tftp.datagram.ERRORDatagram object at 0x7f6f8e137128>
Apr 14 19:43:50 donphan sh[28839]: 2016-04-14 19:43:50+0000 [RemoteOriginReadSession (UDP)] Got error: <tftp.datagram.ERRORDatagram object at 0x7f6f8e11fba8>
Apr 14 19:43:57 donphan sh[28839]: 2016-04-14 19:43:57+0000 [RemoteOriginReadSession (UDP)] Got error: <tftp.datagram.ERRORDatagram object at 0x7f6f8e141400>
Apr 14 19:44:05 donphan sh[28839]: 2016-04-14 19:44:05+0000 [RemoteOriginReadSession (UDP)] Got error: <tftp.datagram.ERRORDatagram object at 0x7f6f8e141eb8>
Apr 14 19:44:06 donphan sh[28839]: 2016-04-14 19:44:06+0000 [RemoteOriginReadSession (UDP)] Got error: <tftp.datagram.ERRORDatagram object at 0x7f6f8e195320>
Apr 14 19:45:10 donphan sh[28839]: 2016-04-14 19:45:10+0000 [RemoteOriginReadSession (UDP)] Got error: <tftp.datagram.ERRORDatagram object at 0x7f6f8e137128>
Apr 14 19:46:19 donphan sh[28575]: #011twisted.internet.error.ConnectionDone: Connection was closed cleanly.
Apr 14 21:34:08 donphan sh[28839]: 2016-04-14 21:34:08+0000 [RemoteOriginReadSession (UDP)] Got error: <tftp.datagram.ERRORDatagram object at 0x7f6f8e137400>
Apr 14 21:34:09 donphan sh[28839]: 2016-04-14 21:34:09+0000 [RemoteOriginReadSession (UDP)] Got error: <tftp.datagram.ERRORDatagram object at 0x7f6f8e14f550>
Apr 14 21:34:20 donphan sh[28839]: 2016-04-14 21:34:20+0000 [RemoteOriginReadSession (UDP)] Got error: <tftp.datagram.ERRORDatagram object at 0x7f6f8d8a4438>
Apr 14 21:34:35 donphan sh[28839]: 2016-04-14 21:34:35+0000 [RemoteOriginReadSession (UDP)] Got error: <tftp.datagram.ERRORDatagram object at 0x7f6f8e137c18>
Apr 14 21:34:36 donphan sh[28839]: 2016-04-14 21:34:36+0000 [RemoteOriginReadSession (UDP)] Got error: <tftp.datagram.ERRORDatagram object at 0x7f6f8e0f9f98>
Apr 14 21:35:05 donphan sh[28575]: Failure: twisted.internet.error.ConnectionDone: Connection was closed cleanly.
Apr 14 21:35:46 donphan sh[28839]: 2016-04-14 21:35:46+0000 [RemoteOriginReadSession (UDP)] Got error: <tftp.datagram.ERRORDatagram object at 0x7f6f8e12e0b8>
Apr 14 21:36:51 donphan sh[28575]: #011twisted.internet.error.ConnectionDone: Connection was closed cleanly.
Apr 14 21:37:00 donphan sh[28575]: #011twisted.internet.error.ConnectionDone: Connection was closed cleanly.

Revision history for this message
Andres Rodriguez (andreserl) wrote :

I've tried to replicate this in 2 different MAAS Clusters with 2 different types of machines and I have been unable to replicate. Questions:

Are you trying to access the commisisoning environment because something is failing? There may be the case that commissioning is failing , or something within it is failing preventing the ssh key to be imported into the commissioning environment and preventing the machine to be told to not power off....

What commissioning image are you using? Xenial or Trusty?

Changed in maas:
status: New → Incomplete
milestone: none → 2.0.0
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Also, can you check rsyslog for that node in /var/log/maas/rsyslog/<machjine-name>/ and see if there's something? if there's nothing there, probably cloud-init is failing somewhere.

Revision history for this message
Chris Gregan (cgregan) wrote :

This occurs when commissioning with Trusty.
No reason for trying to access the system other than it is an option for commission and wanted to give it a try

Revision history for this message
Andres Rodriguez (andreserl) wrote : Re: [2.0 beta2] Machine power's off after commissioning with 'Allow SSH...' option enabled, when using Trusty as commissioning image.

In the commissioning machine, I did:

ubuntu@impervious-darrell:~⟫ ls -l /tmp/block-poweroff
-rw-r--r-- 1 root root 0 Apr 15 15:05 /tmp/block-poweroff

The correct file was there, but cloud-init seems to have just ignored it.

summary: - Nodes fail to remain powered after commision with "Allow SSH" selected
+ [2.0 beta2] Machine power's off after commissioning with 'Allow SSH...'
+ option enabled, when using Trusty as commissioning image.
Revision history for this message
Chris Gregan (cgregan) wrote : Re: Nodes fail to remain powered after Trusty commission with "Allow SSH" selected

Cloud-init seems to be ignoring MAAS signal to keep powered after commissioning.

summary: - [2.0 beta2] Machine power's off after commissioning with 'Allow SSH...'
- option enabled, when using Trusty as commissioning image.
+ Nodes fail to remain powered after Trusty commission with "Allow SSH"
+ selected
Revision history for this message
Andres Rodriguez (andreserl) wrote :

My guess is that we changed to use a new cloud-init feature to do this in comparison to how we did it in 1.9, and this new feature is not available in cloud-init in Trusty, but it is in Xenial. Although, my udnerstanding was that it should have been. We'll investigate.

summary: - Nodes fail to remain powered after Trusty commission with "Allow SSH"
- selected
+ [2.0 beta 2] Nodes fail to remain powered after Trusty commission with
+ "Allow SSH" selected
Changed in maas:
importance: Undecided → Critical
status: Incomplete → Triaged
Revision history for this message
Blake Rouse (blake-rouse) wrote :

MAAS sets power_state in the cloud-init config as such:

power_state:
  delay: now
  mode: poweroff
  timeout: 3600
  condition: test ! -e /tmp/block-poweroff

The issue is the the cloud-init in Xenial supports condition where as the cloud-init in Trusty does not. Would it be possible to get this backported to Trusty. That would make the code path much simpler in MAAS instead of having to check which release it is to change the commissioning script and the generated cloud-init config.

Changed in cloud-init:
status: New → Confirmed
Scott Moser (smoser)
Changed in cloud-init:
importance: Undecided → Medium
status: Confirmed → Fix Released
Changed in cloud-init (Ubuntu):
importance: Undecided → Medium
status: New → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in cloud-init (Ubuntu Trusty):
status: New → Confirmed
tags: added: internal
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Dear user,

This is an automated message.

We believe this bug report is no longer an issue in the latest version of MAAS. For such reason, we are making this issue as Won't Fix. If you believe this issue is still present in the latest version of MAAS, please re-open this bug report.

Changed in maas:
status: Triaged → Won't Fix
Revision history for this message
James Falcon (falcojr) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.