Node powering on but not deploying

Bug #1375664 reported by Adam Collard
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Unassigned

Bug Description

MAAS 1.7.0~beta4+bzr3130-0ubuntu1~trusty1

When attempting to deploy using Juju I hit a situation where a node powered on, but then booted into the "poweroff" system.

The event log is https://pastebin.canonical.com/117875/

Note the power off "44 mins ago" was from a previous environment that was "juju destroy-environment"'d. The error occured after bootstrapping again, and then running juju add-machine on (selected) nodes which are in the Ready state.

Relevant chunk of /var/log/maas/maas.log https://pastebin.canonical.com/117892/

The bootstrap started at 02:08, with the add-machine's happening around 02:13.

The machine which misbehaved is planck, which seems to be missing a "Stopping monitor" line (which the other machines have).

The node was left Allocated, but note the inconsistency in the event log which shows it should be Deploying.

If I hit the Check Power State button then MAAS tells me that the node is powered on, but it's not responding to pings.

Related branches

description: updated
description: updated
description: updated
tags: added: cloud-installer landscape
Changed in maas:
importance: Undecided → High
status: New → Triaged
Revision history for this message
Raphaël Badin (rvb) wrote :

When deploying machines, Juju does two things: first it allocates a machine and then it starts it. My guess is that the second step failed but starting the node (which is part of this second step) succeeded even though the transition itself (from Allocated to Deploying) failed. Can you have a look at the django-maas.log and pserv.log files to see if you can find any stacktrace there? Also, apache's log might tell you if the second API call issued by Juju (the one that should have transitioned the node from Allocated to Deploying) failed.

Revision history for this message
Adam Collard (adam-collard) wrote :

maas-django.log has a chunk of "cluster does not have matching boot image" lines:

ERROR 2014-09-30 01:59:16,102 maasserver Unable to identify boot image for (ubuntu/amd64/generic/trusty/local): cluster 'beretstack' does not have matching boot image.
ERROR 2014-09-30 02:03:53,515 maasserver Unable to identify boot image for (ubuntu/amd64/generic/trusty/local): cluster 'beretstack' does not have matching boot image.
ERROR 2014-09-30 02:04:02,681 maasserver Unable to identify boot image for (ubuntu/amd64/generic/trusty/local): cluster 'beretstack' does not have matching boot image.
ERROR 2014-09-30 02:04:14,495 maasserver Unable to identify boot image for (ubuntu/amd64/generic/trusty/local): cluster 'beretstack' does not have matching boot image.
ERROR 2014-09-30 02:04:25,387 maasserver Unable to identify boot image for (ubuntu/amd64/generic/trusty/local): cluster 'beretstack' does not have matching boot image.
ERROR 2014-09-30 02:10:44,458 maasserver Unable to identify boot image for (ubuntu/amd64/generic/trusty/local): cluster 'beretstack' does not have matching boot image.
ERROR 2014-09-30 02:13:02,114 twisted {}
ERROR 2014-09-30 02:13:38,465 maasserver Unable to identify boot image for (ubuntu/amd64/generic/trusty/poweroff): cluster 'beretstack' does not have matching boot image.

Revision history for this message
Adam Collard (adam-collard) wrote :

pserv.log has a scattering of

2014-09-30 02:13:38-0500 [RemoteOriginReadSession (UDP)] Got error: <tftp.datagram.ERRORDatagram object at 0x7f918903a210>
2014-09-30 02:13:42-0500 [RemoteOriginReadSession (UDP)] Got error: <tftp.datagram.ERRORDatagram object at 0x7f91860ad190>
2014-09-30 02:13:42-0500 [RemoteOriginReadSession (UDP)] Got error: <tftp.datagram.ERRORDatagram object at 0x7f91860ad190>
2014-09-30 02:13:47-0500 [RemoteOriginReadSession (UDP)] Got error: <tftp.datagram.ERRORDatagram object at 0x7f91861029d0>
2014-09-30 02:13:47-0500 [RemoteOriginReadSession (UDP)] Got error: <tftp.datagram.ERRORDatagram object at 0x7f91861029d0>
2014-09-30 02:13:57-0500 [RemoteOriginReadSession (UDP)] Got error: <tftp.datagram.ERRORDatagram object at 0x7f9186102510>

but otherwise nothing standing out

Revision history for this message
Adam Collard (adam-collard) wrote :

Juju got a 200 code when starting the node (apache access log)

10.1.3.4 - - [30/Sep/2014:02:13:12 -0500] "GET /MAAS/api/1.0/nodegroups/?op=list HTTP/1.1" 200 416 "-" "Go 1.1 package http"
10.1.3.4 - - [30/Sep/2014:02:13:12 -0500] "POST /MAAS/api/1.0/nodes/?op=acquire HTTP/1.1" 200 758 "-" "Go 1.1 package http"
10.1.3.4 - - [30/Sep/2014:02:13:12 -0500] "GET /MAAS/api/1.0/nodegroups/39929927-cfd0-43b0-9f3c-393dc602c340/boot-images/ HTTP/1.1" 200 813 "-" "Go 1.1 package http"
10.1.3.4 - - [30/Sep/2014:02:13:12 -0500] "POST /MAAS/api/1.0/nodes/node-27e59910-e295-11e3-944d-782bcb8e566b/?op=start HTTP/1.1" 200 780 "-" "Go 1.1 package http"

Christian Reis (kiko)
Changed in maas:
milestone: none → 1.7.0
importance: High → Medium
importance: Medium → Critical
Revision history for this message
Raphaël Badin (rvb) wrote :

The content event log shows the node transitioned from 'Allocated' to 'Deploying'; but when the node PXE booted, it seems the node was 'Allocated'. The weird part is that the node event log doesn't show the transition from 'Deploying' back to 'Allocated'; these node transition events are tied to Django's internals and thus I see two possible causes for this:
- something outside Django (i.e. outside MAAS) changed the node's status (very unlikely)
- the transaction 'Allocated' → 'Deploying' got rolled back. This is strange because this is part of the "start node" request and we see it succeeded (return code 200 in apache's log). It's worth investigating how the recent work that introduced maybe_change_power_state can roll the DB transaction back and still power up the node.

Revision history for this message
Gavin Panella (allenap) wrote :

Just based on comment #5, could it be that the node is PXE booting fast enough that the transaction in Django hasn't been committed yet?

Revision history for this message
Raphaël Badin (rvb) wrote :

> Just based on comment #5, could it be that the node is PXE booting fast enough that the transaction in Django hasn't been
> committed yet?

That's a possibility but I doubt it's what happened because:
- I haven't seen a node boot in less than a couple of seconds, ever
- the node was left 'Allocated' in the end

@Adam: can you please give us the content of the node event log with the timestamps (http://paste.ubuntu.com/8484906/)?

Revision history for this message
Christian Reis (kiko) wrote :

The "cluster does not have boot image" message may be originating from the poweroff action, which apparently was broken. See bug 1376028.

Revision history for this message
Christian Reis (kiko) wrote :

Raphael, Dean has noted that the logs for this are gone. We'll need to see if we can trip this ourselves, or see if Landscape causes it again.

Changed in maas:
status: Triaged → Incomplete
Revision history for this message
Graham Binns (gmb) wrote :

I'm going to spend some time trying to trigger this.

Revision history for this message
Graham Binns (gmb) wrote :

I spent a couple of hours trying to trigger this, but to no avail. Adam, can you confirm that you still see this bug? If so, can you give me detailed steps to reproduce it?

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

I just hit this on a cluster of 33 machines. I acquired them all, then started them all, and about half remained in Allocated, although they were all powered on. This node event log shows on all of the Allocated nodes:

INFO 14 minutes ago PXE Request — power off
INFO 16 minutes ago Node powered on
INFO 16 minutes ago Powering node on
INFO 16 minutes ago Node changed status — From 'Allocated' to 'Deploying'
INFO 16 minutes ago Node changed status — From 'Ready' to 'Allocated' (to root)

Revision history for this message
Raphaël Badin (rvb) wrote :

Jason just reproduced this in his lab using 1.7.0~beta6+bzr3252-0ubuntu1~trusty1.

Changed in maas:
status: Incomplete → Triaged
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

I also see the issue show up when trying to release a bunch of nodes at once. When I tried to release all 33, about half failed to change to Releasing state. When I tried to release the remaining ones, only about half again worked.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Exact steps to reproduce:

Using the web UI's bulk actions:

1) Acquire all nodes (I have 33 to test with, not sure how many are required).
2) Start all nodes

Some nodes will be in 'Allocated' still, while some are in 'Deploying'. All should be powered on.

Once all nodes are 'Deploying' or 'Deployed', you can reproduce the issue again with a bulk action
3) Release all nodes

Only some of them will go to Releasing, others will remain in Deploying or Deployed.

Revision history for this message
Graham Binns (gmb) wrote :

With that many nodes, my guess is that we're timing out before all the RPC commands have returned (see src/maasserver/models/node.py:133; we only wait 30s for *all* the deferreds to return… If it takes a couple of seconds for them to return then it won't take more than 15-20 nodes for us to start hitting the timeout.)

Revision history for this message
Graham Binns (gmb) wrote :

Quick update; Gavin, Jason and I have been working to debug this. Through crude bisecting, we've isolated the problem (at least with release()) to the call to power_off_nodes() in NodeManager.stop_nodes().

More precisely, if call_power_command() in power_nodes() (src/maasserver/clusterrpc/power.py) returns early rather than calling the Power(Off|On) RPC command, all the nodes will release just hunky dory. If the RPC call goes through, roughly half of the nodes will remain Allocated. We've also see errors in the log for PowerActionAlreadyInProgress for some nodes, though we don't know yet whether that's spurious or not.

Gavin and I are leaving to grab dinner now; we'll pick this up later.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

It's interesting to note that Bulk Commissioning and Abort Commissioning work just fine, even though they also involve power on and off commands and state changes. The difference appears to be that in those actions, the transaction is explicitly saved after each node's state is changed, which is not done for deploying/releasing.

If you add an explicit transaction commit to the deploying and releasing actions, things work fine. I've posted a branch that does this:

 lp:~jason-hobbs/maas/fix-bulk-start-release

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

FWIW you can hit the issue with the 'nodes release' API also - the branch I posted doesn't cover that.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

It's interesting that it seems to be about half of the nodes that don't get state transitioned properly. If I try to release 33 nodes at once, it's almost always either 13 or 14 left that didn't transition, which means about 20 succeeded. If I try to release 17 at once, 8 fail and 9 succeed.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

I updated my branch to cover the Nodes release API also. It's interesting that the transaction.commit() doesn't fix things if it's at the end of the method, prior to the return. It only works if it's done on a per node basis.

Revision history for this message
Graham Binns (gmb) wrote :

The branch I just linked adds commit()s that will work for both the UI and API. That might be enough to fix this bug, but I think we need to fix bug 1330765 to be able to be sure about sane node states on bulk actions.

Changed in maas:
status: Triaged → Fix Committed
Revision history for this message
Julian Edwards (julian-edwards) wrote :

> errors in the log for PowerActionAlreadyInProgress

This is *really* concerning and I've raised this before: I suspect that the power monitor service is clashing with genuine requests.

Revision history for this message
Graham Binns (gmb) wrote : Re: [Bug 1375664] Re: Node powering on but not deploying

On 16 October 2014 00:38, Julian Edwards <email address hidden> wrote:
>> errors in the log for PowerActionAlreadyInProgress
>
> This is *really* concerning and I've raised this before: I suspect that
> the power monitor service is clashing with genuine requests.

From an admittedly quick glance at the code I don't think that the
power monitor service uses anything that puts entries in the power
action "registry"; it's only when we're _changing_ power state that we
take a lock on power actions for a node. It's more likely (in this
case) that we PEBKAC'd — there were three of us futzing with the NEC
lab MAAS at the time.

Revision history for this message
Raphaël Badin (rvb) wrote :

On 10/16/2014 10:11 AM, Graham Binns wrote:
> On 16 October 2014 00:38, Julian Edwards <email address hidden> wrote:
>>> errors in the log for PowerActionAlreadyInProgress
>>
>> This is *really* concerning and I've raised this before: I suspect that
>> the power monitor service is clashing with genuine requests.
>
>>From an admittedly quick glance at the code I don't think that the
> power monitor service uses anything that puts entries in the power
> action "registry"; it's only when we're _changing_ power state that we
> take a lock on power actions for a node. It's more likely (in this
> case) that we PEBKAC'd — there were three of us futzing with the NEC
> lab MAAS at the time.

I think gmb is right on both count: a) PowerActionAlreadyInProgress can
only be raised when power states are being changed b) I know for a fact
I was messing around with the lab at the same time Jason and gmb where
experimenting.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

On Thursday 16 Oct 2014 09:00:25 you wrote:
> On 10/16/2014 10:11 AM, Graham Binns wrote:
> > On 16 October 2014 00:38, Julian Edwards <email address hidden>
wrote:
> >>> errors in the log for PowerActionAlreadyInProgress
> >>
> >> This is *really* concerning and I've raised this before: I suspect that
> >> the power monitor service is clashing with genuine requests.
> >>
> >>From an admittedly quick glance at the code I don't think that the
> >>
> > power monitor service uses anything that puts entries in the power
> > action "registry"; it's only when we're _changing_ power state that we
> > take a lock on power actions for a node. It's more likely (in this
> > case) that we PEBKAC'd — there were three of us futzing with the NEC
> > lab MAAS at the time.
>
> I think gmb is right on both count: a) PowerActionAlreadyInProgress can
> only be raised when power states are being changed b) I know for a fact
> I was messing around with the lab at the same time Jason and gmb where
> experimenting.

Fair enough.

What about when the power query clashes with the state change *on the bmc*?

Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.