Nodes failed to transition out of "New" state on bulk commission

Bug #1375980 reported by Jason Hobbs
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Graham Binns

Bug Description

I tried to bulk commission about 25 IPMI controlled nodes that were all in the 'New' state in MAAS.

A couple of the nodes failed to transition to 'On' for some reason, and MAAS eventually timed out the power on request.

As a result, the nodes that did power on successfully never transitioned out of the 'New' state.

I have these errors in my log:

ERROR 2014-10-01 03:41:43,137 django.request Internal Server Error: /MAAS/nodes/
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/django/core/handlers/base.py", line 112, in get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/lib/python2.7/dist-packages/django/views/generic/base.py", line 69, in view
    return self.dispatch(request, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/django/views/generic/base.py", line 87, in dispatch
    return handler(request, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/maasserver/views/nodes.py", line 276, in post
    return super(NodeListView, self).post(request, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/django/views/generic/edit.py", line 171, in post
    return self.form_valid(form)
  File "/usr/lib/python2.7/dist-packages/maasserver/views/nodes.py", line 288, in form_valid
    stats = form.save()
  File "/usr/lib/python2.7/dist-packages/maasserver/forms.py", line 1910, in save
    return self.perform_action(action_name, system_ids)
  File "/usr/lib/python2.7/dist-packages/maasserver/forms.py", line 1858, in perform_action
    action_instance.execute(allow_redirect=False)
  File "/usr/lib/python2.7/dist-packages/maasserver/node_action.py", line 187, in execute
    self.node.start_commissioning(self.user)
  File "/usr/lib/python2.7/dist-packages/maasserver/models/node.py", line 943, in start_commissioning
    [self.system_id], user, user_data=commissioning_user_data)
  File "/usr/lib/python2.7/dist-packages/maasserver/models/node.py", line 451, in start_nodes
    wait_for_power_commands(deferreds)
  File "/usr/lib/python2.7/dist-packages/maasserver/models/node.py", line 137, in wait_for_power_commands
    results = block_until_commands_complete()
  File "/usr/lib/python2.7/dist-packages/provisioningserver/utils/twisted.py", line 108, in wrapper
    return func_in_reactor(*args, **kwargs).wait(timeout)
  File "/usr/lib/python2.7/dist-packages/crochet/_eventloop.py", line 217, in wait
    result = self._result(timeout)
  File "/usr/lib/python2.7/dist-packages/crochet/_eventloop.py", line 195, in _result
    raise TimeoutError()
TimeoutError

This is with 1.7.0~beta4+bzr3139-0ubuntu1~trusty1

Tags: rpc

Related branches

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

My nodes also failed to power off after PXE booting in the New state:

ERROR 2014-10-01 03:43:29,742 maasserver Unable to identify boot image for (ubuntu/amd64/generic/trusty/poweroff): cluster 'maas' does not have matching boot image.
ERROR 2014-10-01 03:43:34,396 maasserver Unable to identify boot image for (ubuntu/amd64/generic/trusty/poweroff): cluster 'maas' does not have matching boot image.

Revision history for this message
Gavin Panella (allenap) wrote :

Looks like that one exception is causing the whole transaction to abort. The code that starts multiple nodes needs to be more careful with exceptions.

However, there may be an underlying cause: block_until_commands_complete() should only time-out after 30 seconds, which should be more than enough time to do the preliminary checks that power_on does.

There may be a part of a culprit in maybe_change_power_state(), which, while servicing an RPC call, makes a call back to the region (it calls power_change_starting which sends a message to the node event log). That can probably be done from change_power_state() instead.

Still, 30 seconds is a long time for that to take. I'm wondering if there's a deadlock-like situation.

Revision history for this message
Gavin Panella (allenap) wrote :

lp:~allenap/maas/log-power-on-a-bit-later moves the logging out of the RPC handler. Let's see if that makes any difference.

Revision history for this message
Gavin Panella (allenap) wrote :

> My nodes also failed to power off after PXE booting in the New state:

This is a separate bug. Do you mind filing that?

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Gavin - done. bug 1376028

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

It doesn't appear to be fixed all the way. I'm running 1.7.0~beta4+bzr3168-0ubuntu1~trusty1 0 and hit it again, same way.

Here's my logs.
http://paste.ubuntu.com/8480384/

Raphaël Badin (rvb)
Changed in maas:
status: New → Triaged
importance: Undecided → Critical
tags: added: rpc
Revision history for this message
Graham Binns (gmb) wrote :

This could be related to bug 1375970, though I've not successfully reproduced that yet.

Changed in maas:
milestone: none → 1.7.0
Graham Binns (gmb)
Changed in maas:
status: Triaged → In Progress
assignee: nobody → Graham Binns (gmb)
Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.