Timeout leads to inconsistency between maas and real world state, can't commission or start nodes

Bug #1375970 reported by Andreas Hasenack
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Graham Binns

Bug Description

maas 1.7.0~beta4+bzr3130-0ubuntu1~trusty1 upgraded earlier today from 1.7.0~beta4+bzr3095-0ubuntu1.

While viewing a specific node registered with maas, I clicked on "Commission". That took a while, and eventually it showed a bland Internal server error" in a white unstyled page. The logs had several timeout errors and this was the last one:
ERROR 2014-09-30 19:17:27,675 django.request Internal Server Error: /MAAS/nodes/node-6515cc64-44e8-11e4-87d4-2c59e54ace74/view/
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/django/core/handlers/base.py", line 112, in get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/lib/python2.7/dist-packages/django/views/generic/base.py", line 69, in view
    return self.dispatch(request, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/maasserver/views/nodes.py", line 617, in dispatch
    return super(NodeView, self).dispatch(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/django/views/generic/base.py", line 87, in dispatch
    return handler(request, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/django/views/generic/edit.py", line 228, in post
    return super(BaseUpdateView, self).post(request, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/django/views/generic/edit.py", line 171, in post
    return self.form_valid(form)
  File "/usr/lib/python2.7/dist-packages/django/views/generic/edit.py", line 147, in form_valid
    self.object = form.save()
  File "/usr/lib/python2.7/dist-packages/maasserver/forms.py", line 951, in save
    message = action.execute(allow_redirect=allow_redirect)
  File "/usr/lib/python2.7/dist-packages/maasserver/node_action.py", line 187, in execute
    self.node.start_commissioning(self.user)
  File "/usr/lib/python2.7/dist-packages/maasserver/models/node.py", line 943, in start_commissioning
    [self.system_id], user, user_data=commissioning_user_data)
  File "/usr/lib/python2.7/dist-packages/maasserver/models/node.py", line 451, in start_nodes
    wait_for_power_commands(deferreds)
  File "/usr/lib/python2.7/dist-packages/maasserver/models/node.py", line 137, in wait_for_power_commands
    results = block_until_commands_complete()
  File "/usr/lib/python2.7/dist-packages/provisioningserver/utils/twisted.py", line 108, in wrapper
    return func_in_reactor(*args, **kwargs).wait(timeout)
  File "/usr/lib/python2.7/dist-packages/crochet/_eventloop.py", line 217, in wait
    result = self._result(timeout)
  File "/usr/lib/python2.7/dist-packages/crochet/_eventloop.py", line 195, in _result
    raise TimeoutError()
TimeoutError
ERROR 2014-09-30 19:17:27,686 twisted [(True, {})]
ERROR 2014-09-30 19:18:25,360 maasserver Unable to identify boot image for (ubuntu/amd64/generic/trusty/poweroff): cluster 'scapestack' does not have matching boot image.

I then reloaded the node view page, and to my surprise it says the node is "Ready", while in fact the poweron command did work and the node is powered up and sitting at a boot prompt now:
Intel(R) Boot Agent GE v1.4.02
Copyright (C) 1997-2012, Intel Corporation

CLIENT MAC ADDR: 2C 59 E5 4A ED 90 GUID: 37313636 3735 4D58 5133 343230514459
CLIENT IP: 10.96.2.146 MASK: 255.255.0.0 DHCP IP: 10.96.0.10
GATEWAY IP: 10.96.0.1
!PXE entry point found (we hope) at 93DA:0106 via plan A
UNDI code segment at 93DA len 56A0
UNDI data segment at 8D9D len 63D0
Getting cached packet 01 02 03
My IP address seems to be 0A600292 10.96.2.146
ip=10.96.2.146:10.96.0.10:10.96.0.1:255.255.0.0
BOOTIF=01-2c-59-e5-4a-ed-90
SYSUUID=37313636-3735-4d58-5133-343230514459
TFTP prefix:
Trying to load: pxelinux.cfg/01-2c-59-e5-4a-ed-90 ok
APM not present.
boot:

I'm attaching maas-django.log.

The "commission" action was this one:

10.112.0.10 - - [30/Sep/2014:19:16:57 +0000] "POST /MAAS/nodes/node-6515cc64-44e8-11e4-87d4-2c59e54ace74/view/ HTTP/1.1" 500 492 "http://10.96.0.10/MAAS/nodes/node-6515cc64-44e8-11e4-87d4-2c59e54ace74/view/" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:32.0) Gecko/20100101 Firefox/32.0"

Related branches

Revision history for this message
Andreas Hasenack (ahasenack) wrote :
tags: added: cloud-installer landscape
Changed in maas:
milestone: none → 1.7.0
importance: Undecided → Critical
summary: - Timeout leads to inconsistency between maas and real world state
+ Timeout leads to inconsistency between maas and real world state, can't
+ commission or start nodes
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Sequence of events when I click on "commission": http://pastebin.ubuntu.com/8472925/

And the node is powered up afterwards.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I will purge and reinstall, we need a working maas server.

Revision history for this message
Graham Binns (gmb) wrote :

I can fix the Internal Server Error easily enough… not wholly sure about how to fix the fact that the power up command succeeded but timed out, though.

Gavin: any thoughts?

Changed in maas:
status: New → In Progress
assignee: nobody → Graham Binns (gmb)
Revision history for this message
Graham Binns (gmb) wrote :

I've filed bug 1376304 about the UI part of this bug.

Changed in maas:
status: In Progress → Triaged
assignee: Graham Binns (gmb) → nobody
Revision history for this message
Graham Binns (gmb) wrote :

We really need more info to be able to work out why this timeout happened; the RPC method for powering on a node *should* return pretty promptly, so it's very odd that it it took > 30 seconds.

Andreas, have you been able to reproduce this bug? Did it only happen the once, or has it happened on multiple occasions? Was line 19 of http://pastebin.ubuntu.com/8472925/ the last line in maas.log, or was there more after it that you trimmed for brevity?

Until we can reproduce this (and I haven't been able to do so yet) we can't really fix it. Marking as incomplete for now.

Changed in maas:
status: Triaged → Incomplete
importance: Critical → Undecided
importance: Undecided → Critical
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I had to reinstall this MAAS from scratch, I needed it working asap.

I uploaded more logs and details to chinstrap.canonical.com:~andreas/atlas-maas-data.tar.bz2 (24Mb), and that's all I have since the installation is gone now.

Revision history for this message
Graham Binns (gmb) wrote : Re: [Bug 1375970] Re: Timeout leads to inconsistency between maas and real world state, can't commission or start nodes

Thanks, I'll take a trawl.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Graham, the logs from the pastebin were not cut in any way. It was a plain tail -f /var/log/maas/*.log that I left running while I clicked on commission and watched it crash.

Revision history for this message
Graham Binns (gmb) wrote :

On 2 October 2014 14:15, Andreas Hasenack <email address hidden> wrote:
>
> I uploaded more logs and details to chinstrap.canonical.com:~andreas
> /atlas-maas-data.tar.bz2 (24Mb), and that's all I have since the
> installation is gone now.

Unfortunately all those logfiles predate the problem (they run up to
September 30th). I'll keep bisecting and see if I can reproduce, but
so far no dice.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

The problem was noticed on September 30th. I left the deployment up as long as I could, until October 1st my lunch time, and reached out to #maas. Andres and rvba helped but we got nowhere.

Revision history for this message
Graham Binns (gmb) wrote :

On 2 October 2014 16:34, Andreas Hasenack <email address hidden> wrote:
> The problem was noticed on September 30th. I left the deployment up as
> long as I could, until October 1st my lunch time, and reached out to
> #maas. Andres and rvba helped but we got nowhere.

Ah, okay. I'll take another look… I was looking for the specific
entries from yesterday. Thanks for the info.

Christian Reis (kiko)
Changed in maas:
assignee: nobody → Graham Binns (gmb)
Revision history for this message
Raphaël Badin (rvb) wrote :

The timeout is a problem in itself but the underlying bug is that the RPC methods aren't tied to the Django transactions: even when things break on the Django side, the RPC operations are still issued and this leads to inconsistent states. Maybe we could have a record of all the RPC calls that need to be made and only fire them when the related Django transaction is committed.

Revision history for this message
Gavin Panella (allenap) wrote :

> ... the underlying bug is that the RPC methods aren't tied to the
> Django transactions ...

One bug is that we're not catching the exception nearer the call-site.
Django wraps a transaction around the whole HTTP request, but we can and
should take more control of transactions.

I think it depends on what we're going to do. For example, power-on is
meant to be a very quick RPC call, though the process that it kicks off
can take a lot longer. In this case, we could either:

- Issue the PowerOn RPC call while in a sub-transaction (i.e. after a
  savepoint), and wait for it to return. If the call failed, rollback to
  the savepoint, undoing status changes and suchlike.

- Commit, then issue the PowerOn RPC call. If it fails, return node
  statuses and whatnot to their previous state (or put them into a
  failed state) and commit again.

Deferring things to at-the-end is half-way back to the situation we had
with Celery. The difference is that we now have a better return path for
reporting back.

But a key benefit of RPC is the immediacy of it. The PowerOn and
PowerOff calls in particular give immediate feedback if an existing
power change is in progress, and that's something we want to be able to
tell the user.

In short, we must be explicit with transaction boundaries and error
handling, far more than before.

I'd still want to know why it timed-out though, while doing what should
have been a very quick RPC call.

Graham Binns (gmb)
Changed in maas:
status: Incomplete → In Progress
Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.