Errors logged for failure to shutdown node while not being able to contact cluster

Bug #1395375 reported by Larry Michel
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Invalid
Critical
Unassigned

Bug Description

Seeing a server (mier) that was deployed, then powered off but remained in the deployed state for +30 mns. Started to look in the logs and I am seeing errors that seem to point to this inconsistent state. Shouldn't system go to ready state when powered off by maas?

From MAAS UI:

===========================================================================

Status
Deployed
...
Owner
oil
...
Installation output
install log curtin install log
Commissioning output
9 output files
Latest node events
Level Emitted Event
INFO 29 minutes ago Powering node off
INFO 33 minutes ago Node powered off
INFO 33 minutes ago Powering node off
INFO 1 hour, 38 minutes ago Node changed status — From 'Deploying' to 'Deployed'
INFO 1 hour, 38 minutes ago PXE Request — local boot
===========================================================================

From the MAAS log, multiple attempts to shutdown mier are shown. So, server is powered off but was
subsequently not released due to failure to connect to cluster? I did verify system to be powered off
with check power state option.

===========================================================================
ubuntu@maas-trusty-back-may22:/var/log/maas$ grep "Unable to shut node down" maas.log|grep mier
Nov 22 19:09:31 maas-trusty-back-may22 maas.node: [ERROR] mier.local: Unable to shut node down: [Failure instance: Traceback: <class 'provisioningserver.rpc.exceptions.NoConnectionsAvailable'>: Unable to connect to cluster 037c960b-5b9f-4701-8366-eeda2c09d14e; no connections available.#012/usr/lib/python2.7/dist-packages/twisted/internet/base.py:824:runUntilCurrent#012/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:467:cancel#012/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:423:errback#012/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:490:_startRunCallbacks#012--- <exception caught here> ---#012/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:577:_runCallbacks#012/usr/lib/python2.7/dist-packages/maasserver/rpc/regionservice.py:606:cancelled#012]
Nov 22 19:13:19 maas-trusty-back-may22 maas.node: [ERROR] mier.local: Unable to shut node down: [Failure instance: Traceback: <class 'provisioningserver.rpc.exceptions.NoConnectionsAvailable'>: Unable to connect to cluster 037c960b-5b9f-4701-8366-eeda2c09d14e; no connections available.#012/usr/lib/python2.7/dist-packages/twisted/internet/base.py:824:runUntilCurrent#012/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:467:cancel#012/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:423:errback#012/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:490:_startRunCallbacks#012--- <exception caught here> ---#012/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:577:_runCallbacks#012/usr/lib/python2.7/dist-packages/maasserver/rpc/regionservice.py:606:cancelled#012]
ubuntu@maas-trusty-back-may22:/var/log/maas$ grep "Unable to shut node down" maas.log|grep mier|grep "Nov 22 19:09"
Nov 22 19:09:31 maas-trusty-back-may22 maas.node: [ERROR] mier.local: Unable to shut node down: [Failure instance: Traceback: <class 'provisioningserver.rpc.exceptions.NoConnectionsAvailable'>: Unable to connect to cluster 037c960b-5b9f-4701-8366-eeda2c09d14e; no connections available.#012/usr/lib/python2.7/dist-packages/twisted/internet/base.py:824:runUntilCurrent#012/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:467:cancel#012/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:423:errback#012/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:490:_startRunCallbacks#012--- <exception caught here> ---#012/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:577:_runCallbacks#012/usr/lib/python2.7/dist-packages/maasserver/rpc/regionservice.py:606:cancelled#012]
===========================================================================

Looking sometimes later, then event log shows the system finally going to Ready state (~36 minutes later) and another power off is issued. It is then re-allocated and re-deployed:
===========================================================================
INFO 4 minutes ago Node changed status — From 'Allocated' to 'Deploying'
INFO 4 minutes ago Node changed status — From 'Ready' to 'Allocated' (to oil-slave-2)
INFO 7 minutes ago Node powered off
INFO 7 minutes ago Powering node off
INFO 7 minutes ago Node changed status — From 'Deployed' to 'Ready'
INFO 10 minutes ago Node powered off
INFO 10 minutes ago Powering node off
INFO 46 minutes ago Powering node off >>>>>> ~Nov 22 17:13-Nov 22 17:09 time frame
INFO 49 minutes ago Node powered off
INFO 50 minutes ago Powering node off
INFO 1 hour, 54 minutes ago Node changed status — From 'Deploying' to 'Deployed'
===========================================================================

 If I've understood this scenario correctly, then we could see other symptoms that could results in other bugs.

There are 75 such errors in the log since the day began (UTC time), and I can trace these errors back a few weeks.

ubuntu@maas-trusty-back-may22:/var/log/maas$ grep "Unable to shut node down" maas.log|grep "Nov 22" |wc -l
75

May or may not be related.. I have also noticed 466 errors for failed deployments in that time frame..

ubuntu@maas-trusty-back-may22:/var/log/maas$ grep "Nov 22" maas.log|grep "Marking node failed" maas.log| wc -l
466

I'll attach maas logs for analysis.

Larry Michel (lmic)
description: updated
Revision history for this message
Larry Michel (lmic) wrote :

logs attached.

Revision history for this message
Graham Binns (gmb) wrote :

Thank you for the detailed bug report! For completeness, can you also let us know what version of MAAS this was (1.7, clearly, but is this RC3, 1.7.0 or a.n.other version of 1.7?)

The problem as I see it is at least twofold:

 - MAAS fails to surface the fact that the cluster wasn't connected when reporting the power command failure in the UI.
 - The error messages in the log are dreadful (they contain tracebacks when they really shouldn't.

Also, when the a cluster is disconnected, MAAS should show a warning to the tune of "One or more clusters are currently disconnected. Visit the clusters page for more information." Did this happen on the MAAS region you were working with?

Changed in maas:
status: New → Triaged
importance: Undecided → High
milestone: none → next
tags: added: confusing-ui log
Revision history for this message
Larry Michel (lmic) wrote :

This is RC3.

ubuntu@maas-trusty-back-may22:~$ dpkg -l |grep maas
ii maas 1.7.0~rc3+bzr3299-0ubuntu1~trusty1 all MAAS server all-in-one metapackage

I haven't been seeing any warning on the MAAS UI pointing to an issue with the cluster being disconnected.

Revision history for this message
Graham Binns (gmb) wrote : Re: [Bug 1395375] Re: Errors logged for failure to shutdown node while not being able to contact cluster

Okay, thanks. Is this reproducible, or has the problem pretty much gone
away right now?

Revision history for this message
Larry Michel (lmic) wrote :

The problem seems to have gone away since last on Nov 22 where there were 75 of those for various servers.

ubuntu@maas-trusty-back-may22:/var/log/maas$ grep "Unable to shut node down" maas.log| grep -i "connect"|grep "Nov 20"|wc -l
0
ubuntu@maas-trusty-back-may22:/var/log/maas$ grep "Unable to shut node down" maas.log| grep -i "connect"|grep "Nov 21"|wc -l
0
ubuntu@maas-trusty-back-may22:/var/log/maas$ grep "Unable to shut node down" maas.log| grep -i "connect"|grep "Nov 22"|wc -l
75
ubuntu@maas-trusty-back-may22:/var/log/maas$ grep "Unable to shut node down" maas.log| grep -i "connect"|grep "Nov 23"|wc -l
0
ubuntu@maas-trusty-back-may22:/var/log/maas$ grep "Unable to shut node down" maas.log| grep -i "connect"|grep "Nov 24"|wc -l
0

Revision history for this message
Julian Edwards (julian-edwards) wrote :

This is dreadful behaviour from MAAS, sorry. We'll get it fixed ASAP.

@GMB, it seems as though there's some "getClientFor()" calls out there that are not catching exceptions. The code cannot assume it will always work.

Changed in maas:
importance: High → Critical
milestone: next → 1.7.1
Revision history for this message
Graham Binns (gmb) wrote :

On 24 November 2014 at 16:49, Julian Edwards <email address hidden>
wrote:

>
> @GMB, it seems as though there's some "getClientFor()" calls out there
> that are not catching exceptions. The code cannot assume it will always
> work.

Yeah, it's pretty atrocious. I'll get on this presently.

tags: added: oil
Graham Binns (gmb)
Changed in maas:
status: Triaged → In Progress
assignee: nobody → Graham Binns (gmb)
Changed in maas:
milestone: 1.7.1 → 1.7.2
Gavin Panella (allenap)
Changed in maas:
assignee: Graham Binns (gmb) → nobody
Changed in maas:
status: In Progress → New
Revision history for this message
Raphaël Badin (rvb) wrote :

Trying to wrap my head around this bug but I've got a question:

> Seeing a server (mier) that was deployed, then powered off but remained in the deployed state for +30 mns.

When you say "then powered off" do you mean that you *released* the node or did you just power it off (UI/API or manual?)? If you just powered it off, the node should just stay "Deployed".

Changed in maas:
milestone: 1.7.2 → 1.7.3
Changed in maas:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.