MAAS

Errors logged for failure to shutdown node while not being able to contact cluster

Bug #1395375 reported by Larry Michel on 2014-11-22

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	MAAS	Invalid	Critical	Unassigned	MAAS 1.7.3

Bug Description

Seeing a server (mier) that was deployed, then powered off but remained in the deployed state for +30 mns. Started to look in the logs and I am seeing errors that seem to point to this inconsistent state. Shouldn't system go to ready state when powered off by maas?

From MAAS UI:

===========================================================================

Status
Deployed
...
Owner
oil
...
Installation output
install log curtin install log
Commissioning output
9 output files
Latest node events
Level Emitted Event
INFO 29 minutes ago Powering node off
INFO 33 minutes ago Node powered off
INFO 33 minutes ago Powering node off
INFO 1 hour, 38 minutes ago Node changed status — From 'Deploying' to 'Deployed'
INFO 1 hour, 38 minutes ago PXE Request — local boot
===========================================================================

From the MAAS log, multiple attempts to shutdown mier are shown. So, server is powered off but was
subsequently not released due to failure to connect to cluster? I did verify system to be powered off
with check power state option.

===========================================================================
ubuntu@maas-trusty-back-may22:/var/log/maas$ grep "Unable to shut node down" maas.log|grep mier
Nov 22 19:09:31 maas-trusty-back-may22 maas.node: [ERROR] mier.local: Unable to shut node down: [Failure instance: Traceback: <class 'provisioningserver.rpc.exceptions.NoConnectionsAvailable'>: Unable to connect to cluster 037c960b-5b9f-4701-8366-eeda2c09d14e; no connections available.#012/usr/lib/python2.7/dist-packages/twisted/internet/base.py:824:runUntilCurrent#012/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:467:cancel#012/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:423:errback#012/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:490:_startRunCallbacks#012--- <exception caught here> ---#012/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:577:_runCallbacks#012/usr/lib/python2.7/dist-packages/maasserver/rpc/regionservice.py:606:cancelled#012]
Nov 22 19:13:19 maas-trusty-back-may22 maas.node: [ERROR] mier.local: Unable to shut node down: [Failure instance: Traceback: <class 'provisioningserver.rpc.exceptions.NoConnectionsAvailable'>: Unable to connect to cluster 037c960b-5b9f-4701-8366-eeda2c09d14e; no connections available.#012/usr/lib/python2.7/dist-packages/twisted/internet/base.py:824:runUntilCurrent#012/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:467:cancel#012/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:423:errback#012/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:490:_startRunCallbacks#012--- <exception caught here> ---#012/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:577:_runCallbacks#012/usr/lib/python2.7/dist-packages/maasserver/rpc/regionservice.py:606:cancelled#012]
ubuntu@maas-trusty-back-may22:/var/log/maas$ grep "Unable to shut node down" maas.log|grep mier|grep "Nov 22 19:09"
Nov 22 19:09:31 maas-trusty-back-may22 maas.node: [ERROR] mier.local: Unable to shut node down: [Failure instance: Traceback: <class 'provisioningserver.rpc.exceptions.NoConnectionsAvailable'>: Unable to connect to cluster 037c960b-5b9f-4701-8366-eeda2c09d14e; no connections available.#012/usr/lib/python2.7/dist-packages/twisted/internet/base.py:824:runUntilCurrent#012/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:467:cancel#012/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:423:errback#012/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:490:_startRunCallbacks#012--- <exception caught here> ---#012/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:577:_runCallbacks#012/usr/lib/python2.7/dist-packages/maasserver/rpc/regionservice.py:606:cancelled#012]
===========================================================================

Looking sometimes later, then event log shows the system finally going to Ready state (~36 minutes later) and another power off is issued. It is then re-allocated and re-deployed:
===========================================================================
INFO 4 minutes ago Node changed status — From 'Allocated' to 'Deploying'
INFO 4 minutes ago Node changed status — From 'Ready' to 'Allocated' (to oil-slave-2)
INFO 7 minutes ago Node powered off
INFO 7 minutes ago Powering node off
INFO 7 minutes ago Node changed status — From 'Deployed' to 'Ready'
INFO 10 minutes ago Node powered off
INFO 10 minutes ago Powering node off
INFO 46 minutes ago Powering node off >>>>>> ~Nov 22 17:13-Nov 22 17:09 time frame
INFO 49 minutes ago Node powered off
INFO 50 minutes ago Powering node off
INFO 1 hour, 54 minutes ago Node changed status — From 'Deploying' to 'Deployed'
===========================================================================

If I've understood this scenario correctly, then we could see other symptoms that could results in other bugs.

There are 75 such errors in the log since the day began (UTC time), and I can trace these errors back a few weeks.

ubuntu@maas-trusty-back-may22:/var/log/maas$ grep "Unable to shut node down" maas.log|grep "Nov 22" |wc -l
75

May or may not be related.. I have also noticed 466 errors for failed deployments in that time frame..

ubuntu@maas-trusty-back-may22:/var/log/maas$ grep "Nov 22" maas.log|grep "Marking node failed" maas.log| wc -l
466

I'll attach maas logs for analysis.

See original description

Tags:

Larry Michel (lmic) on 2014-11-22

description:

updated

Revision history for this message

Larry Michel (lmic) wrote on 2014-11-22:

maaslogs.tar.gz Edit (99.9 MiB, application/x-tar)

logs attached.

Revision history for this message

Graham Binns (gmb) wrote on 2014-11-24:

Thank you for the detailed bug report! For completeness, can you also let us know what version of MAAS this was (1.7, clearly, but is this RC3, 1.7.0 or a.n.other version of 1.7?)

The problem as I see it is at least twofold:

- MAAS fails to surface the fact that the cluster wasn't connected when reporting the power command failure in the UI.
- The error messages in the log are dreadful (they contain tracebacks when they really shouldn't.

Also, when the a cluster is disconnected, MAAS should show a warning to the tune of "One or more clusters are currently disconnected. Visit the clusters page for more information." Did this happen on the MAAS region you were working with?

Changed in maas:
status:	New → Triaged
importance:	Undecided → High
milestone:	none → next
tags:	added: confusing-ui log

Revision history for this message

Larry Michel (lmic) wrote on 2014-11-24:

This is RC3.

ubuntu@maas-trusty-back-may22:~$ dpkg -l |grep maas
ii maas 1.7.0~rc3+bzr3299-0ubuntu1~trusty1 all MAAS server all-in-one metapackage

I haven't been seeing any warning on the MAAS UI pointing to an issue with the cluster being disconnected.

Revision history for this message

Graham Binns (gmb) wrote on 2014-11-24: Re: [Bug 1395375] Re: Errors logged for failure to shutdown node while not being able to contact cluster

Okay, thanks. Is this reproducible, or has the problem pretty much gone
away right now?

Revision history for this message

Larry Michel (lmic) wrote on 2014-11-24:

The problem seems to have gone away since last on Nov 22 where there were 75 of those for various servers.

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-11-24:

This is dreadful behaviour from MAAS, sorry. We'll get it fixed ASAP.

@GMB, it seems as though there's some "getClientFor()" calls out there that are not catching exceptions. The code cannot assume it will always work.

Changed in maas:
importance:	High → Critical
milestone:	next → 1.7.1

Revision history for this message

Graham Binns (gmb) wrote on 2014-11-24:

On 24 November 2014 at 16:49, Julian Edwards <email address hidden>
wrote:

>
> @GMB, it seems as though there's some "getClientFor()" calls out there
> that are not catching exceptions. The code cannot assume it will always
> work.

Yeah, it's pretty atrocious. I'll get on this presently.

Andres Rodriguez (andreserl) on 2014-11-25

tags:

added: oil

Graham Binns (gmb) on 2014-12-02

Changed in maas:
status:	Triaged → In Progress
assignee:	nobody → Graham Binns (gmb)

Andres Rodriguez (andreserl) on 2014-12-11

Changed in maas:
milestone:	1.7.1 → 1.7.2

Gavin Panella (allenap) on 2015-02-11

Changed in maas:
assignee:	Graham Binns (gmb) → nobody

Andres Rodriguez (andreserl) on 2015-02-24

Changed in maas:
status:	In Progress → New

Revision history for this message

Raphaël Badin (rvb) wrote on 2015-02-25:

Trying to wrap my head around this bug but I've got a question:

> Seeing a server (mier) that was deployed, then powered off but remained in the deployed state for +30 mns.

When you say "then powered off" do you mean that you *released* the node or did you just power it off (UI/API or manual?)? If you just powered it off, the node should just stay "Deployed".

Andres Rodriguez (andreserl) on 2015-03-03