Errors logged for failure to shutdown node while not being able to contact cluster
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
MAAS |
Invalid
|
Critical
|
Unassigned |
Bug Description
Seeing a server (mier) that was deployed, then powered off but remained in the deployed state for +30 mns. Started to look in the logs and I am seeing errors that seem to point to this inconsistent state. Shouldn't system go to ready state when powered off by maas?
From MAAS UI:
=======
Status
Deployed
...
Owner
oil
...
Installation output
install log curtin install log
Commissioning output
9 output files
Latest node events
Level Emitted Event
INFO 29 minutes ago Powering node off
INFO 33 minutes ago Node powered off
INFO 33 minutes ago Powering node off
INFO 1 hour, 38 minutes ago Node changed status — From 'Deploying' to 'Deployed'
INFO 1 hour, 38 minutes ago PXE Request — local boot
=======
From the MAAS log, multiple attempts to shutdown mier are shown. So, server is powered off but was
subsequently not released due to failure to connect to cluster? I did verify system to be powered off
with check power state option.
=======
ubuntu@
Nov 22 19:09:31 maas-trusty-
Nov 22 19:13:19 maas-trusty-
ubuntu@
Nov 22 19:09:31 maas-trusty-
=======
Looking sometimes later, then event log shows the system finally going to Ready state (~36 minutes later) and another power off is issued. It is then re-allocated and re-deployed:
=======
INFO 4 minutes ago Node changed status — From 'Allocated' to 'Deploying'
INFO 4 minutes ago Node changed status — From 'Ready' to 'Allocated' (to oil-slave-2)
INFO 7 minutes ago Node powered off
INFO 7 minutes ago Powering node off
INFO 7 minutes ago Node changed status — From 'Deployed' to 'Ready'
INFO 10 minutes ago Node powered off
INFO 10 minutes ago Powering node off
INFO 46 minutes ago Powering node off >>>>>> ~Nov 22 17:13-Nov 22 17:09 time frame
INFO 49 minutes ago Node powered off
INFO 50 minutes ago Powering node off
INFO 1 hour, 54 minutes ago Node changed status — From 'Deploying' to 'Deployed'
=======
If I've understood this scenario correctly, then we could see other symptoms that could results in other bugs.
There are 75 such errors in the log since the day began (UTC time), and I can trace these errors back a few weeks.
ubuntu@
75
May or may not be related.. I have also noticed 466 errors for failed deployments in that time frame..
ubuntu@
466
I'll attach maas logs for analysis.
description: | updated |
tags: | added: oil |
Changed in maas: | |
status: | Triaged → In Progress |
assignee: | nobody → Graham Binns (gmb) |
Changed in maas: | |
milestone: | 1.7.1 → 1.7.2 |
Changed in maas: | |
assignee: | Graham Binns (gmb) → nobody |
Changed in maas: | |
status: | In Progress → New |
Changed in maas: | |
milestone: | 1.7.2 → 1.7.3 |
Changed in maas: | |
status: | New → Invalid |
logs attached.