Deployment Failed -- Failed to get installation results

Bug #1424080 reported by Larry Michel
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Raphaël Badin

Bug Description

While deploying OpenStack, we had 3 failures this past couple of days caused by servers that failed to deploy as result of failing to get the installation results.

From console.txt:

2015-02-18 06:49:03,108 [DEBUG] deployer.env: Delta unit: ceph/2 change:installed
2015-02-18 06:49:08,108 [DEBUG] deployer.env: Delta unit: ceph/2 change:started
2015-02-18 06:49:28,113 [DEBUG] deployer.env: Delta unit: rabbitmq-server/0 change:installed
2015-02-18 06:49:43,115 [DEBUG] deployer.env: Delta unit: mongodb/0 change:installed
2015-02-18 06:50:08,119 [DEBUG] deployer.env: Delta unit: mongodb/0 change:started
2015-02-18 06:53:18,146 [DEBUG] deployer.env: Delta unit: keystone/0 change:installed
2015-02-18 06:53:53,151 [DEBUG] deployer.env: Delta unit: heat/0 change:installed
2015-02-18 06:53:58,152 [DEBUG] deployer.env: Delta unit: heat/0 change:started
2015-02-18 06:54:03,171 [DEBUG] deployer.env: Delta unit: keystone/0 change:started
2015-02-18 06:55:48,169 [DEBUG] deployer.env: Delta unit: ceilometer/0 change:installed
2015-02-18 06:56:08,172 [DEBUG] deployer.env: Delta unit: ceilometer/0 change:started
2015-02-18 06:56:13,175 [DEBUG] deployer.env: Delta unit: rabbitmq-server/0 change:started
2015-02-18 07:52:40,339 [DEBUG] deployer.env: Connecting to environment...
2015-02-18 07:52:41,771 [DEBUG] deployer.env: Connected to environment
2015-02-18 07:52:41,773 [ERROR] deployer.import: Reached deployment timeout.. exiting
2015-02-18 07:52:41,774 [ERROR] oil_ci.deploy.oil_deployer: Deployment failed:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/oil_ci/deploy/oil_deployer.py", line 140, in deploy
    debug=True, verbose=True)
  File "/usr/lib/python2.7/dist-packages/oil_ci/juju/juju_deployer.py", line 85, in run_deployer
    importer.Importer(env, deploy, options=opts).run()
  File "/usr/lib/python2.7/dist-packages/deployer/action/importer.py", line 214, in run
    self.check_timeout()
  File "/usr/lib/python2.7/dist-packages/deployer/action/importer.py", line 165, in check_timeout
    raise ErrorExit()
ErrorExit
2015-02-18 07:52:41,793 [INFO] oil_ci.deploy.oil_deployer: Saving deployment artifacts to: ./artifacts
+ rc=1
+ echo 'Deployment returned: 1'
Deployment returned: 1

From juju_status.yaml:

  '2':
    agent-state: pending
    containers:
      2/lxc/0:
        instance-id: pending
        series: trusty
      2/lxc/1:
        instance-id: pending
        series: trusty
    dns-name: hayward-63.oil
    hardware: arch=amd64 cpu-cores=8 mem=65536M tags=hw-ok,oil-slave-3,hardware-sm15k
    instance-id: /MAAS/api/1.0/nodes/node-9eb864f4-c4cd-11e3-824b-00163efc5068/
    series: trusty

From maas logs:

ubuntu@maas-trusty-back-may22:~$ grep "Feb 18 06:" /var/log/maas/maas.log|grep hayward-63
.....
Feb 18 06:27:29 maas-trusty-back-may22 maas.node: [INFO] hayward-63 allocated to user oil-slave-6
Feb 18 06:27:35 maas-trusty-back-may22 maas.node: [INFO] hayward-63: Starting monitor: {u'deadline': datetime.datetime(2015, 2, 18, 7, 7, 35, 665510, tzinfo=<twisted.protocols.amp._FixedOffsetTZInfo object at 0x7f5e172e4d90>), u'id': u'node-9eb864f4-c4cd-11e3-824b-00163efc5068', u'context': {u'node_status': 9, u'timeout': 2400.0}}
Feb 18 06:27:36 maas-trusty-back-may22 maas.power: [INFO] Changing power state (on) of node: hayward-63 (node-9eb864f4-c4cd-11e3-824b-00163efc5068)
Feb 18 06:28:02 maas-trusty-back-may22 maas.power: [INFO] Changed power state (on) of node: hayward-63 (node-9eb864f4-c4cd-11e3-824b-00163efc5068)
Feb 18 06:31:44 maas-trusty-back-may22 maas.node: [INFO] hayward-63: Stopping monitor: node-9eb864f4-c4cd-11e3-824b-00163efc5068
Feb 18 06:31:44 maas-trusty-back-may22 maas.node: [ERROR] hayward-63: Marking node failed: Failed to get installation result.
Feb 18 06:36:31 maas-trusty-back-may22 maas.power: [INFO] hayward-63: Power state has changed from on to off.

Related branches

Revision history for this message
Blake Rouse (blake-rouse) wrote :

This means the installation failed. This looks to be doing the incorrect behavior as it should mark the node as failed deployment not marking it broken.

This actually means the installation of curtin failed on the node. Can you check the installation results output for that node to get the actual error for why the node failed to install.

Changed in maas:
status: New → Triaged
importance: Undecided → Critical
milestone: none → next
Changed in maas:
status: Triaged → Incomplete
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Larry, Jason,

Does this continue to be the cause of deployment failures?

IMHO, this is probably something in OIL itself. What I see in the OIL logs is the following:

2015-02-18 07:52:41,773 [ERROR] deployer.import: Reached deployment timeout.. exiting

This is leading me to believe that the deployer, for whatever reason failed and is releasing the environment, marking the machines from Deploying/Deployed to Ready, hence we don't see any installation results on the machines, because there are none.

Have we confirmed that when we see this the machines are actually in 'Failed Deployment' state?

Revision history for this message
Raphaël Badin (rvb) wrote :

> This means the installation failed. This looks to be doing the incorrect behavior as it should mark the node
> as failed deployment not marking it broken.

Where do you see that the node got marked "Broken"? It seems the node ended up in "Failed deployment", as expected. Also, by reading the API code, it seems it will DTRT if it logs "Marking node failed: Failed to get installation result.".

Revision history for this message
Larry Michel (lmic) wrote :

Yes, system would only get marked broken if we pulled it out of the queue and marked it broken before it got redeployed for another build pipeline. Also, the error is failing to retrieve the installation file, so there likely won't any log.

Andres, we had average more than 1 failure per days during past week. We got as many as 3 such failures on the 23rd and averaged more than one failure per day between 16th and the 23rd.

Feb 16 17:58:39 maas-trusty-back-may22 maas.node: [ERROR] hayward-27: Marking node failed: Failed to get installation result.
Feb 17 15:52:54 maas-trusty-back-may22 maas.node: [ERROR] hayward-45: Marking node failed: Failed to get installation result.
Feb 18 00:16:13 maas-trusty-back-may22 maas.node: [ERROR] sunset.local: Marking node failed: Failed to get installation result.
Feb 18 06:31:44 maas-trusty-back-may22 maas.node: [ERROR] hayward-63: Marking node failed: Failed to get installation result.
Feb 18 18:16:15 maas-trusty-back-may22 maas.node: [ERROR] hayward-7: Marking node failed: Failed to get installation result.
Feb 18 19:13:02 maas-trusty-back-may22 maas.node: [ERROR] hayward-24: Marking node failed: Failed to get installation result.
Feb 20 06:54:07 maas-trusty-back-may22 maas.node: [ERROR] hayward-36: Marking node failed: Failed to get installation result.
Feb 20 06:54:07 maas-trusty-back-may22 maas.node: [ERROR] bailey.local: Marking node failed: Failed to get installation result.
Feb 22 18:23:00 maas-trusty-back-may22 maas.node: [ERROR] hayward-17.oil: Marking node failed: Failed to get installation result.
Feb 23 19:56:18 maas-trusty-back-may22 maas.node: [ERROR] hayward-31: Marking node failed: Failed to get installation result.
Feb 23 19:56:40 maas-trusty-back-may22 maas.node: [ERROR] hayward-26: Marking node failed: Failed to get installation result.
Feb 23 19:57:45 maas-trusty-back-may22 maas.node: [ERROR] hayward-40: Marking node failed: Failed to get installation result.

Looking at past 3 months:

ubuntu@maas-trusty-back-may22:~$ grep "Failed to get installation result" /var/log/maas/maas.log|grep "Dec " |wc -l
75
ubuntu@maas-trusty-back-may22:~$ grep "Failed to get installation result" /var/log/maas/maas.log|grep "Jan " |wc -l
117
ubuntu@maas-trusty-back-may22:~$ grep "Failed to get installation result" /var/log/maas/maas.log|grep "Feb " |wc -l
33

Revision history for this message
Blake Rouse (blake-rouse) wrote :

Looking at the MAAS code this ERROR is raised when the installation fails on the node. This is really a bad error and needs to be fixed. Do you have the installation results from these failed installations? I bet curtin will have thrown an error causing this error to be reported.

Revision history for this message
Ashley Lai (alai) wrote :

We hit this error again on prodstack.

Error from console:
ERROR failed to bootstrap environment: bootstrap instance started but did not change to Deployed state: instance "/MAAS/api/1.0/nodes/node-f64f188e-ae16-11e3-b194-00163efc5068/" is started but not deployed

maas log:
Mar 23 14:54:40 maas-trusty-back-may22 maas.bootresources: [INFO] Importing images from source: http://maas.ubuntu.com/images/ephemeral-v2/daily/
Mar 23 14:54:44 maas-trusty-back-may22 maas.bootresources: [INFO] Finished importing of boot images from 1 source(s).
Mar 23 14:54:44 maas-trusty-back-may22 maas.import-images: [INFO] Started importing boot images.
Mar 23 14:54:45 maas-trusty-back-may22 maas.import-images: [INFO] Finished importing boot images, the region does not have any new images.
Mar 23 14:54:53 maas-trusty-back-may22 maas.node: [INFO] baubas.oil: Stopping monitor: node-3fea91d2-ae0b-11e3-b194-00163efc5068
Mar 23 14:54:53 maas-trusty-back-may22 maas.node: [ERROR] baubas.oil: Marking node failed: Failed to get installation result.
Mar 23 14:55:02 maas-trusty-back-may22 maas.lease_upload_service: [INFO] Uploading 141 DHCP leases to region controller.
Mar 23 14:55:06 maas-trusty-back-may22 maas.node: [INFO] shahbaz.local: Stopping monitor: node-4951e5e6-ae9b-11e3-9074-00163efc5068
Mar 23 14:55:06 maas-trusty-back-may22 maas.node: [ERROR] shahbaz.local: Marking node failed: Failed to get installation result.
Mar 23 14:55:12 maas-trusty-back-may22 maas.node: [INFO] reading.local: Stopping monitor: node-f64f188e-ae16-11e3-b194-00163efc5068
Mar 23 14:55:12 maas-trusty-back-may22 maas.node: [ERROR] reading.local: Marking node failed: Failed to get installation result.
Mar 23 14:55:32 maas-trusty-back-may22 maas.node: [INFO] gurley.local: Stopping monitor: node-cb79abaa-3f58-11e4-b38a-00163eca07b6
Mar 23 14:55:32 maas-trusty-back-may22 maas.node: [ERROR] gurley.local: Marking node failed: Failed to get installation result.
Mar 23 14:56:04 maas-trusty-back-may22 maas.node: [INFO] hayward-03: Stopping monitor: node-a1225948-c4cd-11e3-8102-00163efc5068
Mar 23 14:56:04 maas-trusty-back-may22 maas.node: [ERROR] hayward-03: Marking node failed: Failed to get installation result.

Revision history for this message
Larry Michel (lmic) wrote :

One thing to note in the logs above is that system had just finished importing boot images then we hit this error on multiple systems.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

I caught a system in MAAS still in the failed deployment state after hitting this:

http://paste.ubuntu.com/10861546/

This is a fusionIO system and the problem is something in the dkms build/install of the package is removing everything in /tmp, including curtin's state, which breaks curtin.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

It's likely there is more than one root cause for this. This is really just any "curtin reported that the installation failed", which can be caused by a variety of reasons. I think this bug should just go for correcting the misleading error message and separate bugs should be opened for other sources of failure, like the iomemory/curtin/dkms thing in the log I posted above.

Revision history for this message
Raphaël Badin (rvb) wrote :

Just fixed the error message that shows up in the node event log: the original message was confusing: it was saying that the installation log could not be uploaded where it should said that the installation simply failed.

Changed in maas:
status: Incomplete → Fix Committed
milestone: next → 1.8.0
assignee: nobody → Raphaël Badin (rvb)
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.