IPA fails to reboot node after image deployment

Bug #1500891 reported by Barry McLarnon
26
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Ironic
Triaged
Medium
Unassigned

Bug Description

We are using agent_ipmitool with an Ironic Python Agent ramdisk to deploy images using a standalone Ironic (kilo) deployment. Around 60-70% of the time, when we are performing a deployment using 'ironic node-set-provision-state <node_uuid> active', the image is written to disk and reboots successfully. However, sometimes after the image deployment is finished, and the node is rebooted, we get 'deploy failed' and the following error:

  Error rebooting node <node_uuid>. Error: Failed to set node power state to power on.

Even from this 'deploy failed' state, manually powering on the node and booting to local disk shows that the image has been written successfully. The attached log states "Not going to change_node_power_state because current state = requested state = 'power off'", but then immediately fails with an Asynchronous exception where it fails to turn the power back on.

It seems that Conductor isn't giving the node enough time to power off before attempting to power it on again, so the state transition somehow fails.

Revision history for this message
Barry McLarnon (bmclarnon) wrote :
Revision history for this message
Barry McLarnon (bmclarnon) wrote :

I should mention that the environment Ironic is running on is Suse Linux Enterprise Server (SLES) 12.

Revision history for this message
Dmitry Tantsur (divius) wrote :

hi! Could you provide the complete conductor log for this issue? You're using the agent_ipmitool driver, right?

Changed in ironic:
status: New → Incomplete
Revision history for this message
Barry McLarnon (bmclarnon) wrote :

Sorry for the delay in responding. Yes, we're using agent_ipmitool. I have attached the full log.

Dmitry Tantsur (divius)
Changed in ironic:
status: Incomplete → Confirmed
importance: Undecided → Medium
tags: added: agent conductor
Revision history for this message
Barry McLarnon (bmclarnon) wrote :

We have recently had a recurrence of this bug with some of our baremetal nodes after not seeing it for a while. We are using Ironic 5.1.0 in standalone mode, and an IPA image based on the stable/mitaka branch (with a custom HW manager).

This time, we have both the Conductor log, and shipped systemd journals from the IPA image, both from a successful and a failed deployment (no clear difference in these logs from our analysis).

Changed in ironic:
status: Confirmed → Triaged
Revision history for this message
Andres Toomsalu (andres-active) wrote :

We are seeing exact same behaviour with RHOSP12 (Pike) and HP 460c Gen9 bladeservers.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.