Comment 6 for bug 1766763

Revision history for this message
Julia Kreger (juliaashleykreger) wrote :

The problem is not a failure to schedule, but a failure for the deployment to be successful resulting in the instance in ERROR state shortly after the node spawn action occurs in nova. But we would never be able to re-use the port in the baremetal context because it has changed, the binding profile has been set/asserted and changed... so realistically we might not be able to re-use the port or IP allocation which I believe it does already.

Thinking about it a little further, with any complex network involving an ML2 driver, the port would have to be completely detached, unbound, and removed, and I think the only way to do that would be through deletion as ML2 drivers may have wired networks through switch fabrics to support an instance on a compute node, and just moving that port may orphan that in the switch fabric.

That same issue actually exists for baremetal as well as an ML2 driver mapping networks through to a compute host.

The only way to avoid that would be to detach, unbind, and then sanity check the remaining metadata about the port (mac address included, because a conflict is a hard failure for neutron) to ensure it matches the initial state, and if something does not match the initial state then delete it. Then again, deleting it might just be easier... :|

Anyway, I'm happy to add a check for replacement if the prior instance is only in ERROR state, which should still enable rollback in other cases, that is once rollback works again. (Granted, an instance in error state may be able to be recovered through rebuild or some other action, but I'm not sure that is entirely in scope for an orchestration tool.)