No retry for removing instance in case of ironic service down
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Confirmed
|
Medium
|
Unassigned |
Bug Description
When ironic service is shortly down (e.g. ironic conductor down), removing an instance will immediately make this instance into error state without status polling.
After investigation, it points to the code segment: https:/
When conductor is down, the exception is raised, so ironic will not apply the configuration CONF.ironic.
Reproduce:
1. nova boot a baremetal instance.
2. reboot the ironic conductor node (or stop conductor service).
3. remove instance in spawn.
4. instance go into error state, not after 2 minutes (default value).
As a comparison, simply comments L983-984 to reproduce. It seems that, if we comment out L983-984, then if ironic conductor is up before nova mark instance into error state, then nova delete again will also delete ironic instance info. If not, instance on ironic node will not be removed when remove instance from nova.
Still needs investigate.
Changed in nova: | |
assignee: | nobody → Wang KaiFeng (kaifeng) |
description: | updated |
Changed in nova: | |
assignee: | Wang KaiFeng (kaifeng) → nobody |
Changed in nova: | |
status: | New → Confirmed |
importance: | Undecided → Medium |
Thanks for confirming this bug. Since I have not deep insight with nova, following statements may not accurate or true, but these is what I found and put here for reference.
I think the case is when virt driver raises exception during instance destroy, nova will mark this instance to error state, and when user deletes this instance, nova will never call virt driver, so ironic has no chance to get cleaned up.
The pooling provision state does not cause a major problem, it's the outcome of first issue. If the driver can't successfully send request to ironic api, waiting for 2 minutes is meaningless.
Possibly there are two ways to address this bug:
1. nova do not remove instance in error state, when user deletes the instance, virt driver has a chance to get called, so the provisioning request can be sent to ironic api again. nova never delete an instance without the success acknowledgement from virt driver.
2. add retry mechanism to provisioning request in ironic driver.
I don't know if method 1 is reasonable, but it seems logical to me based on my current knowledge.
Method 2 is definitely a workaround, but it's easy to adopt, and works when service unavailable time is short, this is the way I do in the downstream.