Cleaning can restart in infinite loop in some hardware failure cases
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ironic |
Triaged
|
Medium
|
Unassigned |
Bug Description
In an extreme edge case, when using hardware manager dynamic loading with evaluate_
Reproduction instructions:
1) Have a piece of hardware with an intermittently failing piece of hardware, like a disk that sometimes shows up in the OS and sometimes doesn't.
2) Implement a custom hardware manager for which evaluate_
3) As the hardware "flaps" in and out of the OS on reboot, IPA will load a different set of hardware managers each time the piece of hardware appears/disappears on reboot. This will trigger a cleaning restart due to version change.
With my testing, having custom hardware managers with about 8 steps and 3 reboots, I saw machines restart cleaning several times in a thirty minute period.
I've thought of a few potential solutions:
1) Update documentation for hardware managers to stop encouraging dynamically loading them based on present hardware.
- Pros: Reliable behavior for any booted agent, regardless of hardware.
- Cons: Depending on complexity of cleaning steps, may require different agents for different hardware
2) Have Ironic keep track of how many times cleaning has restarted, and CLEANFAIL if cleaning restarted $clean_restart_max number of times.
- Pros: Would prevent similar bugs in this same vein. Allows deployers to decide how many cleaning restarts are reasonable.
- Cons: It's perfectly reasonable for someone to want to deploy multiple agents while a node is in a cleaning cycle. This would invalidate that use case.
I'm not sure what the right path is, but this is a recipe for badness -- Ironic should be able to deal reasonably with hardware issues and in this case it does not.
I would do the latter with $clean_restart_max being pretty big (maybe even calculated based on a number of steps).