Ironic

Cleaning can restart in infinite loop in some hardware failure cases

Bug #1526561 reported by Jay Faulkner on 2015-12-15

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Ironic	Triaged	Medium	Unassigned

Bug Description

In an extreme edge case, when using hardware manager dynamic loading with evaluate_hardware_support(), some machines can get stuck in an infinite cleaning loop.

Reproduction instructions:

1) Have a piece of hardware with an intermittently failing piece of hardware, like a disk that sometimes shows up in the OS and sometimes doesn't.
2) Implement a custom hardware manager for which evaluate_hardware_support() returns 0 if the piece of hardware from step 1 doesn't exist, and returns a positive int otherwise.
3) As the hardware "flaps" in and out of the OS on reboot, IPA will load a different set of hardware managers each time the piece of hardware appears/disappears on reboot. This will trigger a cleaning restart due to version change.

With my testing, having custom hardware managers with about 8 steps and 3 reboots, I saw machines restart cleaning several times in a thirty minute period.

I've thought of a few potential solutions:

1) Update documentation for hardware managers to stop encouraging dynamically loading them based on present hardware.
- Pros: Reliable behavior for any booted agent, regardless of hardware.
- Cons: Depending on complexity of cleaning steps, may require different agents for different hardware

2) Have Ironic keep track of how many times cleaning has restarted, and CLEANFAIL if cleaning restarted $clean_restart_max number of times.
- Pros: Would prevent similar bugs in this same vein. Allows deployers to decide how many cleaning restarts are reasonable.
- Cons: It's perfectly reasonable for someone to want to deploy multiple agents while a node is in a cleaning cycle. This would invalidate that use case.

I'm not sure what the right path is, but this is a recipe for badness -- Ironic should be able to deal reasonably with hardware issues and in this case it does not.

Tags:

Revision history for this message

Dmitry Tantsur (divius) wrote on 2016-05-12:

I would do the latter with $clean_restart_max being pretty big (maybe even calculated based on a number of steps).

Changed in ironic:
status:	New → Triaged
importance:	Undecided → Medium

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.