Attempts to use same node multiple times when many nodes created simultaneously (race condition?)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ironic |
New
|
Undecided
|
Unassigned |
Bug Description
# Problem description
When instantiating a modest number (ten or so) Ironic nodes simultaneously via Nova using Terraform, I find that sometimes a few instances will immediately enter the Error state.
The final exception error message is "Failure prepping block device." which I suspect is a red herring since further up the exception trace is the following:
nova.
node 228437bc-
the instance 168c5e97-
Here node 228437bc[...] is a (previously available) Ironic node which, upon inspection, is now running one of the other instances created in the batch. Instance 168c5e97[...] is the instance reporting this error.
Full stack trace here: https:/
Also found in a Nova log file is:
2023-06-22 08:46:39.796 217494 WARNING nova.scheduler.
My interpretation of this situation is that there might be some kind of race condition which is causing a single Ironic node to be offered to two instances created in quick succession and obviously only the first one wins.
# Reproduction steps
1. Setup an Ironic system with a modest number of nodes (15 in my case).
2. Simultaneously create several instances (again 15 in my case -- one
per available Ironic node). I used Terraform which internally calls
the Nova API. Unfortunately it would not be helpful to share my
Terraform script since it uses various internal Terraform modules.
3. Something like 10% of instances will immediately enter the Error
state with an exception as described above. This appears to be a
matter of random chance.
# Software Versions
Ironic: 06895641fb8a44c
with the following (unrelated) patches applied:
Nova: c9de185ea1ac1e8
with the following (unrelated) patches applied:
Ibd68fb72
Ifed0fa16
I5a399f1d