Attempts to use same node multiple times when many nodes created simultaneously (race condition?)

Bug #2024647 reported by Jonathan Heathcote
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
New
Undecided
Unassigned

Bug Description

# Problem description

When instantiating a modest number (ten or so) Ironic nodes simultaneously via Nova using Terraform, I find that sometimes a few instances will immediately enter the Error state.

The final exception error message is "Failure prepping block device." which I suspect is a red herring since further up the exception trace is the following:

    nova.exception.InstanceDeployFailure: Failed to reserve
    node 228437bc-b091-4a8a-8b3d-7134214892e6 when provisioning
    the instance 168c5e97-bff2-4408-b12a-8158414fc5de

Here node 228437bc[...] is a (previously available) Ironic node which, upon inspection, is now running one of the other instances created in the batch. Instance 168c5e97[...] is the instance reporting this error.

Full stack trace here: https://paste.openstack.org/show/bSxVzPIVahJKYA5zGumX/

Also found in a Nova log file is:

2023-06-22 08:46:39.796 217494 WARNING nova.scheduler.client.report [None req-8e6b5289-cd5c-4499-bad5-e6b576a70e88 ce20f804a02fc250ec44394149d69a682aed245fd50194cd917fd238c2dc52c1 ffaaa801d7ac4b7e94945aa8b4b9d245 - - default default] Failed to save allocation for d43c698f-5ee7-445f-9653-a8d3a24e71d1. Got HTTP 409: {"errors": [{"status": 409, "title": "Conflict", "detail": "There was a conflict when trying to complete your request.\n\n Unable to allocate inventory: Unable to create allocation for 'CUSTOM_R730' on resource provider '228437bc-b091-4a8a-8b3d-7134214892e6'. The requested amount would exceed the capacity. ", "code": "placement.undefined_code", "request_id": "req-6bd98596-0a8a-48af-b1af-6c884c81085a"}]}

My interpretation of this situation is that there might be some kind of race condition which is causing a single Ironic node to be offered to two instances created in quick succession and obviously only the first one wins.

# Reproduction steps

1. Setup an Ironic system with a modest number of nodes (15 in my case).
2. Simultaneously create several instances (again 15 in my case -- one
   per available Ironic node). I used Terraform which internally calls
   the Nova API. Unfortunately it would not be helpful to share my
   Terraform script since it uses various internal Terraform modules.
3. Something like 10% of instances will immediately enter the Error
   state with an exception as described above. This appears to be a
   matter of random chance.

# Software Versions

Ironic: 06895641fb8a44caf4574919bd518f0de76cba3d
        with the following (unrelated) patches applied:
        Ic9dea4f51d82866be8ac16242a79237c789b9745
        I2a6bca3550819b98adbaffe315f77427b8a43d62

Nova: c9de185ea1ac1e8d4435c5863b2ad7cefdb28c76
      with the following (unrelated) patches applied:
      Ibd68fb72957ca850f3be4e7b4ea68af038fad07c
      Ifed0fa16053228990a6a8df8d4c666521db7e329
      I5a399f1d3d702bfb76c067893e9c924904c8c360

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.