Ironic locking up completely under extreme load

Bug #2038438 reported by Dmitry Tantsur
This bug affects 1 person
Affects Status Importance Assigned to Milestone
In Progress
Dmitry Tantsur
Dmitry Tantsur

Bug Description

While testing Ironic on our scale environment (~3500 VMs in total, ~500 VMs at a time, 1 single-process Ironic backed by SQLite), we have noticed a pathological behavior. It only occurred after misconfiguring the testbed to have excessive bandwidth and network stability limitations. While we don't have a clear picture what happened (to a large extent because of log rotation - in this mode Ironic is generating hundreds of megabytes per hour), at the end of the run Ironic is only reporting NoFreeConductorWorker in response to anything, mostly heartbeats. The heartbeats are dominating the workload with around 100 of them coming each second (!). This is how I see the sequence:

1) Something caused a lot of nodes to fail heartbeating.
2) They all re-run heartbeats after 5 seconds (hardcoded delay in IPA), resulting in 100 heartbeats/second.
3) The only Ironic has a limit of 100 green threads, so all heartbeats + periodic tasks + API requests are racing for these threads.
4) In the end, no process ever finishes because anything requires a long race for green threads.

These are my observations based on this reasoning and the research of the code:

1) 100 green threads is a very low number. I've seen people on the internet talking about 100k "normal" OS threads and 1M of green threads. I don't know why we choose 100. It's probably extremely conservative.
2) We allow 250 parallel deployments (+50 possible cleanings). The number of threads has to be higher than that to prevent these processes to finish without racing for threads.
3) Back when migrating to Futurist, I misunderstood how Futurist is configured. As a result, we ended up with a pool of 100 threads and then 100 more requests for threads were allowed to queue. The latter has never been intentional. Given that Ironic often acquires an exclusive lock before asking for a thread, this behavior may result in a race for locks on top of the race for threads, potentially even in deadlocks.
4) Everything is even worse in the fast-track case where we may have thousands of available nodes idling - and heartbeating!

These are the solutions that I can see:
1) Bump the thread pool size to 300 and see if we can bump it further. But also prevent thread requests from ever being queued.
2) Move the heartbeat verification logic outside of the thread so that it happens synchronously. This involves verifying the provisioning state, trying to acquire an exclusive lock and updating the current agent information in driver_internal_info.
3) Split the worker pool to keep Ironic responsive even in the event of an extreme load. Create two pools with 90% and 10% of capacity (so, 270 and 30 threads by default). All spawn operations will be able to use the 1st (main) pool, but some operations (heartbeats, periodic tasks) will not be able to fall back to the 2nd (critical) pool. This way API will continue working and e.g. operators will be able to abort deployments.

Revision history for this message
Steve Baker (steve-stevebaker) wrote :

It sounds like we should also do something on the IPA side about thundering herd heartbeat retries?

Changed in ironic:
status: Triaged → In Progress
Revision history for this message
Dmitry Tantsur (divius) wrote :

> It sounds like we should also do something on the IPA side about thundering herd heartbeat retries?

We should, but we also cannot rely on the client side to always be reasonable.

Changed in ironic-python-agent:
importance: Undecided → Medium
assignee: nobody → Dmitry Tantsur (divius)
status: New → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (master)

Submitter: "Zuul (22348)"
Branch: master

commit 224cdd726cc0ca1826273e46619769f87380f2d7
Author: Dmitry Tantsur <email address hidden>
Date: Mon Oct 2 18:42:07 2023 +0200

    Bump workers_pool_size to 300 and remove queueing of tasks

    Especially in a single-conductor environment, the number of threads
    should be larger than max_concurrent_deploy, otherwise the latter cannot
    be reached in practice or will cause issues with heartbeats.

    On the other hand, this change fixes an issue with how we use futurist.
    Due to a misunderstanding, we ended up setting the workers pool size to
    100 and then also allowing 100 more requests to be queued.

    To be it shortly, this change moves from 100 threads + 100 queued to
    300 threads and no queue.

    Partial-Bug: #2038438
    Change-Id: I1aeeda89a8925fbbc2dae752742f0be4bc23bee0

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.