Ironic locking up completely under extreme load
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ironic |
In Progress
|
Medium
|
Dmitry Tantsur | ||
ironic-python-agent |
Triaged
|
Medium
|
Dmitry Tantsur |
Bug Description
While testing Ironic on our scale environment (~3500 VMs in total, ~500 VMs at a time, 1 single-process Ironic backed by SQLite), we have noticed a pathological behavior. It only occurred after misconfiguring the testbed to have excessive bandwidth and network stability limitations. While we don't have a clear picture what happened (to a large extent because of log rotation - in this mode Ironic is generating hundreds of megabytes per hour), at the end of the run Ironic is only reporting NoFreeConductor
1) Something caused a lot of nodes to fail heartbeating.
2) They all re-run heartbeats after 5 seconds (hardcoded delay in IPA), resulting in 100 heartbeats/second.
3) The only Ironic has a limit of 100 green threads, so all heartbeats + periodic tasks + API requests are racing for these threads.
4) In the end, no process ever finishes because anything requires a long race for green threads.
These are my observations based on this reasoning and the research of the code:
1) 100 green threads is a very low number. I've seen people on the internet talking about 100k "normal" OS threads and 1M of green threads. I don't know why we choose 100. It's probably extremely conservative.
2) We allow 250 parallel deployments (+50 possible cleanings). The number of threads has to be higher than that to prevent these processes to finish without racing for threads.
3) Back when migrating to Futurist, I misunderstood how Futurist is configured. As a result, we ended up with a pool of 100 threads and then 100 more requests for threads were allowed to queue. The latter has never been intentional. Given that Ironic often acquires an exclusive lock before asking for a thread, this behavior may result in a race for locks on top of the race for threads, potentially even in deadlocks.
4) Everything is even worse in the fast-track case where we may have thousands of available nodes idling - and heartbeating!
These are the solutions that I can see:
1) Bump the thread pool size to 300 and see if we can bump it further. But also prevent thread requests from ever being queued.
2) Move the heartbeat verification logic outside of the thread so that it happens synchronously. This involves verifying the provisioning state, trying to acquire an exclusive lock and updating the current agent information in driver_
3) Split the worker pool to keep Ironic responsive even in the event of an extreme load. Create two pools with 90% and 10% of capacity (so, 270 and 30 threads by default). All spawn operations will be able to use the 1st (main) pool, but some operations (heartbeats, periodic tasks) will not be able to fall back to the 2nd (critical) pool. This way API will continue working and e.g. operators will be able to abort deployments.
Changed in ironic: | |
status: | Triaged → In Progress |
It sounds like we should also do something on the IPA side about thundering herd heartbeat retries?