Sharded OpWQ drops suicide_grace after waiting for work
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ubuntu Cloud Archive |
Fix Released
|
Medium
|
Dan Hill | ||
Queens |
In Progress
|
Medium
|
Kellen Renshaw | ||
Rocky |
Won't Fix
|
Medium
|
Dan Hill | ||
Stein |
Won't Fix
|
Medium
|
Dan Hill | ||
Train |
Fix Released
|
Medium
|
Dan Hill | ||
ceph (Ubuntu) |
Fix Released
|
Medium
|
Dan Hill | ||
Bionic |
Fix Committed
|
Medium
|
Kellen Renshaw | ||
Eoan |
Won't Fix
|
Medium
|
Dan Hill | ||
Focal |
Fix Released
|
Medium
|
Dan Hill |
Bug Description
[Impact]
The Sharded OpWQ will opportunistically wait for more work when processing an empty queue. While waiting, the heartbeat timeout and suicide_grace values are modified. The `threadpool_
After finding work, the original work queue grace/suicide_grace values are not re-applied. This can result in hung operations that do not trigger an OSD suicide recovery.
The missing suicide recovery was observed on Luminous 12.2.11. The environment was consistently hitting a known authentication race condition (issue#37778 [0]) due to repeated OSD service restarts on a node exhibiting MCEs from a faulty DIMM.
The auth race condition would stall pg operations. In some cases, the hung ops would persist for hours without suicide recovery.
[Test Case]
I have not identified a reliable reproducer. Currently testing the fix by exercising I/O.
Recommend letting this bake upstream before considering a back-port.
[Regression Potential]
This fix improves suicide_grace coverage of the Sharded OpWq.
This change is made in a critical code path that drives client I/O. An OSD suicide will trigger a service restart and repeated restarts (flapping) will adversely impact cluster performance.
The fix mitigates risk by keeping the applied suicide_grace value consistent with the value applied before entering `OSD::ShardedOp
- In-Progress -
Opened upstream tracker for issue#45076 [1] and fix pr#34575 [2]
[0] https:/
[1] https:/
[2] https:/
tags: | added: sts |
Changed in ceph (Ubuntu): | |
status: | New → Triaged |
assignee: | nobody → Dan Hill (hillpd) |
importance: | Undecided → Medium |
summary: |
- Ceph 12.2.11-0ubuntu0.18.04.2 doesn't honor suicide_grace + Sharded OpWQ drops suicide_grace after waiting for work |
Changed in ceph (Ubuntu Bionic): | |
status: | New → Confirmed |
assignee: | nobody → Dan Hill (hillpd) |
Changed in ceph (Ubuntu Eoan): | |
assignee: | nobody → Dan Hill (hillpd) |
Changed in ceph (Ubuntu Bionic): | |
importance: | Undecided → Medium |
Changed in ceph (Ubuntu Eoan): | |
importance: | Undecided → Medium |
status: | New → Confirmed |
Changed in ceph (Ubuntu Focal): | |
status: | Triaged → Confirmed |
description: | updated |
description: | updated |
description: | updated |
Changed in ceph (Ubuntu Focal): | |
status: | Confirmed → Fix Released |
Changed in ceph (Ubuntu): | |
status: | Confirmed → Fix Released |
Changed in cloud-archive: | |
assignee: | nobody → Dan Hill (hillpd) |
Changed in ceph (Ubuntu Bionic): | |
status: | Confirmed → In Progress |
Changed in cloud-archive: | |
status: | New → Fix Released |
Changed in cloud-archive: | |
importance: | Undecided → Medium |
@hillpd any update on this bug?