[SRU] error deleting cloned volumes and parent at the same time when using ceph

Bug #2083061 reported by Rodrigo Barbieri
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Cinder
Fix Released
Undecided
Unassigned
Ubuntu Cloud Archive
Invalid
Undecided
Unassigned
Antelope
Fix Released
Undecided
Unassigned
Bobcat
Fix Released
Undecided
Unassigned
Yoga
Fix Released
Undecided
Unassigned
Zed
Won't Fix
Undecided
Unassigned
cinder (Ubuntu)
Fix Released
Undecided
Unassigned
Jammy
Fix Released
Undecided
Unassigned

Bug Description

******* SRU TEMPLATE AT THE BOTTOM **********

Affects: bobcat and older

A race condition when deleting cloned volumes at the same time as their parent results in the volumes in error_deleting state. The reason it happens is because the code that looks for the parent in [3] may find the original volume or the "<volume>.deleted" renamed volume if the parent has been marked for deletion. The race happens because by running the deletion of both the parent and the child at the same time, the child may see the parent volume before it is marked for deletion, and then in [4] it fails to find it again because it is gone (renamed to "<volume>.deleted").

Steps to reproduce:

1) openstack volume create --size 1 v1

Wait for volume to be create and available

2) for i in {1..9}; do openstack volume create d$i --source v1 --size 1;done

Wait for all volumes to be created and available

3) openstack volume delete $(openstack volume list --format value -c ID | sort | xargs)

Some volumes may be in error_deleting state.

Workaround: Reset volume state and try to delete again.

Solutions:

a) The issue does not happen in caracal+ because of commit [1] which refactors the code. I tried to reproduce in Caracal with 50 volumes, including grandparent volumes, and I couldn't. If we could backport this fix as far back as Yoga this would address the problem for our users.

b) A single line of code in [2] can address the problem in bobcat and older releases by adding a retry:

    @utils.retry(rbd.ImageNotFound, 2)
    def delete_volume(self, volume: Volume) -> None:

The retry basically causes the ImageNotFound exception thrown at [4] to retry the delete_volume function, which will then find the "<volume>.deleted" at [3], solving the race condition. It is simpler than adding something more complex directly at [4] where the error happens.

[1] https://github.com/openstack/cinder/commit/1a675c9aa178c6d9c6ed10fd98f086c46d350d3f

[2] https://github.com/openstack/cinder/blob/5b3717f8bfa69c142778ffeabfc4ab91f1f23581/cinder/volume/drivers/rbd.py#L1371

[3] https://github.com/openstack/cinder/blob/5b3717f8bfa69c142778ffeabfc4ab91f1f23581/cinder/volume/drivers/rbd.py#L1401

[4] https://github.com/openstack/cinder/blob/5b3717f8bfa69c142778ffeabfc4ab91f1f23581/cinder/volume/drivers/rbd.py#L1337

===================================================
SRU TEMPLATE
============

[Impact]

Due to a race condition, attempting to delete multiple volumes where among them there is a parent and a child can result in one or more volumes being stuck in error_deleting. The reason is because the childs get updated as the parent is deleted, and if the code had already started deleting the child then the reference changes halfway through and fails. Later, the volumes can still be deleted by resetting the state and retrying, but the user experience is cumbersome.

Upstream has fixed the issue in Caracal by refactoring the delete method with significant behavioural changes (see comment #2), and has backported the refactor to Antelope. Also the refactor code applies to Yoga, it is preferred to implement a simpler fix to address this specific problem in Yoga. The simpler fix is a retry decorator which will force the delete method to re-run, picking up the updated reference of the parent being deleted and therefore succeeding deleting the childs.

[Test case]

1) Deploy Cinder with Ceph
2) Create a parent volume

openstack volume create --size 1 v1

3) Create the child volumes

for i in {1..9}; do openstack volume create d$i --source v1 --size 1;done

4) Wait for all volumes to be created and available

5) Delete all the volumes

for item in $(openstack volume list --format value -c ID | sort | xargs); do openstack volume delete $item & done

6) Check for volumes stuck in error_deleting, if None, repeat steps 2-5

7) Confirm error message rbd.ImageNotFound in the logs

8) Install fixed package

9) Repeat steps 2-5, confirm no volumes stuck in error_deleting and the following messages in the log:

"no longer exists in backend"

"Retrying cinder.volume.drivers.rbd.RBDDriver.delete_volume.<locals>._delete_volume_with_retry ... as it raised ImageNotFound: [errno 2] RBD image not found (error opening image ... at snapshot None)"

If messages are not in the logs after the new delete command, then retry steps 2-5 until messages appear while still not having any volumes stuck in error_deleting (in other words, all volume deleted successfully).

[Regression Potential]

For Bobcat and Antelope, there is reasonable regression potential because of the complexity of refactor [1] (see comment #2), however, discussions on previous upstream meetings and upstream CI runs of Caracal, Bobcat and Antelope backports which test the refactor provide some level of reassurance. For Yoga, we consider no regression potential with the simpler retry decorator fix.

[Other Info]

None.

Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

to backport [1] cleanly to yoga it is necessary to apply the following commits [2], [3] and [4], but [3] and [4] are not necessary because the methods are completely removed in [1]. Therefore only [2] is needed to avoid unit test conflicts.

[1] https://github.com/openstack/cinder/commit/1a675c9aa178c6d9c6ed10fd98f086c46d350d3f

[2] https://github.com/openstack/cinder/commit/e1138a126f80f9cf5a38cd49066133baba1a0fef

[3] https://github.com/openstack/cinder/commit/5179e4f6bf49795d2aa4c8d0807a502ee1561f60

[4] https://github.com/openstack/cinder/commit/b235048d6dc49fee0f3ad83d3216a42c23b69a20

Revision history for this message
Dan Hill (hillpd) wrote :

My main concern with backporting [1] is that it changes the volume delete behavior to use `trash_move()` as a fallback mechanism. Existing workflows and operators may not be aware that the trash namespace needs to be monitored for clean-up.

I'll also note that [1] changes delete behavior to perform additional flatten operations when necessary. This will make some delete operations much more costly in terms of performance. The implementation mitigates this risk by limiting the number of concurrent flatten operations, but it still introduces the potential for unexpected backend load.

After reviewing both options, I agree that the retry mechanism makes more sense as a backport candidate. Adding the retry decorator around `delete_volume` provides a straightforward solution to the reported race condition and avoids significant changes to the delete volume behavior that could potentially cause utilization and performance issues with existing deployments.

no longer affects: ubuntu
summary: - error deleting cloned volumes and parent at the same time when using
- ceph
+ [SRU] error deleting cloned volumes and parent at the same time when
+ using ceph
description: updated
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

For reference this is the error and stack trace:

parent: 2078a5c0-4272-46f3-b95b-d89d62da67af
child: 12d02019-17b1-45a3-9026-aed36873edaf

2024-11-20 15:43:43.583 38606 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/cinder/volume/manager.py", line 981, in delete_volume
2024-11-20 15:43:43.583 38606 ERROR oslo_messaging.rpc.server self.driver.delete_volume(volume)
2024-11-20 15:43:43.583 38606 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/cinder/volume/drivers/rbd.py", line 1350, in delete_volume
2024-11-20 15:43:43.583 38606 ERROR oslo_messaging.rpc.server self._delete_clone_parent_refs(client, parent, parent_snap)
2024-11-20 15:43:43.583 38606 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/cinder/volume/drivers/rbd.py", line 1231, in _delete_clone_parent_refs
2024-11-20 15:43:43.583 38606 ERROR oslo_messaging.rpc.server parent_rbd = self.rbd.Image(client.ioctx, parent_name)
2024-11-20 15:43:43.583 38606 ERROR oslo_messaging.rpc.server File "rbd.pyx", line 2896, in rbd.Image.__init__
2024-11-20 15:43:43.583 38606 ERROR oslo_messaging.rpc.server rbd.ImageNotFound: [errno 2] RBD image not found (error opening image b'volume-2078a5c0-4272-46f3-b95b-d89d62da67af' at snapshot None)
2024-11-20 15:43:43.583 38606 ERROR oslo_messaging.rpc.server
2024-11-20 15:43:43.593 38606 DEBUG cinder.volume.drivers.rbd [req-5c4144c1-c310-4f05-ba15-b05dda6d61af 70958fca143047a583e91795ff460152 5c20c2e1c8ed4948923449807a40b3e7 - - -] volume is a clone so cleaning references delete_volume /usr/lib/python3/dist-packages/cinder/volume/drivers/rbd.py:1348

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (unmaintained/yoga)

Fix proposed to branch: unmaintained/yoga
Review: https://review.opendev.org/c/openstack/cinder/+/935976

description: updated
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

forgot the DEP3 tags. Deleting debdiffs and re-creating

Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :
Revision history for this message
James Page (james-page) wrote :

Proposed updates for all targets uploaded for SRU team review.

I'll process UCA targets (which are also uploaded) once the SRU team accepted into proposed

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

The main devel task for cinder is still "New", is this fixed in plucky?

James Page (james-page)
Changed in cinder (Ubuntu):
status: New → Invalid
Changed in cloud-archive:
status: New → Invalid
Changed in cinder:
status: New → Invalid
Revision history for this message
Heitor Alves de Siqueira (halves) wrote :

This has already been fixed in Cinder for newer releases, I've adjusted the bug status accordingly.
It's also been sponsored for Jammy already, currently in the Unapproved queue.

Changed in cinder (Ubuntu):
status: Invalid → Fix Released
Changed in cinder:
status: Invalid → Fix Released
Changed in cinder (Ubuntu Jammy):
status: New → In Progress
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :
Revision history for this message
Andreas Hasenack (ahasenack) wrote : Please test proposed package

Hello Rodrigo, or anyone else affected,

Accepted cinder into jammy-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cinder/2:20.3.1-0ubuntu1.6 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-jammy to verification-done-jammy. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-jammy. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in cinder (Ubuntu Jammy):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-jammy
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :
tags: added: verification-done-jammy
removed: verification-needed-jammy
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cinder - 2:20.3.1-0ubuntu1.6

---------------
cinder (2:20.3.1-0ubuntu1.6) jammy; urgency=medium

  * d/p/lp1823445.patch: Fix "signature_verified" volume metadata
    from propagating to images and causing errors later when
    creating volumes from such images (LP: #1823445).
  * d/p/lp2083061.patch: Fix race condition when deleting cloned
    volumes and their parents at the same time. (LP: #2083061)

 -- Rodrigo Barbieri <email address hidden> Mon, 09 Dec 2024 15:37:22 +0000

Changed in cinder (Ubuntu Jammy):
status: Fix Committed → Fix Released
Revision history for this message
Timo Aaltonen (tjaalton) wrote : Update Released

The verification of the Stable Release Update for cinder has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

Still pending UCAs for Antelope and Bobcat

Revision history for this message
Guillaume Boutry (gboutry) wrote : Please test proposed package

Hello Rodrigo, or anyone else affected,

Accepted cinder into bobcat-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:bobcat-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-bobcat-needed to verification-bobcat-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-bobcat-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-bobcat-needed
Revision history for this message
Guillaume Boutry (gboutry) wrote :

Hello Rodrigo, or anyone else affected,

Accepted cinder into antelope-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:antelope-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-antelope-needed to verification-antelope-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-antelope-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-antelope-needed
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :
tags: added: verification-antelope-done verification-bobcat-done
removed: verification-antelope-needed verification-bobcat-needed
Revision history for this message
Edward Hope-Morley (hopem) wrote :

@rodrigo-barbieri2010 if we are making a conscious decision not to backport to Zed can we please state why and mark it as Won't Fix. Thanks.

Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

marked Zed as won't fix. We don't backport minor issues to non-LTS releases.

Revision history for this message
Edward Hope-Morley (hopem) wrote :

Hello Rodrigo, or anyone else affected,

Accepted cinder into yoga-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:yoga-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-yoga-needed to verification-yoga-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-yoga-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-yoga-needed
Revision history for this message
Edward Hope-Morley (hopem) wrote : Update Released

The verification of the Stable Release Update for cinder has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Edward Hope-Morley (hopem) wrote :

This bug was fixed in the package cinder - 2:23.0.0-0ubuntu1.4~cloud2
---------------

 cinder (2:23.0.0-0ubuntu1.4~cloud2) jammy-bobcat; urgency=medium
 .
   * d/p/lp1823445.patch: Fix "signature_verified" volume metadata
     from propagating to images and causing errors later when
     creating volumes from such images (LP: #1823445).
   * d/p/lp2083061.patch: Fix race condition when deleting cloned
     volumes and their parents at the same time. (LP: #2083061)

Revision history for this message
Edward Hope-Morley (hopem) wrote :

The verification of the Stable Release Update for cinder has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Edward Hope-Morley (hopem) wrote :

This bug was fixed in the package cinder - 2:22.1.1-0ubuntu1.3~cloud3
---------------

 cinder (2:22.1.1-0ubuntu1.3~cloud3) jammy-antelope; urgency=medium
 .
   * d/p/lp2083061.patch: Fix race condition when deleting cloned
     volumes and their parents at the same time. (LP: #2083061)
   * d/p/lp1823445.patch: Fix "signature_verified" volume metadata
     from propagating to images and causing errors later when
     creating volumes from such images (LP: #1823445).

Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :
tags: added: verification-yoga-done
removed: verification-yoga-needed
Revision history for this message
Edward Hope-Morley (hopem) wrote :

The verification of the Stable Release Update for cinder has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Edward Hope-Morley (hopem) wrote :

This bug was fixed in the package cinder - 2:20.3.1-0ubuntu1.6~cloud0
---------------

 cinder (2:20.3.1-0ubuntu1.6~cloud0) focal-yoga; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 cinder (2:20.3.1-0ubuntu1.6) jammy; urgency=medium
 .
   * d/p/lp1823445.patch: Fix "signature_verified" volume metadata
     from propagating to images and causing errors later when
     creating volumes from such images (LP: #1823445).
   * d/p/lp2083061.patch: Fix race condition when deleting cloned
     volumes and their parents at the same time. (LP: #2083061)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.