Upload/Import image continues consuming glance host cpu/memory/network/disk resources even after the image is deleted

Bug #1417304 reported by Tushar Patil
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Glance
Incomplete
Undecided
Unassigned
OpenStack Security Advisory
Won't Fix
Undecided
Unassigned

Bug Description

This issue is related to the already fixed security issue 1398830.

Steps to reproduce this problem
1. Upload image by passing image location in the copy_from parameter using v1 version or use import task api for importing image using v2 version.
2. When the image is in the 'saving' status, delete the image. Presently, it allows you to delete the image in the 'saving' state.
3. Uploading of image is performed asynchronously in a separate eventlet thread so there is no way to stop it until upload is complete and later while updating the metadata it finds the image is deleted so it cleans up the image data properly.

If you keep on creating new images, uploading big size of images (5GB) and delete it immediately while images are is in 'saving' state, then it will consume glance host CPU/Memory/Network resources without putting any burden on the user (billing point of view). Also, at some point it is quite possible the total size of image data in the backend for a particular tenant could go beyond user_storage_quota for a temporarily period.

Expected behavior: Uploading of image should be stopped immediately after the image is deleted.
Problem: Image location is not available in the delete image method while it is in saving status as it is added only after
the image data is uploaded successfully to the backend.

Revision history for this message
Grant Murphy (gmurphy) wrote :

Since this report concerns a possible security risk, an incomplete security advisory task has been added while the core security reviewers for the affected project or projects confirm the bug and discuss the scope of any vulnerability along with potential solutions.

Changed in ossa:
status: New → Incomplete
Jeremy Stanley (fungi)
description: updated
Revision history for this message
Thierry Carrez (ttx) wrote :

Needs glance-coresec confirmation/debunking.

Revision history for this message
Nikhil Komawar (nikhil-komawar) wrote :

Hi Tushar,

Thanks for the detailed bug report! I find your concerns valid and this seems to be an issue.

However, this is a challenging task to be solved as the upload process and the delete operations are on disjoint threads (possibly different glance nodes). It seems like a design issue to be that we are not able to handle the stopping of data transfer on image status transitions. Do you have any initial thoughts on how to handle this scenario?

Else, I would like to ask everyone involved in this group to weigh in their comments on what information we can give to uses addressing this as a known issue. ( ? )

Thanks!

Changed in glance:
status: New → Triaged
importance: Undecided → Critical
milestone: none → kilo-3
status: Triaged → Opinion
Changed in glance:
importance: Critical → Undecided
Revision history for this message
Tristan Cacqueray (tristan-cacqueray) wrote :

Would this be already fixed by https://review.openstack.org/#/c/122427/ ?

Revision history for this message
Nikhil Komawar (nikhil-komawar) wrote :

I do not believe so. The intent of the author seems to indicate that they are interested in stopping the asynchronous operations when a delete call is made to the image. This however, cannot be guaranteed with asynchronous style operations being performed on the tasks.

The solution from a operator point of view seem to be to implement a rate-limiting kind of setup and possibly allow only trusted users(groups) to operate on /tasks resource if this is being considered a potential security flaw. This way it would be easier to get some sort of amendment from the user to guarantee that they will not use the resources for any malicious purposes and such.

The above mentioned approach can be implemented using tools like nginx.

Revision history for this message
Thierry Carrez (ttx) wrote :

Beware that "Opinion" in Launchpad is a closed state (meaning the reporter and the development team have a difference in opinion, which is just a gentle way of saying wontfix)

Changed in glance:
status: Opinion → Incomplete
Revision history for this message
Thierry Carrez (ttx) wrote :

This feels almost undistinguishable from random HTTP calls triggering activity. Externally-facing API nodes are by nature exposed to user activity that may trigger CPU/Disk resources to be consumed without anyone paying for those resources. Unless there is a significant amplification effect and that can easily trigger a DoS, I'm tempted to consider those just opportunities for strengthening than exploitable vulnerabilities.

Revision history for this message
Jeremy Stanley (fungi) wrote :

Yes, I'm leaning toward class D (security hardening opportunity) for this report. https://wiki.openstack.org/wiki/Vulnerability_Management#Incident_report_taxonomy

Revision history for this message
Tristan Cacqueray (tristan-cacqueray) wrote :

Agreed with class D, let's open this next Monday if no objections...

Revision history for this message
Nikhil Komawar (nikhil-komawar) wrote :

My opinion is that these asynchronous tasks cannot be guaranteed to be stopped (for example if the asynchronous operation is using eventlet threads). However, we can explore the possibility of using taskflow and their flows to stop a processing task (if that is permissible). If that is something that can come in class D, that's great. If not, how can we better state that this is by design and operators must deploy Glance with appropriate scripts if that is worrisome ?

For Glance, my inclination is towards Opinion/Won't Fix if the reporter has no more suggestions.

Thanks!

Tushar Patil (tpatil)
summary: - Upload/Import image continues consuming glance host cpu/memory/network
- resources even after the image is deleted
+ Upload/Import image continues consuming glance host
+ cpu/memory/network/disk resources even after the image is deleted
Revision history for this message
Tushar Patil (tpatil) wrote :

1) In the present code location is saved after uploading the image data.
We have added a new method in the glance_store.driver.Store to get the location where the image data should be stored, save it as image metadata then pass it to the store_add_to_backend method. This way, when user will call delete api, it will get the location and attempt to delete the file if the location status is in 'saving/active' state. If the image data file is deleted, then in the image data upload logic in each store, we can add code to check whether the image data file exists or not (Performance overhead). If not, raise the appropriate exception and break the upload logic. The above solution will work for backend file store but not sure about the rest of the backend stores.

2) As Nikhil has pointed out, there should be some way to break the taskflow if it's still running when the image is deleted.

Revision history for this message
Flavio Percoco (flaper87) wrote :

I agree this sounds like a class D bug. For what it's worth, this is already being hardened as part of Kilo development cycle. We're implementing a, hopefully, more robust async upload/import process that doesn't rely exactly on eventlet but uses taskflow which gives more control on how things happen in the system.

Revision history for this message
Thierry Carrez (ttx) wrote :

Class D it is. Opening up

Changed in ossa:
status: Incomplete → Won't Fix
information type: Private Security → Public
Thierry Carrez (ttx)
Changed in glance:
milestone: kilo-3 → kilo-rc1
Revision history for this message
Nikhil Komawar (nikhil-komawar) wrote :

It is doubtful for the entire solution related to this bug be implemented in Kilo.

1. @Tushar: Thanks for that nice suggestion. It does make sense however, we've some challenges to implement it. With multiple locations enabled, it becomes really difficult for Glance to determine whether all the locations need to be checked before deleting this Image. This can result into deletes being slow. They can be slow already in some of the backends (especially for larger Images). So, we are looking at some tradeoffs here (and thanks for adding the performance overhead tag to your suggestion). Also, passing around locations might not be a good idea for some of the stores like swift that have sensitive data.

Also, this kind of resource utilization is seen in case of the current Image upload logic (using REST API call and not asynchronous task process). If an Image in saving state is deleted it's not deleted right away and only after the API tries to change it's state to active that it find it to be deleted (NotFound) and breaks the Image creation workflow.

Having to check the state of Image record every so often during the upload process is an expensive operation. I'm not completely against it however, we need a cleaner and more performant way to fix this.

So, until we have a good solution, I recommend this to be deferred.

Can we remove the RC1 tag or anyone has any objections?

Thanks!

Changed in glance:
milestone: kilo-rc1 → liberty-1
Jeremy Stanley (fungi)
description: updated
Changed in glance:
milestone: liberty-1 → none
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.