[GLANCE][CEPH] deleting an image during long time

Bug #1456683 reported by Leontiy Istomin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Bartłomiej Piotrowski
6.0.x
Won't Fix
High
MOS Maintenance

Bug Description

api: '1.0'
astute_sha: 055b2d82fe8499b27c7047295e2e36a7a2c5d430
auth_required: true
build_id: 2015-04-16_21-30-10
build_number: '317'
feature_groups:
- mirantis
fuellib_sha: db5f39e96e7ab9f79691202755e547bf8242661f
fuelmain_sha: 0de2d2039e76839d339f977df45b111bef7200d6
nailgun_sha: 52d92c86e68602fb5dd2f3b8870a773d20a387ef
openstack_version: 2014.2-6.1
ostf_sha: b0991dbad159f53d335efa5d31cb94016ad5312e
production: docker
python-fuelclient_sha: 279ffc358e40dbdc162cfe76898cbd0874529f1f
release: '6.1'

Successfully deployed the following configuration with LMA plugin:
Ubuntu,IBP,HA, Neutron-gre,Ceph-all,Nova-debug,Nova-quotas, 6.1-317
Controllers:3 Computes:47

performed rally test. During create_and_upload_volume_to_image scenario we got the following error:
from rally:
http://paste.openstack.org/show/228286/
When rally tried to delete image from glance it got 504 from hapfroxy:
from messages (haproxy):
http://paste.openstack.org/show/228287/
minute ago rally requested this deletion
from rally:
http://paste.openstack.org/show/228288/
The image was delete since 62 seconds from request
from glance-all:
http://paste.openstack.org/show/228319/

rally test tries to craete/delete 405 images with concurrency=5. Volume size=1GB
test was failed with 134 iteration

Here is Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-05-19_12-47-27.tar.xz

description: updated
description: updated
description: updated
Changed in mos:
importance: Undecided → High
Revision history for this message
Leontiy Istomin (listomin) wrote :

glance logs from all controllers

Revision history for this message
Leontiy Istomin (listomin) wrote :

rally log

Changed in mos:
milestone: none → 6.1
Revision history for this message
Inessa Vasilevskaya (ivasilevskaya) wrote :
Download full text (4.6 KiB)

In order to fulfill a delete request, glance-api issues 3 HTTP requests to glance-registry: GET for image retrieval, PUT for 'deleted/pending_delete' state change and DELETE for metadata deletion (https://github.com/openstack/glance/blob/master/glance/api/v1/images.py#L1030)

The 504 appeared because the registry-requests were dispatched behind schedule: the very last registry DELETE metadata request was scheduled 62 seconds after the client initiated image deletion, and the very first, GET registry request didn't happen right away, but was scheduled only 20:38:45 - 20:38:17 = 32 seconds after the original delete initiation.

1. Client initiated image deletion at 20:38:17

(grep f66cade6-cea8-4cea-86b9-2ec8a728f51b ~/Downloads/create-and-upload-volume-to-image-rally-stdout.log | grep DELETE)

2015-05-18 20:38:17.772 11233 DEBUG glanceclient.common.http [-] curl -g -i -X DELETE -H 'Accept-Encoding: gzip, deflate' -H 'Accept: */*' -H 'User-Agent: python-glanceclient' -H 'Connection: keep-alive' -H 'X-Auth-Token: {SHA1}ea6c5ec55c9482b08b0f4edddbd6eee61bd7fd2c' -H 'Content-Type: application/octet-stream' http://172.16.45.101:9292/v1/images/f66cade6-cea8-4cea-86b9-2ec8a728f51b log_curl_request /opt/stack/.venv/lib/python2.7/site-packages/glanceclient/common/http.py:123

2. DELETE request from haproxy arrived at glance-api at 20:38:17

(grep f66cade6-cea8-4cea-86b9-2ec8a728f51b ../node2/var/log/glance-all.log.2 | grep DELETE)

<151>May 18 20:38:17 node-2 glance-api 2015-05-18 20:38:17.778 20990 DEBUG glance.api.middleware.version_negotiation [-] Determining version of request: DELETE /v1/images/f66cade6-cea8-4cea-86b9-2ec8a728f51b Accept: */* process_request /usr/lib/python2.7/dist-packages/glance/api/middleware/version_negotiation.py:44

3. GET request from haproxy arrived at glance-registry at 20:38:45

(grep f66cade6-cea8-4cea-86b9-2ec8a728f51b node1/var/log/glance-all.log.2 | grep GET)

<151>May 18 20:38:45 node-2 glance-api 2015-05-18 20:38:45.056 20990 DEBUG glance.registry.client.v1.client [-] Registry request GET /images/f66cade6-cea8-4cea-86b9-2ec8a728f51b HTTP 200 request id req-6bf25707-8dc6-4662-a8d2-58aff82d3555 do_request /usr/lib/python2.7/dist-packages/glance/registry/client/v1/client.py:122

4. PUT request from haproxy arrived at glance-registry only at 20:39:15 and was completed rightaway:

(grep f66cade6-cea8-4cea-86b9-2ec8a728f51b node1/var/log/glance-all.log.2 | grep PUT)

<151>May 18 20:39:15 node-1 glance-registry 2015-05-18 20:39:15.624 24353 DEBUG routes.middleware [30e47e58-eed8-42e4-8d39-5f728ed9ead2 719d38ea179943b9b5c070cc3f1b6bd8 74bfc9d863434550b299bebb20577073 - - -] Matched PUT /images/f66cade6-cea8-4cea-86b9-2ec8a728f51b __call__ /usr/lib/python2.7/dist-packages/routes/middleware.py:100
....
<150>May 18 20:39:15 node-1 glance-registry 2015-05-18 20:39:15.700 24353 INFO glance.wsgi.server [30e47e58-eed8-42e4-8d39-5f728ed9ead2 719d38ea179943b9b5c070cc3f1b6bd8 74bfc9d863434550b299bebb20577073 - - -] 192.168.0.2 - - [18/May/2015 20:39:15] "PUT /images/f66cade6-cea8-4cea-86b9-2ec8a728f51b HTTP/1.1" 200 984 30.423810

5. DELETE metadata request followed, which completed at 20:39:19 and resulted in 200 OK from glance-api...

Read more...

Changed in mos:
status: New → Confirmed
Revision history for this message
Leontiy Istomin (listomin) wrote :

So, it seems haproxy waits answer from glance-registry a lot of time.
regarding the following line there was low-speed request from registry ~17 sec:
<134>May 18 20:38:45 node-1 haproxy[17376]: 192.168.0.5:57567 [18/May/2015:20:38:28.367] glance-registry glance-registry/node-3 0/0/0/16905/16905 200 983 - - ---- 560/4/3/1/0 0/0 "GET /images/f66cade6-cea8-4cea-86b9-2ec8a728f51b HTTP/1.1
I attached screenshot with atop at the time. But I don't thinks that the node was so busy.

Revision history for this message
Inessa Vasilevskaya (ivasilevskaya) wrote :

I suspect some bottleneck in glance-registry requests' queue. Each of the glance-registry request, having been picked up by glance-registry, is processed in less than a second, but a single request waits for a solid amount of time before it becomes "the chosen one":

17 seconds for GET (20:38:28 - 20:38:45), 30 seconds for PUT (20:38:45 - 20:39:15) and 4 seconds for DELETE (20:39:15 - 20:39:19).

Revision history for this message
Leontiy Istomin (listomin) wrote :

The same issue with create_and_delete_image rally scenario. The same build.
Baremetal,Ubuntu,IBP,HA, Neutron-gre,Ceph-all,Nova-debug,Nova-quotas, 6.1-317
Controllers:3 Computes:47

from rally.log:
http://paste.openstack.org/show/229681/

from messages (haproxy):
http://paste.openstack.org/show/229682/

registry response too late
from glance-all.log
http://paste.openstack.org/show/229693/

snapshot is here: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-05-20_09-38-14.tar.xz

Revision history for this message
Leontiy Istomin (listomin) wrote :

So, glance clients timeouts for this operation equals 10 minutes. So, we need to add "timeout server" parameter with value 11m to the following files to avoid closing sessions by haproxy:
/etc/haproxy/conf.d/080-glance-api.cfg
/etc/haproxy/conf.d/090-glance-registry.cfg

no longer affects: mos
Changed in fuel:
status: New → Triaged
tags: added: low-hanging-fruit
Changed in fuel:
importance: Undecided → High
assignee: nobody → Fuel Library Team (fuel-library)
milestone: none → 6.1
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Bartlomiej Piotrowski (bpiotrowski)
Changed in fuel:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/184770

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/184770
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=9e4bbca9e03b3bf032d8f1cec32453b027cc14a6
Submitter: Jenkins
Branch: master

commit 9e4bbca9e03b3bf032d8f1cec32453b027cc14a6
Author: Bartłomiej Piotrowski <email address hidden>
Date: Thu May 21 14:34:07 2015 +0200

    Close connections to glance-{api,registry} after 11 minutes

    Change-Id: I5d8ba80038dafca12b06bef7804b28cae8e57207
    Closes-Bug: 1456683

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Leontiy Istomin (listomin) wrote :

Hasn't been reproduced with 511, 521 builds

Changed in fuel:
status: Fix Committed → Fix Released
Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Won't Fix for 6.0-updates as there is no delivery channel for Fuel fixes in 6.0 updates

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.