[library] Large qcow2 images booted to and create volume fail.

Bug #1280399 reported by Francis Smith
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
Medium
Aleksandr Didenko

Bug Description

Fuel 4.0 setup.
6 Node cluster, ceph for images and volumes, RadosGW enabled
3 controller + ceph
1 compute
2 compute + ceph

[root@node-7 ~]# glance image-list
+--------------------------------------+--------------+-------------+------------------+------------+--------+
| ID | Name | Disk Format | Container Format | Size | Status |
+--------------------------------------+--------------+-------------+------------------+------------+--------+
| a5e4ea4f-5ea6-4caa-9037-c25c6b3947f1 | Base_OEL | qcow2 | bare | 1452933120 | active |
| aa6c93d1-2bba-4ef7-895a-df5c097a9faa | Base_OEL_IMG | raw | bare | 4294967296 | active |
| 31021860-a663-44ec-bb85-89dcb017ec5d | Centos_base | qcow2 | bare | 793903104 | active |
| 19f79ec3-5cb7-4fed-b143-e47cafe25313 | TestVM | qcow2 | bare | 14811136 | active |
+--------------------------------------+--------------+-------------+------------------+------------+--------+

boot from image create volume (Base_OEL_IMG) choosing 40G volume works

boot from image create volume (Base_OEL) choosing 40G volume fails
==> the volume is created successfully
==> booting another instance but choosing the volume created from the boot from image create volume (Base_OEL_IMG) is successful.

boot from image create volume (Centos_base) choosing 40G volume works

It seems there is a timeout happening here, as any large qcow2 image used to boot from image and create volume seems to fail...

Revision history for this message
Francis Smith (fsmith-6) wrote :

More info:

When I try to boot > 6 of my (Centos_base) image and create volume, some may or may not fail as well. The raw images don't seem to have this restriction.

Andrew Woodward (xarses)
tags: added: 4.0 ceph
Mike Scherbakov (mihgen)
Changed in fuel:
milestone: none → 4.1
importance: Undecided → High
status: New → Confirmed
Mike Scherbakov (mihgen)
Changed in fuel:
assignee: nobody → Fuel Library Team (fuel-library)
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

could you add diagnostic snapshot of the environment, please?

Changed in fuel:
status: Confirmed → Incomplete
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

IRC discussion of this issue can be found here:
http://irclog.perlgeek.de/fuel/2014-02-14#i_8286110

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

IIRC the conclusion of the discussion was that the most likely cause of the failure would be a timeout while cinder volume manager waits for qemu-img convert to finish converting a qcow2 image to raw.

Revision history for this message
Andrew Woodward (xarses) wrote :

There is enough information here to test that this isn't a problem elsewhere. Two people in IRC discussed having variations of this issue.

Changed in fuel:
status: Incomplete → New
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Ok, guys, we need some openstack logs to identify where the failure is happening. If we do not have them - it will require more time to debug the issue. Thus, we can try to fix it in 4.1 if we have diagnostic snapshot or we postpone it for 5.0 as we need time to reproduce it and collect all the debug information.

Changed in fuel:
milestone: 4.1 → 5.0
Evgeniy L (rustyrobot)
Changed in fuel:
status: New → Confirmed
Changed in fuel:
importance: High → Medium
Changed in fuel:
milestone: 5.0 → 5.1
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Sergii Golovatiuk (sgolovatiuk)
tags: added: cinder
removed: 4.0
Changed in fuel:
assignee: Sergii Golovatiuk (sgolovatiuk) → Aleksandr Didenko (adidenko)
Dmitry Ilyin (idv1985)
summary: - Large qcow2 images booted to and create volume fail.
+ [puppet] Large qcow2 images booted to and create volume fail.
Dmitry Ilyin (idv1985)
summary: - [puppet] Large qcow2 images booted to and create volume fail.
+ [library] Large qcow2 images booted to and create volume fail.
Changed in fuel:
milestone: 5.1 → 6.0
Changed in fuel:
milestone: 6.0 → 6.1
Revision history for this message
Aleksandr Didenko (adidenko) wrote :

Managed to reproduce it. On a large images that require large volumes, spawning of new instance may timeout while waiting for block device mapping to finish. But cinder continues to create volume until it's complete:

root@node-6:~# cinder list
+--------------------------------------+-------------+--------------+------+-------------+----------+-------------+
| ID | Status | Display Name | Size | Volume Type | Bootable | Attached to |
+--------------------------------------+-------------+--------------+------+-------------+----------+-------------+
| d9fb1238-91e7-4ae5-99d2-23faa7aab550 | downloading | | 40 | None | false | |
+--------------------------------------+-------------+--------------+------+-------------+----------+-------------+

That's why you can use this volume later to spawn new instance from it.

So we just need to increase timeout value for block device mapping in nova config, using configuration parameters introduced in Juno (https://review.openstack.org/#/c/102891/). Default values are:

block_device_allocate_retries=60
block_device_allocate_retries_interval=3

Which is 3 minutes. After increasing block_device_allocate_retries from 60 to 300 (15 minutes) on my local env, spawning new instances from 1.8G qcow2 image and 40G volume started to work just fine.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/171233

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/171233
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=a3322f28b5548dd83e89b225b88cf067cf9da43d
Submitter: Jenkins
Branch: master

commit a3322f28b5548dd83e89b225b88cf067cf9da43d
Author: Aleksandr Didenko <email address hidden>
Date: Tue Apr 7 17:15:07 2015 +0300

    Make block device mapping timeout configurable

    When booting instances passing in block-device and increasing the
    volume size, instances can go in to error state if the volume takes
    longer to create than 3 minutes (current default value).

    This change adds support for 2 new cluster configutation options:
    - block_device_allocate_retries (default: 300)
    - block_device_allocate_retries_interval (default: 3)

    Thess options are configurable via fuel cli or Hiera. They are added
    to nova.conf on compute nodes and allow to adjust block device
    allocation timeout value.

    DocImpact

    Change-Id: If9148da046d31aab60a13bd70e94700331770000
    Closes-bug: #1280399

Changed in fuel:
status: In Progress → Fix Committed
Stanislav Makar (smakar)
tags: added: on-verifying
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.