Fuel for OpenStack

[library] Large qcow2 images booted to and create volume fail.

Bug #1280399 reported by Francis Smith on 2014-02-14

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Fix Committed	Medium	Aleksandr Didenko	Fuel for OpenStack 6.1

Bug Description

Fuel 4.0 setup.
6 Node cluster, ceph for images and volumes, RadosGW enabled
3 controller + ceph
1 compute
2 compute + ceph

boot from image create volume (Base_OEL_IMG) choosing 40G volume works

boot from image create volume (Base_OEL) choosing 40G volume fails
==> the volume is created successfully
==> booting another instance but choosing the volume created from the boot from image create volume (Base_OEL_IMG) is successful.

boot from image create volume (Centos_base) choosing 40G volume works

It seems there is a timeout happening here, as any large qcow2 image used to boot from image and create volume seems to fail...

Tags:

Revision history for this message

Francis Smith (fsmith-6) wrote on 2014-02-14:

More info:

When I try to boot > 6 of my (Centos_base) image and create volume, some may or may not fail as well. The raw images don't seem to have this restriction.

Andrew Woodward (xarses) on 2014-02-14

tags:

added: 4.0 ceph

Mike Scherbakov (mihgen) on 2014-02-14

Changed in fuel:
milestone:	none → 4.1
importance:	Undecided → High
status:	New → Confirmed

Mike Scherbakov (mihgen) on 2014-02-18

Changed in fuel:
assignee:	nobody → Fuel Library Team (fuel-library)

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2014-02-18:

could you add diagnostic snapshot of the environment, please?

Changed in fuel:
status:	Confirmed → Incomplete

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-02-18:

IRC discussion of this issue can be found here:
http://irclog.perlgeek.de/fuel/2014-02-14#i_8286110

Revision history for this message

Dmitry Borodaenko (angdraug) wrote on 2014-02-18:

IIRC the conclusion of the discussion was that the most likely cause of the failure would be a timeout while cinder volume manager waits for qemu-img convert to finish converting a qcow2 image to raw.

Revision history for this message

Andrew Woodward (xarses) wrote on 2014-02-19:

There is enough information here to test that this isn't a problem elsewhere. Two people in IRC discussed having variations of this issue.

Changed in fuel:
status:	Incomplete → New

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2014-02-20:

Ok, guys, we need some openstack logs to identify where the failure is happening. If we do not have them - it will require more time to debug the issue. Thus, we can try to fix it in 4.1 if we have diagnostic snapshot or we postpone it for 5.0 as we need time to reproduce it and collect all the debug information.

Changed in fuel:
milestone:	4.1 → 5.0

Evgeniy L (rustyrobot) on 2014-02-27

Changed in fuel:
status:	New → Confirmed

Vladimir Kuklin (vkuklin) on 2014-04-03

Changed in fuel:
importance:	High → Medium

Vladimir Kuklin (vkuklin) on 2014-04-29

Changed in fuel:
milestone:	5.0 → 5.1

Sergii Golovatiuk (sgolovatiuk) on 2014-06-24

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Sergii Golovatiuk (sgolovatiuk)

Dmitry Borodaenko (angdraug) on 2014-06-25

tags:

added: cinder
removed: 4.0

Sergii Golovatiuk (sgolovatiuk) on 2014-06-25

Changed in fuel:
assignee:	Sergii Golovatiuk (sgolovatiuk) → Aleksandr Didenko (adidenko)

Dmitry Ilyin (idv1985) on 2014-07-15

summary:

- Large qcow2 images booted to and create volume fail.
+ [puppet] Large qcow2 images booted to and create volume fail.

Dmitry Ilyin (idv1985) on 2014-07-15

summary:

- [puppet] Large qcow2 images booted to and create volume fail.
+ [library] Large qcow2 images booted to and create volume fail.

Vladimir Kuklin (vkuklin) on 2014-07-28

Changed in fuel:
milestone:	5.1 → 6.0

Vladimir Kuklin (vkuklin) on 2014-11-26

Changed in fuel:
milestone:	6.0 → 6.1

Revision history for this message

Aleksandr Didenko (adidenko) wrote on 2015-04-07:

Managed to reproduce it. On a large images that require large volumes, spawning of new instance may timeout while waiting for block device mapping to finish. But cinder continues to create volume until it's complete:

That's why you can use this volume later to spawn new instance from it.

So we just need to increase timeout value for block device mapping in nova config, using configuration parameters introduced in Juno (https://review.openstack.org/#/c/102891/). Default values are:

block_device_allocate_retries=60
block_device_allocate_retries_interval=3

Which is 3 minutes. After increasing block_device_allocate_retries from 60 to 300 (15 minutes) on my local env, spawning new instances from 1.8G qcow2 image and 40G volume started to work just fine.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-04-07: Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/171233

Changed in fuel:
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-04-08: Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/171233
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=a3322f28b5548dd83e89b225b88cf067cf9da43d
Submitter: Jenkins
Branch: master

commit a3322f28b5548dd83e89b225b88cf067cf9da43d
Author: Aleksandr Didenko <email address hidden>
Date: Tue Apr 7 17:15:07 2015 +0300

Make block device mapping timeout configurable

    When booting instances passing in block-device and increasing the
    volume size, instances can go in to error state if the volume takes
    longer to create than 3 minutes (current default value).

    This change adds support for 2 new cluster configutation options:
    - block_device_allocate_retries (default: 300)
    - block_device_allocate_retries_interval (default: 3)

    Thess options are configurable via fuel cli or Hiera. They are added
    to nova.conf on compute nodes and allow to adjust block device
    allocation timeout value.

DocImpact

Change-Id: If9148da046d31aab60a13bd70e94700331770000
Closes-bug: #1280399

Changed in fuel:
status:	In Progress → Fix Committed

Stanislav Makar (smakar) on 2015-06-04

tags:

added: on-verifying

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.