VMwareAPI-Team

VMware: spawning large amounts of VMs concurrently sometimes causes "VMDK lock" error

Bug #1230047 reported by Ryan Hsu on 2013-09-25

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Fix Released	High	Gary Kotton	OpenStack Compute (nova) 2014.1 "icehouse"
	VMwareAPI-Team	Confirmed	High	Gary Kotton

Bug Description

When using the VMwareVCDriver, spawning large amounts of virtual machines concurrently causes some instances to spawn with status ERROR. The number of machines that fail to build is unpredictable and sometimes all instances do end up spawning successfully.

The issue can be reproduced by running:

nova boot --image debian-2.6.32-i686 --flavor 1 --num-instances 32 nameless

The number of instances that causes the errors differ from environment to environment. Start with 30-40. There are two errors seen in the logs that are causing the instance spawn failures. The first is the ESX host not finding the image in the nfs datastore (even though it is there, otherwise other instances couldn't be spawned). The second is the ESX host not being able to access the vmdk image because it is locked.

Image not found error:

Traceback (most recent call last):
  File "/opt/stack/nova/nova/compute/manager.py", line 1408, in _spawn
    block_device_info)
  File "/opt/stack/nova/nova/virt/vmwareapi/driver.py", line 609, in spawn
    admin_password, network_info, block_device_info)
  File "/opt/stack/nova/nova/virt/vmwareapi/vmops.py", line 440, in spawn
    vmdk_file_size_in_kb, linked_clone)
  File "/opt/stack/nova/nova/virt/vmwareapi/volumeops.py", line 71, in attach_disk_to_vm
    self._session._wait_for_task(instance_uuid, reconfig_task)
  File "/opt/stack/nova/nova/virt/vmwareapi/driver.py", line 795, in _wait_for_task
    ret_val = done.wait()
  File "/usr/local/lib/python2.7/dist-packages/eventlet/event.py", line 116, in wait
    return hubs.get_hub().switch()
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 187, in switch
    return self.greenlet.switch()
NovaException: File [ryan-nfs] vmware_base/e8c42ed8-05e7-45bc-90c3-49a34e5a37c6.vmdk was not found

Image locked error:

Traceback (most recent call last):
  File "/opt/stack/nova/nova/compute/manager.py", line 1407, in _spawn
    block_device_info)
  File "/opt/stack/nova/nova/virt/vmwareapi/driver.py", line 623, in spawn
    admin_password, network_info, block_device_info)
  File "/opt/stack/nova/nova/virt/vmwareapi/vmops.py", line 504, in spawn
    root_gb_in_kb, linked_clone)
  File "/opt/stack/nova/nova/virt/vmwareapi/volumeops.py", line 71, in attach_disk_to_vm
    self._session._wait_for_task(instance_uuid, reconfig_task)
  File "/opt/stack/nova/nova/virt/vmwareapi/driver.py", line 900, in _wait_for_task
    ret_val = done.wait()
  File "/usr/local/lib/python2.7/dist-packages/eventlet/event.py", line 116, in wait
    return hubs.get_hub().switch()
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 187, in switch
    return self.greenlet.switch()
NovaException: Unable to access file [ryan-nfs] vmware_base/f110bb94-2170-4a3a-ae0d-760f95eb8b47.0.

Environment information:

- 1 datacenter, 1 cluster, 7 hosts
- NFS shared datastore
- was able to spawn 7 instances before errors appeared
- screen log with tracebacks: http://paste.openstack.org/show/47410/

See original description

Tags:

Ryan Hsu (rhsu) on 2013-09-25

description:

updated

Revision history for this message

GuoHui Liu (guohliu) wrote on 2013-09-25:

Have you changed the config options of task_poll_interval? the default is 5 seconds, in NFS case, it might be increased to a more appropriate value.

Revision history for this message

Shawn Hartsock (hartsock) wrote on 2013-09-25:

I'll try and look at this myself later in the week.

Ryan Hsu (rhsu) on 2013-09-25

description:	updated
summary:	- VMware: errors spawning large amounts of VMs + VMware: spawning large amounts of VMs sometimes causes errors

Ryan Hsu (rhsu) on 2013-10-06

description:

updated

Vui Lam (vui) on 2013-10-09

Changed in nova:
status:	New → Confirmed
assignee:	nobody → Vui Lam (vui)

Tracy Jones (tjones-i) on 2013-10-09

Changed in nova:
importance:	Undecided → High

Revision history for this message

Maithem (maithem) wrote on 2013-10-29: Re: VMware: spawning large amounts of VMs sometimes causes errors

Spawning small number of VMs also fails. For example, if you try to spawn 4 VMs at the same time. All will fail except for one VM. The error message for the other VMs is (from vsphere client):

Create virtual disk
[datastore1] vmware_base/4a7fd81b-9119-4a86-96d0-c1f6a7753883.vmdk
Cannot complete the operation because the file or folder [datastore1]
vmware_base/4a7fd81b-9119-4a86-96d0-c1f6a7753883.vmdk already
exists

It seems like multiple vmdks with the same name are being created. I have a one node setup and I've tried using both the ESX and VC drivers, I get the same error.

Revision history for this message

Vui Lam (vui) wrote on 2013-10-29:

Maithem, if you are talking about the case when multiple spawning on a datastore that does not yet have the image cached, then it is a different issue ( bug is 1222948 ), and https://review.openstack.org/#/c/46400 is what is supposed to address it. In this case I believe the image is already cached, each spawn operation is just supposed to make use of it, but somehow the cached vmdk is not found for some odd reason.

Revision history for this message

Tracy Jones (tjones-i) wrote on 2013-10-29:

ryan - was this bug before you increased your ram and pre-populated the cache with the image?

Revision history for this message

Ryan Hsu (rhsu) wrote on 2013-10-29:

Tracy, I pre-populated the cache before I ran the test to reproduce the issue. As for memory, increasing the RAM seemed to help on some systems but did not help others.

Tracy Jones (tjones-i) on 2013-11-19

tags:

added: havana-backport-potential

Shawn Hartsock (hartsock) on 2013-11-19

Changed in openstack-vmwareapi-team:
status:	New → Confirmed
importance:	Undecided → High
assignee:	nobody → Vui Lam (vui)

Shawn Hartsock (hartsock) on 2013-11-26

summary:

- VMware: spawning large amounts of VMs sometimes causes errors
+ VMware: spawning large amounts of VMs concurrently sometimes causes
+ errors

Revision history for this message

Shawn Hartsock (hartsock) wrote on 2013-11-27: Re: VMware: spawning large amounts of VMs concurrently sometimes causes errors

Update: we have a possible root cause and a patch in internal development/testing on this problem.

Revision history for this message

Gary Kotton (garyk) wrote on 2013-12-01:

Addressed by - https://review.openstack.org/#/c/58598/

Changed in nova:
assignee:	Vui Lam (vui) → Gary Kotton (garyk)
Changed in openstack-vmwareapi-team:
assignee:	Vui Lam (vui) → Gary Kotton (garyk)
Changed in nova:
milestone:	none → icehouse-1

Revision history for this message

dan wendlandt (danwent) wrote on 2013-12-02:

The patch for this seems to be stacked on top of a "feature change" (image aging) that will not be backported to Havana. Would it make sense to merge this change first, so that the backport for Havana is cleaner?

summary:

VMware: spawning large amounts of VMs concurrently sometimes causes
- errors
+ "VMDK lock" error

Russell Bryant (russellb) on 2013-12-03

Changed in nova:
milestone:	icehouse-1 → icehouse-2

OpenStack Infra (hudson-openstack) on 2013-12-05

Changed in nova:
status:	Confirmed → In Progress

Revision history for this message

Tracy Jones (tjones-i) wrote on 2014-01-11:

#10

Tracking - This bug is being addressed by https://review.openstack.org/#/c/63933/ - Status: Proposed

Thierry Carrez (ttx) on 2014-01-22

Changed in nova:
milestone:	icehouse-2 → icehouse-3

Thierry Carrez (ttx) on 2014-03-05

Changed in nova:
milestone:	icehouse-3 → icehouse-rc1

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-03-06: Fix merged to nova (master)

#11

Reviewed: https://review.openstack.org/63933
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=bf882fbc4aa68cf0191bce70e622d81969617c02
Submitter: Jenkins
Branch: master

commit bf882fbc4aa68cf0191bce70e622d81969617c02
Author: Gary Kotton <email address hidden>
Date: Tue Dec 24 06:19:26 2013 -0800

VMware: create datastore utility functions

The patch moves datastore interfaces from vmops and vm_util
to a separate file - ds_util.

This is part of the blueprint vmware-image-cache-management.

    In the past the datastore methods would poll the task
    to determine that it had completed. This has beem addressed by
    making the driver _poll_task return the task information.

    The patch set also solves the following bugs by checking if
    the FileAlreadyExistsException exception is raised. This could
    be due to another process or thread creating the file.

Closes-bug: 1230047
Closes-bug: 1254128

Change-Id: I788e33dbcb3dedc41831b976137607274b1c02ca

Changed in nova:
status:	In Progress → Fix Committed

Thierry Carrez (ttx) on 2014-03-31

Changed in nova:
status:	Fix Committed → Fix Released

Thierry Carrez (ttx) on 2014-04-17

Changed in nova:
milestone:	icehouse-rc1 → 2014.1

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.