VMware: spawning large amounts of VMs concurrently sometimes causes "VMDK lock" error

Bug #1230047 reported by Ryan Hsu
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Gary Kotton
VMwareAPI-Team
Confirmed
High
Gary Kotton

Bug Description

When using the VMwareVCDriver, spawning large amounts of virtual machines concurrently causes some instances to spawn with status ERROR. The number of machines that fail to build is unpredictable and sometimes all instances do end up spawning successfully.

The issue can be reproduced by running:

    nova boot --image debian-2.6.32-i686 --flavor 1 --num-instances 32 nameless

The number of instances that causes the errors differ from environment to environment. Start with 30-40. There are two errors seen in the logs that are causing the instance spawn failures. The first is the ESX host not finding the image in the nfs datastore (even though it is there, otherwise other instances couldn't be spawned). The second is the ESX host not being able to access the vmdk image because it is locked.

Image not found error:

Traceback (most recent call last):
  File "/opt/stack/nova/nova/compute/manager.py", line 1408, in _spawn
    block_device_info)
  File "/opt/stack/nova/nova/virt/vmwareapi/driver.py", line 609, in spawn
    admin_password, network_info, block_device_info)
  File "/opt/stack/nova/nova/virt/vmwareapi/vmops.py", line 440, in spawn
    vmdk_file_size_in_kb, linked_clone)
  File "/opt/stack/nova/nova/virt/vmwareapi/volumeops.py", line 71, in attach_disk_to_vm
    self._session._wait_for_task(instance_uuid, reconfig_task)
  File "/opt/stack/nova/nova/virt/vmwareapi/driver.py", line 795, in _wait_for_task
    ret_val = done.wait()
  File "/usr/local/lib/python2.7/dist-packages/eventlet/event.py", line 116, in wait
    return hubs.get_hub().switch()
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 187, in switch
    return self.greenlet.switch()
NovaException: File [ryan-nfs] vmware_base/e8c42ed8-05e7-45bc-90c3-49a34e5a37c6.vmdk was not found

Image locked error:

Traceback (most recent call last):
  File "/opt/stack/nova/nova/compute/manager.py", line 1407, in _spawn
    block_device_info)
  File "/opt/stack/nova/nova/virt/vmwareapi/driver.py", line 623, in spawn
    admin_password, network_info, block_device_info)
  File "/opt/stack/nova/nova/virt/vmwareapi/vmops.py", line 504, in spawn
    root_gb_in_kb, linked_clone)
  File "/opt/stack/nova/nova/virt/vmwareapi/volumeops.py", line 71, in attach_disk_to_vm
    self._session._wait_for_task(instance_uuid, reconfig_task)
  File "/opt/stack/nova/nova/virt/vmwareapi/driver.py", line 900, in _wait_for_task
    ret_val = done.wait()
  File "/usr/local/lib/python2.7/dist-packages/eventlet/event.py", line 116, in wait
    return hubs.get_hub().switch()
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 187, in switch
    return self.greenlet.switch()
NovaException: Unable to access file [ryan-nfs] vmware_base/f110bb94-2170-4a3a-ae0d-760f95eb8b47.0.

Environment information:

- 1 datacenter, 1 cluster, 7 hosts
- NFS shared datastore
- was able to spawn 7 instances before errors appeared
- screen log with tracebacks: http://paste.openstack.org/show/47410/

Ryan Hsu (rhsu)
description: updated
Revision history for this message
GuoHui Liu (guohliu) wrote :

Have you changed the config options of task_poll_interval? the default is 5 seconds, in NFS case, it might be increased to a more appropriate value.

Revision history for this message
Shawn Hartsock (hartsock) wrote :

I'll try and look at this myself later in the week.

Ryan Hsu (rhsu)
description: updated
summary: - VMware: errors spawning large amounts of VMs
+ VMware: spawning large amounts of VMs sometimes causes errors
Ryan Hsu (rhsu)
description: updated
Vui Lam (vui)
Changed in nova:
status: New → Confirmed
assignee: nobody → Vui Lam (vui)
Tracy Jones (tjones-i)
Changed in nova:
importance: Undecided → High
Revision history for this message
Maithem (maithem) wrote : Re: VMware: spawning large amounts of VMs sometimes causes errors

Spawning small number of VMs also fails. For example, if you try to spawn 4 VMs at the same time. All will fail except for one VM. The error message for the other VMs is (from vsphere client):

Create virtual disk
[datastore1] vmware_base/4a7fd81b-9119-4a86-96d0-c1f6a7753883.vmdk
Cannot complete the operation because the file or folder [datastore1]
vmware_base/4a7fd81b-9119-4a86-96d0-c1f6a7753883.vmdk already
exists

It seems like multiple vmdks with the same name are being created. I have a one node setup and I've tried using both the ESX and VC drivers, I get the same error.

Revision history for this message
Vui Lam (vui) wrote :

Maithem, if you are talking about the case when multiple spawning on a datastore that does not yet have the image cached, then it is a different issue ( bug is 1222948 ), and https://review.openstack.org/#/c/46400 is what is supposed to address it. In this case I believe the image is already cached, each spawn operation is just supposed to make use of it, but somehow the cached vmdk is not found for some odd reason.

Revision history for this message
Tracy Jones (tjones-i) wrote :

ryan - was this bug before you increased your ram and pre-populated the cache with the image?

Revision history for this message
Ryan Hsu (rhsu) wrote :

Tracy, I pre-populated the cache before I ran the test to reproduce the issue. As for memory, increasing the RAM seemed to help on some systems but did not help others.

Tracy Jones (tjones-i)
tags: added: havana-backport-potential
Changed in openstack-vmwareapi-team:
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Vui Lam (vui)
summary: - VMware: spawning large amounts of VMs sometimes causes errors
+ VMware: spawning large amounts of VMs concurrently sometimes causes
+ errors
Revision history for this message
Shawn Hartsock (hartsock) wrote : Re: VMware: spawning large amounts of VMs concurrently sometimes causes errors

Update: we have a possible root cause and a patch in internal development/testing on this problem.

Revision history for this message
Gary Kotton (garyk) wrote :
Changed in nova:
assignee: Vui Lam (vui) → Gary Kotton (garyk)
Changed in openstack-vmwareapi-team:
assignee: Vui Lam (vui) → Gary Kotton (garyk)
Changed in nova:
milestone: none → icehouse-1
Revision history for this message
dan wendlandt (danwent) wrote :

The patch for this seems to be stacked on top of a "feature change" (image aging) that will not be backported to Havana. Would it make sense to merge this change first, so that the backport for Havana is cleaner?

summary: VMware: spawning large amounts of VMs concurrently sometimes causes
- errors
+ "VMDK lock" error
Changed in nova:
milestone: icehouse-1 → icehouse-2
Changed in nova:
status: Confirmed → In Progress
Revision history for this message
Tracy Jones (tjones-i) wrote :

Tracking - This bug is being addressed by https://review.openstack.org/#/c/63933/ - Status: Proposed

Thierry Carrez (ttx)
Changed in nova:
milestone: icehouse-2 → icehouse-3
Thierry Carrez (ttx)
Changed in nova:
milestone: icehouse-3 → icehouse-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/63933
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=bf882fbc4aa68cf0191bce70e622d81969617c02
Submitter: Jenkins
Branch: master

commit bf882fbc4aa68cf0191bce70e622d81969617c02
Author: Gary Kotton <email address hidden>
Date: Tue Dec 24 06:19:26 2013 -0800

    VMware: create datastore utility functions

    The patch moves datastore interfaces from vmops and vm_util
    to a separate file - ds_util.

    This is part of the blueprint vmware-image-cache-management.

    In the past the datastore methods would poll the task
    to determine that it had completed. This has beem addressed by
    making the driver _poll_task return the task information.

    The patch set also solves the following bugs by checking if
    the FileAlreadyExistsException exception is raised. This could
    be due to another process or thread creating the file.

    Closes-bug: 1230047
    Closes-bug: 1254128

    Change-Id: I788e33dbcb3dedc41831b976137607274b1c02ca

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: icehouse-rc1 → 2014.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.