ServerRescueTest may fail due to RESCUE taking too long

Bug #1260644 reported by Andrea Frittoli
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
High
Unassigned
tempest
Invalid
Undecided
Unassigned

Bug Description

In the grenade test [0] for a bp I'm working on, ServerRescueTestXML rescue_unrescue test failed because the VM did not get into RESCUE state in time. It seems that the test is flacky.

From the tempest log [1] I see the sequence VM ACTIVE, RESCUE issues, WAIT, timeout, DELETE VM.

From the nova cpu log [1], following request ID: req-6c20654c-c00c-4932-87ad-8cfec9866399, I see that the RESCUE RCP is received immediately by n-cpu, however then the requests starves for 3 minutes waiting for a "compute_resources" lock.

The VM is then deleted by the test and when nova tries to process the RESCUE it throws and exception as the VM is not there:

bc-b27a-83c39b7566c8] Traceback (most recent call last):
bc-b27a-83c39b7566c8] File "/opt/stack/new/nova/nova/compute/manager.py", line 2664, in rescue_instance
bc-b27a-83c39b7566c8] rescue_image_meta, admin_password)
bc-b27a-83c39b7566c8] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 2109, in rescue
bc-b27a-83c39b7566c8] write_to_disk=True)
bc-b27a-83c39b7566c8] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 3236, in to_xml
bc-b27a-83c39b7566c8] libvirt_utils.write_to_file(xml_path, xml)
bc-b27a-83c39b7566c8] File "/opt/stack/new/nova/nova/virt/libvirt/utils.py", line 494, in write_to_file
bc-b27a-83c39b7566c8] with open(path, 'w') as f:
bc-b27a-83c39b7566c8] IOError: [Errno 2] No such file or directory: u'/opt/stack/data/nova/instances/a5099beb-f4a2-47bc-b27a-83c39b7566c8/libvirt.xml'
bc-b27a-83c39b7566c8]

There may be a problem in nova as well, as RESCUE is held for 3 minutes waiting on a lock.

[0] https://review.openstack.org/#/c/60434/
[1] http://logs.openstack.org/34/60434/5/check/check-grenade-dsvm/1d2852d/logs/tempest.txt.gz
[2] http://logs.openstack.org/34/60434/5/check/check-grenade-dsvm/1d2852d/logs/new/screen-n-cpu.txt.gz?

Tags: gate-failure
Revision history for this message
Sean Dague (sdague) wrote :

This looks like the root cause is Nova exploding on the transition. I'm going to mark the Tempest side invalid.

Changed in nova:
status: New → Confirmed
importance: Undecided → High
Changed in tempest:
status: New → Invalid
description: updated
Revision history for this message
Matt Riedemann (mriedem) wrote :

Hit it here also:

http://logs.openstack.org/98/55798/12/check/check-grenade-dsvm/1d928a5/console.html

Anything we can use for an e-r query to identify this?

tags: added: gate-failure
Revision history for this message
Matt Riedemann (mriedem) wrote :

I see this in the nova compute log when I hit the failure:

2013-12-13 22:37:58.043 WARNING nova.virt.libvirt.driver [req-0b016f8d-41e1-4d02-aed3-669f9ac86c25 tempest.scenario.manager-tempest-1268694175-user tempest.scenario.manager-tempest-1268694175-tenant] [instance: 41ea1271-84f4-4d25-91e2-4a545480c1f7] File injection into a boot from volume instance is not supported

2013-12-13 22:37:58.046 WARNING nova.virt.disk.api [req-0b016f8d-41e1-4d02-aed3-669f9ac86c25 tempest.scenario.manager-tempest-1268694175-user tempest.scenario.manager-tempest-1268694175-tenant] Ignoring error injecting data into image ([Errno 2] No such file or directory: '/opt/stack/data/nova/instances/41ea1271-84f4-4d25-91e2-4a545480c1f7/disk')

2013-12-13 22:38:09.969 ERROR nova.compute.manager [req-c9e27115-7a1b-4a38-b0d8-84a66437470f ServerRescueTestXML-tempest-1077445745-user ServerRescueTestXML-tempest-1077445745-tenant] [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] Error trying to Rescue Instance

Seems that we could write an e-r query on "Error trying to Rescue Instance". I ran that in a query and got 9 hits in the last 7 days:

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiRXJyb3IgdHJ5aW5nIHRvIFJlc2N1ZSBJbnN0YW5jZVwiIEFORCBmaWxlbmFtZTpcImxvZ3Mvc2NyZWVuLW4tY3B1LnR4dFwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxMzg3MDU0NTE5MDAxfQ==

The problem is 8 out of the 9 times it fails, it's failing due to a libvirt connection reset. In only one case did I see this:

IOError: [Errno 2] No such file or directory: u'/opt/stack/data/nova/instances/5a9b502f-7e4b-422a-9813-ff39666c73dc/libvirt.xml'

And that was in a check-tempest-dsvm-postgres-full build, so we can't filter on it just happening with the grenade test.

If we had multi-line/doc query support with elastic recheck we could combine those two messages for a pretty solid query, but that isn't supported yet.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Download full text (6.4 KiB)

Hmm, actually when I expand the logs it looks like when I hit this failure it was due to the libvirt connection getting reset also:

2013-12-13 22:38:09.969 ERROR nova.compute.manager [req-c9e27115-7a1b-4a38-b0d8-84a66437470f ServerRescueTestXML-tempest-1077445745-user ServerRescueTestXML-tempest-1077445745-tenant] [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] Error trying to Rescue Instance
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] Traceback (most recent call last):
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] File "/opt/stack/new/nova/nova/compute/manager.py", line 2664, in rescue_instance
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] rescue_image_meta, admin_password)
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 2086, in rescue
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] unrescue_xml = self._get_existing_domain_xml(instance, network_info)
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 1272, in _get_existing_domain_xml
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] xml = virt_dom.XMLDesc(0)
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 179, in doit
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] result = proxy_call(self._autowrap, f, *args, **kwargs)
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 139, in proxy_call
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] rv = execute(f,*args,**kwargs)
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 77, in tworker
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] rv = meth(*args,**kwargs)
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] File "/usr/lib/python2.7/dist-packages/libvirt.py", line 381, in XMLDesc
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] if ret is None: raise libvirtError ('virDomainGetXMLDesc() failed', dom=self)
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] libvirtError: Unable to read from monitor: Connection reset by peer
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager...

Read more...

Revision history for this message
Matt Riedemann (mriedem) wrote :

From Andrea's grenade failure logs, it's actually not due to the libvirt connection reset, it's due to the domain xml not being found, so this is a valid unique bug (apart from bug 1255624).

2013-12-12 19:17:31.244 10417 TRACE nova.openstack.common.rpc.amqp InstanceNotRescuable: Instance a5099beb-f4a2-47bc-b27a-83c39b7566c8 cannot be rescued: Driver Error: [Errno 2] No such file or directory: u'/opt/stack/data/nova/instances/a5099beb-f4a2-47bc-b27a-83c39b7566c8/libvirt.xml'

Revision history for this message
Matt Riedemann (mriedem) wrote :

We should be able to write an e-r query on this:

message:"cannot be rescued: Driver Error: [Errno 2] No such file or directory:" AND filename:"logs/screen-n-cpu.txt"

That has both the 'cannot be rescued' piece and IOError pieces I was looking for to make this a unique query.

There is 1 hit in the last 7 days:

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiY2Fubm90IGJlIHJlc2N1ZWQ6IERyaXZlciBFcnJvcjogW0Vycm5vIDJdIE5vIHN1Y2ggZmlsZSBvciBkaXJlY3Rvcnk6XCIgQU5EIGZpbGVuYW1lOlwibG9ncy9zY3JlZW4tbi1jcHUudHh0XCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjEzODcwNTU2MjA0MjZ9

That's in a non-grenade test, probably because of the path to the log filename. Maybe I can correct that with an OR in the query.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Proposed an elastic-recheck query here: https://review.openstack.org/#/c/62192/

Revision history for this message
Andrea Frittoli (andrea-frittoli) wrote :

Matt, thanks for your investigation on this.

In my understanding the sequence of events from the nova logs is the following:
- n-cpu starts processing RESCUE request
- it waits 3 minutes+ for a lock compute_resources to become available - this could be a consequence of bug 1255624
- meanwhile the timeout in the test expires, and the test proceeds to delete the test VM (just a guess)
- the deletion of the VM is processed by n-cpu
- the execution of RESCUE resumes, it assumes that the VM is still in the same state, no noticing that it has been deleted already. This it hits the issue with the domain file not found.

So perhaps the additional error message in this case, compared to bug 1255624, is due to a race condition where the test deleted the VM successfully before RESCUE could resume.

Revision history for this message
Joe Gordon (jogo) wrote :

 No hits in a while looks like this resolved itself or the fingerprint is out of date.

Changed in nova:
status: Confirmed → Incomplete
Sean Dague (sdague)
Changed in nova:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.