Bug #1260644 “ServerRescueTest may fail due to RESCUE taking too...” : Bugs : tempest

Revision history for this message

Sean Dague (sdague) wrote on 2013-12-13:

#1

This looks like the root cause is Nova exploding on the transition. I'm going to mark the Tempest side invalid.

Changed in nova:
status:	New → Confirmed
importance:	Undecided → High
Changed in tempest:
status:	New → Invalid

Andrea Frittoli (andrea-frittoli) on 2013-12-13

description:

updated

Revision history for this message

Matt Riedemann (mriedem) wrote on 2013-12-13:

#2

Hit it here also:

http://logs.openstack.org/98/55798/12/check/check-grenade-dsvm/1d928a5/console.html

Anything we can use for an e-r query to identify this?

tags:

added: gate-failure

Revision history for this message

Matt Riedemann (mriedem) wrote on 2013-12-14:

#3

I see this in the nova compute log when I hit the failure:

2013-12-13 22:37:58.043 WARNING nova.virt.libvirt.driver [req-0b016f8d-41e1-4d02-aed3-669f9ac86c25 tempest.scenario.manager-tempest-1268694175-user tempest.scenario.manager-tempest-1268694175-tenant] [instance: 41ea1271-84f4-4d25-91e2-4a545480c1f7] File injection into a boot from volume instance is not supported

2013-12-13 22:37:58.046 WARNING nova.virt.disk.api [req-0b016f8d-41e1-4d02-aed3-669f9ac86c25 tempest.scenario.manager-tempest-1268694175-user tempest.scenario.manager-tempest-1268694175-tenant] Ignoring error injecting data into image ([Errno 2] No such file or directory: '/opt/stack/data/nova/instances/41ea1271-84f4-4d25-91e2-4a545480c1f7/disk')

2013-12-13 22:38:09.969 ERROR nova.compute.manager [req-c9e27115-7a1b-4a38-b0d8-84a66437470f ServerRescueTestXML-tempest-1077445745-user ServerRescueTestXML-tempest-1077445745-tenant] [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] Error trying to Rescue Instance

Seems that we could write an e-r query on "Error trying to Rescue Instance". I ran that in a query and got 9 hits in the last 7 days:

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiRXJyb3IgdHJ5aW5nIHRvIFJlc2N1ZSBJbnN0YW5jZVwiIEFORCBmaWxlbmFtZTpcImxvZ3Mvc2NyZWVuLW4tY3B1LnR4dFwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxMzg3MDU0NTE5MDAxfQ==

The problem is 8 out of the 9 times it fails, it's failing due to a libvirt connection reset. In only one case did I see this:

IOError: [Errno 2] No such file or directory: u'/opt/stack/data/nova/instances/5a9b502f-7e4b-422a-9813-ff39666c73dc/libvirt.xml'

And that was in a check-tempest-dsvm-postgres-full build, so we can't filter on it just happening with the grenade test.

If we had multi-line/doc query support with elastic recheck we could combine those two messages for a pretty solid query, but that isn't supported yet.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2013-12-14:

#4

Download full text (6.4 KiB)

Hmm, actually when I expand the logs it looks like when I hit this failure it was due to the libvirt connection getting reset also:

2013-12-13 22:38:09.969 ERROR nova.compute.manager [req-c9e27115-7a1b-4a38-b0d8-84a66437470f ServerRescueTestXML-tempest-1077445745-user ServerRescueTestXML-tempest-1077445745-tenant] [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] Error trying to Rescue Instance
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] Traceback (most recent call last):
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] File "/opt/stack/new/nova/nova/compute/manager.py", line 2664, in rescue_instance
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] rescue_image_meta, admin_password)
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 2086, in rescue
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] unrescue_xml = self._get_existing_domain_xml(instance, network_info)
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 1272, in _get_existing_domain_xml
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] xml = virt_dom.XMLDesc(0)
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 179, in doit
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] result = proxy_call(self._autowrap, f, *args, **kwargs)
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 139, in proxy_call
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] rv = execute(f,*args,**kwargs)
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 77, in tworker
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] rv = meth(*args,**kwargs)
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] File "/usr/lib/python2.7/dist-packages/libvirt.py", line 381, in XMLDesc
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] if ret is None: raise libvirtError ('virDomainGetXMLDesc() failed', dom=self)
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] libvirtError: Unable to read from monitor: Connection reset by peer
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager...

Hmm, actually when I expand the logs it looks like when I hit this failure it was due to the libvirt connection getting reset also:

2013-12-13 22:38:09.969 ERROR nova.compute.manager [req-c9e27115-7a1b-4a38-b0d8-84a66437470f ServerRescueTestXML-tempest-1077445745-user ServerRescueTestXML-tempest-1077445745-tenant] [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] Error trying to Rescue Instance
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] Traceback (most recent call last):
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095]   File "/opt/stack/new/nova/nova/compute/manager.py", line 2664, in rescue_instance
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095]     rescue_image_meta, admin_password)
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095]   File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 2086, in rescue
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095]     unrescue_xml = self._get_existing_domain_xml(instance, network_info)
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095]   File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 1272, in _get_existing_domain_xml
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095]     xml = virt_dom.XMLDesc(0)
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095]   File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 179, in doit
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095]     result = proxy_call(self._autowrap, f, *args, **kwargs)
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095]   File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 139, in proxy_call
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095]     rv = execute(f,*args,**kwargs)
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095]   File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 77, in tworker
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095]     rv = meth(*args,**kwargs)
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095]   File "/usr/lib/python2.7/dist-packages/libvirt.py", line 381, in XMLDesc
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095]     if ret is None: raise libvirtError ('virDomainGetXMLDesc() failed', dom=self)
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] libvirtError: Unable to read from monitor: Connection reset by peer
2013-12-13 22:38:09.969 10037 TRACE nova.compute.manager [instance: 1a65dda0-a949-414c-a43c-51bc60cd7095] 
2013-12-13 22:38:10.137 ERROR nova.openstack.common.rpc.amqp [req-c9e27115-7a1b-4a38-b0d8-84a66437470f ServerRescueTestXML-tempest-1077445745-user ServerRescueTestXML-tempest-1077445745-tenant] Exception during message handling
2013-12-13 22:38:10.137 10037 TRACE nova.openstack.common.rpc.amqp Traceback (most recent call last):
2013-12-13 22:38:10.137 10037 TRACE nova.openstack.common.rpc.amqp   File "/opt/stack/new/nova/nova/openstack/common/rpc/amqp.py", line 461, in _process_data
2013-12-13 22:38:10.137 10037 TRACE nova.openstack.common.rpc.amqp     **args)
2013-12-13 22:38:10.137 10037 TRACE nova.openstack.common.rpc.amqp   File "/opt/stack/new/nova/nova/openstack/common/rpc/dispatcher.py", line 172, in dispatch
2013-12-13 22:38:10.137 10037 TRACE nova.openstack.common.rpc.amqp     result = getattr(proxyobj, method)(ctxt, **kwargs)
2013-12-13 22:38:10.137 10037 TRACE nova.openstack.common.rpc.amqp   File "/opt/stack/new/nova/nova/exception.py", line 90, in wrapped
2013-12-13 22:38:10.137 10037 TRACE nova.openstack.common.rpc.amqp     payload)
2013-12-13 22:38:10.137 10037 TRACE nova.openstack.common.rpc.amqp   File "/opt/stack/new/nova/nova/openstack/common/excutils.py", line 68, in __exit__
2013-12-13 22:38:10.137 10037 TRACE nova.openstack.common.rpc.amqp     six.reraise(self.type_, self.value, self.tb)
2013-12-13 22:38:10.137 10037 TRACE nova.openstack.common.rpc.amqp   File "/opt/stack/new/nova/nova/exception.py", line 73, in wrapped
2013-12-13 22:38:10.137 10037 TRACE nova.openstack.common.rpc.amqp     return f(self, context, *args, **kw)
2013-12-13 22:38:10.137 10037 TRACE nova.openstack.common.rpc.amqp   File "/opt/stack/new/nova/nova/compute/manager.py", line 248, in decorated_function
2013-12-13 22:38:10.137 10037 TRACE nova.openstack.common.rpc.amqp     pass
2013-12-13 22:38:10.137 10037 TRACE nova.openstack.common.rpc.amqp   File "/opt/stack/new/nova/nova/openstack/common/excutils.py", line 68, in __exit__
2013-12-13 22:38:10.137 10037 TRACE nova.openstack.common.rpc.amqp     six.reraise(self.type_, self.value, self.tb)
2013-12-13 22:38:10.137 10037 TRACE nova.openstack.common.rpc.amqp   File "/opt/stack/new/nova/nova/compute/manager.py", line 234, in decorated_function
2013-12-13 22:38:10.137 10037 TRACE nova.openstack.common.rpc.amqp     return function(self, context, *args, **kwargs)
2013-12-13 22:38:10.137 10037 TRACE nova.openstack.common.rpc.amqp   File "/opt/stack/new/nova/nova/compute/manager.py", line 299, in decorated_function
2013-12-13 22:38:10.137 10037 TRACE nova.openstack.common.rpc.amqp     function(self, context, *args, **kwargs)
2013-12-13 22:38:10.137 10037 TRACE nova.openstack.common.rpc.amqp   File "/opt/stack/new/nova/nova/compute/manager.py", line 2670, in rescue_instance
2013-12-13 22:38:10.137 10037 TRACE nova.openstack.common.rpc.amqp     reason=_("Driver Error: %s") % unicode(e))
2013-12-13 22:38:10.137 10037 TRACE nova.openstack.common.rpc.amqp InstanceNotRescuable: Instance 1a65dda0-a949-414c-a43c-51bc60cd7095 cannot be rescued: Driver Error: Unable to read from monitor: Connection reset by peer
2013-12-13 22:38:10.137 10037 TRACE nova.openstack.common.rpc.amqp

So maybe this bug is just a dupe of another because I thought we already had one tracking random libvirt connection drops.

Looks like that's bug 1255624.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2013-12-14:

#5

From Andrea's grenade failure logs, it's actually not due to the libvirt connection reset, it's due to the domain xml not being found, so this is a valid unique bug (apart from bug 1255624).

2013-12-12 19:17:31.244 10417 TRACE nova.openstack.common.rpc.amqp InstanceNotRescuable: Instance a5099beb-f4a2-47bc-b27a-83c39b7566c8 cannot be rescued: Driver Error: [Errno 2] No such file or directory: u'/opt/stack/data/nova/instances/a5099beb-f4a2-47bc-b27a-83c39b7566c8/libvirt.xml'

Revision history for this message

Matt Riedemann (mriedem) wrote on 2013-12-14:

#6

We should be able to write an e-r query on this:

message:"cannot be rescued: Driver Error: [Errno 2] No such file or directory:" AND filename:"logs/screen-n-cpu.txt"

That has both the 'cannot be rescued' piece and IOError pieces I was looking for to make this a unique query.

There is 1 hit in the last 7 days:

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiY2Fubm90IGJlIHJlc2N1ZWQ6IERyaXZlciBFcnJvcjogW0Vycm5vIDJdIE5vIHN1Y2ggZmlsZSBvciBkaXJlY3Rvcnk6XCIgQU5EIGZpbGVuYW1lOlwibG9ncy9zY3JlZW4tbi1jcHUudHh0XCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjEzODcwNTU2MjA0MjZ9

That's in a non-grenade test, probably because of the path to the log filename. Maybe I can correct that with an OR in the query.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2013-12-14:

#7

Proposed an elastic-recheck query here: https://review.openstack.org/#/c/62192/

Revision history for this message

Andrea Frittoli (andrea-frittoli) wrote on 2013-12-15:

#8

Matt, thanks for your investigation on this.

In my understanding the sequence of events from the nova logs is the following:
- n-cpu starts processing RESCUE request
- it waits 3 minutes+ for a lock compute_resources to become available - this could be a consequence of bug 1255624
- meanwhile the timeout in the test expires, and the test proceeds to delete the test VM (just a guess)
- the deletion of the VM is processed by n-cpu
- the execution of RESCUE resumes, it assumes that the VM is still in the same state, no noticing that it has been deleted already. This it hits the issue with the domain file not found.

So perhaps the additional error message in this case, compared to bug 1255624, is due to a race condition where the test deleted the VM successfully before RESCUE could resume.

Revision history for this message

Joe Gordon (jogo) wrote on 2014-08-30:

#9

No hits in a while looks like this resolved itself or the fingerprint is out of date.

Changed in nova:
status:	Confirmed → Incomplete

Sean Dague (sdague) on 2014-09-16

Changed in nova:
status:	Incomplete → Invalid

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Invalid	High	Unassigned
	tempest	Invalid	Undecided	Unassigned

tempest

ServerRescueTest may fail due to RESCUE taking too long

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches