Bug #1334398 “libvirt live_snapshot periodically explodes on lib...” : Bugs : OpenStack Compute (nova)

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-06-25:

#1

Download full text (3.6 KiB)

The n-cpu logs have several errors for the libvirt connection being reset:

http://logs.openstack.org/70/97670/5/check/check-tempest-dsvm-postgres-full/7d4c7cf/logs/screen-n-cpu.txt.gz?level=TRACE#_2014-06-24_22_54_52_973

2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d] Traceback (most recent call last):
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d] File "/opt/stack/new/nova/nova/compute/manager.py", line 352, in decorated_function
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d] *args, **kwargs)
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d] File "/opt/stack/new/nova/nova/compute/manager.py", line 2788, in snapshot_instance
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d] task_states.IMAGE_SNAPSHOT)
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d] File "/opt/stack/new/nova/nova/compute/manager.py", line 2819, in _snapshot_instance
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d] update_task_state)
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 1532, in snapshot
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d] image_format)
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 1631, in _live_snapshot
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d] domain.blockJobAbort(disk_path, 0)
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d] File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 179, in doit
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d] result = proxy_call(self._autowrap, f, *args, **kwargs)
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d] File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 139, in proxy_call
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d] rv = execute(f,*args,**kwargs)
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d] File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 77, in tworker
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d] rv = meth(*args,**kwargs)
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d] File "/usr/lib/python2.7/dist-packages/libvirt.py", line 646, in blockJobAbort
2014-06-24 22:54:52.973...

The n-cpu logs have several errors for the libvirt connection being reset:

http://logs.openstack.org/70/97670/5/check/check-tempest-dsvm-postgres-full/7d4c7cf/logs/screen-n-cpu.txt.gz?level=TRACE#_2014-06-24_22_54_52_973

2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d] Traceback (most recent call last):
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d]   File "/opt/stack/new/nova/nova/compute/manager.py", line 352, in decorated_function
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d]     *args, **kwargs)
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d]   File "/opt/stack/new/nova/nova/compute/manager.py", line 2788, in snapshot_instance
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d]     task_states.IMAGE_SNAPSHOT)
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d]   File "/opt/stack/new/nova/nova/compute/manager.py", line 2819, in _snapshot_instance
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d]     update_task_state)
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d]   File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 1532, in snapshot
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d]     image_format)
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d]   File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 1631, in _live_snapshot
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d]     domain.blockJobAbort(disk_path, 0)
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d]   File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 179, in doit
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d]     result = proxy_call(self._autowrap, f, *args, **kwargs)
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d]   File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 139, in proxy_call
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d]     rv = execute(f,*args,**kwargs)
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d]   File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 77, in tworker
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d]     rv = meth(*args,**kwargs)
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d]   File "/usr/lib/python2.7/dist-packages/libvirt.py", line 646, in blockJobAbort
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d]     if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed', dom=self)
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d] libvirtError: Unable to read from monitor: Connection reset by peer
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d]

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-06-25:

#2

Here is a logstash query on the tempest failure:

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwic2VydmVyX2NoZWNrX3RlYXJkb3duXCIgQU5EIG1lc3NhZ2U6XCJTZXJ2ZXJcIiBBTkQgbWVzc2FnZTpcImZhaWxlZCB0byByZWFjaCBBQ1RJVkUgc3RhdHVzIGFuZCB0YXNrIHN0YXRlIFxcXCJOb25lXFxcIiB3aXRoaW4gdGhlIHJlcXVpcmVkIHRpbWVcIiBBTkQgbWVzc2FnZTpcIkN1cnJlbnQgc3RhdHVzOiBBQ1RJVkUuIEN1cnJlbnQgdGFzayBzdGF0ZTogaW1hZ2VfcGVuZGluZ191cGxvYWRcIiBBTkQgdGFnczpcInRlbXBlc3QudHh0XCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjE3MjgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjE0MDM3MjE4NDQ1MjksIm1vZGUiOiIiLCJhbmFseXplX2ZpZWxkIjoiIn0=

message:"server_check_teardown" AND message:"Server" AND message:"failed to reach ACTIVE status and task state \"None\" within the required time" AND message:"Current status: ACTIVE. Current task state: image_pending_upload" AND tags:"tempest.txt"

18 hits in 2 days.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-06-25:

#3

Nova bug 1255624 is tracking libvirt connection reset errors, in that case it failed during virDomainSuspend, here it fails during virDomainBlockJobAbort.

tags:

added: libvirt

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-06-25:

#4

Here is a logstash query on the libvirt connection reset error:

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwicmFpc2UgbGlidmlydEVycm9yICgndmlyRG9tYWluQmxvY2tKb2JBYm9ydCgpIGZhaWxlZCdcIiBBTkQgbWVzc2FnZTpcImxpYnZpcnRFcnJvcjogVW5hYmxlIHRvIHJlYWQgZnJvbSBtb25pdG9yOiBDb25uZWN0aW9uIHJlc2V0IGJ5IHBlZXJcIiBBTkQgdGFnczpcInNjcmVlbi1uLWNwdS50eHRcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiMTcyODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTQwMzcyMjMxMDczNX0=

159 hits in 2 days, there is a small successful percentage when that shows up.

Looking at the tests when this fails, they are doing image snapshots so this seems like a pretty good query.

Changed in nova:
importance:	Undecided → High
no longer affects:	glance
summary:	- test_images_oneserver times out in tearDown during task_state - "image_pending_upload" + snapshot hangs when libvirt connection is reset

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-06-25: Re: snapshot hangs when libvirt connection is reset

#5

e-r patch: https://review.openstack.org/#/c/102608/

Looks like this really spiked on 6/24 and goes down again on 6/25.

Revision history for this message

Sean Dague (sdague) wrote on 2014-06-25:

#6

Going down on 6/25 is a mirage, we're just backed up on ES data.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-06-25:

#7

Wondering if bug 1193146 has any interesting historical information.

Revision history for this message

Ken'ichi Ohmichi (oomichi) wrote on 2014-06-26:

#8

I also faced this problem many times today, and most failure happened on ListImageFiltersTestXML.
I'm not sure why it does not happen on ListImageFiltersTestJSON..

Revision history for this message

Sean Dague (sdague) wrote on 2014-06-26:

#9

It looks like there is a very particular explode around _live_snapshot. The bug actually seems to only explode when we are in _live_snapshot, and not any other cases. (Modifying the elastic recheck search string for _live_snapshot has the same # of counts).

Changed in nova:
assignee:	nobody → Sean Dague (sdague)
summary:	- snapshot hangs when libvirt connection is reset + libvirt live_snapshot periodically explodes on libvirt 1.2.2 in the gate

Revision history for this message

Sean Dague (sdague) wrote on 2014-06-26:

#10

It's worth noting that the _live_snapshot code path was never tested by us until the trusty update, as it was hidden behind a version flag that meant we didn't run it before in the gate.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-06-26:

#11

This should get us around the gate failures for now https://review.openstack.org/#/c/102643/

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-06-26:

#12

@Ken'ichi, per comment 8, it's a pretty intensive setup:

http://git.openstack.org/cgit/openstack/tempest/tree/tempest/api/compute/images/test_list_image_filters.py#n29

The setup creates 2 servers and then 3 snapshots from those 2 servers, and the JSON and XML test classes are running concurrently, so we could be creating multiple snapshots concurrently which is in theory overloading libvirt/qemu and causing the connection reset with libvirt.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-06-26: Related fix merged to nova (master)

#13

Reviewed: https://review.openstack.org/102643
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c1c159460de376a06b8479f90f42d9f62eace961
Submitter: Jenkins
Branch: master

commit c1c159460de376a06b8479f90f42d9f62eace961
Author: Sean Dague <email address hidden>
Date: Wed Jun 25 16:56:04 2014 -0400

effectively disable libvirt live snapshotting

    As being seen in the gate, libvirt 1.2.2 doesn't appear to actually
    handle live snapshotting under any appreciable load (possibly
    related to parallel operations). It isn't a 100% failure, but it's
    currently being hit in a large number of runs.

    Effectively turn this off by increasing the
    MIN_LIBVIRT_LIVESNAPSHOT_VERSION to something that doesn't yet
    exist. This can get us back to a working state, then we can decide
    if live snapshotting is something that can be made to actually work
    under load.

DocImpact

Related-Bug: #1334398

Change-Id: I9908b743df2093c8dd2723af89be51630eafc99f

Revision history for this message

Kashyap Chamarthy (kashyapc) wrote on 2014-06-26:

#14

Thought I'd add the below.

I just created a simple test[1] which creates an external live snapshot[2] of a libvirt guest (with versions affecting the gate -- libvirt 1.2.2 and QEMU 2.0 on Fedora 20), by executing the below command in a loop of 100 (I also tested for 1000, it ran just fine too). I ran the script for 3 virtual machines in parallel.

$ virsh snapshot-create-as --domain $DOMAIN \
    --name snap-$i \
    --description snap$i-desc \
    --disk-only \
    --diskspec hda,snapshot=external,file=$SNAPSHOTS_DIR/$DOMAIN-snap$i.qcow2 \
    --atomic

Result of a 100 loop run would be an image with a backing chain of 100 qcow2 images[3].

The above script just creates a snapshot, nothing more. Matt Riedemann pointed out on #openstack-nova that on Gate there could be others tests running concurrently that could be doing things like suspend/resume/rescue, etc.

  [1] https://github.com/kashyapc/ostack-misc/blob/master/libvirt-live-snapshots-stress.sh
  [2] "external live snapshot "meaning: Every time you take a snapshot, the current disk becomes a (read-only) 'backing file' and a new qcow2 overlay is created to track the current 'delta'.
  [3] http://kashyapc.fedorapeople.org/virt/guest-with-backing-chain-of-100-qcow2-images.txt

Revision history for this message

Daniel Berrange (berrange) wrote on 2014-06-26:

#15

The interesting thing in the logs is the stack trace about virDomainBlockJobAbort failing.

Nova issues this API call right at the start of the snapshot funtion to validated there's no old stale job left over:

            # Abort is an idempotent operation, so make sure any block
            # jobs which may have failed are ended. This operation also
            # confirms the running instance, as opposed to the system as a
            # whole, has a new enough version of the hypervisor (bug 1193146).
            try:
                virt_dom.blockJobAbort(disk_path, 0)

As the comment says, aborting the job is supposed to be a completely safe thing todo. We don't even expect any existing job to be running, so it should basically end up as a no-op inside QEMU.

Now the error message libvirt reports when virDomaniBlockJobAbort fails is:

libvirtError: Unable to read from monitor: Connection reset by peer

This is a generic message you get when QEMU crashes & burns unexpectedly, causing the monitor connection to be dropped.

We've not even got as far as running the libvirt snapshot API at this point when QEMU crashes & burns. This likely explains why Kashyap can't see the error in this test script which just invokes snapshot.

This all points the finger towards a flaw in QEMU of some kind, but there's no easy way to figure out what this might be from the libvirtd logs.

What we need here is the /var/log/libvirt/qemu/instanceNNNNNN.log files corresponding to the failed test run. If we are lucky there might be some assertion error printed from QEMU before it crashed & burned. Failing that, what we need is a stack trace from QEMU when it crashes & burns. IIUC, ubuntu has something similar to Fedora's abrt daemon which captures stack traces of any process which SEGVs or core dumps. We really need to try to get a stack trace of QEMU from Ubuntu's crash handler too.

Revision history for this message

Daniel Berrange (berrange) wrote on 2014-06-26:

#16

This is the service I was talking about

https://wiki.ubuntu.com/Apport

We need to re-configure it to collect core dumps as it is disabled by default. Then capture any crashes in

/var/crash/

Revision history for this message

Daniel Berrange (berrange) wrote on 2014-06-26:

#17

Sorry, I was looking at the wrong blockJobAbort call in the code earlier. The actual one that is failing is in this code

            # NOTE (rmk): Establish a temporary mirror of our root disk and
            # issue an abort once we have a complete copy.
            domain.blockRebase(disk_path, disk_delta, 0,
                               libvirt.VIR_DOMAIN_BLOCK_REBASE_COPY |
                               libvirt.VIR_DOMAIN_BLOCK_REBASE_REUSE_EXT |
                               libvirt.VIR_DOMAIN_BLOCK_REBASE_SHALLOW)

while self._wait_for_block_job(domain, disk_path):
time.sleep(0.5)

domain.blockJobAbort(disk_path, 0)

So we've done a block rebase, and then when we finish waiting for it to finish, we abort the job and at that point we see the crashed QEMU. The QEMU logs are stuff something useful to get

Revision history for this message

Daniel Berrange (berrange) wrote on 2014-06-26:

#18

I'm actually beginning to wonder if there is a flaw in the tempest tests rather than in QEMU. The " Unable to read from monitor: Connection reset by peer" error message can actually indicate that a second thread has killed QEMU, while the first thread is talking to it - so this is a potential alternative idea to explore vs my previous QEMU-SEGV bug theory.

I've been examining the screen-n-cpu.log file to see what happens with instance 90c79adf-4df1-497c-a786-13bdc5cca98d which is the one with the virDomainBlockJob error trace

First I see the snapshot process starting

2014-06-24 22:51:24.314 INFO nova.virt.libvirt.driver [req-e4651efe-7c84-4a57-bbb1-88b107d4a282 ImagesOneServerTestJSON-967160715 ImagesOneServerTestJSON-32972017] [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d] Beginning live snapshot process

Then I see the something killing this very same instance

2014-06-24 22:54:40.255 AUDIT nova.compute.manager [req-218dba14-516c-4805-9908-b55cd73a00e5 ImagesOneServerTestJSON-967160715 ImagesOneServerTestJSON-32972017] [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d] Terminating instance

And a lifecycle event to show that it was killed

2014-06-24 22:54:51.033 16186 INFO nova.compute.manager [-] Lifecycle event 1 on VM 90c79adf-4df1-497c-a786-13bdc5cca98d

then we see the snapshot process crash & burn

2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d] File "/usr/lib/python2.7/dist-packages/libvirt.py", line 646, in blockJobAbort
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d] if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed', dom=self)
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d] libvirtError: Unable to read from monitor: Connection reset by peer
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d]

So this looks very much to me like something in the test is killing the instance while the snapshot is still being done

Now, as for why this doesn't affect non-live snapshots we were testing before...

For non-live snapshots, we issue a 'managedSave' call, this terminates the guest. Then we do the snapshot process. Then we start up the guest against from the managed save image. My guess is that this race-ing 'Terminate instance' call is happening while the guest is already shutdown and hence does not cause a failure of the test suite when doing non-live snapshot (or at least the window in which the race could hit is dramatically smaller).

So based on the sequence in the screen-n-cpu.log file my money is currently on a race in the test scripts where something explicitly kills the instance while snapshot is being taken, and that the non-live snapshot code is not exposed to the race.

I'm actually beginning to wonder if there is a flaw in the tempest tests rather than in QEMU.  The " Unable to read from monitor: Connection reset by peer" error message can actually indicate that a second thread has killed QEMU, while the first thread is talking to it - so this is a potential alternative idea to explore vs my previous QEMU-SEGV bug theory.

I've been examining the screen-n-cpu.log file to see what happens with instance  90c79adf-4df1-497c-a786-13bdc5cca98d which is the one with the virDomainBlockJob error trace

First I see the snapshot process starting

2014-06-24 22:51:24.314 INFO nova.virt.libvirt.driver [req-e4651efe-7c84-4a57-bbb1-88b107d4a282 ImagesOneServerTestJSON-967160715 ImagesOneServerTestJSON-32972017] [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d] Beginning live snapshot process

Then I see the something killing this very same instance

2014-06-24 22:54:40.255 AUDIT nova.compute.manager [req-218dba14-516c-4805-9908-b55cd73a00e5 ImagesOneServerTestJSON-967160715 ImagesOneServerTestJSON-32972017] [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d] Terminating instance

And a lifecycle event to show that it was killed

2014-06-24 22:54:51.033 16186 INFO nova.compute.manager [-] Lifecycle event 1 on VM 90c79adf-4df1-497c-a786-13bdc5cca98d

then we see the snapshot process crash & burn

2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d]   File "/usr/lib/python2.7/dist-packages/libvirt.py", line 646, in blockJobAbort
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d]     if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed', dom=self)
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d] libvirtError: Unable to read from monitor: Connection reset by peer
2014-06-24 22:54:52.973 16186 TRACE nova.compute.manager [instance: 90c79adf-4df1-497c-a786-13bdc5cca98d]

So this looks very much to me like something in the test is killing the instance while the snapshot is still being done

Now, as for why this doesn't affect non-live snapshots we were testing before...

For non-live snapshots, we issue a 'managedSave' call, this terminates the guest. Then we do the snapshot process. Then we start up the guest against from the managed save image.  My guess is that this race-ing 'Terminate instance' call is happening while the guest is already shutdown and hence does not cause a failure of the test suite when doing  non-live snapshot (or at least the window in which the race could hit is dramatically smaller).

So based on the sequence in the screen-n-cpu.log file my money is currently on a race in the test scripts where something explicitly kills the instance  while snapshot is being taken, and that the non-live snapshot code is not exposed to the race.

Revision history for this message

Sean Dague (sdague) wrote on 2014-06-26:

#19

We do kill the snapshot if it exceeds the timeout, which is currently 196s, because at some point we need to actually move on. When these are successful, they typically succeed in about 10s.

Revision history for this message

Vish Ishaya (vishvananda) wrote on 2014-06-26:

#20

Ok so it looks like the problem is that the snaphsot is not completing in a reasonable amount of time. The time stamps look like it took 2.5 minutes before it was killed which aligns with the above. So it looks like the BlockMirror is not completing.

Revision history for this message

Kashyap Chamarthy (kashyapc) wrote on 2014-06-27:

#21

After looking a little bit more (at' _live_snapshot' function in Nova[*]) , the below seem to be the precise equivalent sequence of (libvirt) operations of what's happening in Nova's '_live_snapshot' function. Thanks to libvirt developer Eric Blake for reviewing this:

(0) Take the Libvirt guest's XML backup:

$ virsh dumpxml --inactive vm1 > /var/tmp/vm1.xml

(1) Abort any failed/finished block operations:

$ virsh blockjob vm1 vda --abort

(2) Undefine a running domain. (Note: Undefining a running domain does not _kill_ the domain, it just converts it from persistent to
transient.)

$ virsh undefine vm1

(3) Invoke 'virsh blockcopy' (This will take time, depending on the size of disk image vm1):

    $ virsh blockcopy \
        --domain vm1 vda \
        /export/backups/vm1-copy.qcow2 \
        --wait \
        --verbose

(4) Abort any failed/finished block operations: (as Dan pointed out in comment #17, this the abort operation where QEMU
might be failing):

$ virsh blockjob vm1 vda --abort

NOTE: If we use '--finish' command in step 3 it is equivalent to the
above command (consequently, step 4 can be skipped).

(5) Define the guest again (to make it persistent):

$ virsh define /var/tmp/vm1.xml

(6) From the obtained new copy, convert the QCOW2 with a backing file to a flat (raw) image with no backing file:

$ qemu-img convert -f qcow2 -O raw vm1.qcow2 conv-vm1.img

Notes (from Eric Blake):
    The _live_snapshot function concludes it all with redefining the
    domain (umm, that part looks fishy in the code - you undefine it
    only if it was persistent, but redefine the domain unconditionally;
    so if you call your function on a domain that is initially
    transient, you end up with a persistent domain at the end of your
    function).

[*] https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L1593

Revision history for this message

Dan Genin (daniel-genin) wrote on 2014-06-27:

#22

FWIW, I have encountered libvirt connection reset errors when my DevStack VM ran low on memory. I first saw these when the more recent versions of DevStack started spawning numerous nova-api and nova-conductor instances, which ate up the relatively small RAM of the VM. When this happened, I was unable to boot any instances, with libvirt connection reset errors reported in the n-cpu log. Not sure that low memory is the what's causing errors here (maybe some other resource starvation) but they look awfully similar.

Revision history for this message

Kashyap Chamarthy (kashyapc) wrote on 2014-06-30:

#23

Download full text (5.5 KiB)

Here's some investigation of what happens when _live_snapshot is
invoked, at Libvirt level. I performed the live_snapshot test w/ current
git (after I modified MIN_LIBVIRT_LIVESNAPSHOT_VERSION=1.0.0 for testing
purpose) with below log filters in /etc/libvirt/libvirtd.conf (and
restared libvirt):

log_level = 1
log_outputs="1:file:/var/tmp/libvirtd.log"

Find what QMP commands libvirt is sending to QEMU
log_filters="1:qemu_monitor"

Libvirt call sequence (More
---------------------

(1) virDomainGetXMLDesc
(2) virDomainBlockJobAbort
(3) virDomainUndefine
(4) virDomainBlockRebase
     - NOTE (from libvirt documentation): By default, the copy job runs
       in the background, and consists of two phases: (a) the block
       operation copies all data from the source, and during this phase,
       the job can only be canceled to revert back to the source disk,
       with no guarantees about the destination. (b) After phase (a0
       completes, both the source and the destination remain mirrored
       until a call to the block opertation with the --abort.

(5) virDomainBlockJobAbort.

Test
----

Boot a new Nova instance:

$ nova boot --flavor 1 --key_name oskey1 --image \
cirros-0.3.0-x86_64-disk cvm1

Issue a snapshot (this should trigger the _live_snapshot code path):

$ nova image-create --poll cvm1 snap1-cvm1

Ensure that "live snapshot" _did_ take place by searching the
'screen-n-cpu.log':

    $ grep -i "Beginning live snapshot process" ../data/new/screen-logs/screen-n-cpu.log
    2014-06-30 03:34:32.237 INFO nova.virt.libvirt.driver [req-b02ac5c5-7694-44ce-9185-129b91eaf6b5 admin admin] [instance: 4b62dab0-0efe-4125-84ec-3e24d3371082] Beginning live snapshot process
    $

Libvirt logs
------------

(1) Save copy of libvirt XML(virDomainGetXMLDesc):
----
2014-06-30 09:08:13.586+0000: 8470: debug : virDomainGetXMLDesc:4344 : dom=0x7f2994000fd0, (VM: name=instance-00000001, uuid=4b62dab0-0efe-4125-84ec-3e24d3371082), flags=0
----

(2) Issue a BlockJobAbort (virDomainBlockJobAbort), so that any prior active block operation on that disk will be cancelled:
----
2014-06-30 09:08:13.632+0000: 8470: debug : virDomainBlockJobAbort:19492 : dom=0x7f2994000af0, (VM: name=instance-00000001, uuid=4b62dab0-0efe-4125-84ec-3e24d3371082), disk=/home/kashyapc/src/openstack/data/nova/instances/4b62dab0-0efe-4125-84ec-3e24d3371082/disk, flags=0
----

(3) Undefining the running libvirt domain (virDomainUndefine), to make it transient[*]:
----
2014-06-30 09:08:14.069+0000: 8471: debug : virDomainUndefine:8683 : dom=0x7f29a0000910, (VM: name=instance-00000001, uuid=4b62dab0-0efe-4125-84ec-3e24d3371082)
2014-06-30 09:08:14.069+0000: 8471: info : qemuDomainUndefineFlags:6334 : Undefining domain 'instance-00000001'
----

We'll define the guest again, further below from the saved copy from step(1).

[*] Reasoning for making the domain transient: BlockRebase ('blockcopy')
jobs last forever until canceled, which implies that they should last
across domain restarts if the domain were persistent. But, QEMU doesn't
yet provide a way to restart a copy job on domain restart (while
mirroring is still intact). So the trick is to tempo...

Here's some investigation of what happens when _live_snapshot is
invoked, at Libvirt level. I performed the live_snapshot test w/ current
git (after I modified MIN_LIBVIRT_LIVESNAPSHOT_VERSION=1.0.0 for testing
purpose) with below log filters in /etc/libvirt/libvirtd.conf (and
restared libvirt):

log_level = 1
    log_outputs="1:file:/var/tmp/libvirtd.log"

Find what QMP commands libvirt is sending to QEMU
    log_filters="1:qemu_monitor"

Libvirt call sequence (More 
---------------------

(1) virDomainGetXMLDesc
(2) virDomainBlockJobAbort
(3) virDomainUndefine
(4) virDomainBlockRebase
     - NOTE (from libvirt documentation): By default, the copy job runs
       in the background, and consists of two phases: (a) the block
       operation copies all data from the source, and during this phase,
       the job can only be canceled to revert back to the source disk,
       with no guarantees about the destination. (b) After phase (a0
       completes, both the source and the destination remain mirrored
       until a call to the block opertation with the --abort.

(5) virDomainBlockJobAbort.

Test
----

Boot a new Nova instance:

$ nova boot --flavor 1 --key_name oskey1 --image \
        cirros-0.3.0-x86_64-disk cvm1

Issue a snapshot (this should trigger the _live_snapshot code path):

$ nova image-create --poll cvm1 snap1-cvm1

Ensure that "live snapshot" _did_ take place by searching the
'screen-n-cpu.log':

$ grep -i "Beginning live snapshot process" ../data/new/screen-logs/screen-n-cpu.log
    2014-06-30 03:34:32.237 INFO nova.virt.libvirt.driver [req-b02ac5c5-7694-44ce-9185-129b91eaf6b5 admin admin] [instance: 4b62dab0-0efe-4125-84ec-3e24d3371082] Beginning live snapshot process
    $

Libvirt logs
------------

(1) Save copy of libvirt XML(virDomainGetXMLDesc): 
----
2014-06-30 09:08:13.586+0000: 8470: debug : virDomainGetXMLDesc:4344 : dom=0x7f2994000fd0, (VM: name=instance-00000001, uuid=4b62dab0-0efe-4125-84ec-3e24d3371082), flags=0
----

(2)  Issue a BlockJobAbort (virDomainBlockJobAbort), so that any prior active block operation on that disk will be cancelled:
----
2014-06-30 09:08:13.632+0000: 8470: debug : virDomainBlockJobAbort:19492 : dom=0x7f2994000af0, (VM: name=instance-00000001, uuid=4b62dab0-0efe-4125-84ec-3e24d3371082), disk=/home/kashyapc/src/openstack/data/nova/instances/4b62dab0-0efe-4125-84ec-3e24d3371082/disk, flags=0
----

(3) Undefining the running libvirt domain (virDomainUndefine), to make it transient[*]:
----
2014-06-30 09:08:14.069+0000: 8471: debug : virDomainUndefine:8683 : dom=0x7f29a0000910, (VM: name=instance-00000001, uuid=4b62dab0-0efe-4125-84ec-3e24d3371082)
2014-06-30 09:08:14.069+0000: 8471: info : qemuDomainUndefineFlags:6334 : Undefining domain 'instance-00000001'
----

We'll define the guest again, further below from the saved copy from step(1).

[*] Reasoning for making the domain transient: BlockRebase ('blockcopy')
jobs last forever until canceled, which implies that they should last
across domain restarts if the domain were persistent. But, QEMU doesn't
yet provide a way to restart a copy job on domain restart (while
mirroring is still intact). So the trick is to temporarily make the
domain transient.

(4) Invoke libvirt BlockRebase (and mirroring):
----
2014-06-30 09:08:14.070+0000: 8470: debug : virDomainBlockRebase:19785 : dom=0x7f2994000e40, (VM: name=instance-00000001, uuid=4b62dab0-0efe-4125-84ec-3e24d3371082), disk=/home/kashyapc/src/openstack/data/nova/instances/4b62dab0-0efe-4125-84ec-3e24d3371082/disk, base=/home/kashyapc/src/openstack/data/nova/instances/snapshots/tmp9_j3S0/5644b17a7de6427993d12cbab6ca3205.delta, bandwidth=0, flags=b
----

(4.1) QEMU 'drive-mirror' in progress:
[. . .]
2014-06-30 09:08:14.071+0000: 8470: debug : qemuMonitorJSONCommandWithFd:286 : Send command '{"execute":"drive-mirror","arguments":{"device":"drive-virtio-disk0","target":"/home/kashyapc/src/openstack/data/nova/
instances/snapshots/tmp9_j3S0/5644b17a7de6427993d12cbab6ca3205.delta","speed":0,"sync":"top","mode":"existing","format":"qcow2"},"id":"libvirt-10"}' for write with FD -1
2014-06-30 09:08:14.071+0000: 8470: debug : virEventPollUpdateHandle:152 : EVENT_POLL_UPDATE_HANDLE: watch=10 events=15
2014-06-30 09:08:14.071+0000: 8470: debug : virEventPollInterruptLocked:729 : Interrupting
2014-06-30 09:08:14.071+0000: 8470: debug : qemuMonitorSend:969 : QEMU_MONITOR_SEND_MSG: mon=0x7f299c000bc0 msg={"execute":"drive-mirror","arguments":{"device":"drive-virtio-disk0","target":"/home/kashyapc/src/o
penstack/data/nova/instances/snapshots/tmp9_j3S0/5644b17a7de6427993d12cbab6ca3205.delta","speed":0,"sync":"top","mode":"existing","format":"qcow2"},"id":"libvirt-10"}
 fd=-1
[. . .]

(5) Issue Abort, so that block mirroring can be completed gracefully:
----
2014-06-30 09:08:14.615+0000: 8472: debug : virDomainBlockJobAbort:19492 : dom=0x7f299c001670, (VM: name=instance-00000001, uuid=4b62dab0-0efe-4125-84ec-3e24d3371082), disk=/home/kashyapc/src/openstack/data/nova/instances/4b62dab0-0efe-4125-84ec-3e24d3371082/disk, flags=0
[. . .]
2014-06-30 09:08:14.618+0000: 8466: debug : qemuMonitorIOWrite:507 : QEMU_MONITOR_IO_WRITE: mon=0x7f299c000bc0 buf={"execute":"block-job-cancel","arguments":{"device":"drive-virtio-disk0"},"id":"libvirt-13"}
 len=94 ret=94 errno=11
----

(6)
----
2014-06-30 09:08:14.719+0000: 8470: debug : virDomainDefineXML:8639 : conn=0x7f2994000c50, xml=<domain type='kvm' id='2'>
  <name>instance-00000001</name>
[. . .]
2014-06-30 09:08:14.764+0000: 8470: info : qemuDomainDefineXML:6248 : Creating domain 'instance-00000001'
----

Revision history for this message

Daniel Berrange (berrange) wrote on 2014-06-30:

#24

I've added some more debugging to the live snapshot code in this change:

https://review.openstack.org/#/c/103066/

When it failed in this test run:

http://logs.openstack.org/66/103066/7/check/check-tempest-dsvm-postgres-full/02e97b1

I see

2014-06-30 11:55:55.398+0000: 18078: debug : virDomainGetBlockJobInfo:19415 : dom=0x7f13200018f0, (VM: name=instance-00000020, uuid=b1e3c5de-af31-4a4d-94dd-ef382936583b), disk=/opt/stack/data/nova/instances/b1e3c5de-af31-4a4d-94dd-ef382936583b/disk, info=0x7f1315ffa390, flags=0
2014-06-30 11:55:55.415 WARNING nova.virt.libvirt.driver [req-3e682ad2-5af5-47d2-a72d-9bac23e8c2bc ListImageFiltersTestJSON-1350287421 ListImageFiltersTestJSON-1927810029] blockJobInfo snapshot cur=0 end=25165824

2014-06-30 11:55:56.074+0000: 18071: debug : virDomainGetBlockJobInfo:19415 : dom=0x7f13200018f0, (VM: name=instance-00000020, uuid=b1e3c5de-af31-4a4d-94dd-ef382936583b), disk=/opt/stack/data/nova/instances/b1e3c5de-af31-4a4d-94dd-ef382936583b/disk, info=0x7f13256ea390, flags=0
2014-06-30 11:55:56.094 WARNING nova.virt.libvirt.driver [req-3e682ad2-5af5-47d2-a72d-9bac23e8c2bc ListImageFiltersTestJSON-1350287421 ListImageFiltersTestJSON-1927810029] blockJobInfo snapshot cur=25165824 end=25165824

This shows that as far as virDomainGetBlockJobInfo is concerned, the job has completed copying the data in not very much time at all, which seems reasonable considering it is a cirros image.

We then go into a virDomainBlockJobAbort call to finish the snapshot operation:

2014-06-30 11:55:56.127+0000: 18070: debug : virDomainBlockJobAbort:19364 : dom=0x7f13200018f0, (VM: name=instance-00000020, uuid=b1e3c5de-af31-4a4d-94dd-ef382936583b), disk=/opt/stack/data/nova/instances/b1e3c5de-af31-4a4d-94dd-ef382936583b/disk, flags=0

This should take a fraction of a second, but after 3 minute it still isn't done. Tempest gets fed up waiting and so issues a call to destroy the guest:

2014-06-30 11:59:10.341+0000: 18090: debug : virDomainDestroy:2173 : dom=0x7f12f0002910, (VM: name=instance-00000020, uuid=b1e3c5de-af31-4a4d-94dd-ef382936583b)

Shortly thereafter QEMU is dead and the virDomainBlockJobAbort call returns, obviously with an error,

2014-06-30 11:59:21.279 17542 TRACE nova.compute.manager [instance: b1e3c5de-af31-4a4d-94dd-ef382936583b] libvirtError: Unable to read from monitor: Connection reset by peer

So, based on this debug info I think that Nova is doing the right thing, and this is probably a bug in QEMU (or possibly, but unlikely, a bug in libvirt). My inclination is that QEMU is basically hanging in the block job abort call, due to some fairly infrequently hit race condition.