libvirt: test_attach_volume_shelved_or_offload_server times out waiting for device detach (which fails)

Bug #1673483 reported by Matt Riedemann
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Undecided
Unassigned
libvirt (Ubuntu)
Triaged
Medium
Unassigned

Bug Description

http://logs.openstack.org/75/446175/1/gate/gate-tempest-dsvm-neutron-full-ubuntu-xenial/479c7bf/console.html#_2017-03-16_01_31_33_688563

The test teardown times out waiting for the volume to be detached from the server because the volume detach fails due to a libvirt i/o error during detach:

http://logs.openstack.org/75/446175/1/gate/gate-tempest-dsvm-neutron-full-ubuntu-xenial/479c7bf/logs/screen-n-cpu.txt.gz?#_2017-03-16_01_01_32_897

2017-03-16 01:01:32.897 7447 ERROR nova.compute.manager [req-ac93339e-9eb1-4104-8743-5e5626253ab8 tempest-AttachVolumeShelveTestJSON-116820638 tempest-AttachVolumeShelveTestJSON-116820638] [instance: 5f67c47b-f7d7-4063-88f1-8f7b6177fcb9] Failed to detach volume dc49318a-e51f-4812-97a3-64bf93e522eb from /dev/vdc
2017-03-16 01:01:32.897 7447 ERROR nova.compute.manager [instance: 5f67c47b-f7d7-4063-88f1-8f7b6177fcb9] Traceback (most recent call last):
2017-03-16 01:01:32.897 7447 ERROR nova.compute.manager [instance: 5f67c47b-f7d7-4063-88f1-8f7b6177fcb9] File "/opt/stack/new/nova/nova/compute/manager.py", line 4833, in _driver_detach_volume
2017-03-16 01:01:32.897 7447 ERROR nova.compute.manager [instance: 5f67c47b-f7d7-4063-88f1-8f7b6177fcb9] encryption=encryption)
2017-03-16 01:01:32.897 7447 ERROR nova.compute.manager [instance: 5f67c47b-f7d7-4063-88f1-8f7b6177fcb9] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 1334, in detach_volume
2017-03-16 01:01:32.897 7447 ERROR nova.compute.manager [instance: 5f67c47b-f7d7-4063-88f1-8f7b6177fcb9] live=live)
2017-03-16 01:01:32.897 7447 ERROR nova.compute.manager [instance: 5f67c47b-f7d7-4063-88f1-8f7b6177fcb9] File "/opt/stack/new/nova/nova/virt/libvirt/guest.py", line 414, in detach_device_with_retry
2017-03-16 01:01:32.897 7447 ERROR nova.compute.manager [instance: 5f67c47b-f7d7-4063-88f1-8f7b6177fcb9] _try_detach_device(conf, persistent, live)
2017-03-16 01:01:32.897 7447 ERROR nova.compute.manager [instance: 5f67c47b-f7d7-4063-88f1-8f7b6177fcb9] File "/opt/stack/new/nova/nova/virt/libvirt/guest.py", line 408, in _try_detach_device
2017-03-16 01:01:32.897 7447 ERROR nova.compute.manager [instance: 5f67c47b-f7d7-4063-88f1-8f7b6177fcb9] raise exception.DeviceNotFound(device=device)
2017-03-16 01:01:32.897 7447 ERROR nova.compute.manager [instance: 5f67c47b-f7d7-4063-88f1-8f7b6177fcb9] File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2017-03-16 01:01:32.897 7447 ERROR nova.compute.manager [instance: 5f67c47b-f7d7-4063-88f1-8f7b6177fcb9] self.force_reraise()
2017-03-16 01:01:32.897 7447 ERROR nova.compute.manager [instance: 5f67c47b-f7d7-4063-88f1-8f7b6177fcb9] File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2017-03-16 01:01:32.897 7447 ERROR nova.compute.manager [instance: 5f67c47b-f7d7-4063-88f1-8f7b6177fcb9] six.reraise(self.type_, self.value, self.tb)
2017-03-16 01:01:32.897 7447 ERROR nova.compute.manager [instance: 5f67c47b-f7d7-4063-88f1-8f7b6177fcb9] File "/opt/stack/new/nova/nova/virt/libvirt/guest.py", line 393, in _try_detach_device
2017-03-16 01:01:32.897 7447 ERROR nova.compute.manager [instance: 5f67c47b-f7d7-4063-88f1-8f7b6177fcb9] self.detach_device(conf, persistent=persistent, live=live)
2017-03-16 01:01:32.897 7447 ERROR nova.compute.manager [instance: 5f67c47b-f7d7-4063-88f1-8f7b6177fcb9] File "/opt/stack/new/nova/nova/virt/libvirt/guest.py", line 450, in detach_device
2017-03-16 01:01:32.897 7447 ERROR nova.compute.manager [instance: 5f67c47b-f7d7-4063-88f1-8f7b6177fcb9] self._domain.detachDeviceFlags(device_xml, flags=flags)
2017-03-16 01:01:32.897 7447 ERROR nova.compute.manager [instance: 5f67c47b-f7d7-4063-88f1-8f7b6177fcb9] File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 186, in doit
2017-03-16 01:01:32.897 7447 ERROR nova.compute.manager [instance: 5f67c47b-f7d7-4063-88f1-8f7b6177fcb9] result = proxy_call(self._autowrap, f, *args, **kwargs)
2017-03-16 01:01:32.897 7447 ERROR nova.compute.manager [instance: 5f67c47b-f7d7-4063-88f1-8f7b6177fcb9] File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 144, in proxy_call
2017-03-16 01:01:32.897 7447 ERROR nova.compute.manager [instance: 5f67c47b-f7d7-4063-88f1-8f7b6177fcb9] rv = execute(f, *args, **kwargs)
2017-03-16 01:01:32.897 7447 ERROR nova.compute.manager [instance: 5f67c47b-f7d7-4063-88f1-8f7b6177fcb9] File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 125, in execute
2017-03-16 01:01:32.897 7447 ERROR nova.compute.manager [instance: 5f67c47b-f7d7-4063-88f1-8f7b6177fcb9] six.reraise(c, e, tb)
2017-03-16 01:01:32.897 7447 ERROR nova.compute.manager [instance: 5f67c47b-f7d7-4063-88f1-8f7b6177fcb9] File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 83, in tworker
2017-03-16 01:01:32.897 7447 ERROR nova.compute.manager [instance: 5f67c47b-f7d7-4063-88f1-8f7b6177fcb9] rv = meth(*args, **kwargs)
2017-03-16 01:01:32.897 7447 ERROR nova.compute.manager [instance: 5f67c47b-f7d7-4063-88f1-8f7b6177fcb9] File "/usr/local/lib/python2.7/dist-packages/libvirt.py", line 1192, in detachDeviceFlags
2017-03-16 01:01:32.897 7447 ERROR nova.compute.manager [instance: 5f67c47b-f7d7-4063-88f1-8f7b6177fcb9] if ret == -1: raise libvirtError ('virDomainDetachDeviceFlags() failed', dom=self)
2017-03-16 01:01:32.897 7447 ERROR nova.compute.manager [instance: 5f67c47b-f7d7-4063-88f1-8f7b6177fcb9] libvirtError: End of file while reading data: Input/output error
2017-03-16 01:01:32.897 7447 ERROR nova.compute.manager [instance: 5f67c47b-f7d7-4063-88f1-8f7b6177fcb9]

It's not clear if we're trying the detach twice, and the first time it's OK and then the 2nd fails with the i/o error, which is not in the list of expected libvirt errors we handle in this case.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/446601

Matt Riedemann (mriedem)
Changed in nova:
status: New → Confirmed
Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

(From an IRC interaction with Dan Berrangé [danpb] and Matthew Booth
[mdbooth].)

Dan has done some log analysis. Specifically from the below log message
(from libvirt debug logs), he could deduce some clues about root cause:

   [...]
   2017-03-16 01:01:33.096+0000: 26064: debug : virNetlinkStartup:138 : Running global netlink initialization
   [...]

[danpb]: "The above is showing libvirtd has been restarted, so
presumably it crashed, causing Nova to see the EOF. We should probably
make sure OpenStack CI does *not* restart things, as it just hides the
obvious failure!"

And after looking at the error from 'syslog':

    [...]
    Mar 16 01:01:32 ubuntu-xenial-infracloud-vanilla-7903827 libvirtd[17211]: *** Error in `/usr/sbin/libvirtd': malloc(): memory corruption: 0x0000562d21527f50 ***
    [...]

[danpb]: "So it's a bug in libvirt in Ubuntu. Which is probably fixed
sometime after 1.3.1."

NOTE (kashyap): There is an existing bug for the above memory corruption
here:

    https://bugs.launchpad.net/nova/+bug/1643911/ -- 'libvirt randomly
    crashes on xenial nodes with "*** Error in `/usr/sbin/libvirtd':
    malloc(): memory corruption:"'

After some code inspection, Dan pointed out two potential commits from
libvirt that are likely candidates to have fixed the issue:

(1) One possible candidate in 1.3.2 would be:

    'qemu: Process monitor EOF in a job' --
    https://libvirt.org/git/?p=libvirt.git;a=commit;h=8c9ff99

(2) This might be relevant, too, but less likely:

    ;qemu: Avoid calling qemuProcessStop without a job' --
    https://libvirt.org/git/?p=libvirt.git;a=commit;h=81f50cb

He notes further: "the first [libvirt] commit hash noted above (8c9ff99)
is something you'd want to encourage Ubuntu maintainers to add as I
don't see it in their patches right now."

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi Kashyap,
I appreciate your check with Dan.

I agree that 8c9ff99 is a reasonable fix in general.
The backporting of this series from Jiri can be done two ways.
A) On one hand I could either try to pick 8c9ff996 which implies at least 4d0c535a to work reasonably.
B) Or OTOH I could try to leverage the full series of Jiri - both changes recommended by Dan, but since that means more dependencies it is essentially all of the series = 17c4312c..81f50cb92

A) is less invasive, while B) is certainly more complete but might hide some of its dependencies.
I prepared both in PPAs and will drive some basic tests on them now.
Once they pass tests I would come back here recommend one or both of them to be tried in your context.

In preparation I wanted to ask to what extend you could test from a ppa in your environment?

Changed in libvirt (Ubuntu):
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Ok, basic verifications were all good, I faced no issues on automated tests on all sort of libvirt interactions as well as migrations from/to/in-between xenial. I also have not seen any arch difference.

This means I would have two ppa's for you to try:
- Backport of the minimal recommended fix: https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/2620
- Backport of the full series of related fixes: https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/2619

Is there a reasonable way to expose that in your infrastructure - although I'm afraid "2 hits in 7 days" make this hard to feel 100% sure that "0 hits in 7 days" is a safe validation. But you know your environment best - is there a way to test these to gain some confidence - in this being a fix as well as not causing another regression?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/446601
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=33d0b029f1927a455b6b00f41b4d83ed19a694ff
Submitter: Jenkins
Branch: master

commit 33d0b029f1927a455b6b00f41b4d83ed19a694ff
Author: Matt Riedemann <email address hidden>
Date: Thu Mar 16 11:35:04 2017 -0400

    libvirt: add debug logging in detach_device_with_retry

    When we hit failures in this method it can be hard to
    sort out from a traceback what actually happened, and
    if we failed from the intial detach or if we failed
    during the retry loop.

    This patch adds some simple debug logging statements
    to help give context to what's going on when we need
    to debug a failure in this operation.

    Change-Id: I392192f139ced077664e54469eead0b0df9ebf4a
    Related-Bug: #1673483

Revision history for this message
Matt Riedemann (mriedem) wrote :

(9:56:32 AM) mriedem: clarkb: sorry if you guys already talked about this, but in https://bugs.launchpad.net/nova/+bug/1673483 there are comments from ubuntu devs that backports have been made to some packages and are available for testing,
(9:56:34 AM) openstack: Launchpad bug 1673483 in libvirt (Ubuntu) "libvirt: test_attach_volume_shelved_or_offload_server times out waiting for device detach (which fails)" [Medium,Triaged]
(9:56:45 AM) mriedem: clarkb: it's not clear to me if we have a way to hook testing packages in our CI?
(9:56:48 AM) clarkb: mriedem: yes but that bug isn't a problem
(9:56:57 AM) clarkb: mriedem: we have 3 other libvirt bugs that are actually major issues
(9:57:12 AM) clarkb: mriedem: and I mentioned it would be simple to modify devstack to use a ppa or just wget and dpkg -i
(9:57:47 AM) mriedem: ok, and we don't have fixes identified to backport for 1.3.3 in xenial LTS for testing yet
(9:58:06 AM) jamespage: clarkb: can I get context please on those three bugs
(9:58:16 AM) jamespage: refs would be ok for now
jamespage jamespd
(9:58:30 AM) jamespage: guessing one is bug 1643911
(9:58:30 AM) mriedem: jamespage: https://bugs.launchpad.net/nova/+bug/1643911/
(9:58:31 AM) openstack: bug 1643911 in OpenStack Compute (nova) "libvirt randomly crashes on xenial nodes with "*** Error in `/usr/sbin/libvirtd': malloc(): memory corruption:"" [Medium,Confirmed] https://launchpad.net/bugs/1643911
(9:58:47 AM) clarkb: jamespage: its the three elastic recheck links I gave you above
(9:58:52 AM) clarkb: jamespage: they have links to launchpad
(9:58:56 AM) jamespage: clarkb: ta

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I updated the other potentially related bug to encourage considering ways to test the PPAs mentioned in comment #5

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

The scope of "seems potentially related" are these bugs:
- bug 1646779
- bug 1643911
- bug 1638982
- bug 1673483 (= this one)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/ocata)

Related fix proposed to branch: stable/ocata
Review: https://review.openstack.org/472308

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/newton)

Related fix proposed to branch: stable/newton
Review: https://review.openstack.org/472344

Sean Dague (sdague)
Changed in nova:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/ocata)

Reviewed: https://review.openstack.org/472308
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=527834dea884d0663c1cf5f075c850e6f4a4eeaa
Submitter: Jenkins
Branch: stable/ocata

commit 527834dea884d0663c1cf5f075c850e6f4a4eeaa
Author: Matt Riedemann <email address hidden>
Date: Thu Mar 16 11:35:04 2017 -0400

    libvirt: add debug logging in detach_device_with_retry

    When we hit failures in this method it can be hard to
    sort out from a traceback what actually happened, and
    if we failed from the intial detach or if we failed
    during the retry loop.

    This patch adds some simple debug logging statements
    to help give context to what's going on when we need
    to debug a failure in this operation.

    Change-Id: I392192f139ced077664e54469eead0b0df9ebf4a
    Related-Bug: #1673483
    (cherry picked from commit 33d0b029f1927a455b6b00f41b4d83ed19a694ff)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/newton)

Reviewed: https://review.openstack.org/472344
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=58e7f8c3253d6f22a60ed567d95bdd4de64da5f1
Submitter: Jenkins
Branch: stable/newton

commit 58e7f8c3253d6f22a60ed567d95bdd4de64da5f1
Author: Matt Riedemann <email address hidden>
Date: Thu Mar 16 11:35:04 2017 -0400

    libvirt: add debug logging in detach_device_with_retry

    When we hit failures in this method it can be hard to
    sort out from a traceback what actually happened, and
    if we failed from the intial detach or if we failed
    during the retry loop.

    This patch adds some simple debug logging statements
    to help give context to what's going on when we need
    to debug a failure in this operation.

    Change-Id: I392192f139ced077664e54469eead0b0df9ebf4a
    Related-Bug: #1673483
    (cherry picked from commit 33d0b029f1927a455b6b00f41b4d83ed19a694ff)
    (cherry picked from commit 0e21bbde99b765f97100c832d3a81df1fee5b23b)

tags: added: in-stable-newton
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.