Description of problem: Short summary: if a QEMU/KVM VM hangs for unresponsive storage (NFS server unreachable), after a random amount of time virDomainGetControlInfo() stops to respond. Packages: qemu-kvm-tools-rhev-2.3.0-31.el7_2.14.x86_64 ipxe-roms-qemu-20130517-7.gitc4bce43.el7.noarch qemu-kvm-rhev-2.3.0-31.el7_2.14.x86_64 libvirt-daemon-driver-qemu-1.3.4-1.el7.x86_64 qemu-img-rhev-2.3.0-31.el7_2.14.x86_64 qemu-kvm-common-rhev-2.3.0-31.el7_2.14.x86_64 libvirt-daemon-driver-storage-1.3.4-1.el7.x86_64 libvirt-daemon-driver-interface-1.3.4-1.el7.x86_64 libvirt-debuginfo-1.3.4-1.el7.x86_64 libvirt-daemon-kvm-1.3.4-1.el7.x86_64 libvirt-daemon-config-nwfilter-1.3.4-1.el7.x86_64 libvirt-daemon-config-network-1.3.4-1.el7.x86_64 libvirt-client-1.3.4-1.el7.x86_64 libvirt-daemon-driver-lxc-1.3.4-1.el7.x86_64 libvirt-lock-sanlock-1.3.4-1.el7.x86_64 libvirt-daemon-1.3.4-1.el7.x86_64 libvirt-daemon-driver-qemu-1.3.4-1.el7.x86_64 libvirt-devel-1.3.4-1.el7.x86_64 libvirt-daemon-driver-secret-1.3.4-1.el7.x86_64 libvirt-daemon-lxc-1.3.4-1.el7.x86_64 libvirt-nss-1.3.4-1.el7.x86_64 libvirt-1.3.4-1.el7.x86_64 libvirt-daemon-driver-nodedev-1.3.4-1.el7.x86_64 libvirt-python-1.2.17-2.el7.x86_64 libvirt-daemon-driver-network-1.3.4-1.el7.x86_64 libvirt-login-shell-1.3.4-1.el7.x86_64 libvirt-daemon-driver-nwfilter-1.3.4-1.el7.x86_64 libvirt-docs-1.3.4-1.el7.x86_64 libvirt recompiled from git, qemu from RHEL Context: Vdsm is the node management system of oVirt (http://www.ovirt.org) and uses libvirt to run and monitor VMs. We use QEMU/KVM VMs, over shared storage. Among the calls Vdsm periodically run to monitor the VM state: virConnectGetAllDomainStats virDomainListGetStats virDomainGetBlockIoTune virDomainBlockJobInfo virDomainGetBlockInfo virDomainGetVcpus We know from experience storage may get unresponsive/unreachable, so QEMU monitor calls can hang, leading in turn to libvirt call to hang. Vdsm does the monitoring using a thread pool. Should one of the worker thread become unresponsive, it is replaced. To avoid to stall libvirt, and to leak threads undefinitely, Vdsm has one additional protection layer: it inspects libvirt state before to call which go down to QEMU, using code like def isDomainReadyForCommands(self): try: state, details, stateTime = self._dom.controlInfo() except virdomain.NotConnectedError: # this method may be called asynchronously by periodic # operations. Thus, we must use a try/except block # to avoid racy checks. return False except libvirt.libvirtError as e: if e.get_error_code() == libvirt.VIR_ERR_NO_DOMAIN: return False else: raise else: return state == libvirt.VIR_DOMAIN_CONTROL_OK Vdsm actually issues the potentially hanging call if and only if the call above returns True (hence virDomainControlInfo() state is VIR_DOMAIN_CONTROL_OK) When the NFS server is unreachable, the protection layer in Vdsm triggers, and Vdsm avoid to send libvirt calls. After a while, however we see virDomainGetControlInfo() calls not responding anymore, like (full log attached) 2016-05-18 06:01:45.920+0000: 3069: debug : virThreadJobSet:96 : Thread 3069 (virNetServerHandleJob) is now running job remoteDispatchDomainGetVcpus 2016-05-18 06:01:45.920+0000: 3069: info : virObjectNew:202 : OBJECT_NEW: obj=0x7f5a70004070 classname=virDomain 2016-05-18 06:01:45.920+0000: 3069: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a5c000ec0 2016-05-18 06:01:45.920+0000: 3069: debug : virDomainGetVcpus:7733 : dom=0x7f5a70004070, (VM: name=a1, uuid=048f8624-03fc-4729-8f4d-12cb4387f018), info=0x7f5a70002140, maxinfo=2, cpumaps=0x7f5a70002200, maplen=1 2016-05-18 06:01:45.920+0000: 3069: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a64009bf0 2016-05-18 06:01:45.920+0000: 3069: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a930f6f00 2016-05-18 06:01:45.920+0000: 3069: debug : virAccessManagerCheckDomain:234 : manager=0x7f5a930f6f00(name=stack) driver=QEMU domain=0x7f5a64012c40 perm=1 2016-05-18 06:01:45.920+0000: 3069: debug : virAccessManagerCheckDomain:234 : manager=0x7f5a930ebdf0(name=none) driver=QEMU domain=0x7f5a64012c40 perm=1 2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a930f6f00 2016-05-18 06:01:45.920+0000: 3069: debug : qemuGetProcessInfo:1486 : Got status for 3500/3505 user=1507 sys=209 cpu=1 rss=531128 2016-05-18 06:01:45.920+0000: 3069: debug : qemuGetProcessInfo:1486 : Got status for 3500/3506 user=1150 sys=55 cpu=0 rss=531128 2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a64009bf0 2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a70004070 2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:261 : OBJECT_DISPOSE: obj=0x7f5a70004070 2016-05-18 06:01:45.920+0000: 3069: debug : virDomainDispose:313 : release domain 0x7f5a70004070 a1 048f8624-03fc-4729-8f4d-12cb4387f018 2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a5c000ec0 2016-05-18 06:01:45.920+0000: 3069: debug : virThreadJobClear:121 : Thread 3069 (virNetServerHandleJob) finished job remoteDispatchDomainGetVcpus with ret=0 2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a5c0009c0 2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a5c0009c0 2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a930ff630 2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a93110890 2016-05-18 06:01:45.922+0000: 3070: warning : qemuDomainObjBeginJobInternal:2180 : Cannot start job (query, none) for domain a1; current job is (query, none) owned by (3066 remoteDispatchDomainGetBlockIoTune, 0 ) for (30s, 0s) 2016-05-18 06:01:45.922+0000: 3070: error : qemuDomainObjBeginJobInternal:2192 : Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainGetBlockIoTune) 2016-05-18 06:01:45.922+0000: 3070: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a30178fe0 2016-05-18 06:01:45.922+0000: 3070: debug : virCgroupGetValueStr:814 : Get value /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-qemu\x2d1\x2da1.scope/cpuacct.usage 2016-05-18 06:01:45.922+0000: 3070: debug : virFileClose:102 : Closed fd 26 2016-05-18 06:01:45.922+0000: 3070: debug : virCgroupGetValueStr:814 : Get value /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-qemu\x2d1\x2da1.scope/cpuacct.stat 2016-05-18 06:01:45.922+0000: 3070: debug : virFileClose:102 : Closed fd 26 2016-05-18 06:01:45.922+0000: 3070: debug : qemuGetProcessInfo:1486 : Got status for 3500/3505 user=1507 sys=209 cpu=1 rss=531128 2016-05-18 06:01:45.922+0000: 3070: debug : virFileClose:102 : Closed fd 26 2016-05-18 06:01:45.922+0000: 3070: debug : qemuGetProcessInfo:1486 : Got status for 3500/3506 user=1150 sys=55 cpu=0 rss=531128 2016-05-18 06:01:45.922+0000: 3070: debug : virFileClose:102 : Closed fd 26 2016-05-18 06:01:45.922+0000: 3070: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a30178fe0 2016-05-18 06:01:45.922+0000: 3070: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a30178fe0 2016-05-18 06:01:45.922+0000: 3070: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a30178fe0 2016-05-18 e6:01:45.938+0000: 3065: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a93110890 2016-05-18 06:01:45.938+0000: 3065: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a930ff630 2016-05-18 06:01:45.939+0000: 3068: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a5c0009c0 2016-05-18 06:01:45.939+0000: 3068: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a5c0009c0 2016-05-18 06:01:45.939+0000: 3068: debug : virThreadJobSet:96 : Thread 3068 (virNetServerHandleJob) is now running job remoteDispatchDomainGetControlInfo 2016-05-18 06:01:45.939+0000: 3068: info : virObjectNew:202 : OBJECT_NEW: obj=0x7f5a6c002b50 classname=virDomain 2016-05-18 06:01:45.939+0000: 3068: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a5c000ec0 2016-05-18 06:01:45.939+0000: 3068: debug : virDomainGetControlInfo:2526 : dom=0x7f5a6c002b50, (VM: name=a1, uuid=048f8624-03fc-4729-8f4d-12cb4387f018), info=0x7f5a79a50ad0, flags=0 2016-05-18 06:01:45.939+0000: 3068: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a64009bf0 2016-05-18 06:01:47.940+0000: 3065: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a93110890 2016-05-18 06:01:47.940+0000: 3065: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a930ff630 2016-05-18 06:01:47.940+0000: 3067: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a5c0009c0 2016-05-18 06:01:47.940+0000: 3067: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a5c0009c0 2016-05-18 06:01:47.940+0000: 3067: debug : virThreadJobSet:96 : Thread 3067 (virNetServerHandleJob) is now running job remoteDispatchDomainGetControlInfo Please note that calls begin and never end; previosly calls begun and end in a couple of msecs most. Version-Release number of selected component (if applicable): 1.3.4 How reproducible: 100% of times, but the issue takes a random amount of time before to surface. I've seen it happen after few minutes or after few hours. Steps to Reproduce: 1. create a QEMU domain on NFS shared storage (see attached dom.xml for example) 2. periodically run monitoring calls, each one protected by virDomainGetControlInfo() as exposed above. 3. wait some time, at least a couple of hours to have a good chance to see the failure Actual results: After a random time, virDomainGetControlInfo() stops responding Expected results: virDomainGetControlInfo() keeps responding Additional info: 1. I can't tell if it is interplay of libvirt calls; so I can't tell if calling only virDomainGetControlInfo() periodically is sufficent to make it hang 2. when virDomainGetControlInfo() hangs, virsh stop responding as well. (even virsh -r list hangs)