[danpb]: "The above is showing libvirtd has been restarted, so
presumably it crashed, causing Nova to see the EOF. We should probably
make sure OpenStack CI does *not* restart things, as it just hides the
obvious failure!"
And after looking at the error from 'syslog':
[...]
Mar 16 01:01:32 ubuntu-xenial-infracloud-vanilla-7903827 libvirtd[17211]: *** Error in `/usr/sbin/libvirtd': malloc(): memory corruption: 0x0000562d21527f50 ***
[...]
[danpb]: "So it's a bug in libvirt in Ubuntu. Which is probably fixed
sometime after 1.3.1."
NOTE (kashyap): There is an existing bug for the above memory corruption
here:
He notes further: "the first [libvirt] commit hash noted above (8c9ff99)
is something you'd want to encourage Ubuntu maintainers to add as I
don't see it in their patches right now."
(From an IRC interaction with Dan Berrangé [danpb] and Matthew Booth
[mdbooth].)
Dan has done some log analysis. Specifically from the below log message
(from libvirt debug logs), he could deduce some clues about root cause:
[...] up:138 : Running global netlink initialization
2017-03-16 01:01:33.096+0000: 26064: debug : virNetlinkStart
[...]
[danpb]: "The above is showing libvirtd has been restarted, so
presumably it crashed, causing Nova to see the EOF. We should probably
make sure OpenStack CI does *not* restart things, as it just hides the
obvious failure!"
And after looking at the error from 'syslog':
[...] xenial- infracloud- vanilla- 7903827 libvirtd[17211]: *** Error in `/usr/sbin/ libvirtd' : malloc(): memory corruption: 0x0000562d21527f50 ***
Mar 16 01:01:32 ubuntu-
[...]
[danpb]: "So it's a bug in libvirt in Ubuntu. Which is probably fixed
sometime after 1.3.1."
NOTE (kashyap): There is an existing bug for the above memory corruption
here:
https:/ /bugs.launchpad .net/nova/ +bug/1643911/ -- 'libvirt randomly libvirtd' :
crashes on xenial nodes with "*** Error in `/usr/sbin/
malloc(): memory corruption:"'
After some code inspection, Dan pointed out two potential commits from
libvirt that are likely candidates to have fixed the issue:
(1) One possible candidate in 1.3.2 would be:
'qemu: Process monitor EOF in a job' -- /libvirt. org/git/ ?p=libvirt. git;a=commit; h=8c9ff99
https:/
(2) This might be relevant, too, but less likely:
;qemu: Avoid calling qemuProcessStop without a job' -- /libvirt. org/git/ ?p=libvirt. git;a=commit; h=81f50cb
https:/
He notes further: "the first [libvirt] commit hash noted above (8c9ff99)
is something you'd want to encourage Ubuntu maintainers to add as I
don't see it in their patches right now."