Activity log for bug #1798110

Date Who What changed Old value New value Message
2018-10-16 14:37:19 Mauricio Faria de Oliveira bug added bug
2018-10-16 15:00:05 Ubuntu Kernel Bot linux (Ubuntu): status New Incomplete
2018-10-16 15:00:06 Ubuntu Kernel Bot tags xenial
2018-10-16 15:06:10 Mauricio Faria de Oliveira linux (Ubuntu): status Incomplete Confirmed
2018-10-16 15:28:48 Mauricio Faria de Oliveira description (I'll add the SRU template + testing steps and post to ML shortly.) A customer reported a CPU soft lockup on Trusty HWE kernel from Xenial when detaching a virtio-scsi drive, and provided a crashdump that shows 2 things: 1) The soft locked up CPU is waiting for another CPU to finish something, and that does not happen because the other CPU is infinitely looping in virtscsi_target_destroy(). 2) The loop happens because the 'tgt->reqs' counter is non-zero, and that probably happened due to a missing decrement in SCSI command requeue path, exercised when the virtio ring is full. The reported problem itself happens because of a downstream/SAUCE patch, coupled with the problem of the missing decrement for the reqs counter. Introducing a decrement in the SCSI command requeue path resolves the problem, verified synthetically with QEMU+GDB and with test-case/loop provided by the customer as problem reproducer. [Impact] * Detaching virtio-scsi disk in Xenial guest can cause CPU soft lockup in guest (and take 100% CPU in host). * It may prevent further progress on other tasks that depend on resources locked earlier in the SCSI target removal stack, and/or impact other SCSI functionality. * The fix resolves a corner case in the requests counter in the virtio SCSI target, which impacts a downstream (SAUCE) patch in the virtio-scsi target removal handler that depends on the requests counter. [Test Case] * See LP #1798110 (this bug)'s comment #3 (too long for this section -- synthetic case with GDB+QEMU) and comment #4 (organic test case in cloud instance). [Regression Potential] * It seem low -- this only affects the SCSI command requeue path with regards to the reference counter, which is only used with real chance of problems in our downstream patch (which is now passing this testcase). * The other less serious issue would be decrementing it to a negative / < 0 value, which is not possible with this driver logic (see commit message), because the reqs counter is always incremented before calling virtscsi_queuecommand(), where this decrement operation is inserted. [Original Description] A customer reported a CPU soft lockup on Trusty HWE kernel from Xenial when detaching a virtio-scsi drive, and provided a crashdump that shows 2 things: 1) The soft locked up CPU is waiting for another CPU to finish something, and that does not happen because the other CPU is infinitely looping in virtscsi_target_destroy(). 2) The loop happens because the 'tgt->reqs' counter is non-zero, and that probably happened due to a missing decrement in SCSI command requeue path, exercised when the virtio ring is full. The reported problem itself happens because of a downstream/SAUCE patch, coupled with the problem of the missing decrement for the reqs counter. Introducing a decrement in the SCSI command requeue path resolves the problem, verified synthetically with QEMU+GDB and with test-case/loop provided by the customer as problem reproducer.
2018-10-16 15:37:02 Mauricio Faria de Oliveira description [Impact] * Detaching virtio-scsi disk in Xenial guest can cause CPU soft lockup in guest (and take 100% CPU in host). * It may prevent further progress on other tasks that depend on resources locked earlier in the SCSI target removal stack, and/or impact other SCSI functionality. * The fix resolves a corner case in the requests counter in the virtio SCSI target, which impacts a downstream (SAUCE) patch in the virtio-scsi target removal handler that depends on the requests counter. [Test Case] * See LP #1798110 (this bug)'s comment #3 (too long for this section -- synthetic case with GDB+QEMU) and comment #4 (organic test case in cloud instance). [Regression Potential] * It seem low -- this only affects the SCSI command requeue path with regards to the reference counter, which is only used with real chance of problems in our downstream patch (which is now passing this testcase). * The other less serious issue would be decrementing it to a negative / < 0 value, which is not possible with this driver logic (see commit message), because the reqs counter is always incremented before calling virtscsi_queuecommand(), where this decrement operation is inserted. [Original Description] A customer reported a CPU soft lockup on Trusty HWE kernel from Xenial when detaching a virtio-scsi drive, and provided a crashdump that shows 2 things: 1) The soft locked up CPU is waiting for another CPU to finish something, and that does not happen because the other CPU is infinitely looping in virtscsi_target_destroy(). 2) The loop happens because the 'tgt->reqs' counter is non-zero, and that probably happened due to a missing decrement in SCSI command requeue path, exercised when the virtio ring is full. The reported problem itself happens because of a downstream/SAUCE patch, coupled with the problem of the missing decrement for the reqs counter. Introducing a decrement in the SCSI command requeue path resolves the problem, verified synthetically with QEMU+GDB and with test-case/loop provided by the customer as problem reproducer. [Impact]  * Detaching virtio-scsi disk in Xenial guest can cause    CPU soft lockup in guest (and take 100% CPU in host).  * It may prevent further progress on other tasks that    depend on resources locked earlier in the SCSI target    removal stack, and/or impact other SCSI functionality.  * The fix resolves a corner case in the requests counter    in the virtio SCSI target, which impacts a downstream    (SAUCE) patch in the virtio-scsi target removal handler    that depends on the requests counter value to be zero. [Test Case]  * See LP #1798110 (this bug)'s comment #3 (too long for    this section -- synthetic case with GDB+QEMU) and    comment #4 (organic test case in cloud instance). [Regression Potential]  * It seem low -- this only affects the SCSI command requeue    path with regards to the reference counter, which is only    used with real chance of problems in our downstream patch    (which is now passing this testcase).  * The other less serious issue would be decrementing it to    a negative / < 0 value, which is not possible with this    driver logic (see commit message), because the reqs counter    is always incremented before calling virtscsi_queuecommand(),    where this decrement operation is inserted. [Original Description] A customer reported a CPU soft lockup on Trusty HWE kernel from Xenial when detaching a virtio-scsi drive, and provided a crashdump that shows 2 things: 1) The soft locked up CPU is waiting for another CPU to finish something, and that does not happen because the other CPU is infinitely looping in virtscsi_target_destroy(). 2) The loop happens because the 'tgt->reqs' counter is non-zero, and that probably happened due to a missing decrement in SCSI command requeue path, exercised when the virtio ring is full. The reported problem itself happens because of a downstream/SAUCE patch, coupled with the problem of the missing decrement for the reqs counter. Introducing a decrement in the SCSI command requeue path resolves the problem, verified synthetically with QEMU+GDB and with test-case/loop provided by the customer as problem reproducer.
2018-10-16 18:37:23 Joseph Salisbury linux (Ubuntu): importance Undecided Medium
2018-10-16 18:37:26 Joseph Salisbury linux (Ubuntu): status Confirmed Triaged
2018-10-16 18:37:40 Joseph Salisbury nominated for series Ubuntu Xenial
2018-10-16 18:37:40 Joseph Salisbury bug task added linux (Ubuntu Xenial)
2018-10-16 18:37:45 Joseph Salisbury linux (Ubuntu Xenial): status New Triaged
2018-10-16 18:37:48 Joseph Salisbury linux (Ubuntu Xenial): importance Undecided Medium
2018-10-24 09:58:40 Kleber Sacilotto de Souza linux (Ubuntu Xenial): status Triaged Fix Committed
2018-10-25 08:04:39 Brad Figg tags xenial verification-needed-xenial xenial
2018-10-25 14:07:17 Mauricio Faria de Oliveira tags verification-needed-xenial xenial verification-done-xenial xenial
2018-10-25 14:41:31 David Coronel bug added subscriber David Coronel
2018-11-13 17:53:26 Launchpad Janitor linux (Ubuntu Xenial): status Fix Committed Fix Released
2018-11-13 17:53:26 Launchpad Janitor cve linked 2018-7755
2019-07-24 21:16:27 Brad Figg tags verification-done-xenial xenial cscc verification-done-xenial xenial