[Dell Vostro 430] Regression: Kernel 3.2.0-64 problems with USB3 controller

Bug #1330530 reported by Maciej Puzio
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Triaged
Medium
Unassigned

Bug Description

With a Dell Vostro 430, the HighPoint RocketU 1144C USB 3.0 controller, Areca ARC-5040 USB 3.0 RAID enclosure connected to it, and the following conditions are met:
1. System booted kernel 3.2.0-64,
2. HighPoint RocketU 1144C controller was installed,
3. Areca ARC-5040 was connected to that controller.

An error loop during boot contains the following messages:
[ 34.084469] usb 8-1: reset SuperSpeed USB device number 2 using xhci_hcd
[ 34.101825] xhci_hcd 0000:05:00.0: xHCI xhci_drop_endpoint called with disabled ep ffff88042102e000
[ 34.101918] xhci_hcd 0000:05:00.0: xHCI xhci_drop_endpoint called with disabled ep ffff88042102e040
This continues for about 18 minutes, after which the filesystem on the Areca drive is mounted, and boot process continues successfully, as if nothing had happened. Afterwards the affected drive works seemingly fine, although I experienced some system instability, causing a total system freeze. At this point I am not sure if this instability is related to the problem at hand.

I've attached a file generated by apport-cli -f -p linux --save filename.apport .

The problem did not appear if I booted an older kernel (e.g. 3.2.0-63), or if Areca enclosure was not attached, or if it was attached using another interface (USB2 or eSATA). The problem was also absent if I replaced the Areca enclosure with another USB3 device (a flash drive). The test machine's motherboard did not have a built-in USB3 controller, but I performed an additional test on yet another computer, equipped with a NEC USB3 controller. That test was done with kernel 3.2.0-64 and the Areca enclosure, and did not replicate the problem. Thus I assume that it is the combination of the RocketU controller and a specific USB3 device that triggers kernel regression.

Similar effects happen if Areca enclosure is hot-plugged to the working system. In such a case OS boots fine (as the enclosure is absent during boot). After plugging the Areca, the drive is unavailable for 18 minutes, during which time numerous errors as above are logged. After 18 minutes elapse, drive is mounted and behaves normally.

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1330530

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Maciej Puzio (maciej-puzio) wrote : Re: Regression: Kernel 3.2.0-64 problems with USB3 controller
Revision history for this message
Maciej Puzio (maciej-puzio) wrote :

A few notes regarding the contents of the apport file submitted above:
1. Times logged in syslog are incorrect and do not reflect the 18-minute delay. It appears that rsyslog is started after the delay and logs its startup time, not the real time of events.
2. Nouveau segfaults are not related to this bug report, and were occurring in older kernels as well.

As this bug has been replicated on various hardware, affects more than one user, and requested system information has been provided, I am changing status to 'confirmed'. When the nature of this problem is better understood, this bug may possibly be marked as a duplicate of bug 1328984.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Maciej Puzio (maciej-puzio) wrote :

Following the advice given for bug 1328984, I have tested the latest upstream kernel, and I am happy to report that the problem did not occur (neither during boot or hotplug).

Kernel tested: 3.16.0-031600rc1-generic x86_64
URL: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.16-rc1-utopic/

uname -a
Linux ubuntu 3.16.0-031600rc1-generic #201406160035 SMP Mon Jun 16 04:36:15 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest 3.2 upstream stable kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.2 stable kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.2.60-precise/

Changed in linux (Ubuntu):
importance: Undecided → Medium
tags: added: kernel-da-key needs-bisect precise regression-update
penalvch (penalvch)
tags: added: apport-collected
tags: added: kernel-fixed-upstream-3.16-rc1 needs-reverse-bisect
removed: needs-bisect
penalvch (penalvch)
description: updated
tags: added: bios-outdated-2.4.0
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
penalvch (penalvch)
summary: - Regression: Kernel 3.2.0-64 problems with USB3 controller
+ [Dell Vostro 430] Regression: Kernel 3.2.0-64 problems with USB3
+ controller
Revision history for this message
Maciej Puzio (maciej-puzio) wrote :

I have tested the mainline kernel 3.2.60, and was able to reproduce the problem, with exactly the same symptoms as with kernel 3.2.0-64 (3.2.59).
Kernel URL: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.2.60-precise/

I also tested Western Digital My Passport 2TB USB 3.0 drive (Part# WDBY8L0020BBL). This drive is causing the same problems as Areca ARC-5040 (with kernels 3.2.0-64 and 3.2.60). No problems with kernel 3.16.0-031600rc1. Thus I have two USB3 devices that trigger the bug. As of now, the only constant element required for the bug to appear is HighPoint RocketU 1144C USB3 controller.

Revision history for this message
penalvch (penalvch) wrote :

Maciej Puzio, the next step is to fully reverse commit bisect the kernel in order to identify the offending commit. Could you please do this following https://wiki.ubuntu.com/Kernel/KernelBisection#How_do_I_reverse_bisect_the_upstream_kernel.3F ?

Revision history for this message
Maciej Puzio (maciej-puzio) wrote :

I have tested 28 mainline kernels from 9 branches currently maintained (3.2, 3.4, 3.10, 3.11, 3.12, 3.13, 3.14, 3.15, 3.16), focusing on those that were built around the time the problematic commit was introduced (May-June 2014). The bug appears to affect the 3.2 branch exclusively. Thus I will now try to forward bisect commits from 3.2.58 (last good) to 3.2.59 (first bad).

Just for reference, here are results of my testing. "Bad" means that bug was reproducible in the given kernel, "good" that it was not.
3.2.58 good
3.2.59 bad
3.2.60 bad
3.4.89 good
3.4.90 good
3.4.91 good
3.4.92 good
3.4.93 good
3.4.94 good
3.10.44 good
3.11.10.11 good
3.13.22 good
3.13.11 good
3.13.11.1 good
3.13.11.2 good
3.13.11.3 good
3.14.8 good
3.15-rc1 good
3.15-rc2 good
3.15-rc3 good
3.15-rc4 good
3.15-rc5 good
3.15-rc6 good
3.15-rc7 good
3.15-rc8 good
3.15 good
3.15.1 good
3.16-rc1 good

Revision history for this message
Maciej Puzio (maciej-puzio) wrote :

I bisected commits between Ubuntu-3.2.0-63.95 and Ubuntu-3.2.0-64.97, and arrived at a specific xhci-related commit. However, manual modification of the relevant file to revert the effects of this commit yielded a kernel that still suffered from a regression. Further complicating the matter is the fact that the the code in question is modified by two commits.

Since I was not able to verify the result of bisection, I am not posting it, to avoid confusion. Next week I will further debug the code, and post here when I get conclusive results.

Revision history for this message
Roman Shipovskij (roman-shipovskij) wrote :
Revision history for this message
Maciej Puzio (maciej-puzio) wrote :

After testing this thoroughly, I am confident to say that the regression is caused by commit "usb: xhci: Prefer endpoint context dequeue pointer over stopped_trb". In ubuntu-precise git repository this is commit f04e4b02bce3a0ce19f9673bbefde9b8c624c00a.
However, an equivalent commit is part of mainline kernel v3.16-rc1, where it does not cause problems. My guess is that this commit revealed a bug hidden somewhere else, in a code that was modified since kernel 3.2.

tags: removed: needs-reverse-bisect
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
penalvch (penalvch) wrote :

Maciej Puzio, while the initial regression commit is identified, the upstream fix commit has not been. Could you please provide this?

tags: added: bisect-done needs-reverse-bisect
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Maciej Puzio (maciej-puzio) wrote :

Christopher, due to the nature of this bug, I cannot perform the reverse bisect. I explained it already in comment #8. Just to be clearer: the regression has not been fixed upstream. There is no 3.x kernel branch which would contain the regression and the subsequent fix. The regression either is not there at all (all branches except 3.2), or remains unfixed (3.2). Thus I have no target for the bisection.

On somewhat happier news, I have spent last few days debugging the kernel, and I got some results. Specifically, I have a patch that fixes regression on 3.2.0-64 running on a particular hardware. The patch is rather ugly, and will probably cause problems on other hardware, but at least it shows some direction.

I will gladly discuss this matter, but I will need a little more attention shown by Ubuntu maintainers. Each of my posts corresponds to many hours of my work, and while I am grateful for any attention, its current level does not permit a constructive dialogue. I am very sorry to say this, and I mean no offense, but I will put more effort into writing here only when I see a chance that someone will read it.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
tags: removed: kernel-fixed-upstream-3.16-rc1 needs-reverse-bisect
Revision history for this message
Maciej Puzio (maciej-puzio) wrote :

I think that I may have found the bug, and since the newest upstream kernel 3.16.0-rc3 has the affected code essentially unmodified, I contacted the maintainer of the XHCI driver and the author of the problematic commit. I also asked for help on linux-usb kernel mailing list:
http://permalink.gmane.org/gmane.linux.usb.general/110685
http://www.spinics.net/lists/linux-usb/msg109949.html

Revision history for this message
Maciej Puzio (maciej-puzio) wrote :

Julius Werner, the author of the commit in question, has found the problem and created a patch. The problem is in the place that I identified, but specific regression-triggering details are different that I originally thought. The patch is available here:
https://lkml.org/lkml/2014/7/8/571

I tested this patch and can confirm that it fixes the regression in kernel 3.2.x. Newer kernels have not been affected by the regression, as it is masked by another code change that has not been backported to 3.2. Here is the link for the discussion:
http://thread.gmane.org/gmane.linux.usb.general/110685

As I understand it, we are now waiting for Julius' patch to be pulled to the mainline kernel.

Revision history for this message
Maciej Puzio (maciej-puzio) wrote :

The patch has been somewhat reworked and added to the upstream mainline kernel 3.17-rc3, and to the 3.16-stable tree.
https://lkml.org/lkml/2014/8/29/386
http://www.spinics.net/lists/stable/msg59724.html

penalvch (penalvch)
tags: added: cherry-pick
Changed in linux (Ubuntu):
status: Confirmed → Triaged
Revision history for this message
Maciej Puzio (maciej-puzio) wrote :

One more thing, the patch as included in upstream mainline kernel (3.17) will fail to build in the 3.2 branch, because it removes function find_trb_seg, which is still needed in the 3.2 kernel. This function should be left in place when patch is applied to 3.2 kernel. Further details available here: http://thread.gmane.org/gmane.linux.kernel.stable/103514

By the way, I checked the upstream 3.2.y kernel (3.2.62), and unfortunately the patch has not yet been backported to it.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.