[Dell PowerEdge R510] Regression: Kernel 3.2.0-64 fails to boot with USB3 controller card

Bug #1328984 reported by Maciej Puzio
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
High
Unassigned

Bug Description

A routine system update of Ubuntu 12.04 LTS to kernel 3.2.0-64 resulted in unbootable system on two machines. Further testing revealed that kernel fails while initializing HighPoint RocketU 1144C USB 3.0 controller. This is a PCIe x4 add-in card that contains four USB 3.0 ports, each equipped with its own controller. The card did and does work without any problems with kernel 3.2.0-63 and earlier. Prior to installing kernel 3.2.0-64 there were neither hardware nor software problems with either of the machines.

Steps to reproduce:
apt-get dist-upgrade
sync
reboot
Result: system fails to boot.

The workaround is to revert to kernel 3.2.0-63 or to remove the RocketU card.

Hardware description (same on both machines):
Dell PowerEdge R510
PERC6/i RAID controller
64GB RAM DDR3 ECC registered
Dual CPU: Intel Xeon X5660 2.80GHz
HighPoint RocketU 1144C 4-Port USB 3.0 PCIe 2.0 x4 HBA

Operating system (identical on both machines):
Ubuntu 12.04.4 LTS
Linux 3.2.0-64-generic x86_64

Drives:
sda - logical drive on PERC6/i, OS
sdb - logical drive on PERC6/i, data
sdc - Areca 5040 external RAID connected by USB3 to RocketU card
sdd - Areca 5040 external RAID connected by USB3 to RocketU card
sde - Areca 5040 external RAID connected by USB3 to RocketU card

Symptoms:
System boots normally until initialization of Areca drives connected to the RocketU card. The following messages are displayed on screen when booting without quiet and with debug options. These are last messages of a "typical" part of the boot sequence. Following it is a ~2 minute lag when no messages are displayed.

[Please note that no trace of the boot progress gets recorded in system logs, and messages on screen scroll very fast. I had to record the boot progress with a high framerate camera, and even so some messages scrolled too fast and were not recorded. The following is a manual transcript of fragments of these videos; please forgive inevitable typos.]

[5.621523] scsi 5:0:0:0: Direct-Access Areca Areca5 PQ: 0 ANSI: 5
[5.622896] sd 5:0:0:0: Attached scsi generic sg4 type 0
[5.623230] sd 5:0:0:0: [sdc] Very big device. Trying to use READ CAPACITY(16).
[5.623668] sd 5:0:0:0: [sdc] 41015622144 512-byte logical blocks: (20.9 TB/19.0 TiB)
[5.741152] scsi 6:0:0:0: Direct-Access Areca Areca3 PQ: 0 ANSI: 5
[5.744003] sd 6:0:0:0: Attached scsi generic sg5 type 0
[5.744545] sd 6:0:0:0: [sdd] Very big device. Trying to use READ CAPACITY(16).
[5.744980] sd 6:0:0:0: [sdd] 41015622144 512-byte logical blocks: (20.9 TB/19.0 TiB)
[6.004526] scsi76:0:0:0: Direct-Access Areca Areca7 PQ: 0 ANSI: 5
[6.006121] sd 7:0:0:0: Attached scsi generic sg6 type 0
[6.006488] sd 7:0:0:0: [sde] Very big device. Trying to use READ CAPACITY(16).
[6.006834] sd 7:0:0:0: [sde] 35156217552 512-byte logical blocks: (17.9 TB/16.3 TiB)
[7.133091] Adding 46874620k swap on /dev/sda3. Priority: -1 extents:1 across 46874620k

After a two minute delay, the following messages appear in an infinite loop. Please note that these messages appear in a somewhat random sequence, and not all messages appear on every boot. The only thing that works at this point is Ctrl-Alt-Delete.

udevd[632]: timeout: killing '/sbin/modprobe -bv acpi:ACPI000D:PMP0C01:' [774]
udevd[703]: timeout: killing '/sbin/modprobe -bv acpi:PMP0C014:' [776]
udevd[529]: timeout: killing '/sbin/modprobe -bv input:b0003v0557p2261e0110-e0,1,2,3,4,k110,111,112,r8,a0,1,m4,lsfw' [1642]
udevd[630]: timeout: killing '/sbin/modprobe -bv serio:ty06pr00id00ex00' [655]
udevd[508]: timeout: killing '/sbin/modprobe -bv pci:v0000808640000342Esv00000000sd00000000bc00sc00i00' [512]
udevd[494]: timeout: killing '/sbin/modprobe -bv input:b0019v0000p0001e0000-r0,1,k74,ramlsfw' [771]
udevd[699]: timeout: killing '/sbin/modprobe -bv dmi:bvnDellInc.:bvr1.12.0:bd07/26/2013:svnDellInc.:pnPowerEdgeR510:pvr:rvnDellInc.:rm00HDP0:rvr002:cvnDellInc.:ct23:cvr:' [708]
udevd[529]: timeout: killing '/sbin/modprobe -bv input:b0003v0557p2261e0110-e0,1,2,3,4,k71,72,73,74,77,80,82,83,85,86,87,88,89,8A,8B,8C,8E,8F,90,96,98,9B,9C,9E,9F,A1,A3,A4,A5,A6,A7,A8,A9,AB,AC,AD,AE,B1,B2,B5,CE,CF,D0,D1,D2,D4,D8,D9,DB,E4,EA,EB,F1,100,161,162,166,16A,16E,172,174,176,178,179,17A,17B,17C,17D,17F,180,182,182,185,188,189,18C,18D,18E,18F,190,191,192,193,195,198,199,19A,1A9,1A1,1A2,1A3,1A4,1A5,1A6,1A7,1A8,1A9,1AA,1AB,1AC,1AD,1AE,1B0,1B1,1B7,1BA,r6,a20,m4,lsfw' [1678]

After pressing Ctrl-Alt-Delete, the above messages continue to appear for a few seconds, and after that the following messages are displayed:

An error occurred while mounting /mnt/sdb.
mountall: mount /mnt/sdb [1785] killed by KILL signal
mountall: Filesystem could not be mounted: /mnt/sdb
 * Killing all remaining processes... [Press
S to skip mounting or M for manual recovery
fail]
rpcbind: rpcbind terminating on signal. Restart with "rpcbind -w"
 * Deconfiguring network interfaces [ OK ]
 * Deactivating swap... [ OK ]
 * Unmounting local filesystems... [ OK ]
 * Will now restart
[184.341144] hub 4-0:1.0: hub_port_status failed (err = -110)
[184.341222] hub 4-0:1.0: hub_port_status failed (err = -110)
[201.324536] usb 16-1: device not accepting address 2, error -62
[201.380907] sd 7:0:0:0: [sde] Asking for cache data failed
[201.380980] sd 7:0:0:0: [sde] Assuming drive cache: write through
[201.381767] sd 7:0:0:0: [sde] Asking for cache data failed
[201.381840] sd 7:0:0:0: [sde] Assuming drive cache: write through
[201.382457] sd 7:0:0:0: [sde] Asking for cache data failed
[201.382530] sd 7:0:0:0: [sde] Assuming drive cache: write through
[211.880194] usb 12-1: device not accepting address 2, error -62
[211.936396] sd 6:0:0:0: [sdd] Asking for cache data failed
[211.936466] sd 6:0:0:0: [sdd] Assuming drive cache: write through
[222.435967] usb 10-1: device not accepting address 2, error -62

After the last message screen goes blank and machine reboots.

Additional note:
Not sure if this is related, but while looking for existing bug reports, I have found several posts about kernel 3.2.0-64 regressing in USB 3.0 support:
https://bugs.launchpad.net/software-center/+bug/1328883
http://www.linuxquestions.org/questions/linux-software-2/sudden-loss-of-usb-3-0-on-ubuntu-12-04-64-bit-kernel-3-2-0-64-generic-4175507335/

Note about attachments:
Due to kernel 3.2.0-64 not being able to boot, the attached command output was obtained using kernel 3.2.0-63.

Revision history for this message
Maciej Puzio (maciej-puzio) wrote :
Revision history for this message
Maciej Puzio (maciej-puzio) wrote :
Revision history for this message
Maciej Puzio (maciej-puzio) wrote :
Revision history for this message
Maciej Puzio (maciej-puzio) wrote :
affects: software-center → linux-meta (Ubuntu)
Brad Figg (brad-figg)
affects: linux-meta (Ubuntu) → linux (Ubuntu)
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1328984

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Maciej Puzio (maciej-puzio) wrote : Re: Regression: Kernel 3.2.0-64 fails to boot with USB3 controller card

I am unable to run the apport-collect command for two reasons:

1. The bug in question renders the machine unbootable. To boot the machine and run apport-collect, it is necessary to change either the software or hardware configuration. This would create an environment in which bug is not reproducible.

2. Two affected machines are servers located behind a corporate proxy. Apport-collect does not allow transmission through a proxy.

As instructed, I am changing the status to 'Confirmed'.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
penalvch (penalvch) wrote :

Maciej Puzio:

1) The purpose of the apport-collect is to gather the necessary debug information about your hardware, not provide logs about the crash.

2) One may utilize https://help.ubuntu.com/community/ReportingBugs#Filing_bugs_when_off-line to attach the file to this report.

Changed in linux (Ubuntu):
importance: Undecided → High
status: Confirmed → Incomplete
Revision history for this message
Maciej Puzio (maciej-puzio) wrote :

Thank you for the information. I collected apport data, but I am currently unable to submit it, as the procedure outlined on the linked page still requires usage of tools that do not work with a proxy. Unless I run into more apport-related problems, I will be able to submit it later today or tomorrow.

In the meantime, I managed to replicate the regression on a different hardware (still using HighPoint RocketU 1144C and Areca ARC-5040). Should I include my observations, error logs and apport-collected data from that machine in this bug report, or should I create a new bug?

Revision history for this message
Maciej Puzio (maciej-puzio) wrote :

I am unable to submit apport-collected data due to what appears to be numerous bugs in apport tools:

1. Submitting data directly from the affected machine is not possible due to apport not being able to connect through a proxy.

2. Following instructions on referenced page to submit previously created .apport file using ubuntu-bug does not work, because ubuntu-bug ignores file specified in -c or --crash-file parameter, and proceeds to gather system information from the machine on which it is running. As this is a wrong machine, I had no choice but to cancel the submission.

3. In addition, ubuntu-bug generates the following error:
No packages found matching linux.
ERROR: hook /usr/share/apport/general-hooks/cloud_archive.py crashed

At this point debugging apport is not my priority. I would be grateful if you indicate an alternative way to proceed.

Revision history for this message
penalvch (penalvch) wrote :

Maciej Puzio, could you please boot into 14.04 Server via http://releases.ubuntu.com/trusty/ (or 10.04/other release if that doesn't work), gather the apport-collect file, use a USB stick to move the data to another computer, and attach the file manually to this report?

Revision history for this message
Maciej Puzio (maciej-puzio) wrote :

I'm attaching the file generated by command:
apport-cli -f -p linux --save bug1328984.apport
This was done with the machine running 12.04 Server with kernel 3.2.0-63.

May I ask for the reply to my question about the results of testing the problem on a different hardware? (The regression has been reproduced, but symptoms are somewhat different, and I worry about confusion which may result from mixing discussion about two hardware configurations. On the other hand, I am reluctant to start a new bug report, as this appears to be a single bug.)

Revision history for this message
penalvch (penalvch) wrote :

Maciej Puzio, it's strongly preferred one reports on a per hardware basis, as it's cheap to mark something a duplicate of another if determined as such.

Revision history for this message
penalvch (penalvch) wrote :

Maciej Puzio, could you please test the latest upstream kernel available from the first line at the top page (not the daily folder) following https://wiki.ubuntu.com/KernelMainlineBuilds ? It will allow additional upstream developers to examine the issue. Once you've tested the upstream kernel, please comment on which kernel version specifically you tested. If this bug is fixed in the mainline kernel, please add the following tags:
kernel-fixed-upstream
kernel-fixed-upstream-VERSION-NUMBER

where VERSION-NUMBER is the version number of the kernel you tested. For example:
kernel-fixed-upstream-3.15

This can be done by clicking on the yellow circle with a black pencil icon next to the word Tags located at the bottom of the bug description. As well, please remove the tag:
needs-upstream-testing

If the mainline kernel does not fix this bug, please add the following tags:
kernel-bug-exists-upstream
kernel-bug-exists-upstream-VERSION-NUMBER

As well, please remove the tag:
needs-upstream-testing

Once testing of the upstream kernel is complete, please mark this bug's Status as Confirmed. Please let us know your results. Thank you for your understanding.

tags: added: 1.12.0 latest-bios- needs-bisect regression-update
removed: kernel udev usb
summary: - Regression: Kernel 3.2.0-64 fails to boot with USB3 controller card
+ [Dell PowerEdge R510] Regression: Kernel 3.2.0-64 fails to boot with
+ USB3 controller card
Revision history for this message
Bard Hemmer (bard-hemmer) wrote :

I had the same issue on a server that has been running Ubuntu 12.04 LTS without a single problem for over two years.

After upgrading to linux-image-3.2.0-64-generic on the system with a Supermicro X9SCM mainboard, two HighPoint RocketU 1144A USB 3.0 controllers and at least one USB 3.0 disk attached, I observed the same infinite udevd loop (although the message details were different). When I disconnected all USB 3.0 disks, the system booted.

My workaround was to revert to the previously installed kernel, linux-image-3.2.0-61-generic.

Revision history for this message
Bard Hemmer (bard-hemmer) wrote :

I observed the udevd loop with only one Western Digital My Passport 0748 2TB USB 3.0 disk connected.

Revision history for this message
penalvch (penalvch) wrote :

Bard Hemmer, thank you for your comment. So your hardware and problem may be tracked, could you please file a new report with Ubuntu by executing the following in a terminal while booted into the default Ubuntu kernel (not a mainline one) via:
ubuntu-bug linux

For more on this, please read the official Ubuntu documentation:
Ubuntu Bug Control and Ubuntu Bug Squad: https://wiki.ubuntu.com/Bugs/BestPractices#X.2BAC8-Reporting.Focus_on_One_Issue
Ubuntu Kernel Team: https://wiki.ubuntu.com/KernelTeam/KernelTeamBugPolicies#Filing_Kernel_Bug_reports
Ubuntu Community: https://help.ubuntu.com/community/ReportingBugs#Bug_reporting_etiquette

When opening up the new report, please feel free to subscribe me to it.

Thank you for your understanding.

Helpful bug reporting tips:
https://wiki.ubuntu.com/ReportingBugs

Revision history for this message
Maciej Puzio (maciej-puzio) wrote :

I have created a new bug report describing this problem replicated on another hardware: bug 1330530
As that is a test machine entirely devoted to this issue, I will test the upstream kernel on it and post the results in bug 1330530.

Regarding testing of the upstream kernel on PowerEdge machines, these are production servers, and I need to schedule a maintenance window in order to take one of them offline. I am required to give an advance notification to users, so this is not something that can be done on a very short notice, or very often (once per week is max I can do). For this reason, may I ask if there are any other conceivable tests that I could run? It would speed things up considerably if I could use the maintenance window to do as many tests as possible, rather than do one at a time.

Revision history for this message
Maciej Puzio (maciej-puzio) wrote :
Download full text (3.8 KiB)

I have tested kernels 3.16.0-031600rc1-generic and 3.2.60-030260-generic. On the former, the problem does not appear, on the latter, the bug is replicated with similar symptoms as on 3.2.0-64. I used a flash drive with a vanilla Ubuntu 12.04 desktop install for all tests. To summarize kernels tested so far:
Good kernels: 3.2.0-63, 3.16.0-031600rc1
Bad kernels: 3.2.0-64, 3.2.60-030260

I also tested this issue on three additional machines, and the results were the same. So I have now five different hardware configurations (including one from bug 1330530) that are affected by this problem and show very similar symptoms. In fact, I was not able to find a computer that would not replicate this regression. If we also take into account Bard Hemmer's hardware, we can reasonably conclude that the issue is not related to motherboard/chipset/CPU/BIOS. It is however related to HighPoint RocketU 1144C add-in adapter that I used in all my tests.

I would like to note that symptoms are similar on various hardware, but not identical. The errors are generally similar (xhci, udev, modprobe), but it appears that timing differences cause the issue to occur at different parts of the boot process, depending on the hardware. So far I have seen:

1. Dropping to initramfs shell in the middle of the boot ("Gave up waiting for root device." ... ALERT! [boot drive] does not exist! Dropping to shell!")

2. An error loop preventing system to boot (as described in this report). In this case I am not sure whether this is an infinite loop, or if the system would boot after a long delay.

3. Boot is delayed by 18 minutes, during which time numerous errors are thrown. After 18 minutes, OS boots fine.

4. System boots to text console, rather than the graphical login screen. It is possible to log on to the console. Within seconds, xhci and/or udev errors start appearing in the syslog. After two minutes, screen goes blank, and the console seems unresponsive for another 16 minutes. Following that, the graphical login screen appears, and from this point system behaves fine.

5. As in 4, but after two minutes in the text console, incomplete graphical login screen appears. Password box is missing and the background is not fully loaded. After another 16 minutes, login screen loads missing parts, and system behaves OK. In this case it is possible to switch between text and graphical consoles during these 16 minutes, but the graphical console becomes a purple empty screen after the switch.

It is also worth noting that symptoms are highly dependent on the external device(s) attached to RocketU's ports. Here is a summary:

1. No device connected to RocletU adapter - no problems during boot
2. USB3 flash drives (tested two models) - no problems during boot
3. Areca ARC-5040 enclosure - bug is triggered
4. WD MyPassport 2TB US 3.0 drive - bug is triggered
5. Transcend USB 3.0 SD card reader (TS-RDF5K) - bug is triggered with different symptoms: only a small delay (~15 seconds) and small number of xhci errors occur during boot, but the device does not work when OS is fully booted.

All the above devices work fine with "good" kernels. Note that I tested three RocketU controllers and five Ar...

Read more...

tags: added: kernel-fixed-upstream kernel-fixed-upstream-3.16
tags: added: kernel-fixed-upstream-3.16-rc1
removed: kernel-fixed-upstream-3.16
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
penalvch (penalvch) wrote :

Maciej Puzio, the next step is to fully reverse commit bisect the kernel in order to identify the offending commit. Could you please do this following https://wiki.ubuntu.com/Kernel/KernelBisection#How_do_I_reverse_bisect_the_upstream_kernel.3F ?

tags: added: needs-reverse-bisect
removed: needs-bisect
tags: added: latest-bios-1.12.0
removed: 1.12.0 latest-bios-
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Maciej Puzio (maciej-puzio) wrote :

tl;dr: This bug report is a duplicate of bug 1330530.

I am going to focus my efforts on bug 1330530, and I do not intend to do any work on this bug report until bug 1330530 is resolved. This is because doing the same work twice is not a good use of time and effort of anybody involved. Please do not change the status to incomplete, and do not request same steps or actions as for bug 1330530. Instead, please mark this bug as a duplicate of 1330530. This will focus everybody's attention on the problem and minimize confusion. Thank you.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.