Dell CERC Adapter hangs temporarily until reset. REGRESSION

Bug #1552551 reported by Daniel A. Gauthier
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Triaged
High
Unassigned

Bug Description

The HD CERC SATA 1.5 RAID controller (PCI-64 slot) hangs regularly for 10-60 seconds until reset.
The CERC card firmware is the newest I can find reference to: 4.1-0 [bld 7419]. The Dell Poweredge 1800 has newest BIOS (A07).
All hardware has been tested thoroughly OK except possibly the APIC component (Serverset Northbridge?).
The 14.04LTS with all updates (or not) works fine with a perfectly consistent throughput of 77MB/s (slowest HD=66MB/s), RAID5.
15.10 has the hangs of 10-60 seconds, AND with newest updates as of 26-FEB-2016.
I have changed PCI interrupts around and moved card slots to minimize sharing and removed cards - repeatedly.
All hard drives and cables have been thoroughly tested, but the drives are mixed: 1 of 66MB/s, 1 of 125 MB/s, and 3 of 100MB/s.
The drives not only work perfectly in 14.04LTS, but all Windows versions and DOS/BIOS-INT 19 as well (0x13).
I have tried all combinations of card RAID options Read ahead, Write cache, and other BIOS options.
Relevant dmesg lines (these were consecutive):
[ 1046.808027] aacraid: Host adapter abort request (4,0,0,0)
[ 1046.808123] aacraid: Host adapter reset request. SCSI hang ?
[ 1297.828038] aacraid: Host adapter abort request (4,0,0,0)
[ 1297.828168] aacraid: Host adapter reset request. SCSI hang ?

ProblemType: Bug
DistroRelease: Ubuntu 15.10
Package: linux-image-4.2.0-16-generic 4.2.0-16.19
ProcVersionSignature: Ubuntu 4.2.0-16.19-generic 4.2.3
Uname: Linux 4.2.0-16-generic x86_64
ApportVersion: 2.19.1-0ubuntu3
Architecture: amd64
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CasperVersion: 1.365
CurrentDesktop: Unity
Date: Thu Mar 3 05:08:57 2016
LiveMediaBuild: Ubuntu 15.10 "Wily Werewolf" - Release amd64 (20151021)
Lsusb:
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 004 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 003 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
MachineType: Dell Computer Corporation PowerEdge 1800
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 nouveaufb
ProcKernelCmdLine: BOOT_IMAGE=u15.10-64/casper/vmlinuz.efi file=u15.10-64/preseed/username.seed boot=casper netboot=nfs nfsroot=10.100.1.86:/data2/tftpboot/u15.10-64 initrd=u15.10-64/casper/initrd.lz ipv6.disable=1 net.ifnames=0
RelatedPackageVersions:
 linux-restricted-modules-4.2.0-16-generic N/A
 linux-backports-modules-4.2.0-16-generic N/A
 linux-firmware 1.149
RfKill:
 0: phy0: Wireless LAN
  Soft blocked: no
  Hard blocked: no
SourcePackage: linux
UdevLog: Error: [Errno 2] No such file or directory: '/var/log/udev'
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 09/29/2006
dmi.bios.vendor: Dell Computer Corporation
dmi.bios.version: A07
dmi.board.name: 0P8611
dmi.board.vendor: Dell Computer Corporation
dmi.board.version: A04
dmi.chassis.type: 17
dmi.chassis.vendor: Dell Computer Corporation
dmi.modalias: dmi:bvnDellComputerCorporation:bvrA07:bd09/29/2006:svnDellComputerCorporation:pnPowerEdge1800:pvr:rvnDellComputerCorporation:rn0P8611:rvrA04:cvnDellComputerCorporation:ct17:cvr:
dmi.product.name: PowerEdge 1800
dmi.sys.vendor: Dell Computer Corporation

Revision history for this message
Daniel A. Gauthier (fractal2010) wrote :
Revision history for this message
Daniel A. Gauthier (fractal2010) wrote :

I forgot to add that although this is booted from the PXE at this time, I have previously installed and updated the 15.10.
I have tested both 32 and 64 bit versions with identical results, the server now has exactly 4G RAM (4x1GBxREGECC).
14.04LTS is currently installed as 2nd OS on HD, but it has a PCI NVIDIA card so the video doesn't work right (missing icons/menus/text in lists when moused over). I'm hoping the 16.04LTS will solve both issues simultaneously.
Both the 32 and 64 bit versions of 14.04LTS and earlier work OK.

The user is trying to use this server as a desktop workstation, so the onboard video isn't good enough, which is why it has the PCI card, it used to have 2 PCI video cards when he required DVI output because the BIOS won't recognize most cards, so no video until OS was booted. I have tried removing the PCI wireless also, which is the last remaining other card.
I do not specifically remember trying the drives with the onboard video only, but since it didn't matter if the interrupt was tied to the same line or not (tried in 2 different slots), I doubt that would make a difference.

Dan

Revision history for this message
Brad Figg (brad-figg) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Changed in linux (Ubuntu):
importance: Undecided → High
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.5 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.5-rc6-wily/

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
penalvch (penalvch)
tags: added: regression-release
removed: xeon
Revision history for this message
Daniel A. Gauthier (fractal2010) wrote :

Set as directed by email from this bug report. Exists upstream kernel. Looking to test lack of SMP next, after which I'll add to original post more details.

Dan

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
tags: added: kernel-bug-exists-upstream
Revision history for this message
Daniel A. Gauthier (fractal2010) wrote :

Last of details I can supply unless someone wants me to test 14.10 or 15.04 releases. I'd have to redownload or find a DVD and a drive that reads them to PXE it since the only DVD I could find won't read in the drives I have.

Ubuntu 9.10 works both 32 and 64 bit just fine.
15.10 hangs, both 32 and 64 bit.
Setting /sys/class/block/sd?/device/timeout to very small (5) or very large (90) doesn't help, but a small timeout seems to be less likely to return a read-error to userspace.
I tried the 4.5 rc7 candidate from upstream, as well as the rc6, and it still occurs there (64-bit only tested).
Booting with "nosmp" option at first seemed to help, I didn't get a hang until 95G had been read, but they were regular after that.

Dan

Revision history for this message
penalvch (penalvch) wrote :

Daniel A. Gauthier, the next step is to fully commit bisect from kernel 2.6.31 to 4.2 in order to identify the last good kernel commit, followed immediately by the first bad one. This will allow for a more expedited analysis of the root cause of your issue. Could you please do this following https://wiki.ubuntu.com/Kernel/KernelBisection ?

Please note, finding adjacent kernel versions is not fully commit bisecting.

After the offending commit (not kernel version) has been identified, then please mark this report Status Confirmed.

Thank you for your understanding.

Helpful bug reporting tips:
https://wiki.ubuntu.com/ReportingBugs

tags: added: kernel-bug-exists-upstream-4.5-rc7 latest-bios-a07 needs-bisect
description: updated
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
Changed in linux (Ubuntu):
status: Expired → Confirmed
Revision history for this message
Daniel A. Gauthier (fractal2010) wrote :

I have bisected the source of the bug, then took a brief attempt at a fix, unfortunately the quickie fix failed (bad CONST definitions in an .h file). I thought perhaps it was some kind of memory region locking error causing corruption, but it appears to be more complicated than that.

The bad commit is b836439 - 4KB Sector Support.
The previous good commit is 2f5d1f7.

I'm having difficulty integrating this compiled kernel into my PXE environment, hence the week long delay in posting this, unfortunately I lowered my priority for the whole issue once I figured it was too late to get anything into the new LTS release first run and it's been marked as expired. christopher.m.penalver said to mark as Confirmed, but I don't know if I can do that myself, or if I can do anything at all at this point other than post this info here for anyone who's interested.

If I do ever get time to look at the (significant) code changes for the 4K support, I'll post changes here as well?

Dan

Revision history for this message
penalvch (penalvch) wrote :

Daniel A. Gauthier, one last issue before upstreaming, could you please test the latest mainline kernel (4.6) and advise to the results?

tags: added: bisect-done
removed: needs-bisect
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
penalvch (penalvch) wrote :

Daniel A. Gauthier, regarding the latest mainline kernel, instructions are available at https://wiki.ubuntu.com/Kernel/MainlineBuilds.

Revision history for this message
Daniel A. Gauthier (fractal2010) wrote : [Bug 1552551] Re: Dell CERC Adapter hangs temporarily until reset. REGRESSION

Chris,

I tested the newest Ubuntu Mainline Kernel, 4.6.0-040600rc7-generic
#201605081830 and had the same "SCSI hang" errors. (amd64 version)

Dan

On 5/19/2016, "Christopher M. Penalver"
<email address hidden> wrote:

>Daniel A. Gauthier, one last issue before upstreaming, could you please
>test the latest mainline kernel (4.6) and advise to the results?

Revision history for this message
penalvch (penalvch) wrote :

Daniel Gauthier, to advise, 4.6 is a later release than 4.6-rc7.

Revision history for this message
Daniel A. Gauthier (fractal2010) wrote :

Chris,

rc7 was the newest kernel for Wily. I reinstalled the newest 16.04
(again 64 bit) and tested the 4.6.0 version from a week later
(4.6.0-040600-generic #201605151930) and it's stll bad.

Dan

Revision history for this message
penalvch (penalvch) wrote :

Daniel A. Gauthier, the naming on the end can be confusing.

Despite this, could you please test the latest mainline kernel (4.7-rc1) and advise on the results?

tags: added: kernel-bug-exists-upstream-4.6
removed: kernel-bug-exists-upstream-4.5-rc7
Revision history for this message
Daniel A. Gauthier (fractal2010) wrote :

The problem still exists in 4.7.0rc3 and rc4, along with a few new messages I've never seen before:

At boot, on screen (slowly twice) before X is up: "AAC: Host adapter dead -1" (at modprobe perhaps)

This message seemed to predict the comlete lack of an sdb device, only the SATA laptop hard drive I installed 16.04 onto to test shows up as sda. Reboot performed.

Now (unlike previous kernels), the AAC is sda and the SATA HD is sdb, and I've never seen the BLINK or panic lines before. I'll still try to test rc1, I had to reload to recover my initramfs config and haven't fixed the rc1 kernel yet.

[34779.808023] aacraid: Host adapter abort request (0,0,0,0)
[34779.808129] aacraid: Host adapter reset request. SCSI hang ?
[34841.832026] aacraid: Host adapter abort request (0,0,0,0)
[34841.832139] aacraid: Host adapter reset request. SCSI hang ?
[34911.816022] aacraid: Host adapter abort request (0,0,0,0)
[34911.816030] aacraid: Host adapter abort request (0,0,0,0)
[34911.816034] aacraid: Host adapter abort request (0,0,0,0)
[34911.816037] aacraid: Host adapter abort request (0,0,0,0)
[34911.816147] aacraid: Host adapter reset request. SCSI hang ?
[34912.024026] AAC: Host adapter BLINK LED 0xc7
[34913.028030] AAC0: adapter kernel panic'd c7.

Dan

penalvch (penalvch)
tags: added: kernel-bug-exists-upstream-4.7-rc4
removed: kernel-bug-exists-upstream-4.6
Revision history for this message
penalvch (penalvch) wrote :

Daniel A. Gauthier, the issue you are reporting is an upstream one. Could you please report this problem following the instructions verbatim at https://wiki.ubuntu.com/Bugs/Upstream/kernel to the appropriate mailing list (TO Mahesh Rajashekhara, James E.J. Bottomley, Adaptec OEM Raid Solutions, Harry Yang, Achim Leubner, Rajinikanth Pandurangan, and Rich Bono CC linux-scsi)?

Please provide a direct URL to your post to the mailing list when it becomes available so that it may be tracked.

Thank you for your understanding.

Changed in linux (Ubuntu):
status: Incomplete → Triaged
Revision history for this message
Brian Daniels (bd6string) wrote :

I've had the same issues after applying patches for 14.10. Had to use recovery to boot and decided to upgrade to fix. This did not help as 16.04 has same issue with "Host adapter" errors on Dell PowerEdge 2800 and CERC controller. OS will not boot.

Revision history for this message
Donc (zaber1) wrote :

I believe I am having the same issue on a Dell PowerVault 745N. The system worked fine on 14.04 but would not boot when upgraded to 16.04. It does still boot with kernel 3.13.0-95-generic.

Revision history for this message
Jon Schewe (jpschewe) wrote :

I have a Dell PowerEdge 1800 with Dell CERC 1.5/6ch RAID controller in it. Everything worked fine under Ubuntu 14.04.1. When I upgraded to 16.04.1 the system won't boot. It can't find the root filesystem. I see errors about host adapter dead on the screen many times. This is with kernel 4.4.0-57 and with 4.8.0-32. When I switch back to kernel 3.13.0-105 the system boots just fine.

Using intel_iommu=on doesn't help.

I have mptbios 5.06.04.
The CERC card says bios version 4.1-0 [Build 7403]

I can't figure out how to upgrade the firmware or BIOS on the system from Ubuntu either.

Revision history for this message
penalvch (penalvch) wrote :

Jon Schewe, it will help immensely if you filed a new report with Ubuntu, using the default repository kernel (not mainline/upstream/3rd party) via a terminal:
ubuntu-bug linux

Please feel free to subscribe me to it.

For more on why this is helpful, please see https://wiki.ubuntu.com/ReportingBugs.

Revision history for this message
Jon Schewe (jpschewe) wrote :

Christopher,

I tried to file a bug report and it won't let me. I can't boot into the 4.4.0 kernel, so I can't file a bug report from that version and it appears that I can't file the bug report from my 3.13.0 kernel either.

>ubuntu-bug linux

*** Collecting problem information

The collected information can be sent to the developers to improve the
application. This might take a few minutes.
.........................................

*** Problem in linux-image-3.13.0-105-generic

The problem cannot be reported:

This is not an official Ubuntu package. Please remove any third party package and try again.

Press any key to continue...

No pending crash reports. Try --help for more information.

Revision history for this message
Jon Schewe (jpschewe) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.