system hangs after strange errors - raid6 and xfs defective (lsi driver?)

Bug #599830 reported by Lars
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
ecs (Ubuntu)
New
Undecided
Unassigned

Bug Description

Hi there.

this is a report for a failure that happened not only once. But I'm still unsure what part of the system bares the blame.

The setup is as follows:

smp on multi core 64bit system:
# uname -a
Linux speicher48 2.6.32-22-server #36-Ubuntu SMP Thu Jun 3 20:38:33 UTC 2010 x86_64 GNU/Linux

LSI controller:
08:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08)
09:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08)

# cat /proc/scsi/mptsas/10
ioc0: LSISAS1068E B3, FwRev=011e0000h, Ports=1, MaxQ=483
# cat /proc/scsi/mptsas/11
ioc2: LSISAS1068E B3, FwRev=011e0000h, Ports=1, MaxQ=483

lots of sataii drives attached through chenbro enclosures:

# cat /proc/scsi/scsi | egrep '(CHENBRO|Enclosure|Id: 24)'
Host: scsi10 Channel: 00 Id: 24 Lun: 00
  Vendor: CHENBRO Model: SASX36 B0 Rev: AB11
  Type: Enclosure ANSI SCSI revision: 03
Host: scsi11 Channel: 00 Id: 24 Lun: 00
  Vendor: CHENBRO Model: SASX36 B0 Rev: AB11
  Type: Enclosure ANSI SCSI revision: 03

# cat /proc/partitions
major minor #blocks name

   8 0 146523384 sda
   8 1 14651248 sda1
   8 2 14651280 sda2
   8 3 19535040 sda3
   8 16 146523384 sdb
   8 17 14651248 sdb1
   8 18 14651280 sdb2
   8 19 19535040 sdb3
   9 0 14651136 md0
   9 1 14651200 md1
   8 32 1953525166 sdc
   8 48 293036184 sdd
   8 64 293036184 sde
   8 80 293036184 sdf
   8 96 293036184 sdg
   8 112 293036184 sdh
   8 128 293036184 sdi
   8 144 293036184 sdj
   8 160 293036184 sdk
   8 176 293036184 sdl
   8 192 293036184 sdm
   8 208 293036184 sdn
   9 2 6446794112 md2
   8 224 293036184 sdo
   8 240 293036184 sdp
  65 0 293036184 sdq
  65 16 293036184 sdr
  65 32 293036184 sds
  65 48 293036184 sdt
  65 64 293036184 sdu
  65 80 293036184 sdv
  65 96 293036184 sdw
  65 112 293036184 sdx
  65 128 293036184 sdy
  65 144 293036184 sdz
  65 160 293036184 sdaa
  65 176 293036184 sdab
  65 192 293036184 sdac
  65 224 293036184 sdae
  65 208 293036184 sdad
  65 240 293036184 sdaf
  66 0 293036184 sdag
  66 16 293036184 sdah
  66 32 293036184 sdai
  66 48 293036184 sdaj
  66 64 293036184 sdak
  66 80 293036184 sdal
  66 96 293036184 sdam
  66 112 293036184 sdan
  66 128 293036184 sdao
  66 144 293036184 sdap
  66 160 293036184 sdaq
  66 176 293036184 sdar
  66 192 293036184 sdas
  66 193 14188543 sdas1
  66 224 293036184 sdau
  66 208 293036184 sdat
  66 240 293036184 sdav
  67 0 293036184 sdaw
  67 16 293036184 sdax
  67 32 293036184 sday
   9 3 6446794112 md3

sw-raid6:
# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md3 : active raid6 sdav[20] sdah[6] sdal[10] sdai[7] sdae[3] sdar[16] sdag[5] sdab[0] sdaj[8] sdak[9] sdas[17] sdao[13] sday[23] sdam[11] sdac[1] sdaf[4] sdat[18] sdaq[15] sdaw[21] sdax[22] sdad[2] sdap[14] sdau[19] sdan[12]
      6446794112 blocks level 6, 64k chunk, algorithm 2 [24/24] [UUUUUUUUUUUUUUUUUUUUUUUU]

md2 : active raid6 sdf[24] sdr[14] sdq[13] sdy[17] sdn[9] sdw[19] sdj[6] sdz[22] sdaa[23] sdm[10] sdx[20] sdv[18] sdu[21] sds[15] sdt[16] sdd[0] sdo[11] sdp[12] sdl[8] sdk[7] sdi[5] sdh[4] sdg[3] sde[1]
      6446794112 blocks level 6, 64k chunk, algorithm 2 [24/23] [UU_UUUUUUUUUUUUUUUUUUUUU]
      [=================>...] recovery = 87.4% (256342920/293036096) finish=36.6min speed=16676K/sec

md1 : active raid1 sda2[0] sdb2[1]
      14651200 blocks [2/2] [UU]

md0 : active raid1 sda1[0] sdb1[1]
      14651136 blocks [2/2] [UU]

unused devices: <none>

XFS on raids:
# LANG=C df -ht xfs
Filesystem Size Used Avail Use% Mounted on
/dev/md2 6.1T 5.6T 466G 93% /backup2
/dev/md3 6.1T 4.8T 1.3T 79% /backup1

System hangs after some days of operation. No sys-rq possible anymore.
At least one of the sw-raid6 can't be assembled after the system is rebooted, because at least one hdd is inaccessible. (Tested with smartctl -i /dev/sd__)

Only a halt and power off with disconnetion from the power grid for some seconds gets all hdds accessible again.

After this really hard xfs errors occur.
Sometimes a xfs_repair needs the «-P» option to get it working again.
I already had a segmentation fault with xfs_check before a xfs_repair.

The LSI-controllers contain the latest firmware:
1.30.00.00 18-DEC-09
from LSI:
http://www.lsi.com/storage_home/products_home/host_bus_adapters/sas_hbas/internal/sas3081e-r/index.html

Maybe the driver has some problems?

Sadly I can't make long lasting debugging sessions because the server is backup critical.

I'm stuck.

Kind regards
Lars

Revision history for this message
Lars (lars-taeuber) wrote :
Revision history for this message
Lars (lars-taeuber) wrote :
priya (priyamtk)
affects: ubuntu → ecs (Ubuntu)
Revision history for this message
Lars (lars-taeuber) wrote :

Hi,

here is a dmesg log of the system.
There are a lot of messages and errors from the LSI driver module (mptbase). Maybe someone knows how to interprete these.

[...]
[48087.552179] mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000)
[48087.552381] mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000)
[48087.552606] mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000)
[48087.552831] mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000)
[48578.283740] mptbase: ioc2: LogInfo(0x31110700): Originator={PL}, Code={Reset}, SubCode(0x0700)
[48579.771554] mptbase: ioc2: LogInfo(0x31110700): Originator={PL}, Code={Reset}, SubCode(0x0700)
[48579.771724] mptbase: ioc2: LogInfo(0x31110700): Originator={PL}, Code={Reset}, SubCode(0x0700)
[48579.771905] mptbase: ioc2: LogInfo(0x31110700): Originator={PL}, Code={Reset}, SubCode(0x0700)
[...]

Regards
Lars

Revision history for this message
Lars (lars-taeuber) wrote :
Download full text (3.5 KiB)

Hallo again,

here are some other logs that might be connected to the failure:

[...]
Jul 2 17:52:25 speicher48 kernel: [17690.334458] sd 10:0:20:0: [sdx] CDB: Read(10): 28 00 00 00 00 00 00 00 08 00
Jul 2 17:52:25 speicher48 kernel: [17690.338722] sd 10:0:20:0: [sdx] Device not ready
Jul 2 17:52:25 speicher48 kernel: [17690.338723] sd 10:0:20:0: [sdx] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jul 2 17:52:25 speicher48 kernel: [17690.338726] sd 10:0:20:0: [sdx] Sense Key : Not Ready [current]
Jul 2 17:52:25 speicher48 kernel: [17690.338728] sd 10:0:20:0: [sdx] Add. Sense: Logical unit failed self-configuration
Jul 2 17:52:25 speicher48 kernel: [17690.338731] sd 10:0:20:0: [sdx] CDB: Read(10): 28 00 00 00 00 00 00 00 08 00
Jul 2 17:52:25 speicher48 kernel: [17690.342955] sd 10:0:20:0: [sdx] Device not ready
Jul 2 17:52:25 speicher48 kernel: [17690.342956] sd 10:0:20:0: [sdx] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jul 2 17:52:25 speicher48 kernel: [17690.342959] sd 10:0:20:0: [sdx] Sense Key : Not Ready [current]
Jul 2 17:52:25 speicher48 kernel: [17690.342961] sd 10:0:20:0: [sdx] Add. Sense: Logical unit failed self-configuration
Jul 2 17:52:25 speicher48 kernel: [17690.342964] sd 10:0:20:0: [sdx] CDB: Read(10): 28 00 00 00 10 00 00 00 08 00
Jul 2 17:52:25 speicher48 kernel: [17690.374555] sd 10:0:20:0: [sdx] Device not ready
Jul 2 17:52:25 speicher48 kernel: [17690.374562] sd 10:0:20:0: [sdx] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jul 2 17:52:25 speicher48 kernel: [17690.374569] sd 10:0:20:0: [sdx] Sense Key : Not Ready [current]
Jul 2 17:52:25 speicher48 kernel: [17690.374577] sd 10:0:20:0: [sdx] Add. Sense: Logical unit failed self-configuration
Jul 2 17:52:25 speicher48 kernel: [17690.374587] sd 10:0:20:0: [sdx] CDB: Read(10): 28 00 22 ee c0 80 00 00 08 00
Jul 2 17:52:25 speicher48 kernel: [17690.379051] sd 10:0:20:0: [sdx] Device not ready
Jul 2 17:52:25 speicher48 kernel: [17690.379055] sd 10:0:20:0: [sdx] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jul 2 17:52:25 speicher48 kernel: [17690.379061] sd 10:0:20:0: [sdx] Sense Key : Not Ready [current]
Jul 2 17:52:25 speicher48 kernel: [17690.379068] sd 10:0:20:0: [sdx] Add. Sense: Logical unit failed self-configuration
Jul 2 17:52:25 speicher48 kernel: [17690.379076] sd 10:0:20:0: [sdx] CDB: Read(10): 28 00 22 ee c0 80 00 00 08 00
Jul 2 17:52:25 speicher48 kernel: [17690.383570] sd 10:0:20:0: [sdx] Device not ready
Jul 2 17:52:25 speicher48 kernel: [17690.383575] sd 10:0:20:0: [sdx] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jul 2 17:52:25 speicher48 kernel: [17690.383581] sd 10:0:20:0: [sdx] Sense Key : Not Ready [current]
Jul 2 17:52:25 speicher48 kernel: [17690.383588] sd 10:0:20:0: [sdx] Add. Sense: Logical unit failed self-configuration
Jul 2 17:52:25 speicher48 kernel: [17690.383597] sd 10:0:20:0: [sdx] CDB: Read(10): 28 00 22 ee c1 20 00 00 08 00
Jul 2 17:52:25 speicher48 kernel: [17690.388115] sd 10:0:20:0: [sdx] Device not ready
Jul 2 17:52:25 speicher48 kernel: [17690.388120] sd 10:0:20:0: [sdx] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jul 2 17:52:25 speicher48 kernel: [17690.388126] sd 10:0:20:0: [sdx] Sens...

Read more...

Revision history for this message
Jason Unrein (diabelek) wrote :

The mpt messages in your logs suggest that the firmware had an NCQ problem that required it to abort all the outstanding commands and have the OS retry them (see http://en.wikipedia.org/wiki/NCQ for what NCQ is). You can disable NCQ, at the cost of IO performance usually, to work around the issue (see https://ata.wiki.kernel.org/index.php/Libata_FAQ#Enabling.2C_disabling_and_checking_NCQ).

The problem would probably either be a bad drive or off change a bad cable or card. You might check each driver with smartctl to confirm their health. You might also what watch /sys/block/sdX/device/ioerr_cnt for each device to help clue in on any problems (never used the file before so I'd be curious if it helps).

Also, the xfs.log shows a panic from a null pointer. This is probably just a result of the problems on with the fw<->drive communication.

Revision history for this message
Lars (lars-taeuber) wrote :

Hi Jason,

thanks for your hints.
I did a FW update of the LSI SAS controller and reduced the fs content. Since the the update and the filesystems are less filled the error didn't occur again.

# cat /proc/scsi/mptsas/9
ioc1: LSISAS1068E B3, FwRev=011f0200h, Ports=1, MaxQ=483
# cat /proc/scsi/mptsas/10
ioc2: LSISAS1068E B3, FwRev=011f0200h, Ports=1, MaxQ=483

# ./sasflash -listall

 ****************************************************************************
    LSI Corporation SAS FLASH Utility.

    SASFlash Version 1.26.00.00 (2010.05.18)

    Copyright (c) 2006-2007 LSI Corporation. All rights reserved.
 ****************************************************************************

 Adapter Selected is a LSI SAS 1068E(B3):

 Num Ctlr FW Ver NVDATA x86-BIOS EFI-BSD PCI Addr
-----------------------------------------------------------------------

1 1068E(B3) 01.31.02.00 2d.03 06.32.00.00 No Image 00:08:00:00
2 1068E(B3) 01.31.02.00 2d.03 06.32.00.00 No Image 00:09:00:00

The fs look like this:
# LANG=C df -ht xfs
Filesystem Size Used Avail Use% Mounted on
/dev/md2 6.1T 2.2T 3.9T 36% /backup2
/dev/md3 6.1T 3.8T 2.3T 63% /backup1

Just for your interest:
# cat /sys/block/sd?/device/ioerr_cnt /sys/block/sd??/device/ioerr_cnt
0x358
0x358
0x53
0x48
0x47
0x46
0x59
0x55
0x55
0x60
0x63
0x62
0x60
0x5e
0x6c
0x62
0x60
0x67
0x68
0x6c
0x76
0x70
0x72
0x6e
0x6d
0x65
0xc3
0xbd
0xc5
0xca
0xf0
0x104
0x107
0x113
0x119
0x127
0x11b
0x127
0x126
0x12d
0x12f
0x13c
0x12e
0x142
0x17f
0x13a
0x141
0x144
0x13e
0x141

The first 2 drives are attached through SATA controller (AHCI). I don't know what numbers are normal but there is a server with drives that have an error count of more than 850 and work flawlessly.
I would disable NCQ only at very last step, because throughput is important. The server has to fill 2 LTO tapes with fast write speed.

Is it possible to reopen bug reports? If yes, I think you can close this one for now.
I'll report when problems occur again.

Thanks
Lars

Revision history for this message
Lars (lars-taeuber) wrote :

Ha, wait!

I forgott the most important fact.
I installed a very much newer driver version manually. And I do it every kernel update again.
It's the driver from the zip archive from lsi. Actually it is for redhat and suse but the archive contains a source tarball.

http://lsi.com/storage_home/products_home/host_bus_adapters/sas_hbas/internal/sas3081e-r/index.html

# cat /sys/module/mptbase/version
4.24.00.00

Regards
Lars

Revision history for this message
Jason Unrein (diabelek) wrote :

That is good to hear. There have been some workarounds in the LSI driver to handle problems similar to this. It might be that you're still having a problem but the new fw/driver combination is able to mask it much better. I would keep an eye on your system for anything suspicious for a while so you don't have any down time.

I was curious if you knew what driver you were using before you upgraded to the LSI driver. I could try to compare the two and see what the differences were (probably a lot but maybe a specific change will stand out). If you don't know the driver version, then your kernel version will help narrow it down.

Also, that driver should come with a DKMS package. It only has builds for Redhat Enterprise and Suse Linux Enterprise but you should be able to use DKMS to build it for your kernel and then future updates of your kernel should rebuild it automatically. Here's a link to DKMS if you're not familiar: https://help.ubuntu.com/community/DKMS. You should only have to do the bottom half since LSI builds the dkms package for everyone.

Lastly, if you're satisfied with LSI driver, you might close the bug.

Revision history for this message
Lars (lars-taeuber) wrote :

Hi Jason,

the previous module versions were the ones shipped with ubuntu amd64 server 10.04(.1) LTS (up to 2.6.32-25-server) and 10.10 (2.6.35-24-server).

I'll have a look at DKMS. I never used it yet.

If the server runs without related issues for the next 2 month I'll close the bug if it's still open.

Thanks again.
Lars

Revision history for this message
BDV (bdv) wrote :

One of my servers does have a similar problem.

Ubuntu 10.04.2 LTS 64 bit
kernel 2.6.32-31-server

The RAID controller is a Symbios Logic LSI MegaSAS 9260 (rev 03)
(default drivers)

Today I found next error in dmesg
" task xfssyncd: blocked for more than 120 seconds"

Revision history for this message
Lars (lars-taeuber) wrote :

Hi BDV,

I suggest to update the driver with the most recent one from LSI and install it using dkms.
You might need to change the source a little bit, because there are statements like
#if (KERNEL_VERSION >= 2.6.32) .....

but it has to be
#if (KERNEL_VERSION > 2.6.32) .....

You'll find it just easily.

Try to update your firmware if possible, too. There are linux utils to do this from LSI available.

Good luck.
Lars

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.