system hangs after strange errors - raid6 and xfs defective (lsi driver?)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
ecs (Ubuntu) |
New
|
Undecided
|
Unassigned |
Bug Description
Hi there.
this is a report for a failure that happened not only once. But I'm still unsure what part of the system bares the blame.
The setup is as follows:
smp on multi core 64bit system:
# uname -a
Linux speicher48 2.6.32-22-server #36-Ubuntu SMP Thu Jun 3 20:38:33 UTC 2010 x86_64 GNU/Linux
LSI controller:
08:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08)
09:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08)
# cat /proc/scsi/
ioc0: LSISAS1068E B3, FwRev=011e0000h, Ports=1, MaxQ=483
# cat /proc/scsi/
ioc2: LSISAS1068E B3, FwRev=011e0000h, Ports=1, MaxQ=483
lots of sataii drives attached through chenbro enclosures:
# cat /proc/scsi/scsi | egrep '(CHENBRO|
Host: scsi10 Channel: 00 Id: 24 Lun: 00
Vendor: CHENBRO Model: SASX36 B0 Rev: AB11
Type: Enclosure ANSI SCSI revision: 03
Host: scsi11 Channel: 00 Id: 24 Lun: 00
Vendor: CHENBRO Model: SASX36 B0 Rev: AB11
Type: Enclosure ANSI SCSI revision: 03
# cat /proc/partitions
major minor #blocks name
8 0 146523384 sda
8 1 14651248 sda1
8 2 14651280 sda2
8 3 19535040 sda3
8 16 146523384 sdb
8 17 14651248 sdb1
8 18 14651280 sdb2
8 19 19535040 sdb3
9 0 14651136 md0
9 1 14651200 md1
8 32 1953525166 sdc
8 48 293036184 sdd
8 64 293036184 sde
8 80 293036184 sdf
8 96 293036184 sdg
8 112 293036184 sdh
8 128 293036184 sdi
8 144 293036184 sdj
8 160 293036184 sdk
8 176 293036184 sdl
8 192 293036184 sdm
8 208 293036184 sdn
9 2 6446794112 md2
8 224 293036184 sdo
8 240 293036184 sdp
65 0 293036184 sdq
65 16 293036184 sdr
65 32 293036184 sds
65 48 293036184 sdt
65 64 293036184 sdu
65 80 293036184 sdv
65 96 293036184 sdw
65 112 293036184 sdx
65 128 293036184 sdy
65 144 293036184 sdz
65 160 293036184 sdaa
65 176 293036184 sdab
65 192 293036184 sdac
65 224 293036184 sdae
65 208 293036184 sdad
65 240 293036184 sdaf
66 0 293036184 sdag
66 16 293036184 sdah
66 32 293036184 sdai
66 48 293036184 sdaj
66 64 293036184 sdak
66 80 293036184 sdal
66 96 293036184 sdam
66 112 293036184 sdan
66 128 293036184 sdao
66 144 293036184 sdap
66 160 293036184 sdaq
66 176 293036184 sdar
66 192 293036184 sdas
66 193 14188543 sdas1
66 224 293036184 sdau
66 208 293036184 sdat
66 240 293036184 sdav
67 0 293036184 sdaw
67 16 293036184 sdax
67 32 293036184 sday
9 3 6446794112 md3
sw-raid6:
# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md3 : active raid6 sdav[20] sdah[6] sdal[10] sdai[7] sdae[3] sdar[16] sdag[5] sdab[0] sdaj[8] sdak[9] sdas[17] sdao[13] sday[23] sdam[11] sdac[1] sdaf[4] sdat[18] sdaq[15] sdaw[21] sdax[22] sdad[2] sdap[14] sdau[19] sdan[12]
6446794112 blocks level 6, 64k chunk, algorithm 2 [24/24] [UUUUUUUUUUUUUU
md2 : active raid6 sdf[24] sdr[14] sdq[13] sdy[17] sdn[9] sdw[19] sdj[6] sdz[22] sdaa[23] sdm[10] sdx[20] sdv[18] sdu[21] sds[15] sdt[16] sdd[0] sdo[11] sdp[12] sdl[8] sdk[7] sdi[5] sdh[4] sdg[3] sde[1]
6446794112 blocks level 6, 64k chunk, algorithm 2 [24/23] [UU_UUUUUUUUUUU
[
md1 : active raid1 sda2[0] sdb2[1]
14651200 blocks [2/2] [UU]
md0 : active raid1 sda1[0] sdb1[1]
14651136 blocks [2/2] [UU]
unused devices: <none>
XFS on raids:
# LANG=C df -ht xfs
Filesystem Size Used Avail Use% Mounted on
/dev/md2 6.1T 5.6T 466G 93% /backup2
/dev/md3 6.1T 4.8T 1.3T 79% /backup1
System hangs after some days of operation. No sys-rq possible anymore.
At least one of the sw-raid6 can't be assembled after the system is rebooted, because at least one hdd is inaccessible. (Tested with smartctl -i /dev/sd__)
Only a halt and power off with disconnetion from the power grid for some seconds gets all hdds accessible again.
After this really hard xfs errors occur.
Sometimes a xfs_repair needs the «-P» option to get it working again.
I already had a segmentation fault with xfs_check before a xfs_repair.
The LSI-controllers contain the latest firmware:
1.30.00.00 18-DEC-09
from LSI:
http://
Maybe the driver has some problems?
Sadly I can't make long lasting debugging sessions because the server is backup critical.
I'm stuck.
Kind regards
Lars
affects: | ubuntu → ecs (Ubuntu) |
Hi,
here is a dmesg log of the system.
There are a lot of messages and errors from the LSI driver module (mptbase). Maybe someone knows how to interprete these.
[...] 0x31080000) : Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000) 0x31080000) : Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000) 0x31080000) : Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000) 0x31080000) : Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000) 0x31110700) : Originator={PL}, Code={Reset}, SubCode(0x0700) 0x31110700) : Originator={PL}, Code={Reset}, SubCode(0x0700) 0x31110700) : Originator={PL}, Code={Reset}, SubCode(0x0700) 0x31110700) : Originator={PL}, Code={Reset}, SubCode(0x0700)
[48087.552179] mptbase: ioc0: LogInfo(
[48087.552381] mptbase: ioc0: LogInfo(
[48087.552606] mptbase: ioc0: LogInfo(
[48087.552831] mptbase: ioc0: LogInfo(
[48578.283740] mptbase: ioc2: LogInfo(
[48579.771554] mptbase: ioc2: LogInfo(
[48579.771724] mptbase: ioc2: LogInfo(
[48579.771905] mptbase: ioc2: LogInfo(
[...]
Regards
Lars