read_all_sys test in ubuntu_ltp cause kernel panic and system reboot on node onibi
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
ubuntu-kernel-tests |
New
|
Undecided
|
Unassigned |
Bug Description
This issue is not 100% reproducible, and it seems to be hardware related.
Issue found on AMD64 node onibi with X-4.4.0-256.290.
During my manual attempts, I will have to restart the test for 4 times to trigger this issue.
Steps:
git clone https:/
cd ltp; make autotools; ./configure; make ; sudo make install
echo "read_all_sys read_all -d /sys -q -r 3" > /tmp/test
sudo /opt/ltp/runltp -f /tmp/test
Test output:
<<<test_start>>>
tag=read_all_sys stime=1719219587
cmdline="read_all -d /sys -q -r 3"
contacts=""
analysis=exit
<<<test_output>>>
incrementing stop
tst_test.c:1733: TINFO: LTP version: 20230929-
tst_test.c:1619: TINFO: Timeout per run is 0h 02m 10s
read_all.c:569: TINFO: Worker timeout set to 10% of max_runtime: 1000ms
read_all.c:449: TINFO: Worker 2242 (0): Stuck for 1000117us, restarting it
read_all.c:449: TINFO: Worker 2243 (1): Stuck for 1000048us, restarting it
read_all.c:449: TINFO: Worker 2244 (2): Stuck for 1000143us, restarting it
read_all.c:384: TINFO: Worker 2242 (0): Last popped '/sys/devices/
read_all.c:367: TINFO: Worker 2243 (1): Timeout waiting after kill
read_all.c:384: TINFO: Worker 2243 (1): Last popped '/sys/devices/
read_all.c:367: TINFO: Worker 2244 (2): Timeout waiting after kill
read_all.c:384: TINFO: Worker 2244 (2): Last popped '/sys/devices/
read_all.c:449: TINFO: Worker 2246 (0): Stuck for 1007574us, restarting it
read_all.c:384: TINFO: Worker 2246 (0): Last popped '/sys/devices/
read_all.c:449: TINFO: Worker 2247 (1): Stuck for 1001038us, restarting it
read_all.c:384: TINFO: Worker 2247 (1): Last popped '/sys/devices/
read_all.c:449: TINFO: Worker 2248 (2): Stuck for 1004113us, restarting it
read_all.c:384: TINFO: Worker 2248 (2): Last popped '/sys/devices/
(system reboots here)
Console output:
[ 8196.572925] LTP: starting read_all_sys (read_all -d /sys -q -r 3)
[ 8196.737991] mpt2sas_cm0: _ctl_host_
[ 8196.741766] mpt2sas_cm0: _ctl_host_
[ 8196.741994] mpt2sas_cm0: _ctl_BRM_
[ 8196.742186] mpt2sas_cm0: _ctl_host_
[ 8196.771177] mpt2sas_cm0: _ctl_BRM_
[ 8196.737991] m[ 8196.779105] mpt2sas_cm0: _ctl_host_
pt2sas_cm0: _ctl_host_
ace_buffer is not registered
[ 8196.741766] mpt2sas_cm0: _ctl_host_
[ 8196.741994] mpt2sas_cm0: _ctl_BRM_
[ 8196.742186] mpt2sas_cm0: _ctl_host_
[ 8196.771177] mpt2sas_cm0: _ctl_BRM_
[ 8196.779105] mpt2sas_cm0: _ctl_host_
[ 8196.924951] mpt2sas_cm0: fault_state(
[ 8196.929352] mpt2sas_cm0: sending diag reset !!
[ 8198.226444] mpt2sas_cm0: _config_request: timeout
[ 8198.226232] mpt2sas_cm0: diag reset: SUCCESS
[ 8198.226444] mpt2sas_cm0: _config_request: timeout
[ 8198.231246] mf:
[ 8198.231250] 04000001
[ 8198.231253] 00000000
[ 8198.231254] 00000000
[ 8198.231256] 00000000
[ 8198.231257] 00000000
[ 8198.231260] 08001a0a
[ 8198.231261] 1000004f
[ 8198.231263] d3000068
[ 8198.231264]
[ 8198.231267] 34eb7000
[ 8198.231269] 00000000
[ 8198.231271] 00000000
[ 8198.231275] mpt2sas_cm0: _config_request: attempting retry (1)
[ 8198.520262] mpt2sas_cm0: _ctl_host_
[ 8198.528695] mpt2sas_cm0: _ctl_BRM_
[ 8198.520262] m[ 8198.536511] mpt2sas_cm0: _ctl_host_
pt2sas_cm0: _ctl_host_
[ 8198.528695] mpt2sas_cm0: _ctl_BRM_
[ 8198.536511] mpt2sas_cm0: _ctl_host_
[ 8199.232921] mpt2sas_cm0: _config_request: waiting for operational state(count=1)
[ 8199.232926] mpt2sas_cm0: _config_request: ioc is operational
[ 8199.233040] mpt2sas_cm0: _config_request: retry (1) completed!!
[ 8199.234071] mpt2sas_cm0: log_info(
[ 8199.234108] mpt2sas_cm0: log_info(
[ 8199.234117] mpt2sas_cm0: LSISAS2008: FWVersion(
[ 8199.234120] mpt2sas_cm0: Protocol=(
[ 8199.234122] Initiator
[ 8199.234125] ,Target
[ 8199.234126] ),
[ 8199.234128] Capabilities=(
[ 8199.234130] Raid
[ 8199.234132] ,TLR
[ 8199.234133] ,EEDP
[ 8199.234136] ,Snapshot Buffer
[ 8199.234137] ,Diag Trace Buffer
[ 8199.234139] ,Task Set Full
[ 8199.234140] ,NCQ
[ 8199.234146] )
[ 8199.234660] mpt2sas_cm0: sending port enable !!
[ 8199.529930] mpt2sas_cm0: phy(0), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.530015] mpt2sas_cm0: phy(0), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.530425] mpt2sas_cm0: phy(0), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.530505] mpt2sas_cm0: phy(0), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.530656] mpt2sas_cm0: phy(0), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.530732] mpt2sas_cm0: phy(0), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.530835] mpt2sas_cm0: phy(0), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.530952] mpt2sas_cm0: phy(0), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.531031] mpt2sas_cm0: phy(0), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.531117] mpt2sas_cm0: phy(0), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.531189] mpt2sas_cm0: phy(0), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.531325] mpt2sas_cm0: phy(1), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.531398] mpt2sas_cm0: phy(0), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.531522] mpt2sas_cm0: phy(1), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.531771] mpt2sas_cm0: phy(1), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.531904] mpt2sas_cm0: phy(1), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.532027] mpt2sas_cm0: phy(1), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.532235] mpt2sas_cm0: phy(1), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.532314] mpt2sas_cm0: phy(1), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.532392] mpt2sas_cm0: phy(1), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.532469] mpt2sas_cm0: phy(1), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.532549] mpt2sas_cm0: phy(1), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.532625] mpt2sas_cm0: phy(1), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.532733] mpt2sas_cm0: phy(1), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.533461] mpt2sas_cm0: phy(2), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.533540] mpt2sas_cm0: phy(2), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.533979] mpt2sas_cm0: phy(2), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.534052] mpt2sas_cm0: phy(2), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.534354] mpt2sas_cm0: phy(2), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.534509] mpt2sas_cm0: phy(2), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.534548] mpt2sas_cm0: phy(2), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.534589] mpt2sas_cm0: phy(2), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.534663] mpt2sas_cm0: phy(2), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.534949] mpt2sas_cm0: phy(3), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.535025] mpt2sas_cm0: phy(2), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.535098] mpt2sas_cm0: phy(3), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.535336] mpt2sas_cm0: phy(2), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.535443] mpt2sas_cm0: phy(2), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.535519] mpt2sas_cm0: phy(3), ioc_status (0x0022), loginfo(
[ 8199.535615][ 8200.182253] {1}[Hardware Error]: event severity: fatal
mpt2sas_cm0: ph[ 8200.188920] {1}[Hardware Error]: Error 0, type: fatal
[ 8200.195624] {1}[Hardware Error]: section_type: PCIe error
y(3), ioc_status[ 8200.201372] {1}[Hardware Error]: port_type: 0, PCIe end point
[ 8200.208954] {1}[Hardware Error]: version: 1.0
(0x0022), login[ 8200.213549] {1}[Hardware Error]: command: 0x0407, status: 0x0010
[ 8200.221397] {1}[Hardware Error]: device_id: 0000:05:00.0
fo(0x310f0001)
[ 8200.227018] {1}[Hardware Error]: slot: 2
[ 8200.232669] {1}[Hardware Error]: secondary_bus: 0x00
[ 8200.238064] {1}[Hardware Error]: vendor_id: 0x1000, device_id: 0x0072
[ 8199.535904] m[ 8200.244761] {1}[Hardware Error]: class_code: 010700
[ 8200.251368] {1}[Hardware Error]: aer_uncor_status: 0x00000000, aer_uncor_mask: 0x00018000
pt2sas_cm0: phy([ 8200.259909] {1}[Hardware Error]: aer_uncor_severity: 0x000e7031
[ 8200.267537] {1}[Hardware Error]: TLP Header: 00000000 00000000 00000000 00000000
3), ioc_status ([ 8200.275300] Kernel panic - not syncing: Fatal hardware error!
0x0022), loginfo[ 8200.282806] Kernel Offset: disabled
I can reproduce this issue with an older version of our LTP fork (upstream head commit: cbc2d05684 mkdir03: Convert docs to docparse).
I can reproduce this with X 4.4.0-254-generic as well, one failure with multiple attempts.
It seems this is more likely a corner case that we are not lucky enough to catch in the old days, as it must be reproduced on this hardware.