read_all_sys test in ubuntu_ltp cause kernel panic and system reboot on node onibi

Bug #2070262 reported by Po-Hsu Lin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
ubuntu-kernel-tests
New
Undecided
Unassigned

Bug Description

This issue is not 100% reproducible, and it seems to be hardware related.

Issue found on AMD64 node onibi with X-4.4.0-256.290.

During my manual attempts, I will have to restart the test for 4 times to trigger this issue.

Steps:
git clone https://git.launchpad.net/~canonical-kernel-team/+git/ltp
cd ltp; make autotools; ./configure; make ; sudo make install
echo "read_all_sys read_all -d /sys -q -r 3" > /tmp/test
sudo /opt/ltp/runltp -f /tmp/test

Test output:
<<<test_start>>>
tag=read_all_sys stime=1719219587
cmdline="read_all -d /sys -q -r 3"
contacts=""
analysis=exit
<<<test_output>>>
incrementing stop
tst_test.c:1733: TINFO: LTP version: 20230929-609-gbdcd225
tst_test.c:1619: TINFO: Timeout per run is 0h 02m 10s
read_all.c:569: TINFO: Worker timeout set to 10% of max_runtime: 1000ms
read_all.c:449: TINFO: Worker 2242 (0): Stuck for 1000117us, restarting it
read_all.c:449: TINFO: Worker 2243 (1): Stuck for 1000048us, restarting it
read_all.c:449: TINFO: Worker 2244 (2): Stuck for 1000143us, restarting it
read_all.c:384: TINFO: Worker 2242 (0): Last popped '/sys/devices/pci0000:00/0000:00:05.0/0000:05:00.0/vpd'
read_all.c:367: TINFO: Worker 2243 (1): Timeout waiting after kill
read_all.c:384: TINFO: Worker 2243 (1): Last popped '/sys/devices/pci0000:00/0000:00:05.0/0000:05:00.0/host4/target4:1:0/4:1:0:0/raid_devices/4:1:0:0/state'
read_all.c:367: TINFO: Worker 2244 (2): Timeout waiting after kill
read_all.c:384: TINFO: Worker 2244 (2): Last popped '/sys/devices/pci0000:00/0000:00:05.0/0000:05:00.0/host4/target4:1:0/4:1:0:0/raid_devices/4:1:0:0/state'
read_all.c:449: TINFO: Worker 2246 (0): Stuck for 1007574us, restarting it
read_all.c:384: TINFO: Worker 2246 (0): Last popped '/sys/devices/pci0000:00/0000:00:05.0/0000:05:00.0/host4/port-4:1/end_device-4:1/target4:0:1/4:0:1:0/evt_media_change'
read_all.c:449: TINFO: Worker 2247 (1): Stuck for 1001038us, restarting it
read_all.c:384: TINFO: Worker 2247 (1): Last popped '/sys/devices/pci0000:00/0000:00:1c.4/0000:02:00.0/config'
read_all.c:449: TINFO: Worker 2248 (2): Stuck for 1004113us, restarting it
read_all.c:384: TINFO: Worker 2248 (2): Last popped '/sys/devices/pci0000:00/0000:00:1c.4/0000:02:00.1/vpd'
(system reboots here)

Console output:
[ 8196.572925] LTP: starting read_all_sys (read_all -d /sys -q -r 3)
[ 8196.737991] mpt2sas_cm0: _ctl_host_trace_buffer_show: host_trace_buffer is not registered
[ 8196.741766] mpt2sas_cm0: _ctl_host_trace_buffer_show: host_trace_buffer is not registered
[ 8196.741994] mpt2sas_cm0: _ctl_BRM_status_show: BRM attribute is only for warpdrive
[ 8196.742186] mpt2sas_cm0: _ctl_host_trace_buffer_size_show: host_trace_buffer is not registered
[ 8196.771177] mpt2sas_cm0: _ctl_BRM_status_show: BRM attribute is only for warpdrive
[ 8196.737991] m[ 8196.779105] mpt2sas_cm0: _ctl_host_trace_buffer_size_show: host_trace_buffer is not registered
pt2sas_cm0: _ctl_host_trace_buffer_show: host_tr[ 8196.924951] mpt2sas_cm0: fault_state(0x0d04)!
ace_buffer is not registered
[ 8196.741766] mpt2sas_cm0: _ctl_host_trace_buffer_show: host_trace_buffer is not registered
[ 8196.741994] mpt2sas_cm0: _ctl_BRM_status_show: BRM attribute is only for warpdrive
[ 8196.742186] mpt2sas_cm0: _ctl_host_trace_buffer_size_show: host_trace_buffer is not registered
[ 8196.771177] mpt2sas_cm0: _ctl_BRM_status_show: BRM attribute is only for warpdrive
[ 8196.779105] mpt2sas_cm0: _ctl_host_trace_buffer_size_show: host_trace_buffer is not registered
[ 8196.924951] mpt2sas_cm0: fault_state(0x0d04)!
[ 8196.929352] mpt2sas_cm0: sending diag reset !!
[ 8198.226444] mpt2sas_cm0: _config_request: timeout
[ 8198.226232] mpt2sas_cm0: diag reset: SUCCESS
[ 8198.226444] mpt2sas_cm0: _config_request: timeout
[ 8198.231246] mf:

[ 8198.231250] 04000001
[ 8198.231253] 00000000
[ 8198.231254] 00000000
[ 8198.231256] 00000000
[ 8198.231257] 00000000
[ 8198.231260] 08001a0a
[ 8198.231261] 1000004f
[ 8198.231263] d3000068
[ 8198.231264]

[ 8198.231267] 34eb7000
[ 8198.231269] 00000000
[ 8198.231271] 00000000

[ 8198.231275] mpt2sas_cm0: _config_request: attempting retry (1)
[ 8198.520262] mpt2sas_cm0: _ctl_host_trace_buffer_show: host_trace_buffer is not registered
[ 8198.528695] mpt2sas_cm0: _ctl_BRM_status_show: BRM attribute is only for warpdrive
[ 8198.520262] m[ 8198.536511] mpt2sas_cm0: _ctl_host_trace_buffer_size_show: host_trace_buffer is not registered
pt2sas_cm0: _ctl_host_trace_buffer_show: host_trace_buffer is not registered
[ 8198.528695] mpt2sas_cm0: _ctl_BRM_status_show: BRM attribute is only for warpdrive
[ 8198.536511] mpt2sas_cm0: _ctl_host_trace_buffer_size_show: host_trace_buffer is not registered
[ 8199.232921] mpt2sas_cm0: _config_request: waiting for operational state(count=1)
[ 8199.232926] mpt2sas_cm0: _config_request: ioc is operational
[ 8199.233040] mpt2sas_cm0: _config_request: retry (1) completed!!
[ 8199.234071] mpt2sas_cm0: log_info(0x30030100): originator(IOP), code(0x03), sub_code(0x0100)
[ 8199.234108] mpt2sas_cm0: log_info(0x30030100): originator(IOP), code(0x03), sub_code(0x0100)
[ 8199.234117] mpt2sas_cm0: LSISAS2008: FWVersion(07.15.08.00), ChipRevision(0x03), BiosVersion(07.11.10.00)
[ 8199.234120] mpt2sas_cm0: Protocol=(
[ 8199.234122] Initiator
[ 8199.234125] ,Target
[ 8199.234126] ),
[ 8199.234128] Capabilities=(
[ 8199.234130] Raid
[ 8199.234132] ,TLR
[ 8199.234133] ,EEDP
[ 8199.234136] ,Snapshot Buffer
[ 8199.234137] ,Diag Trace Buffer
[ 8199.234139] ,Task Set Full
[ 8199.234140] ,NCQ
[ 8199.234146] )
[ 8199.234660] mpt2sas_cm0: sending port enable !!
[ 8199.529930] mpt2sas_cm0: phy(0), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.530015] mpt2sas_cm0: phy(0), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.530425] mpt2sas_cm0: phy(0), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.530505] mpt2sas_cm0: phy(0), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.530656] mpt2sas_cm0: phy(0), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.530732] mpt2sas_cm0: phy(0), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.530835] mpt2sas_cm0: phy(0), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.530952] mpt2sas_cm0: phy(0), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.531031] mpt2sas_cm0: phy(0), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.531117] mpt2sas_cm0: phy(0), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.531189] mpt2sas_cm0: phy(0), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.531325] mpt2sas_cm0: phy(1), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.531398] mpt2sas_cm0: phy(0), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.531522] mpt2sas_cm0: phy(1), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.531771] mpt2sas_cm0: phy(1), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.531904] mpt2sas_cm0: phy(1), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.532027] mpt2sas_cm0: phy(1), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.532235] mpt2sas_cm0: phy(1), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.532314] mpt2sas_cm0: phy(1), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.532392] mpt2sas_cm0: phy(1), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.532469] mpt2sas_cm0: phy(1), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.532549] mpt2sas_cm0: phy(1), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.532625] mpt2sas_cm0: phy(1), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.532733] mpt2sas_cm0: phy(1), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.533461] mpt2sas_cm0: phy(2), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.533540] mpt2sas_cm0: phy(2), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.533979] mpt2sas_cm0: phy(2), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.534052] mpt2sas_cm0: phy(2), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.534354] mpt2sas_cm0: phy(2), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.534509] mpt2sas_cm0: phy(2), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.534548] mpt2sas_cm0: phy(2), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.534589] mpt2sas_cm0: phy(2), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.534663] mpt2sas_cm0: phy(2), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.534949] mpt2sas_cm0: phy(3), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.535025] mpt2sas_cm0: phy(2), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.535098] mpt2sas_cm0: phy(3), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.535336] mpt2sas_cm0: phy(2), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.535443] mpt2sas_cm0: phy(2), ioc_status (0x0022), loginfo(0x310f0001)
[ 8199.535519] mpt2sas_cm0: phy(3), ioc_status (0x0022), loginfo(0x310f0001)[ 8200.171467] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 32993

[ 8199.535615][ 8200.182253] {1}[Hardware Error]: event severity: fatal
 mpt2sas_cm0: ph[ 8200.188920] {1}[Hardware Error]: Error 0, type: fatal
[ 8200.195624] {1}[Hardware Error]: section_type: PCIe error
y(3), ioc_status[ 8200.201372] {1}[Hardware Error]: port_type: 0, PCIe end point
[ 8200.208954] {1}[Hardware Error]: version: 1.0
 (0x0022), login[ 8200.213549] {1}[Hardware Error]: command: 0x0407, status: 0x0010
[ 8200.221397] {1}[Hardware Error]: device_id: 0000:05:00.0
fo(0x310f0001)
[ 8200.227018] {1}[Hardware Error]: slot: 2
[ 8200.232669] {1}[Hardware Error]: secondary_bus: 0x00
[ 8200.238064] {1}[Hardware Error]: vendor_id: 0x1000, device_id: 0x0072
[ 8199.535904] m[ 8200.244761] {1}[Hardware Error]: class_code: 010700
[ 8200.251368] {1}[Hardware Error]: aer_uncor_status: 0x00000000, aer_uncor_mask: 0x00018000
pt2sas_cm0: phy([ 8200.259909] {1}[Hardware Error]: aer_uncor_severity: 0x000e7031
[ 8200.267537] {1}[Hardware Error]: TLP Header: 00000000 00000000 00000000 00000000
3), ioc_status ([ 8200.275300] Kernel panic - not syncing: Fatal hardware error!
0x0022), loginfo[ 8200.282806] Kernel Offset: disabled

I can reproduce this issue with an older version of our LTP fork (upstream head commit: cbc2d05684 mkdir03: Convert docs to docparse).

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

I can reproduce this with X 4.4.0-254-generic as well, one failure with multiple attempts.
It seems this is more likely a corner case that we are not lucky enough to catch in the old days, as it must be reproduced on this hardware.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.