kernel crash power 8 bare metal

Bug #1354459 reported by Scott Moser
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
High
Unassigned

Bug Description

I'm seeing crashes on power8 bare metal (powerNV).
happens sometimes.

[ 66.852889] Workqueue: events .work_for_cpu_fn
[ 66.852950] Call Trace:
[ 66.852977] [c000000fe6edae60] [c000000000016af0] .show_stack+0x170/0x290 (unreliable)
[ 66.853063] [c000000fe6edaf50] [c000000000966fc0] .dump_stack+0x88/0xb4
[ 66.853138] [c000000fe6edafd0] [c000000000111680] .rcu_check_callbacks+0x5b0/0x950
[ 66.853225] [c000000fe6edb100] [c0000000000ad2f8] .update_process_times+0x58/0xb0
[ 66.853311] [c000000fe6edb190] [c00000000011f890] .tick_sched_handle.isra.17+0x40/0xd0
[ 66.853397] [c000000fe6edb220] [c00000000011f984] .tick_sched_timer+0x64/0xa0
[ 66.853472] [c000000fe6edb2c0] [c0000000000cda50] .__run_hrtimer+0xa0/0x270
[ 66.853546] [c000000fe6edb360] [c0000000000ce948] .hrtimer_interrupt+0x148/0x330
[ 66.853633] [c000000fe6edb470] [c000000000020a00] .timer_interrupt+0x120/0x2c0
[ 66.853718] [c000000fe6edb520] [c0000000000023d8] decrementer_common+0x158/0x180
[ 66.853805] --- Exception: 901 at ._raw_spin_lock_irqsave+0xb0/0x110
[ 66.853805] LR = ._raw_spin_lock_irqsave+0xe8/0x110
[ 66.853916] [c000000fe6edb810] [c0000000000f0cb4] .finish_wait+0x74/0xb0 (unreliable)
[ 66.854008] [c000000fe6edb8a0] [d00000000eeaa4f4] .ipr_probe_ioa+0xd84/0x1370 [ipr]
[ 66.854096] [c000000fe6edb9d0] [d00000000eeb25c4] .ipr_probe+0x44/0x4c0 [ipr]
[ 66.854171] [c000000fe6edbac0] [c000000000516cfc] .local_pci_probe+0x4c/0xe0
[ 66.854245] [c000000fe6edbb40] [c0000000000bae68] .work_for_cpu_fn+0x38/0x60
[ 66.854319] [c000000fe6edbbc0] [c0000000000bf628] .process_one_work+0x1a8/0x4d0
[ 66.854405] [c000000fe6edbc60] [c0000000000c04fc] .worker_thread+0x38c/0x4a0
[ 66.854479] [c000000fe6edbd30] [c0000000000c98a0] .kthread+0x110/0x130
[ 66.854553] [c000000fe6edbe30] [c00000000000a460] .ret_from_kernel_thread+0x5c/0x7c
[ 72.412653] BUG: soft lockup - CPU#40 stuck for 22s! [kworker/40:0:209]

Revision history for this message
Scott Moser (smoser) wrote :
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1354459

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Do you have a list of steps to reproduce this bug?

Did this issue start happening after an update/upgrade? Was there a kernel version where you were not having this particular problem? This will help determine if the problem you are seeing is the result of the introduction of a regression, and when this regression was introduced. If this is a regression, we can perform a kernel bisect to identify the commit that introduced the problem.

Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: kernel-da-key ppc64el
tags: added: trusty
Revision history for this message
Scott Moser (smoser) wrote :
Revision history for this message
Anton Blanchard (anton-samba) wrote :
Download full text (4.2 KiB)

We took an EEH error:

[ 44.793204] pnv_pci_dump_phb_diag_data: Unrecognized ioType 33554432
[ 44.793267] EEH: Frozen PE#5 detected on PHB#3
[ 44.793318] CPU: 40 PID: 209 Comm: kworker/40:0 Not tainted 3.13.0-27-generic #50-Ubuntu
[ 44.793396] Workqueue: events .work_for_cpu_fn
[ 44.793458] Call Trace:
[ 44.793487] [c000000fe6edb540] [c000000000016af0] .show_stack+0x170/0x290 (unreliable)
[ 44.793575] [c000000fe6edb630] [c000000000966fc0] .dump_stack+0x88/0xb4
[ 44.793651] [c000000fe6edb6b0] [c0000000000364b0] .eeh_dev_check_failure+0x430/0x480
[ 44.793737] [c000000fe6edb760] [c000000000036584] .eeh_check_failure+0x84/0xe0
[ 44.793827] [c000000fe6edb7f0] [d00000000eea33e0] .ipr_mask_and_clear_interrupts+0x190/0x1d0 [ipr]
[ 44.793928] [c000000fe6edb8a0] [d00000000eeaa394] .ipr_probe_ioa+0xc24/0x1370 [ipr]
[ 44.794017] [c000000fe6edb9d0] [d00000000eeb25c4] .ipr_probe+0x44/0x4c0 [ipr]
[ 44.794093] [c000000fe6edbac0] [c000000000516cfc] .local_pci_probe+0x4c/0xe0
[ 44.794167] [c000000fe6edbb40] [c0000000000bae68] .work_for_cpu_fn+0x38/0x60
[ 44.794242] [c000000fe6edbbc0] [c0000000000bf628] .process_one_work+0x1a8/0x4d0
[ 44.794327] [c000000fe6edbc60] [c0000000000c04fc] .worker_thread+0x38c/0x4a0
[ 44.794401] [c000000fe6edbd30] [c0000000000c98a0] .kthread+0x110/0x130
[ 44.794476] [c000000fe6edbe30] [c00000000000a460] .ret_from_kernel_thread+0x5c/0x7c
[ 44.794572] EEH: Detected PCI bus error on PHB#3-PE#5
[ 44.794632] EEH: This PCI device has failed 1 times in the last hour
[ 44.794693] EEH: Notify device drivers to shutdown
[ 44.794749] Unable to handle kernel paging request for data at address 0x00000008
[ 44.794821] Faulting instruction address: 0xd00000000eea205c
[ 44.794883] Oops: Kernel access of bad area, sig: 11 [#1]
[ 44.794931] SMP NR_CPUS=2048 NUMA PowerNV
[ 44.794982] Modules linked in: ipr(+)
[ 44.795046] CPU: 9 PID: 810 Comm: eehd Not tainted 3.13.0-27-generic #50-Ubuntu
[ 44.795120] task: c000000fdf7066f0 ti: c000000fe33a4000 task.ti: c000000fe33a4000
[ 44.795192] NIP: d00000000eea205c LR: d00000000eea2a14 CTR: c00000000064f720
[ 44.795264] REGS: c000000fe33a75b0 TRAP: 0300 Not tainted (3.13.0-27-generic)
[ 44.795336] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 229d0028 XER: 20000000
[ 44.795641] CFAR: c000000000009318 DAR: 0000000000000008 DSISR: 40000000 SOFTE: 0
GPR00: d00000000eea2a14 c000000fe33a7830 d00000000eec4c58 c000000fda20cc60
GPR04: d00000000eebc178 0000000000000100 9000000100009033 ffffffffffffffff
GPR08: 0000000000000001 0000000000000000 0000000000000000 c00000000064f720
GPR12: d00000000eeb4978 c00000000fe41f80 c0000000000c9790 c000001fd8401600
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000000 c000000000bf9528
GPR24: c000000000bf9500 0000000000000000 d00000000eebc178 0000000000000100
GPR28: 0000000000000000 c000000fda20c538 0000000000000000 c000000fda20c538
[ 44.797447] NIP [d00000000eea205c] .ipr_get_free_ipr_cmnd+0x2c/0x90 [ipr]
[ 44.797567] LR [d00000000eea2a14] ._ipr_initiate_ioa_reset+0xe4/0x130 [ipr]
[ 44.797683] Call Trace:
[ 44.797736] [c000000fe3...

Read more...

Revision history for this message
Anton Blanchard (anton-samba) wrote :

I spoke to Gavin about this:

1. EEH has endian issues in 3.13 should work in 3.16. (EEH is our I/O error recovery mechanism)

2. There was an issue in the IPR driver with early EEH errors, fixed in 3.15 with commit 6270e5932a01 "[SCSI] ipr: Handle early EEH"

Would it be possible to update to the Utopic kernel?

Revision history for this message
Wendy Xiong (wenxiong) wrote :

Also can you check the firmware/boot level on SAS adapters? Thanks! iprconfig ->1 detail information for each adapter.

Revision history for this message
Tammy Yang (wanchingy) wrote :

please retest with neihu 140919-1 or later image which uses kernel 3.13.0-36.63+hwe3

tags: added: bdw-bug
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Looks very similar to bug 1425699 .

Is this still an issue with the latest kernel?

Changed in linux (Ubuntu):
status: Expired → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.