MCE on shutdown when nouveau driver loaded

Bug #1908294 reported by dann frazier
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Unassigned
Focal
Confirmed
Undecided
Unassigned
Groovy
Won't Fix
Undecided
Unassigned
Hirsute
Confirmed
Undecided
Unassigned

Bug Description

[Impact]
When rebooting with the focal kernel, my system always MCEs. Installing an nvidia driver - or simply blacklisting the nouveau driver - avoids the issue.

Sometimes it hard hangs the system, requiring a manual power cycle:

[ OK ] Reached target Reboot.
[ 402.489755] Disabling lock debugging due to kernel taint
[ 402.495319] mce: [Hardware Error]: CPU 24: Machine Check Exception: 5 Bank 6: bb80000000000e0b
[ 402.503924] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff9ead91c7> {intel_idle+0x87/0x130}
[ 402.512530] mce: [Hardware Error]: TSC 29fb4740af0 MISC d7000000
[ 402.518622] mce: [Hardware Error]: PROCESSOR 0:50654 TIME 1601415822 SOCKET 1 APIC 40 microcode 2006906
[ 402.527998] mce: [Hardware Error]: Run the above through 'mcelog --ascii'

Other times it emits the MCE tombstone, but goes ahead and reboots itself:

[ OK ] Reached target Reboot.
[ 870.372933] Disabling lock debugging due to kernel taint
[ 870.378505] mce: [Hardware Error]: CPU 24: Machine Check Exception: 5 Bank 6: bb80000000000e0b
[ 870.387110] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8e2d4847> {intel_idle+0x87/0x130}
[ 870.395716] mce: [Hardware Error]: TSC 44e0f5e602c MISC d7000000
[ 870.401801] mce: [Hardware Error]: PROCESSOR 0:50654 TIME 1589320331 SOCKET 1 APIC 40 microcode 2000064
[ 870.411185] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 870.420531] mce: [Hardware Error]: Machine check: Processor context corrupt
[ 870.427488] Kernel panic - not syncing: Fatal machine check
[ 870.433108] Kernel Offset: 0xc800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 871.054820] Rebooting in 30 seconds..
[ 900.901238] ACPI MEMORY or I/O RESET_REG.

Copyright(c) 2015 American Megatrends, Inc.
0x19 : Pre-memory SB Initialization.
Copyright(c) 2016 American Megatrends, Inc.

Revision history for this message
dann frazier (dannf) wrote :
Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
dann frazier (dannf) wrote :
Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Dann, is this for SRU purpose?

Revision history for this message
dann frazier (dannf) wrote :

@kaihengfeng - It is unclear where the issue lies. If it turns out to be a driver bug, and we find an SRUable fix, then I'd like to see it SRU'd. The issue is still reproducible with latest upstream kernels.

Changed in linux (Ubuntu Groovy):
status: New → Confirmed
Changed in linux (Ubuntu Focal):
status: New → Confirmed
Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Does turning IOMMU off help? Maybe nouveau is touching something outside of its DMA range.

Anyway this will need nouveau devs to take a look.

Revision history for this message
dann frazier (dannf) wrote :

Yes it is reproducible w/ the IOMMU disabled (intel_iommu=off). See attached log.

Revision history for this message
Brian Murray (brian-murray) wrote :

The Groovy Gorilla has reached end of life, so this bug will not be fixed for that release

Changed in linux (Ubuntu Groovy):
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.