Radeon Pro W5500 in passthrough with vfio generates spurious NMI reason 25

Bug #1963893 reported by Ruben De Smet
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-signed-hwe-5.13 (Ubuntu)
New
Undecided
Unassigned

Bug Description

First encountered in 5.4 kernel, but still present in HWE.

Description: Ubuntu 20.04.4 LTS
Release: 20.04

We have three of those cards in three identical EPYC 7302P HP DL325 Gen10 servers.

ruben@alpha:~$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-5.13.0-30-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro amd_iommu=on vfio-pci.ids=1002:7341,1002:ab38 nofb iommu=pt

dmesg excerpt (with vendor-reset):

[ 412.868799] vfio-pci 0000:86:00.0: enabling device (0142 -> 0143)
[ 412.868980] vfio-pci 0000:86:00.0: AMD_NAVI14: version 1.1
[ 412.868982] vfio-pci 0000:86:00.0: AMD_NAVI14: performing pre-reset
[ 412.888842] vfio-pci 0000:86:00.0: AMD_NAVI14: performing reset
[ 412.925218] ATOM BIOS: 113-D3250100-102
[ 412.925221] vendor-reset-drm: atomfirmware: bios_scratch_reg_offset initialized to 4c
[ 413.171020] vfio-pci 0000:86:00.0: AMD_NAVI14: bus reset disabled? yes
[ 413.171028] vfio-pci 0000:86:00.0: AMD_NAVI14: SMU response reg: 0, sol reg: 0, mp1 intr enabled? no, bl ready? yes
[ 413.171035] vfio-pci 0000:86:00.0: AMD_NAVI14: performing post-reset
[ 413.208794] vfio-pci 0000:86:00.0: AMD_NAVI14: reset result = 0
[ 413.208971] vfio-pci 0000:86:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
[ 413.208985] vfio-pci 0000:86:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
[ 413.208990] vfio-pci 0000:86:00.0: vfio_ecap_init: hiding ecap 0x25@0x400
[ 413.208992] vfio-pci 0000:86:00.0: vfio_ecap_init: hiding ecap 0x26@0x410
[ 413.208994] vfio-pci 0000:86:00.0: vfio_ecap_init: hiding ecap 0x27@0x440
[ 413.228798] vfio-pci 0000:86:00.1: enabling device (0140 -> 0142)
[ 413.296899] vfio-pci 0000:86:00.0: AMD_NAVI14: version 1.1
[ 413.296904] vfio-pci 0000:86:00.0: AMD_NAVI14: performing pre-reset
[ 413.297096] vfio-pci 0000:86:00.0: AMD_NAVI14: performing reset
[ 413.333349] ATOM BIOS: 113-D3250100-102
[ 413.333351] vendor-reset-drm: atomfirmware: bios_scratch_reg_offset initialized to 4c
[ 413.579787] vfio-pci 0000:86:00.0: AMD_NAVI14: bus reset disabled? yes
[ 413.579793] vfio-pci 0000:86:00.0: AMD_NAVI14: SMU response reg: 0, sol reg: 0, mp1 intr enabled? no, bl ready? yes
[ 413.579797] vfio-pci 0000:86:00.0: AMD_NAVI14: performing post-reset
[ 413.616795] vfio-pci 0000:86:00.0: AMD_NAVI14: reset result = 0
[ 419.766917] Uhhuh. NMI received for unknown reason 25 on CPU 0.
[ 419.766919] Do you have a strange power saving mode enabled?
[ 419.766920] Dazed and confused, but trying to continue
[ 436.498601] Uhhuh. NMI received for unknown reason 25 on CPU 0.
[ 436.498604] Do you have a strange power saving mode enabled?
[ 436.498605] Dazed and confused, but trying to continue
[ 454.306951] Uhhuh. NMI received for unknown reason 25 on CPU 0.
[ 454.306955] Do you have a strange power saving mode enabled?
[ 454.306955] Dazed and confused, but trying to continue
[ 456.237162] Uhhuh. NMI received for unknown reason 25 on CPU 0.
[ 456.237165] Do you have a strange power saving mode enabled?
[ 456.237166] Dazed and confused, but trying to continue
[ 457.800596] Uhhuh. NMI received for unknown reason 25 on CPU 0.
[ 457.800598] Do you have a strange power saving mode enabled?
[ 457.800599] Dazed and confused, but trying to continue
[ 474.068911] Uhhuh. NMI received for unknown reason 25 on CPU 0.
[ 474.068914] Do you have a strange power saving mode enabled?
[ 474.068915] Dazed and confused, but trying to continue

This happens both with and without the vendor-reset workaround (https://github.com/gnif/vendor-reset/). The GPU works "fine" in a VM (in OpenStack, KVM), although it generates these spurious NMIs frequently, especially when booting the VM and when using ROCm (eg. clinfo) in the VM.

I will now move one of these GPUs in an older Intel system and run bare metal because we need a student to work on it. I'll also test passthrough on that machine, to see whether it has the same behaviour.

ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: linux-image-5.13.0-30-generic 5.13.0-30.33~20.04.1
ProcVersionSignature: Ubuntu 5.13.0-30.33~20.04.1-generic 5.13.19
Uname: Linux 5.13.0-30-generic x86_64
ApportVersion: 2.20.11-0ubuntu27.21
Architecture: amd64
CasperMD5CheckResult: skip
Date: Mon Mar 7 10:26:25 2022
InstallationDate: Installed on 2022-01-05 (60 days ago)
InstallationMedia: Ubuntu-Server 18.04.6 LTS "Bionic Beaver" - Release amd64 (20210915)
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: linux-signed-hwe-5.13
UpgradeStatus: Upgraded to focal on 2022-01-17 (48 days ago)

Revision history for this message
Ruben De Smet (ruben-de-smet) wrote :
Revision history for this message
Ruben De Smet (ruben-de-smet) wrote :
Revision history for this message
Ruben De Smet (ruben-de-smet) wrote :
summary: - Radeon Pro W5500 in passthrough with vfio generates spuriour NMI reason
+ Radeon Pro W5500 in passthrough with vfio generates spurious NMI reason
25
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.