arm64 AWS host hangs during modprobe nvidia on lunar and mantic

Bug #2029934 reported by Francis Ginther
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-aws (Ubuntu)
New
Undecided
Unassigned
nvidia-graphics-drivers-525 (Ubuntu)
New
Undecided
Unassigned
nvidia-graphics-drivers-525-server (Ubuntu)
New
Undecided
Unassigned
nvidia-graphics-drivers-535 (Ubuntu)
New
Undecided
Unassigned
nvidia-graphics-drivers-535-server (Ubuntu)
New
Undecided
Unassigned

Bug Description

Loading the nvidia driver dkms modules with "modprove nvidia" will result in the host hanging and being completely unusable. This was reproduced using both the linux generic and linux-aws kernels on lunar and mantic using an AWS g5g.xlarge instance.

To reproduce using the generic kernel:
# Deploy a arm64 host with an nvidia gpu, such as an AWS g5g.xlarge.

# Install the linux generic kernel from lunar-updates:
$ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y -o DPkg::Options::=--force-confold linux-generic

# Boot to the linux-generic kernel (this can be accomplished by removing the existing kernel, in this case it was the linux-aws 6.2.0-1008-aws kernel)
$ sudo DEBIAN_FRONTEND=noninteractive apt-get purge -y -o DPkg::Options::=--force-confold linux-aws linux-aws-headers-6.2.0-1008 linux-headers-6.2.0-1008-aws linux-headers-aws linux-image-6.2.0-1008-aws linux-image-aws linux-modules-6.2.0-1008-aws linux-headers-6.2.0-1008-aws linux-image-6.2.0-1008-aws linux-modules-6.2.0-1008-aws
$ reboot

# Install the Nvidia 535-server driver DKMS package:
$ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y nvidia-driver-535-server

# Enable the driver
$ sudo modprobe nvidia

# At this point the system will hang and never return.
# A reboot instead of a modprobe will result in a system that never boots up all the way. I was able to recover the console logs from such a system and found (the full captured log is attached):

[ 1.964942] nvidia: loading out-of-tree module taints kernel.
[ 1.965475] nvidia: module license 'NVIDIA' taints kernel.
[ 1.965905] Disabling lock debugging due to kernel taint
[ 1.980905] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 2.012067] nvidia-nvlink: Nvlink Core is being initialized, major device number 510
[ 2.012715]
[ 62.025143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 62.025807] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000000000000000 softirq=653/654 fqs=3301
[ 62.026516] (detected by 0, t=15003 jiffies, g=-699, q=216 ncpus=4)
[ 62.027018] Task dump for CPU 3:
[ 62.027290] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144 flags:0x0000000e
[ 62.028066] Call trace:
[ 62.028273] __switch_to+0xbc/0x100
[ 62.028567] 0x228
Timed out for waiting the udev queue being empty.
Timed out for waiting the udev queue being empty.
[ 242.045143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 242.045655] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000000000000000 softirq=653/654 fqs=12303
[ 242.046373] (detected by 1, t=60008 jiffies, g=-699, q=937 ncpus=4)
[ 242.046874] Task dump for CPU 3:
[ 242.047146] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144 flags:0x0000000f
[ 242.047922] Call trace:
[ 242.048128] __switch_to+0xbc/0x100
[ 242.048417] 0x228
Timed out for waiting the udev queue being empty.
Begin: Loading essential drivers ... [ 384.001142] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [modprobe:215]
[ 384.001738] Modules linked in: nvidia(POE+) crct10dif_ce video polyval_ce polyval_generic drm_kms_helper ghash_ce syscopyarea sm4 sysfillrect sha2_ce sysimgblt sha256_arm64 sha1_ce drm nvme nvme_core ena nvme_common aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher
[ 384.003513] CPU: 2 PID: 215 Comm: modprobe Tainted: P OE 6.2.0-26-generic #26-Ubuntu
[ 384.004210] Hardware name: Amazon EC2 g5g.xlarge/, BIOS 1.0 11/1/2018
[ 384.004715] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 384.005259] pc : smp_call_function_many_cond+0x1b4/0x4b4
[ 384.005683] lr : smp_call_function_many_cond+0x1d0/0x4b4
[ 384.006108] sp : ffff8000089a3a70
[ 384.006381] x29: ffff8000089a3a70 x28: 0000000000000003 x27: ffff00056d1fafa0
[ 384.006954] x26: ffff00056d1d76c8 x25: ffffc87cf18bdd10 x24: 0000000000000003
[ 384.007527] x23: 0000000000000001 x22: ffff00056d1d76c8 x21: ffffc87cf18c2690
[ 384.008086] x20: ffff00056d1fafa0 x19: ffff00056d1d76c0 x18: ffff80000896d058
[ 384.008645] x17: 0000000000000000 x16: 0000000000000000 x15: 617362755f5f0073
[ 384.009209] x14: 0000000000000001 x13: 0000000000000006 x12: 4630354535323145
[ 384.009779] x11: 0101010101010101 x10: ffffb78318e9c0e0 x9 : ffffc87ceeac7da4
[ 384.010339] x8 : ffff00056d1d76f0 x7 : 0000000000000000 x6 : 0000000000000000
[ 384.010894] x5 : 0000000000000004 x4 : 0000000000000000 x3 : ffff00056d1fafa8
[ 384.011464] x2 : 0000000000000003 x1 : 0000000000000011 x0 : 0000000000000000
[ 384.012030] Call trace:
[ 384.012241] smp_call_function_many_cond+0x1b4/0x4b4
[ 384.012635] kick_all_cpus_sync+0x50/0xa0
[ 384.012961] flush_module_icache+0x64/0xd0
[ 384.013294] load_module+0x4ec/0xb54
[ 384.013588] __do_sys_finit_module+0xb0/0x150
[ 384.013944] __arm64_sys_finit_module+0x2c/0x50
[ 384.014306] invoke_syscall+0x7c/0x124
[ 384.014613] el0_svc_common.constprop.0+0x5c/0x1cc
[ 384.015000] do_el0_svc+0x38/0x60
[ 384.015280] el0_svc+0x30/0xe0
[ 384.015540] el0t_64_sync_handler+0x11c/0x150
[ 384.015896] el0t_64_sync+0x1a8/0x1ac

This same procedure impacts the 525, 525-server, 535 and 535-server drivers. It does *not* hang a similarly configured host running focal or jammy.

Revision history for this message
Francis Ginther (fginther) wrote :
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Although nvidia seems to be the trigger here, the crashing code appears to be pure generic linux: arch/arm64/kernel/syscall.c

tags: added: arm64 nvidia
tags: added: lunar mantic
summary: - Host hangs during modprobe nvidia on lunar and mantic
+ arm64 AWS host hangs during modprobe nvidia on lunar and mantic
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.