BUG: kernel NULL pointer dereference, address: 0000000000000050

Bug #1922387 reported by dann frazier
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Incomplete
Undecided
Unassigned
Focal
Confirmed
Undecided
Unassigned
Groovy
Incomplete
Undecided
Unassigned
Hirsute
Incomplete
Undecided
Unassigned

Bug Description

I observed the following kernel panic with the 5.4.0-71.79-generic kernel while running kernel selftests:

blanka login: [ 1671.958400] mmiotrace: Error taking CPU253 down: -28
[ 1672.118199] mmiotrace: Error taking CPU254 down: -28
[ 1672.230306] mmiotrace: Error taking CPU255 down: -28
[ 2503.359753] BUG: kernel NULL pointer dereference, address: 0000000000000050
[ 2503.367527] #PF: supervisor read access in kernel mode
[ 2503.373257] #PF: error_code(0x0000) - not-present page
[ 2503.378989] PGD 0 P4D 0
[ 2503.381812] Oops: 0000 [#1] SMP NOPTI
[ 2503.385896] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G OE 5.4.0-71-generic #79-Ubuntu
[ 2503.395795] Hardware name: NVIDIA DGXA100 920-23687-2530-000/DGXA100, BIOS 0.33 01/19/2021
[ 2503.405027] RIP: 0010:trace_event_raw_event_wbt_timer+0x6f/0x100
[ 2503.411728] Code: 59 80 e5 02 0f 85 8f 00 00 00 4c 89 e6 ba 34 00 00 00 48 8d 7d a0 e8 d0 a4 ca ff 49 89 c4 48 85 c0 74 37 49 8b 87 b8 03 00 00 <48> 8b 70 50 48 85 f6 74 45 49 8d 7c 24 08 ba 20 00 00 00 e8 59 91
[ 2503.432683] RSP: 0018:ffffa8d6c0003d90 EFLAGS: 00010286
[ 2503.438513] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000080000100
[ 2503.446474] RDX: ffff9968a228f418 RSI: 0000000000000100 RDI: ffff9968a228f414
[ 2503.454436] RBP: ffffa8d6c0003df8 R08: ffff9968a228f414 R09: 0000000000000100
[ 2503.462394] R10: 0000000000000007 R11: 0000000000000007 R12: ffff9968a228f418
[ 2503.470353] R13: 00000000fffffffa R14: 0000000000000003 R15: ffff9a686f9b3000
[ 2503.478316] FS: 0000000000000000(0000) GS:ffff99690cc00000(0000) knlGS:0000000000000000
[ 2503.487342] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2503.493752] CR2: 0000000000000050 CR3: 0000007e08ad6000 CR4: 0000000000340ef0
[ 2503.501712] Call Trace:
[ 2503.504438] <IRQ>
[ 2503.506682] wb_timer_fn+0x1d6/0x3c0
[ 2503.510672] ? blk_stat_free_callback_rcu+0x30/0x30
[ 2503.516112] blk_stat_timer_fn+0x134/0x140
[ 2503.520683] call_timer_fn+0x32/0x130
[ 2503.524768] __run_timers.part.0+0x180/0x280
[ 2503.529535] ? trace_event_raw_event_softirq+0x5d/0xa0
[ 2503.535267] run_timer_softirq+0x2a/0x50
[ 2503.539644] __do_softirq+0xe1/0x2d6
[ 2503.543629] irq_exit+0xae/0xb0
[ 2503.547132] smp_apic_timer_interrupt+0x7b/0x140
[ 2503.552280] apic_timer_interrupt+0xf/0x20
[ 2503.556848] </IRQ>
[ 2503.559187] RIP: 0010:native_safe_halt+0xe/0x10
[ 2503.564239] Code: 7b ff ff ff eb bd 90 90 90 90 90 90 e9 07 00 00 00 0f 00 2d 66 dd 52 00 f4 c3 66 90 e9 07 00 00 00 0f 00 2d 56 dd 52 00 fb f4 <c3> 90 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 e8 cd cd 63 ff 65
[ 2503.585191] RSP: 0018:ffffffff94803e18 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
[ 2503.593635] RAX: 000000000001e7c0 RBX: ffff996849080de8 RCX: 0000000000149022
[ 2503.601595] RDX: 0000000000149022 RSI: 0000000000000000 RDI: ffffffff948c5ba0
[ 2503.609556] RBP: ffffffff94803e38 R08: 00000000000002a8 R09: ffff9968a228f000
[ 2503.617516] R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000000
[ 2503.625475] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 2503.633440] ? default_idle+0x20/0x140
[ 2503.637623] arch_cpu_idle+0x15/0x20
[ 2503.641608] default_idle_call+0x23/0x30
[ 2503.645984] do_idle+0x1fb/0x270
[ 2503.649583] cpu_startup_entry+0x20/0x30
[ 2503.653960] rest_init+0xae/0xb0
[ 2503.657563] arch_call_rest_init+0xe/0x1b
[ 2503.662025] start_kernel+0x549/0x56a
[ 2503.666108] x86_64_start_reservations+0x24/0x26
[ 2503.671258] x86_64_start_kernel+0x75/0x79
[ 2503.675828] secondary_startup_64+0xa4/0xb0
[ 2503.680493] Modules linked in: sch_etf sch_fq dccp_ipv6 dccp_ipv4 dccp ip6table_nat iptable_nat xt_nat nf_nat algif_hash af_alg ip6table_filter xt_conntrack nf_conntrack nf_defrag_ipv4 ip6_tables nf_defrag_ipv6 ip_vti ip6_vti fou6 sit ipip tunnel4 geneve act_mirred cls_basic esp6 authenc echainiv iptable_filter xt_policy bpfilter veth esp4_offload esp4 xfrm_user xfrm_algo macsec fou vxlan ip6_udp_tunnel udp_tunnel vrf 8021q garp mrp bridge stp llc ip6_gre ip6_tunnel tunnel6 ip_gre ip_tunnel gre cls_u32 sch_htb dummy binfmt_misc nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua amd64_edac_mod edac_mce_amd kvm_amd kvm ipmi_ssif input_leds cdc_ether usbnet mii ccp k10temp ipmi_si ipmi_devintf ipmi_msghandler mac_hid sch_fq_codel knem(OE) ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 multipath linear ses enclosure ast crct10dif_pclmul drm_vram_helper crc32_pclmul ttm
[ 2503.680569] ghash_clmulni_intel aesni_intel mlx5_core(OE) crypto_simd pci_hyperv_intf drm_kms_helper tls syscopyarea cryptd raid0 glue_helper mlxfw(OE) hid_generic sysfillrect igb sysimgblt mpt3sas uas dca mdev(OE) fb_sys_fops raid_class i2c_algo_bit usbhid nvme scsi_transport_sas hid usb_storage drm mlx_compat(OE) nvme_core i2c_piix4 [last unloaded: trace_printk]
[ 2503.813546] CR2: 0000000000000050
[ 2503.817337] ---[ end trace ccd7c184afc3c422 ]---
[ 2503.933758] RIP: 0010:trace_event_raw_event_wbt_timer+0x6f/0x100
[ 2503.940458] Code: 59 80 e5 02 0f 85 8f 00 00 00 4c 89 e6 ba 34 00 00 00 48 8d 7d a0 e8 d0 a4 ca ff 49 89 c4 48 85 c0 74 37 49 8b 87 b8 03 00 00 <48> 8b 70 50 48 85 f6 74 45 49 8d 7c 24 08 ba 20 00 00 00 e8 59 91
[ 2503.961410] RSP: 0018:ffffa8d6c0003d90 EFLAGS: 00010286
[ 2503.967239] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000080000100
[ 2503.975200] RDX: ffff9968a228f418 RSI: 0000000000000100 RDI: ffff9968a228f414
[ 2503.983161] RBP: ffffa8d6c0003df8 R08: ffff9968a228f414 R09: 0000000000000100
[ 2503.991122] R10: 0000000000000007 R11: 0000000000000007 R12: ffff9968a228f418
[ 2503.999083] R13: 00000000fffffffa R14: 0000000000000003 R15: ffff9a686f9b3000
[ 2504.007044] FS: 0000000000000000(0000) GS:ffff99690cc00000(0000) knlGS:0000000000000000
[ 2504.016070] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2504.022479] CR2: 0000000000000050 CR3: 0000007e08ad6000 CR4: 0000000000340ef0
[ 2504.030442] Kernel panic - not syncing: Fatal exception in interrupt
[ 2504.038450] Kernel Offset: 0x12200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 2504.161847] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---

Tags: focal
dann frazier (dannf)
Changed in linux (Ubuntu Focal):
status: New → Confirmed
Revision history for this message
Francis Ginther (fginther) wrote :
Download full text (3.2 KiB)

This panic occurred while running the ubuntu_kernel_selftests suite. The last bit of logs are:

13:33:20 DEBUG| [stdout] # selftests: ftrace: ftracetest
13:33:20 DEBUG| [stdout] # === Ftrace unit tests ===
13:33:28 DEBUG| [stdout] # [1] Basic trace file check [PASS]
13:37:04 DEBUG| [stdout] # [2] Basic test for tracers [PASS]
13:39:48 DEBUG| [stdout] # [3] Basic trace clock test [PASS]
13:39:56 DEBUG| [stdout] # [4] Basic event tracing check [PASS]
13:40:04 DEBUG| [stdout] # [5] Change the ringbuffer size [PASS]
13:40:20 DEBUG| [stdout] # [6] Snapshot and tracing setting [PASS]
13:40:35 DEBUG| [stdout] # [7] trace_pipe and trace_marker [PASS]
13:40:51 DEBUG| [stdout] # [8] Generic dynamic event - add/remove kprobe events [PASS]
13:41:07 DEBUG| [stdout] # [9] Generic dynamic event - add/remove synthetic events [PASS]
13:41:14 DEBUG| [stdout] # [10] Generic dynamic event - selective clear (compatibility) [PASS]
13:41:22 DEBUG| [stdout] # [11] Generic dynamic event - generic clear event [PASS]
13:41:46 DEBUG| [stdout] # [12] event tracing - enable/disable with event level files [PASS]
13:42:17 DEBUG| [stdout] # [13] event tracing - restricts events based on pid [PASS]
13:42:41 DEBUG| [stdout] # [14] event tracing - enable/disable with subsystem level files [PASS]
13:43:05 DEBUG| [stdout] # [15] event tracing - enable/disable with top level files [PASS]
13:43:14 DEBUG| [stdout] # [16] Test trace_printk from module [PASS]
13:43:56 DEBUG| [stdout] # [17] ftrace - function graph filters with stack tracer [PASS]
13:44:29 DEBUG| [stdout] # [18] ftrace - function graph filters [PASS]
13:45:49 DEBUG| [stdout] # [19] ftrace - function pid filters [PASS]
13:46:06 DEBUG| [stdout] # [20] ftrace - stacktrace filter command [PASS]
13:46:38 DEBUG| [stdout] # [21] ftrace - function trace with cpumask [PASS]
13:47:13 DEBUG| [stdout] # [22] ftrace - test for function event triggers [PASS]
13:47:21 DEBUG| [stdout] # [23] ftrace - function trace on module [PASS]
13:47:31 DEBUG| [stdout] # [24] ftrace - function profiling [PASS]
13:48:07 DEBUG| [stdout] # [25] ftrace - function profiler with function tracing [PASS]
13:48:25 DEBUG| [stdout] # [26] ftrace - test reading of set_ftrace_filter [PASS]
---- END OF MESSAGES ----

This job was run twice. The prior run also hung before completing, but we don't have a console log for that time period, so it's unclear if it also panic'd. It's last messages were:

04:44:27 DEBUG| [stdout] # selftests: timers: nsleep-lat
04:44:48 DEBUG| [stdout] # nsleep latency CLOCK_REALTIME [OK]
04:45:09 DEBUG| [stdout] # nsleep latency CLOCK_MONOTONIC [OK]
04:45:09 DEBUG| [stdout] # nsleep latency CLOCK_MONOTONIC_RAW [UNSUPPORTED]
04:45:09 DEBUG| [stdout] # nsleep latency CLOCK_REALTIME_COARSE [UNSUPPORTED]
04:45:09 DEBUG| [stdout] # nsleep latency CLOCK_MONOTONIC_COARSE [UNSUPPORTED]
04:45:30 DEBUG| [stdout] # nsleep latency CLOCK_BOOTTIME [OK]
04:45:52 DEBUG| [stdout] # nsleep latency CLOCK_REALTIME_ALARM [OK]
04:46:13 DEBUG| [stdout] # nsleep latency CLOCK_BOOTTIME_ALARM [OK]
04:46:34 DEBUG| [stdout] # nsleep latency CLOCK_TAI [OK]
04:46:34 DEBUG| [stdout] # # Pass 0 Fail 0 Xfail 0 Xpass 0 Skip 0 Error 0
04:46:34 DEBUG| [stdout] ok 3 selft...

Read more...

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1922387

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu Groovy):
status: New → Incomplete
tags: added: focal
Revision history for this message
Ian May (ian-may) wrote :

I did some manual ubuntu_kernel_selftests ftrace testing on the 5.4.0-71.79-generic kernel. I was able to replicate the panic, but not on every run, but even on runs with no panic dmesg would report several soft lockups.

After removing the MOFED dkms, I was unable to replicate a panic or any of the soft lockups previously seen. Currently I don't have evidence as to which MOFED module is potentially triggering the problem.

Revision history for this message
Ian May (ian-may) wrote :

Here are the steps I used to reproduce:

#if using proposed pocket kernel
https://wiki.ubuntu.com/Testing/EnableProposed

#Need to enable deb-src for proposed/updates for this work
sudo apt update
$ sudo apt-get source linux

#After source is pulled, build and run ftrace selftests
$ sudo make -C linux-5.4.0/tools/testing/selftests TARGETS=ftrace run_tests

I also tested on Ubuntu-5.4.0-70.78 and saw similar behavior with soft lockups, but have yet to replicate the crash. Though I don't feel I have evidence to indicate this is a kernel regression.

Revision history for this message
Ian May (ian-may) wrote :

Also worth mentioning. We are only seeing this on the A100. Neither our automated testing or manual testing of ftrace saw any issues on DGX2.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.