Devlink backport: Fix mlx5 driver hangs due to mlx5_sf_hw_table_init

Bug #2042455 reported by William Tu
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-bluefield (Ubuntu)
Invalid
Undecided
Unassigned
Jammy
Fix Committed
Undecided
Unassigned

Bug Description

Summary:
Machine hangs when loading OFED 2310 mlx5 driver at BlueField

How to reproduce:
# load the OFED driver

Reason:
BF got stuck and observed call trace "mlx5_sf_hw_table_init+0xf4/0x2d0 [mlx5_core]

dmesg from minicom:
[ 726.569928] INFO: task systemd-udevd:297 blocked for more than 604 seconds.
[ 726.576895] Tainted: G OE 5.15.0-1029-bluefield #31-Ubuntu
[ 726.584101] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 726.591913] task:systemd-udevd state:D stack: 0 pid: 297 ppid: 280 flags:0x0000000d
[ 726.600248] Call trace:
[ 726.602680] __switch_to+0xf8/0x150
[ 726.606159] __schedule+0x2b8/0x790
[ 726.609634] schedule+0x64/0x140
[ 726.612850] schedule_preempt_disabled+0x18/0x24
[ 726.617453] __mutex_lock.constprop.0+0x1a0/0x680
[ 726.622141] __mutex_lock_slowpath+0x40/0x90
[ 726.626396] mutex_lock+0x64/0x70
[ 726.629695] devlink_resource_register+0x50/0x1a0
[ 726.634386] mlx5_sf_hw_table_init+0xf4/0x2d0 [mlx5_core]
[ 726.639882] mlx5_init_one_devl_locked+0x1c8/0x784 [mlx5_core]
[ 726.645791] probe_one+0x300/0x5f0 [mlx5_core]
[ 726.650307] local_pci_probe+0x48/0xb4
[ 726.654043] pci_device_probe+0x18c/0x200
[ 726.658039] really_probe+0xd0/0x490
[ 726.661600] __driver_probe_device+0x148/0x190
[ 726.666029] driver_probe_device+0x48/0x180
[ 726.670198] __driver_attach+0x104/0x240
[ 726.674106] bus_for_each_dev+0x78/0xdc
[ 726.677927] driver_attach+0x2c/0x40
[ 726.681486] bus_add_driver+0x154/0x270
[ 726.685307] driver_register+0x80/0x13c
[ 726.689129] __pci_register_driver+0x4c/0x60
[ 726.693386] __init_backport+0xf0/0x1000 [mlx5_core]
[ 726.698425] do_one_initcall+0x4c/0x250
[ 726.702248] do_init_module+0x50/0x260
[ 726.705983] load_module+0x9fc/0xbe0
[ 726.709543] __do_sys_finit_module+0xa8/0x114
[ 726.713885] __arm64_sys_finit_module+0x28/0x3c
[ 726.718401] invoke_syscall+0x78/0x100
[ 726.722137] el0_svc_common.constprop.0+0x54/0x184
[ 726.726913] do_el0_svc+0x30/0xac
[ 726.730215] el0_svc+0x48/0x160
[ 726.733341] el0t_64_sync_handler+0xa4/0x130
[ 726.737597] el0t_64_sync+0x1a4/0x1a8
[ 847.401924] INFO: task systemd-udevd:297 blocked for more than 724 seconds.
[ 847.408891] Tainted: G OE 5.15.0-1029-bluefield #31-Ubuntu

How to fix:
This is related to
https://bugs.launchpad.net/ubuntu/+source/linux-bluefield/+bug/2039869
and we need to backport/cherry-pick more patches from the series

Patches are below
Backport: f655dacb59ac net: devlink: remove unused locked functions
Backport: 012ec02ae441 netdevsim: convert driver to use unlocked devlink API during init/fini
Cherry-pick: eb0e9fa2c635 net: devlink: add unlocked variants of devlink_region_create/destroy() functions
SKIP: 72a4c8c94efa mlxsw: convert driver to use unlocked devlink API during init/fini
Backport: 70a2ff89369d net: devlink: add unlocked variants of devlink_dpipe*() functions
Cherry-pick: 755cfa69c4ec net: devlink: add unlocked variants of devlink_sb*() functions
Cherry-pick: c223d6a4bf6d net: devlink: add unlocked variants of devlink_resource*() functions
Cherry-pick: 852e85a704c2 net: devlink: add unlocked variants of devling_trap*() functions
Cherry-pick: e26fde2f5bef net: devlink: avoid false DEADLOCK warning reported by lock

Thanks!

William Tu (wtu)
summary: - Devlink backport: Fix mlx5 driver hangs
+ Devlink backport: Fix mlx5 driver hangs due to mlx5_sf_hw_table_init
Changed in linux-bluefield (Ubuntu):
status: New → Invalid
Changed in linux-bluefield (Ubuntu Jammy):
status: New → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-bluefield/5.15.0-1031.33 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-bluefield' to 'verification-done-jammy-linux-bluefield'. If the problem still exists, change the tag 'verification-needed-jammy-linux-bluefield' to 'verification-failed-jammy-linux-bluefield'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-bluefield-v2 verification-needed-jammy-linux-bluefield
tags: added: verification-done-jammy-linux-bluefield
removed: verification-needed-jammy-linux-bluefield
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.