Activity log for bug #2019011

Date Who What changed Old value New value Message
2023-05-09 14:49:29 bugproxy bug added bug
2023-05-09 14:49:31 bugproxy tags architecture-s3903164 bugnameltc-202279 severity-high targetmilestone-inin---
2023-05-09 14:49:32 bugproxy ubuntu: assignee Skipper Bug Screeners (skipper-screen-team)
2023-05-09 14:49:39 bugproxy affects ubuntu linux (Ubuntu)
2023-05-10 14:56:46 Pedro Principeza bug added subscriber Pedro Principeza
2023-05-10 17:04:08 Marcelo Cerri nominated for series Ubuntu Focal
2023-05-10 17:04:08 Marcelo Cerri bug task added linux (Ubuntu Focal)
2023-05-10 17:04:13 Marcelo Cerri linux (Ubuntu Focal): status New In Progress
2023-05-10 17:04:53 Marcelo Cerri attachment added 0001-net-mlx5-cmdif-Avoid-skipping-reclaim-pages-if-FW-is.patch https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2019011/+attachment/5672245/+files/0001-net-mlx5-cmdif-Avoid-skipping-reclaim-pages-if-FW-is.patch
2023-05-10 17:05:05 Marcelo Cerri attachment added 0002-net-mlx5-Fix-handling-of-entry-refcount-when-command.patch https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2019011/+attachment/5672246/+files/0002-net-mlx5-Fix-handling-of-entry-refcount-when-command.patch
2023-05-10 20:13:43 Ubuntu Foundations Team Bug Bot tags architecture-s3903164 bugnameltc-202279 severity-high targetmilestone-inin--- architecture-s3903164 bugnameltc-202279 patch severity-high targetmilestone-inin---
2023-05-10 20:13:44 Ubuntu Foundations Team Bug Bot bug added subscriber Terry Rudd
2023-05-15 06:07:19 Frank Heimes bug task added ubuntu-z-systems
2023-06-02 10:09:34 bugproxy tags architecture-s3903164 bugnameltc-202279 patch severity-high targetmilestone-inin--- architecture-s3903164 bugnameltc-202279 patch severity-high targetmilestone-inin2004
2023-06-28 07:50:00 Frank Heimes linux (Ubuntu): status New Fix Released
2023-06-28 07:50:05 Frank Heimes ubuntu-z-systems: status New In Progress
2023-06-28 07:50:21 Frank Heimes ubuntu-z-systems: assignee Skipper Bug Screeners (skipper-screen-team)
2023-06-28 09:53:09 Frank Heimes description ---Problem Description--- Kernel panic with "refcount_t: underflow" in kernel log Contact Information = Rijoy.k@ibm.com, vineeth.vijayan@ibm.com ---uname output--- 5.4.0-128-generic Machine Type = s390x ---System Hang--- Kernel panic and stack-trace as below ---Debugger--- A debugger is not configured Stack trace output: [Sat Apr 8 17:52:21 UTC 2023] Call Trace: [Sat Apr 8 17:52:21 UTC 2023] ([<0000002a5939a286>] refcount_warn_saturate+0xce/0x140) [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805f861e>] cmd_ent_put+0xe6/0xf8 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805f9b6a>] mlx5_cmd_comp_handler+0x102/0x4f0 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805f9f8a>] cmd_comp_notifier+0x32/0x48 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecf0c6>] notifier_call_chain+0x4e/0xa0 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecf17e>] atomic_notifier_call_chain+0x2e/0x40 [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805fe4fc>] mlx5_eq_async_int+0x13c/0x200 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecf0c6>] notifier_call_chain+0x4e/0xa0 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecf17e>] atomic_notifier_call_chain+0x2e/0x40 [Sat Apr 8 17:52:21 UTC 2023] [<000003ff8061318e>] mlx5_irq_int_handler+0x2e/0x48 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f1455a>] __handle_irq_event_percpu+0x6a/0x250 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f14770>] handle_irq_event_percpu+0x30/0x78 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f1a0c8>] handle_percpu_irq+0x68/0xa0 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f134d2>] generic_handle_irq+0x3a/0x60 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e960ce>] zpci_floating_irq_handler+0xe6/0x1b8 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a594f54a6>] do_airq_interrupt+0x96/0x130 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f1455a>] __handle_irq_event_percpu+0x6a/0x250 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f14770>] handle_irq_event_percpu+0x30/0x78 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f1a0c8>] handle_percpu_irq+0x68/0xa0 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f134d2>] generic_handle_irq+0x3a/0x60 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e30e42>] do_IRQ+0x7a/0xb0 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a5972a408>] io_int_handler+0x12c/0x294 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e2752e>] enabled_wait+0x46/0xd8 [Sat Apr 8 17:52:21 UTC 2023] ([<0000002a58e2752e>] enabled_wait+0x46/0xd8) [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e278aa>] arch_cpu_idle+0x2a/0x40 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ee1536>] do_idle+0xee/0x1b0 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ee17a6>] cpu_startup_entry+0x36/0x40 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e3ab38>] smp_init_secondary+0xc8/0xe8 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e3a770>] smp_start_secondary+0x88/0x90 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a5972a09c>] kernel_thread_starter+0x0/0x10 [Sat Apr 8 17:52:21 UTC 2023] Last Breaking-Event-Address: [Sat Apr 8 17:52:21 UTC 2023] [<0000002a5939a286>] refcount_warn_saturate+0xce/0x140 [Sat Apr 8 17:52:21 UTC 2023] ---[ end trace 6ec6f9c6f666ca2d ]--- [Sat Apr 8 17:52:21 UTC 2023] specification exception: 0006 ilc:3 [#1] SMP [Sat Apr 8 17:52:21 UTC 2023] Modules linked in: sysdigcloud_probe(OE) vhost_net vhost macvtap macvlan tap rpcsec_gss_krb5 auth_rpcgss nfsv3 nfs_acl nfs lockd grace fscache ebtable_broute binfmt_misc nbd veth xt_statistic ipt_REJECT nf_reject_ipv4 ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs iptable_mangle ip6table_mangle ip6table_nat xt_mark sunrpc lcs ctcm fsm zfcp scsi_transport_fc dasd_fba_mod dasd_eckd_mod dasd_mod nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp xt_multiport xt_set ip_set_hash_net ip_set_hash_ip ip_set tcp_diag inet_diag xt_comment xt_nat act_gact sch_multiq act_mirred act_pedit act_tunnel_key cls_flower act_police cls_u32 vxlan ip6_udp_tunnel udp_tunnel dummy sch_ingress mlx5_ib ib_uverbs ib_core mlx5_core tls mlxfw ptp pps_core xt_MASQUERADE iptable_nat xt_addrtype xt_conntrack br_netfilter bridge stp llc aufs ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bpfilter xfrm_user xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key xfrm_algo bonding s390_trng [Sat Apr 8 17:52:21 UTC 2023] vfio_ccw chsc_sch vfio_mdev mdev vfio_iommu_type1 eadm_sch vfio sch_fq_codel ip_tables x_tables overlay raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 linear nf_tables nf_nat nf_conntrack_netlink nfnetlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c qeth_l3 qeth_l2 pkey zcrypt crc32_vx_s390 ghash_s390 prng aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common qeth ccwgroup qdio scsi_dh_emc scsi_dh_rdac scsi_dh_alua dm_multipath [last unloaded: sysdigcloud_probe] [Sat Apr 8 17:52:21 UTC 2023] CPU: 12 PID: 83893 Comm: kworker/u400:91 Kdump: loaded Tainted: G W OE 5.4.0-128-generic #144~18.04.1-Ubuntu [Sat Apr 8 17:52:21 UTC 2023] Hardware name: IBM 8562 GT2 A00 (LPAR) [Sat Apr 8 17:52:21 UTC 2023] Workqueue: mlx5e mlx5e_update_stats_work [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] Krnl PSW : 0404d00180000000 0000002a58ec51d8 (queue_work_on+0x30/0x70) [Sat Apr 8 17:52:21 UTC 2023] R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3 [Sat Apr 8 17:52:21 UTC 2023] Krnl GPRS: 1d721b7c57e8d7f5 0000000000000001 0000000000000200 0000006222a0e800 [Sat Apr 8 17:52:21 UTC 2023] 0000005a8e94a4e1 0000000000000000 0000000000000000 000003e016d23d08 [Sat Apr 8 17:52:21 UTC 2023] 0000005a8e94a4e1 0000006287800120 0000003b8dbbd740 0700003b8dbbd740 [Sat Apr 8 17:52:21 UTC 2023] 00000062690c6600 000003ff8069c808 000003e016d23ae0 000003e016d23aa8 [Sat Apr 8 17:52:21 UTC 2023] Krnl Code: 0000002a58ec51c6: f0a0a7190001 srp 1817(11,%r10),1,0 0000002a58ec51cc: e3b0f0a00004 lg %r11,160(%r15) #0000002a58ec51d2: eb11400000e6 laog %r1,%r1,0(%r4) >0000002a58ec51d8: 07e0 bcr 14,%r0 0000002a58ec51da: a7110001 tmll %r1,1 0000002a58ec51de: a7840016 brc 8,0000002a58ec520a 0000002a58ec51e2: a7280000 lhi %r2,0 0000002a58ec51e6: a7b20300 tmhh %r11,768 [Sat Apr 8 17:52:21 UTC 2023] Call Trace: [Sat Apr 8 17:52:21 UTC 2023] ([<000003e016d23ae0>] 0x3e016d23ae0) [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805fab0a>] cmd_exec+0x44a/0xab0 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805fb2b0>] mlx5_cmd_exec+0x40/0x70 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<000003ff80657cb0>] mlx5_eswitch_get_vport_stats+0xb0/0x2a0 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<000003ff80644602>] mlx5e_rep_update_hw_counters+0x52/0xb8 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<000003ff8061f1ec>] mlx5e_update_stats_work+0x44/0x58 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ec56f4>] process_one_work+0x274/0x4d0 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ec5998>] worker_thread+0x48/0x560 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecd014>] kthread+0x144/0x160 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a5972a094>] ret_from_fork+0x28/0x30 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a5972a09c>] kernel_thread_starter+0x0/0x10 [Sat Apr 8 17:52:21 UTC 2023] Last Breaking-Event-Address: [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805ee060>] 0x3ff805ee060 [Sat Apr 8 17:52:21 UTC 2023] Kernel panic - not syncing: Fatal exception: panic_on_oops Oops output: [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805ee060>] 0x3ff805ee060 [Sat Apr 8 17:52:21 UTC 2023] Kernel panic - not syncing: Fatal exception: panic_on_oops ------------ [Michael] I had a look into the dump from wdc3-qz1-sr2-rk086-s05: crash> sys The system was up and running since: UPTIME: 282 days, 02:16:10 There a a lot of martian source messages again like: [Sun Apr 16 11:09:28 UTC 2023] IPv4: martian source 11.44.203.141 from 11.21.133.2, on dev ipsec0 [Sun Apr 16 11:09:28 UTC 2023] ll header: 00000000: ff ff ff ff ff ff fe ff 0b 15 85 02 08 06 I hope that we get them suppressed soon. Then at the following time a first issue can be observed: NFS timeout [Sun Apr 16 11:09:39 UTC 2023] nfs: server ccistorwdc0751-sec-fz.service.softlayer.com not responding, timed out The reason could be a) the server b) the network c) the local network adapter Then about 1:05 hour later the first mlx5 related issues are reported [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:08.1 if0200023AF58D: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:07.0 if02000440845F: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:08.1 if0200023AF58D: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:07.0 if02000440845F: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:00.2 p0v0: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:00.3 p0v1: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:00.6 p0v4: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:08.2 p1v0: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:08.3 p1v1: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 ? Then about 15 minutes later the NFS code performs a panic_on_oops ? [Sun Apr 16 12:32:34 UTC 2023] nfs: server ccistorwdc0751-sec-fz.service.softlayer.com not responding, timed out [Sun Apr 16 12:34:10 UTC 2023] Unable to handle kernel pointer dereference in virtual kernel address space [Sun Apr 16 12:34:10 UTC 2023] Failing address: 0000809f00008000 TEID: 0000809f00008803 [Sun Apr 16 12:34:10 UTC 2023] Fault in home space mode while using kernel ASCE. [Sun Apr 16 12:34:10 UTC 2023] AS:00000047431f4007 R3:0000000000000024 [Sun Apr 16 12:34:10 UTC 2023] Oops: 0038 ilc:3 [#1] SMP [Sun Apr 16 12:34:10 UTC 2023] Modules linked in: sysdigcloud_probe(OE) binfmt_misc nbd vhost_net vhost macvtap macvlan tap rpcsec_gss_krb5 auth_rpcgss nfsv3 nfs_acl nfs lockd grace fscache xt_s tatistic ipt_REJECT nf_reject_ipv4 ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs ip6table_mangle ip6table_nat ebt_redirect ebt_ip ebtable_broute sunrpc lcs ctcm fsm zfcp scsi_transport_fc dasd_fba_mod dasd_ eckd_mod dasd_mod nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp xt_multiport xt_set ip_set_hash_net ip_set_hash_ip ip_set tcp_diag inet_diag xt_comment xt_nat act_gact iptable_ mangle xt_mark veth sch_multiq act_mirred act_pedit act_tunnel_key cls_flower act_police cls_u32 vxlan ip6_udp_tunnel udp_tunnel dummy sch_ingress mlx5_ib ib_uverbs ib_core mlx5_core tls mlxfw p tp pps_core xt_MASQUERADE iptable_nat xt_addrtype xt_conntrack br_netfilter bridge stp llc aufs ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bpfilter xfrm_user xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key xfrm_algo [Sun Apr 16 12:34:10 UTC 2023] s390_trng vfio_ccw vfio_mdev chsc_sch mdev vfio_iommu_type1 eadm_sch vfio sch_fq_codel ip_tables x_tables overlay raid10 raid456 async_raid6_recov async_memcpy as ync_pq async_xor async_tx xor raid6_pq raid1 raid0 linear nf_tables nf_nat nf_conntrack_netlink nfnetlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c qeth_l3 qeth_l2 pkey zcrypt crc32_v x_s390 ghash_s390 prng aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common qeth ccwgroup qdio scsi_dh_emc scsi_dh_rdac scsi_dh_alua dm_multipath [la st unloaded: sysdigcloud_probe] [Sun Apr 16 12:34:10 UTC 2023] CPU: 4 PID: 32942 Comm: kubelet Kdump: loaded Tainted: G W OE 5.4.0-110-generic #124~18.04.1+hf334332v20220521b1-Ubuntu [Sun Apr 16 12:34:10 UTC 2023] Hardware name: IBM 8562 GT2 A00 (LPAR) [Sun Apr 16 12:34:10 UTC 2023] Krnl PSW : 0704f00180000000 000003ff8076304a (call_bind+0x3a/0xf8 [sunrpc]) [Sun Apr 16 12:34:10 UTC 2023] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:3 PM:0 RI:0 EA:3 [Sun Apr 16 12:34:10 UTC 2023] Krnl GPRS: 00000000000001dc 0000005d16d22400 00000041b9826500 000003e008637ad8 [Sun Apr 16 12:34:10 UTC 2023] 000003ff807794d6 0000004742e35898 0000000000000000 00000041b9826537 [Sun Apr 16 12:34:10 UTC 2023] 000003ff807ae63c 000003ff80763010 0000809f0000809f 00000041b9826500 [Sun Apr 16 12:34:10 UTC 2023] 00000015a0c80000 000003ff807a1d80 000003e008637a80 000003e008637a48 [Sun Apr 16 12:34:10 UTC 2023] Krnl Code: 000003ff8076303a: a7840041 brc 8,000003ff807630bc 000003ff8076303e: e31020c00004 lg %r1,192(%r2) #000003ff80763044: e3a010000004 lg %r10,0(%r1) >000003ff8076304a: e310a4070090 llgc %r1,1031(%r10) 000003ff80763050: a7110010 tmll %r1,16 000003ff80763054: a7740025 brc 7,000003ff8076309e 000003ff80763058: c418ffffe7d8 lgrl %r1,000003ff80760008 000003ff8076305e: 91021003 tm 3(%r1),2 [Sun Apr 16 12:34:10 UTC 2023] Call Trace: [Sun Apr 16 12:34:10 UTC 2023] ([<0000000000000000>] 0x0) [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80779454>] __rpc_execute+0x8c/0x488 [sunrpc] [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80779df2>] rpc_execute+0x8a/0x128 [sunrpc] [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80766d62>] rpc_run_task+0x132/0x180 [sunrpc] [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80766e00>] rpc_call_sync+0x50/0xa0 [sunrpc] [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80360e40>] nfs3_rpc_wrapper.constprop.12+0x48/0xe0 [nfsv3] [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80361c5e>] nfs3_proc_getattr+0x6e/0xc8 [nfsv3] [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80aaeaa8>] __nfs_revalidate_inode+0x158/0x3b0 [nfs] [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80aaef9c>] nfs_getattr+0x1bc/0x388 [nfs] [Sun Apr 16 12:34:10 UTC 2023] [<0000004742161032>] vfs_statx+0xaa/0xf8 [Sun Apr 16 12:34:10 UTC 2023] [<0000004742161798>] __do_sys_newstat+0x38/0x60 [Sun Apr 16 12:34:10 UTC 2023] [<000000474277e802>] system_call+0x2a6/0x2c8 [Sun Apr 16 12:34:10 UTC 2023] Last Breaking-Event-Address: [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80779452>] __rpc_execute+0x8a/0x488 [sunrpc] [Sun Apr 16 12:34:10 UTC 2023] Kernel panic - not syncing: Fatal exception: panic_on_oops The network interfaces p0 and p1 are missing: crash> net | grep -P "p0 |p1 " 5b726fa000 macvtap0 It looks like the p0/p1 issues where the network interfaces have been lost but no recovery was attempted. There are no related recovery messages from the mlx5 kernel module. The kernel finally dumps in the area of the NFS/RPC code. That would be the related upstream commit: aaf2e65cac7f net/mlx5: Fix handling of entry refcount when command is not issued to FW ---- [Niklas] I agree that commit does sound like it could be the fix for exactly this issue. I checked the kernel tree at the tag Ubuntu-5.4.0-128.144 and that does not appear to have this fix. If I read things correctly this is again an issue that may occur during a recovery when the PCI device is isolated and thus doesn't respond. So it likely won't help with not losing the interface but it does sound like it could solve the kernel crash/refcount warning. ==================================================================================================== Summary: Looks like this patch (aaf2e65cac7f) is missing in 20.04 and could be reason for the crash. We would like to backport this to 20.04, 20.04 HWE, 22.04 and 22.04 HWE. aaf2e65cac7f net/mlx5: Fix handling of entry refcount when command is not issued to FW https://lore.kernel.org/netdev/20221122022559.89459-6-saeed@kernel.org/ ==================================================================================================== SRU Justification: ================== [ Impact ] * The mlx5 driver is causing a Kernel panic with "refcount_t: underflow". * This issue occurs during a recovery when the PCI device is isolated and thus doesn't respond. [ Fix ] * This issue got solved upstream with aaf2e65cac7f aaf2e65cac7f2e1ae729c2fbc849091df9699f96 "net/mlx5: Fix handling of entry refcount when command is not issued to FW" (upstream since 6.1-rc1) * But to get aaf2e65cac7f a backport of b898ce7bccf1 b898ce7bccf13087719c021d829dab607c175246 "net/mlx5: cmdif, Avoid skipping reclaim pages if FW is not accessible" is required on top (upstream since 5.10) [ Test Plan ] * An Ubuntu Server for s390x 20.04 LPAR or z/VM installation is needed that has Mellanox cards (RoCE Express 2.1) assigned, configured and enabled and that runs a 5.4 kernel with mlx5 driver. * Create some network traffic on (one of the) RoCE device (interface ens???[d?]) for testing (e.g. with stress-ng). * Make sure the module/driver mlx5 is loaded and in use. * Trigger a recovery (via the Support Element) that will render the adapter (ports) unresponsive for a moment and should provoke a similar situation. * Alternatively the interface itself can be removed for a moment and re-added again (but this may break further things on top). * Due to the lack of RoCE Express 2.1 hardware, the verification is on IBM. [ Where problems could occur ] * The modifications are limited to the Mellanox mlx5 driver only - no other network driver is affected. * The pre-required commit (aaf2e65cac7f) can have a bad impact on (re-)claiming pages if FW is not accessible, which could cause page leaks in case done wrong. But this commit is pretty save since it's upstream since v5.10. * The fix itself (aaf2e65cac7f) mainly changes the cmd_work_handler and mlx5_cmd_comp_handler functions in a way that instead of pci_channel_offline mlx5_cmd_is_down (introiduced by b898ce7bccf1). * Actually b898ce7bccf1 started with changing from pci_channel_offline to mlx5_cmd_is_down, but looks like a few cases (in the area of refcount increate/decrease) were missed, that are now covered by aaf2e65cac7f. * It fixes now on top refcounts are now always properly increment and decrement to achieve a symmetric state for all flows. * These changes may have an impact on all cases where the mlx5 device is not responding, which can happen in case of an offline channel, interface down, reset or recovery. [ Other Info ] * A lookup at the master-next git trees for jammy, kinetic and lunar showed that both fixes are already included, hence only focal is affected. __________ ---Problem Description--- Kernel panic with "refcount_t: underflow" in kernel log Contact Information = Rijoy.k@ibm.com, vineeth.vijayan@ibm.com ---uname output--- 5.4.0-128-generic Machine Type = s390x ---System Hang--- Kernel panic and stack-trace as below ---Debugger--- A debugger is not configured Stack trace output: [Sat Apr 8 17:52:21 UTC 2023] Call Trace: [Sat Apr 8 17:52:21 UTC 2023] ([<0000002a5939a286>] refcount_warn_saturate+0xce/0x140) [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805f861e>] cmd_ent_put+0xe6/0xf8 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805f9b6a>] mlx5_cmd_comp_handler+0x102/0x4f0 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805f9f8a>] cmd_comp_notifier+0x32/0x48 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecf0c6>] notifier_call_chain+0x4e/0xa0 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecf17e>] atomic_notifier_call_chain+0x2e/0x40 [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805fe4fc>] mlx5_eq_async_int+0x13c/0x200 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecf0c6>] notifier_call_chain+0x4e/0xa0 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecf17e>] atomic_notifier_call_chain+0x2e/0x40 [Sat Apr 8 17:52:21 UTC 2023] [<000003ff8061318e>] mlx5_irq_int_handler+0x2e/0x48 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f1455a>] __handle_irq_event_percpu+0x6a/0x250 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f14770>] handle_irq_event_percpu+0x30/0x78 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f1a0c8>] handle_percpu_irq+0x68/0xa0 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f134d2>] generic_handle_irq+0x3a/0x60 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e960ce>] zpci_floating_irq_handler+0xe6/0x1b8 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a594f54a6>] do_airq_interrupt+0x96/0x130 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f1455a>] __handle_irq_event_percpu+0x6a/0x250 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f14770>] handle_irq_event_percpu+0x30/0x78 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f1a0c8>] handle_percpu_irq+0x68/0xa0 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f134d2>] generic_handle_irq+0x3a/0x60 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e30e42>] do_IRQ+0x7a/0xb0 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a5972a408>] io_int_handler+0x12c/0x294 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e2752e>] enabled_wait+0x46/0xd8 [Sat Apr 8 17:52:21 UTC 2023] ([<0000002a58e2752e>] enabled_wait+0x46/0xd8) [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e278aa>] arch_cpu_idle+0x2a/0x40 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ee1536>] do_idle+0xee/0x1b0 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ee17a6>] cpu_startup_entry+0x36/0x40 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e3ab38>] smp_init_secondary+0xc8/0xe8 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e3a770>] smp_start_secondary+0x88/0x90 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a5972a09c>] kernel_thread_starter+0x0/0x10 [Sat Apr 8 17:52:21 UTC 2023] Last Breaking-Event-Address: [Sat Apr 8 17:52:21 UTC 2023] [<0000002a5939a286>] refcount_warn_saturate+0xce/0x140 [Sat Apr 8 17:52:21 UTC 2023] ---[ end trace 6ec6f9c6f666ca2d ]--- [Sat Apr 8 17:52:21 UTC 2023] specification exception: 0006 ilc:3 [#1] SMP [Sat Apr 8 17:52:21 UTC 2023] Modules linked in: sysdigcloud_probe(OE) vhost_net vhost macvtap macvlan tap rpcsec_gss_krb5 auth_rpcgss nfsv3 nfs_acl nfs lockd grace fscache ebtable_broute binfmt_misc nbd veth xt_statistic ipt_REJECT nf_reject_ipv4 ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs iptable_mangle ip6table_mangle ip6table_nat xt_mark sunrpc lcs ctcm fsm zfcp scsi_transport_fc dasd_fba_mod dasd_eckd_mod dasd_mod nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp xt_multiport xt_set ip_set_hash_net ip_set_hash_ip ip_set tcp_diag inet_diag xt_comment xt_nat act_gact sch_multiq act_mirred act_pedit act_tunnel_key cls_flower act_police cls_u32 vxlan ip6_udp_tunnel udp_tunnel dummy sch_ingress mlx5_ib ib_uverbs ib_core mlx5_core tls mlxfw ptp pps_core xt_MASQUERADE iptable_nat xt_addrtype xt_conntrack br_netfilter bridge stp llc aufs ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bpfilter xfrm_user xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key xfrm_algo bonding s390_trng [Sat Apr 8 17:52:21 UTC 2023] vfio_ccw chsc_sch vfio_mdev mdev vfio_iommu_type1 eadm_sch vfio sch_fq_codel ip_tables x_tables overlay raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 linear nf_tables nf_nat nf_conntrack_netlink nfnetlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c qeth_l3 qeth_l2 pkey zcrypt crc32_vx_s390 ghash_s390 prng aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common qeth ccwgroup qdio scsi_dh_emc scsi_dh_rdac scsi_dh_alua dm_multipath [last unloaded: sysdigcloud_probe] [Sat Apr 8 17:52:21 UTC 2023] CPU: 12 PID: 83893 Comm: kworker/u400:91 Kdump: loaded Tainted: G W OE 5.4.0-128-generic #144~18.04.1-Ubuntu [Sat Apr 8 17:52:21 UTC 2023] Hardware name: IBM 8562 GT2 A00 (LPAR) [Sat Apr 8 17:52:21 UTC 2023] Workqueue: mlx5e mlx5e_update_stats_work [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] Krnl PSW : 0404d00180000000 0000002a58ec51d8 (queue_work_on+0x30/0x70) [Sat Apr 8 17:52:21 UTC 2023] R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3 [Sat Apr 8 17:52:21 UTC 2023] Krnl GPRS: 1d721b7c57e8d7f5 0000000000000001 0000000000000200 0000006222a0e800 [Sat Apr 8 17:52:21 UTC 2023] 0000005a8e94a4e1 0000000000000000 0000000000000000 000003e016d23d08 [Sat Apr 8 17:52:21 UTC 2023] 0000005a8e94a4e1 0000006287800120 0000003b8dbbd740 0700003b8dbbd740 [Sat Apr 8 17:52:21 UTC 2023] 00000062690c6600 000003ff8069c808 000003e016d23ae0 000003e016d23aa8 [Sat Apr 8 17:52:21 UTC 2023] Krnl Code: 0000002a58ec51c6: f0a0a7190001 srp 1817(11,%r10),1,0                                           0000002a58ec51cc: e3b0f0a00004 lg %r11,160(%r15)                                          #0000002a58ec51d2: eb11400000e6 laog %r1,%r1,0(%r4)                                          >0000002a58ec51d8: 07e0 bcr 14,%r0                                           0000002a58ec51da: a7110001 tmll %r1,1                                           0000002a58ec51de: a7840016 brc 8,0000002a58ec520a                                           0000002a58ec51e2: a7280000 lhi %r2,0                                           0000002a58ec51e6: a7b20300 tmhh %r11,768 [Sat Apr 8 17:52:21 UTC 2023] Call Trace: [Sat Apr 8 17:52:21 UTC 2023] ([<000003e016d23ae0>] 0x3e016d23ae0) [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805fab0a>] cmd_exec+0x44a/0xab0 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805fb2b0>] mlx5_cmd_exec+0x40/0x70 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<000003ff80657cb0>] mlx5_eswitch_get_vport_stats+0xb0/0x2a0 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<000003ff80644602>] mlx5e_rep_update_hw_counters+0x52/0xb8 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<000003ff8061f1ec>] mlx5e_update_stats_work+0x44/0x58 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ec56f4>] process_one_work+0x274/0x4d0 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ec5998>] worker_thread+0x48/0x560 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecd014>] kthread+0x144/0x160 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a5972a094>] ret_from_fork+0x28/0x30 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a5972a09c>] kernel_thread_starter+0x0/0x10 [Sat Apr 8 17:52:21 UTC 2023] Last Breaking-Event-Address: [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805ee060>] 0x3ff805ee060 [Sat Apr 8 17:52:21 UTC 2023] Kernel panic - not syncing: Fatal exception: panic_on_oops Oops output: [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805ee060>] 0x3ff805ee060 [Sat Apr 8 17:52:21 UTC 2023] Kernel panic - not syncing: Fatal exception: panic_on_oops ------------ [Michael] I had a look into the dump from wdc3-qz1-sr2-rk086-s05: crash> sys The system was up and running since: UPTIME: 282 days, 02:16:10 There a a lot of martian source messages again like: [Sun Apr 16 11:09:28 UTC 2023] IPv4: martian source 11.44.203.141 from 11.21.133.2, on dev ipsec0 [Sun Apr 16 11:09:28 UTC 2023] ll header: 00000000: ff ff ff ff ff ff fe ff 0b 15 85 02 08 06 I hope that we get them suppressed soon. Then at the following time a first issue can be observed: NFS timeout [Sun Apr 16 11:09:39 UTC 2023] nfs: server ccistorwdc0751-sec-fz.service.softlayer.com not responding, timed out The reason could be a) the server b) the network c) the local network adapter Then about 1:05 hour later the first mlx5 related issues are reported [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:08.1 if0200023AF58D: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:07.0 if02000440845F: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:08.1 if0200023AF58D: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:07.0 if02000440845F: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:00.2 p0v0: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:00.3 p0v1: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:00.6 p0v4: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:08.2 p1v0: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:08.3 p1v1: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 ? Then about 15 minutes later the NFS code performs a panic_on_oops ? [Sun Apr 16 12:32:34 UTC 2023] nfs: server ccistorwdc0751-sec-fz.service.softlayer.com not responding, timed out [Sun Apr 16 12:34:10 UTC 2023] Unable to handle kernel pointer dereference in virtual kernel address space [Sun Apr 16 12:34:10 UTC 2023] Failing address: 0000809f00008000 TEID: 0000809f00008803 [Sun Apr 16 12:34:10 UTC 2023] Fault in home space mode while using kernel ASCE. [Sun Apr 16 12:34:10 UTC 2023] AS:00000047431f4007 R3:0000000000000024 [Sun Apr 16 12:34:10 UTC 2023] Oops: 0038 ilc:3 [#1] SMP [Sun Apr 16 12:34:10 UTC 2023] Modules linked in: sysdigcloud_probe(OE) binfmt_misc nbd vhost_net vhost macvtap macvlan tap rpcsec_gss_krb5 auth_rpcgss nfsv3 nfs_acl nfs lockd grace fscache xt_s tatistic ipt_REJECT nf_reject_ipv4 ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs ip6table_mangle ip6table_nat ebt_redirect ebt_ip ebtable_broute sunrpc lcs ctcm fsm zfcp scsi_transport_fc dasd_fba_mod dasd_ eckd_mod dasd_mod nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp xt_multiport xt_set ip_set_hash_net ip_set_hash_ip ip_set tcp_diag inet_diag xt_comment xt_nat act_gact iptable_ mangle xt_mark veth sch_multiq act_mirred act_pedit act_tunnel_key cls_flower act_police cls_u32 vxlan ip6_udp_tunnel udp_tunnel dummy sch_ingress mlx5_ib ib_uverbs ib_core mlx5_core tls mlxfw p tp pps_core xt_MASQUERADE iptable_nat xt_addrtype xt_conntrack br_netfilter bridge stp llc aufs ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bpfilter xfrm_user xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key xfrm_algo [Sun Apr 16 12:34:10 UTC 2023] s390_trng vfio_ccw vfio_mdev chsc_sch mdev vfio_iommu_type1 eadm_sch vfio sch_fq_codel ip_tables x_tables overlay raid10 raid456 async_raid6_recov async_memcpy as ync_pq async_xor async_tx xor raid6_pq raid1 raid0 linear nf_tables nf_nat nf_conntrack_netlink nfnetlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c qeth_l3 qeth_l2 pkey zcrypt crc32_v x_s390 ghash_s390 prng aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common qeth ccwgroup qdio scsi_dh_emc scsi_dh_rdac scsi_dh_alua dm_multipath [la st unloaded: sysdigcloud_probe] [Sun Apr 16 12:34:10 UTC 2023] CPU: 4 PID: 32942 Comm: kubelet Kdump: loaded Tainted: G W OE 5.4.0-110-generic #124~18.04.1+hf334332v20220521b1-Ubuntu [Sun Apr 16 12:34:10 UTC 2023] Hardware name: IBM 8562 GT2 A00 (LPAR) [Sun Apr 16 12:34:10 UTC 2023] Krnl PSW : 0704f00180000000 000003ff8076304a (call_bind+0x3a/0xf8 [sunrpc]) [Sun Apr 16 12:34:10 UTC 2023] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:3 PM:0 RI:0 EA:3 [Sun Apr 16 12:34:10 UTC 2023] Krnl GPRS: 00000000000001dc 0000005d16d22400 00000041b9826500 000003e008637ad8 [Sun Apr 16 12:34:10 UTC 2023] 000003ff807794d6 0000004742e35898 0000000000000000 00000041b9826537 [Sun Apr 16 12:34:10 UTC 2023] 000003ff807ae63c 000003ff80763010 0000809f0000809f 00000041b9826500 [Sun Apr 16 12:34:10 UTC 2023] 00000015a0c80000 000003ff807a1d80 000003e008637a80 000003e008637a48 [Sun Apr 16 12:34:10 UTC 2023] Krnl Code: 000003ff8076303a: a7840041 brc 8,000003ff807630bc                                           000003ff8076303e: e31020c00004 lg %r1,192(%r2)                                          #000003ff80763044: e3a010000004 lg %r10,0(%r1)                                          >000003ff8076304a: e310a4070090 llgc %r1,1031(%r10)                                           000003ff80763050: a7110010 tmll %r1,16                                           000003ff80763054: a7740025 brc 7,000003ff8076309e                                           000003ff80763058: c418ffffe7d8 lgrl %r1,000003ff80760008                                           000003ff8076305e: 91021003 tm 3(%r1),2 [Sun Apr 16 12:34:10 UTC 2023] Call Trace: [Sun Apr 16 12:34:10 UTC 2023] ([<0000000000000000>] 0x0) [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80779454>] __rpc_execute+0x8c/0x488 [sunrpc] [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80779df2>] rpc_execute+0x8a/0x128 [sunrpc] [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80766d62>] rpc_run_task+0x132/0x180 [sunrpc] [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80766e00>] rpc_call_sync+0x50/0xa0 [sunrpc] [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80360e40>] nfs3_rpc_wrapper.constprop.12+0x48/0xe0 [nfsv3] [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80361c5e>] nfs3_proc_getattr+0x6e/0xc8 [nfsv3] [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80aaeaa8>] __nfs_revalidate_inode+0x158/0x3b0 [nfs] [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80aaef9c>] nfs_getattr+0x1bc/0x388 [nfs] [Sun Apr 16 12:34:10 UTC 2023] [<0000004742161032>] vfs_statx+0xaa/0xf8 [Sun Apr 16 12:34:10 UTC 2023] [<0000004742161798>] __do_sys_newstat+0x38/0x60 [Sun Apr 16 12:34:10 UTC 2023] [<000000474277e802>] system_call+0x2a6/0x2c8 [Sun Apr 16 12:34:10 UTC 2023] Last Breaking-Event-Address: [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80779452>] __rpc_execute+0x8a/0x488 [sunrpc] [Sun Apr 16 12:34:10 UTC 2023] Kernel panic - not syncing: Fatal exception: panic_on_oops The network interfaces p0 and p1 are missing: crash> net | grep -P "p0 |p1 "    5b726fa000 macvtap0 It looks like the p0/p1 issues where the network interfaces have been lost but no recovery was attempted. There are no related recovery messages from the mlx5 kernel module. The kernel finally dumps in the area of the NFS/RPC code. That would be the related upstream commit: aaf2e65cac7f net/mlx5: Fix handling of entry refcount when command is not issued to FW ---- [Niklas] I agree that commit does sound like it could be the fix for exactly this issue. I checked the kernel tree at the tag Ubuntu-5.4.0-128.144 and that does not appear to have this fix. If I read things correctly this is again an issue that may occur during a recovery when the PCI device is isolated and thus doesn't respond. So it likely won't help with not losing the interface but it does sound like it could solve the kernel crash/refcount warning. ==================================================================================================== Summary: Looks like this patch (aaf2e65cac7f) is missing in 20.04 and could be reason for the crash. We would like to backport this to 20.04, 20.04 HWE, 22.04 and 22.04 HWE. aaf2e65cac7f net/mlx5: Fix handling of entry refcount when command is not issued to FW https://lore.kernel.org/netdev/20221122022559.89459-6-saeed@kernel.org/ ====================================================================================================
2023-06-28 10:06:03 Frank Heimes linux (Ubuntu Focal): assignee Canonical Kernel Team (canonical-kernel-team)
2023-06-28 10:06:16 Frank Heimes linux (Ubuntu Focal): importance Undecided High
2023-06-28 10:06:18 Frank Heimes linux (Ubuntu): importance Undecided High
2023-06-28 10:06:22 Frank Heimes ubuntu-z-systems: importance Undecided High
2023-06-28 12:25:46 Frank Heimes description SRU Justification: ================== [ Impact ] * The mlx5 driver is causing a Kernel panic with "refcount_t: underflow". * This issue occurs during a recovery when the PCI device is isolated and thus doesn't respond. [ Fix ] * This issue got solved upstream with aaf2e65cac7f aaf2e65cac7f2e1ae729c2fbc849091df9699f96 "net/mlx5: Fix handling of entry refcount when command is not issued to FW" (upstream since 6.1-rc1) * But to get aaf2e65cac7f a backport of b898ce7bccf1 b898ce7bccf13087719c021d829dab607c175246 "net/mlx5: cmdif, Avoid skipping reclaim pages if FW is not accessible" is required on top (upstream since 5.10) [ Test Plan ] * An Ubuntu Server for s390x 20.04 LPAR or z/VM installation is needed that has Mellanox cards (RoCE Express 2.1) assigned, configured and enabled and that runs a 5.4 kernel with mlx5 driver. * Create some network traffic on (one of the) RoCE device (interface ens???[d?]) for testing (e.g. with stress-ng). * Make sure the module/driver mlx5 is loaded and in use. * Trigger a recovery (via the Support Element) that will render the adapter (ports) unresponsive for a moment and should provoke a similar situation. * Alternatively the interface itself can be removed for a moment and re-added again (but this may break further things on top). * Due to the lack of RoCE Express 2.1 hardware, the verification is on IBM. [ Where problems could occur ] * The modifications are limited to the Mellanox mlx5 driver only - no other network driver is affected. * The pre-required commit (aaf2e65cac7f) can have a bad impact on (re-)claiming pages if FW is not accessible, which could cause page leaks in case done wrong. But this commit is pretty save since it's upstream since v5.10. * The fix itself (aaf2e65cac7f) mainly changes the cmd_work_handler and mlx5_cmd_comp_handler functions in a way that instead of pci_channel_offline mlx5_cmd_is_down (introiduced by b898ce7bccf1). * Actually b898ce7bccf1 started with changing from pci_channel_offline to mlx5_cmd_is_down, but looks like a few cases (in the area of refcount increate/decrease) were missed, that are now covered by aaf2e65cac7f. * It fixes now on top refcounts are now always properly increment and decrement to achieve a symmetric state for all flows. * These changes may have an impact on all cases where the mlx5 device is not responding, which can happen in case of an offline channel, interface down, reset or recovery. [ Other Info ] * A lookup at the master-next git trees for jammy, kinetic and lunar showed that both fixes are already included, hence only focal is affected. __________ ---Problem Description--- Kernel panic with "refcount_t: underflow" in kernel log Contact Information = Rijoy.k@ibm.com, vineeth.vijayan@ibm.com ---uname output--- 5.4.0-128-generic Machine Type = s390x ---System Hang--- Kernel panic and stack-trace as below ---Debugger--- A debugger is not configured Stack trace output: [Sat Apr 8 17:52:21 UTC 2023] Call Trace: [Sat Apr 8 17:52:21 UTC 2023] ([<0000002a5939a286>] refcount_warn_saturate+0xce/0x140) [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805f861e>] cmd_ent_put+0xe6/0xf8 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805f9b6a>] mlx5_cmd_comp_handler+0x102/0x4f0 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805f9f8a>] cmd_comp_notifier+0x32/0x48 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecf0c6>] notifier_call_chain+0x4e/0xa0 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecf17e>] atomic_notifier_call_chain+0x2e/0x40 [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805fe4fc>] mlx5_eq_async_int+0x13c/0x200 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecf0c6>] notifier_call_chain+0x4e/0xa0 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecf17e>] atomic_notifier_call_chain+0x2e/0x40 [Sat Apr 8 17:52:21 UTC 2023] [<000003ff8061318e>] mlx5_irq_int_handler+0x2e/0x48 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f1455a>] __handle_irq_event_percpu+0x6a/0x250 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f14770>] handle_irq_event_percpu+0x30/0x78 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f1a0c8>] handle_percpu_irq+0x68/0xa0 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f134d2>] generic_handle_irq+0x3a/0x60 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e960ce>] zpci_floating_irq_handler+0xe6/0x1b8 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a594f54a6>] do_airq_interrupt+0x96/0x130 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f1455a>] __handle_irq_event_percpu+0x6a/0x250 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f14770>] handle_irq_event_percpu+0x30/0x78 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f1a0c8>] handle_percpu_irq+0x68/0xa0 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f134d2>] generic_handle_irq+0x3a/0x60 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e30e42>] do_IRQ+0x7a/0xb0 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a5972a408>] io_int_handler+0x12c/0x294 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e2752e>] enabled_wait+0x46/0xd8 [Sat Apr 8 17:52:21 UTC 2023] ([<0000002a58e2752e>] enabled_wait+0x46/0xd8) [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e278aa>] arch_cpu_idle+0x2a/0x40 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ee1536>] do_idle+0xee/0x1b0 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ee17a6>] cpu_startup_entry+0x36/0x40 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e3ab38>] smp_init_secondary+0xc8/0xe8 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e3a770>] smp_start_secondary+0x88/0x90 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a5972a09c>] kernel_thread_starter+0x0/0x10 [Sat Apr 8 17:52:21 UTC 2023] Last Breaking-Event-Address: [Sat Apr 8 17:52:21 UTC 2023] [<0000002a5939a286>] refcount_warn_saturate+0xce/0x140 [Sat Apr 8 17:52:21 UTC 2023] ---[ end trace 6ec6f9c6f666ca2d ]--- [Sat Apr 8 17:52:21 UTC 2023] specification exception: 0006 ilc:3 [#1] SMP [Sat Apr 8 17:52:21 UTC 2023] Modules linked in: sysdigcloud_probe(OE) vhost_net vhost macvtap macvlan tap rpcsec_gss_krb5 auth_rpcgss nfsv3 nfs_acl nfs lockd grace fscache ebtable_broute binfmt_misc nbd veth xt_statistic ipt_REJECT nf_reject_ipv4 ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs iptable_mangle ip6table_mangle ip6table_nat xt_mark sunrpc lcs ctcm fsm zfcp scsi_transport_fc dasd_fba_mod dasd_eckd_mod dasd_mod nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp xt_multiport xt_set ip_set_hash_net ip_set_hash_ip ip_set tcp_diag inet_diag xt_comment xt_nat act_gact sch_multiq act_mirred act_pedit act_tunnel_key cls_flower act_police cls_u32 vxlan ip6_udp_tunnel udp_tunnel dummy sch_ingress mlx5_ib ib_uverbs ib_core mlx5_core tls mlxfw ptp pps_core xt_MASQUERADE iptable_nat xt_addrtype xt_conntrack br_netfilter bridge stp llc aufs ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bpfilter xfrm_user xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key xfrm_algo bonding s390_trng [Sat Apr 8 17:52:21 UTC 2023] vfio_ccw chsc_sch vfio_mdev mdev vfio_iommu_type1 eadm_sch vfio sch_fq_codel ip_tables x_tables overlay raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 linear nf_tables nf_nat nf_conntrack_netlink nfnetlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c qeth_l3 qeth_l2 pkey zcrypt crc32_vx_s390 ghash_s390 prng aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common qeth ccwgroup qdio scsi_dh_emc scsi_dh_rdac scsi_dh_alua dm_multipath [last unloaded: sysdigcloud_probe] [Sat Apr 8 17:52:21 UTC 2023] CPU: 12 PID: 83893 Comm: kworker/u400:91 Kdump: loaded Tainted: G W OE 5.4.0-128-generic #144~18.04.1-Ubuntu [Sat Apr 8 17:52:21 UTC 2023] Hardware name: IBM 8562 GT2 A00 (LPAR) [Sat Apr 8 17:52:21 UTC 2023] Workqueue: mlx5e mlx5e_update_stats_work [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] Krnl PSW : 0404d00180000000 0000002a58ec51d8 (queue_work_on+0x30/0x70) [Sat Apr 8 17:52:21 UTC 2023] R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3 [Sat Apr 8 17:52:21 UTC 2023] Krnl GPRS: 1d721b7c57e8d7f5 0000000000000001 0000000000000200 0000006222a0e800 [Sat Apr 8 17:52:21 UTC 2023] 0000005a8e94a4e1 0000000000000000 0000000000000000 000003e016d23d08 [Sat Apr 8 17:52:21 UTC 2023] 0000005a8e94a4e1 0000006287800120 0000003b8dbbd740 0700003b8dbbd740 [Sat Apr 8 17:52:21 UTC 2023] 00000062690c6600 000003ff8069c808 000003e016d23ae0 000003e016d23aa8 [Sat Apr 8 17:52:21 UTC 2023] Krnl Code: 0000002a58ec51c6: f0a0a7190001 srp 1817(11,%r10),1,0                                           0000002a58ec51cc: e3b0f0a00004 lg %r11,160(%r15)                                          #0000002a58ec51d2: eb11400000e6 laog %r1,%r1,0(%r4)                                          >0000002a58ec51d8: 07e0 bcr 14,%r0                                           0000002a58ec51da: a7110001 tmll %r1,1                                           0000002a58ec51de: a7840016 brc 8,0000002a58ec520a                                           0000002a58ec51e2: a7280000 lhi %r2,0                                           0000002a58ec51e6: a7b20300 tmhh %r11,768 [Sat Apr 8 17:52:21 UTC 2023] Call Trace: [Sat Apr 8 17:52:21 UTC 2023] ([<000003e016d23ae0>] 0x3e016d23ae0) [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805fab0a>] cmd_exec+0x44a/0xab0 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805fb2b0>] mlx5_cmd_exec+0x40/0x70 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<000003ff80657cb0>] mlx5_eswitch_get_vport_stats+0xb0/0x2a0 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<000003ff80644602>] mlx5e_rep_update_hw_counters+0x52/0xb8 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<000003ff8061f1ec>] mlx5e_update_stats_work+0x44/0x58 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ec56f4>] process_one_work+0x274/0x4d0 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ec5998>] worker_thread+0x48/0x560 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecd014>] kthread+0x144/0x160 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a5972a094>] ret_from_fork+0x28/0x30 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a5972a09c>] kernel_thread_starter+0x0/0x10 [Sat Apr 8 17:52:21 UTC 2023] Last Breaking-Event-Address: [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805ee060>] 0x3ff805ee060 [Sat Apr 8 17:52:21 UTC 2023] Kernel panic - not syncing: Fatal exception: panic_on_oops Oops output: [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805ee060>] 0x3ff805ee060 [Sat Apr 8 17:52:21 UTC 2023] Kernel panic - not syncing: Fatal exception: panic_on_oops ------------ [Michael] I had a look into the dump from wdc3-qz1-sr2-rk086-s05: crash> sys The system was up and running since: UPTIME: 282 days, 02:16:10 There a a lot of martian source messages again like: [Sun Apr 16 11:09:28 UTC 2023] IPv4: martian source 11.44.203.141 from 11.21.133.2, on dev ipsec0 [Sun Apr 16 11:09:28 UTC 2023] ll header: 00000000: ff ff ff ff ff ff fe ff 0b 15 85 02 08 06 I hope that we get them suppressed soon. Then at the following time a first issue can be observed: NFS timeout [Sun Apr 16 11:09:39 UTC 2023] nfs: server ccistorwdc0751-sec-fz.service.softlayer.com not responding, timed out The reason could be a) the server b) the network c) the local network adapter Then about 1:05 hour later the first mlx5 related issues are reported [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:08.1 if0200023AF58D: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:07.0 if02000440845F: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:08.1 if0200023AF58D: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:07.0 if02000440845F: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:00.2 p0v0: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:00.3 p0v1: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:00.6 p0v4: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:08.2 p1v0: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:08.3 p1v1: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 ? Then about 15 minutes later the NFS code performs a panic_on_oops ? [Sun Apr 16 12:32:34 UTC 2023] nfs: server ccistorwdc0751-sec-fz.service.softlayer.com not responding, timed out [Sun Apr 16 12:34:10 UTC 2023] Unable to handle kernel pointer dereference in virtual kernel address space [Sun Apr 16 12:34:10 UTC 2023] Failing address: 0000809f00008000 TEID: 0000809f00008803 [Sun Apr 16 12:34:10 UTC 2023] Fault in home space mode while using kernel ASCE. [Sun Apr 16 12:34:10 UTC 2023] AS:00000047431f4007 R3:0000000000000024 [Sun Apr 16 12:34:10 UTC 2023] Oops: 0038 ilc:3 [#1] SMP [Sun Apr 16 12:34:10 UTC 2023] Modules linked in: sysdigcloud_probe(OE) binfmt_misc nbd vhost_net vhost macvtap macvlan tap rpcsec_gss_krb5 auth_rpcgss nfsv3 nfs_acl nfs lockd grace fscache xt_s tatistic ipt_REJECT nf_reject_ipv4 ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs ip6table_mangle ip6table_nat ebt_redirect ebt_ip ebtable_broute sunrpc lcs ctcm fsm zfcp scsi_transport_fc dasd_fba_mod dasd_ eckd_mod dasd_mod nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp xt_multiport xt_set ip_set_hash_net ip_set_hash_ip ip_set tcp_diag inet_diag xt_comment xt_nat act_gact iptable_ mangle xt_mark veth sch_multiq act_mirred act_pedit act_tunnel_key cls_flower act_police cls_u32 vxlan ip6_udp_tunnel udp_tunnel dummy sch_ingress mlx5_ib ib_uverbs ib_core mlx5_core tls mlxfw p tp pps_core xt_MASQUERADE iptable_nat xt_addrtype xt_conntrack br_netfilter bridge stp llc aufs ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bpfilter xfrm_user xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key xfrm_algo [Sun Apr 16 12:34:10 UTC 2023] s390_trng vfio_ccw vfio_mdev chsc_sch mdev vfio_iommu_type1 eadm_sch vfio sch_fq_codel ip_tables x_tables overlay raid10 raid456 async_raid6_recov async_memcpy as ync_pq async_xor async_tx xor raid6_pq raid1 raid0 linear nf_tables nf_nat nf_conntrack_netlink nfnetlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c qeth_l3 qeth_l2 pkey zcrypt crc32_v x_s390 ghash_s390 prng aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common qeth ccwgroup qdio scsi_dh_emc scsi_dh_rdac scsi_dh_alua dm_multipath [la st unloaded: sysdigcloud_probe] [Sun Apr 16 12:34:10 UTC 2023] CPU: 4 PID: 32942 Comm: kubelet Kdump: loaded Tainted: G W OE 5.4.0-110-generic #124~18.04.1+hf334332v20220521b1-Ubuntu [Sun Apr 16 12:34:10 UTC 2023] Hardware name: IBM 8562 GT2 A00 (LPAR) [Sun Apr 16 12:34:10 UTC 2023] Krnl PSW : 0704f00180000000 000003ff8076304a (call_bind+0x3a/0xf8 [sunrpc]) [Sun Apr 16 12:34:10 UTC 2023] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:3 PM:0 RI:0 EA:3 [Sun Apr 16 12:34:10 UTC 2023] Krnl GPRS: 00000000000001dc 0000005d16d22400 00000041b9826500 000003e008637ad8 [Sun Apr 16 12:34:10 UTC 2023] 000003ff807794d6 0000004742e35898 0000000000000000 00000041b9826537 [Sun Apr 16 12:34:10 UTC 2023] 000003ff807ae63c 000003ff80763010 0000809f0000809f 00000041b9826500 [Sun Apr 16 12:34:10 UTC 2023] 00000015a0c80000 000003ff807a1d80 000003e008637a80 000003e008637a48 [Sun Apr 16 12:34:10 UTC 2023] Krnl Code: 000003ff8076303a: a7840041 brc 8,000003ff807630bc                                           000003ff8076303e: e31020c00004 lg %r1,192(%r2)                                          #000003ff80763044: e3a010000004 lg %r10,0(%r1)                                          >000003ff8076304a: e310a4070090 llgc %r1,1031(%r10)                                           000003ff80763050: a7110010 tmll %r1,16                                           000003ff80763054: a7740025 brc 7,000003ff8076309e                                           000003ff80763058: c418ffffe7d8 lgrl %r1,000003ff80760008                                           000003ff8076305e: 91021003 tm 3(%r1),2 [Sun Apr 16 12:34:10 UTC 2023] Call Trace: [Sun Apr 16 12:34:10 UTC 2023] ([<0000000000000000>] 0x0) [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80779454>] __rpc_execute+0x8c/0x488 [sunrpc] [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80779df2>] rpc_execute+0x8a/0x128 [sunrpc] [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80766d62>] rpc_run_task+0x132/0x180 [sunrpc] [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80766e00>] rpc_call_sync+0x50/0xa0 [sunrpc] [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80360e40>] nfs3_rpc_wrapper.constprop.12+0x48/0xe0 [nfsv3] [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80361c5e>] nfs3_proc_getattr+0x6e/0xc8 [nfsv3] [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80aaeaa8>] __nfs_revalidate_inode+0x158/0x3b0 [nfs] [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80aaef9c>] nfs_getattr+0x1bc/0x388 [nfs] [Sun Apr 16 12:34:10 UTC 2023] [<0000004742161032>] vfs_statx+0xaa/0xf8 [Sun Apr 16 12:34:10 UTC 2023] [<0000004742161798>] __do_sys_newstat+0x38/0x60 [Sun Apr 16 12:34:10 UTC 2023] [<000000474277e802>] system_call+0x2a6/0x2c8 [Sun Apr 16 12:34:10 UTC 2023] Last Breaking-Event-Address: [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80779452>] __rpc_execute+0x8a/0x488 [sunrpc] [Sun Apr 16 12:34:10 UTC 2023] Kernel panic - not syncing: Fatal exception: panic_on_oops The network interfaces p0 and p1 are missing: crash> net | grep -P "p0 |p1 "    5b726fa000 macvtap0 It looks like the p0/p1 issues where the network interfaces have been lost but no recovery was attempted. There are no related recovery messages from the mlx5 kernel module. The kernel finally dumps in the area of the NFS/RPC code. That would be the related upstream commit: aaf2e65cac7f net/mlx5: Fix handling of entry refcount when command is not issued to FW ---- [Niklas] I agree that commit does sound like it could be the fix for exactly this issue. I checked the kernel tree at the tag Ubuntu-5.4.0-128.144 and that does not appear to have this fix. If I read things correctly this is again an issue that may occur during a recovery when the PCI device is isolated and thus doesn't respond. So it likely won't help with not losing the interface but it does sound like it could solve the kernel crash/refcount warning. ==================================================================================================== Summary: Looks like this patch (aaf2e65cac7f) is missing in 20.04 and could be reason for the crash. We would like to backport this to 20.04, 20.04 HWE, 22.04 and 22.04 HWE. aaf2e65cac7f net/mlx5: Fix handling of entry refcount when command is not issued to FW https://lore.kernel.org/netdev/20221122022559.89459-6-saeed@kernel.org/ ==================================================================================================== SRU Justification: ================== [ Impact ]  * The mlx5 driver is causing a Kernel panic with    "refcount_t: underflow".  * This issue occurs during a recovery when the PCI device    is isolated and thus doesn't respond. [ Fix ]  * This issue got solved upstream with    aaf2e65cac7f aaf2e65cac7f2e1ae729c2fbc849091df9699f96    "net/mlx5: Fix handling of entry refcount when command    is not issued to FW" (upstream since 6.1-rc1)  * But to get aaf2e65cac7f a backport of b898ce7bccf1    b898ce7bccf13087719c021d829dab607c175246    "net/mlx5: cmdif, Avoid skipping reclaim pages if FW is    not accessible" is required on top (upstream since 5.10) [ Test Plan ]  * An Ubuntu Server for s390x 20.04 LPAR or z/VM installation    is needed that has Mellanox cards (RoCE Express 2.1)    assigned, configured and enabled and that runs a 5.4    kernel with mlx5 driver.  * Create some network traffic on (one of the) RoCE device    (interface ens???[d?]) for testing (e.g. with stress-ng).  * Make sure the module/driver mlx5 is loaded and in use.  * Trigger a recovery (via the Support Element)    that will render the adapter (ports) unresponsive    for a moment and should provoke a similar situation.  * Alternatively the interface itself can be removed for    a moment and re-added again (but this may break further    things on top).  * Due to the lack of RoCE Express 2.1 hardware,    the verification is on IBM. [ Where problems could occur ]  * The modifications are limited to the Mellanox mlx5 driver    only - no other network driver is affected.  * The pre-required commit (aaf2e65cac7f) can have a bad    impact on (re-)claiming pages if FW is not accessible,    which could cause page leaks in case done wrong.    But this commit is pretty save since it's upstream    since v5.10.  * The fix itself (aaf2e65cac7f) mainly changes the    cmd_work_handler and mlx5_cmd_comp_handler functions    in a way that instead of pci_channel_offline    mlx5_cmd_is_down (introiduced by b898ce7bccf1).  * Actually b898ce7bccf1 started with changing from    pci_channel_offline to mlx5_cmd_is_down,    but looks like a few cases    (in the area of refcount increate/decrease) were missed,    that are now covered by aaf2e65cac7f.  * It fixes now on top refcounts are now always properly    increment and decrement to achieve a symmetric state    for all flows.  * These changes may have an impact on all cases where the    mlx5 device is not responding, which can happen in case    of an offline channel, interface down, reset or recovery. [ Other Info ]  * Looking at the master-next git trees for jammy, kinetic    and lunar showed that both fixes are already included,    hence only focal is affected. __________ ---Problem Description--- Kernel panic with "refcount_t: underflow" in kernel log Contact Information = Rijoy.k@ibm.com, vineeth.vijayan@ibm.com ---uname output--- 5.4.0-128-generic Machine Type = s390x ---System Hang--- Kernel panic and stack-trace as below ---Debugger--- A debugger is not configured Stack trace output: [Sat Apr 8 17:52:21 UTC 2023] Call Trace: [Sat Apr 8 17:52:21 UTC 2023] ([<0000002a5939a286>] refcount_warn_saturate+0xce/0x140) [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805f861e>] cmd_ent_put+0xe6/0xf8 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805f9b6a>] mlx5_cmd_comp_handler+0x102/0x4f0 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805f9f8a>] cmd_comp_notifier+0x32/0x48 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecf0c6>] notifier_call_chain+0x4e/0xa0 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecf17e>] atomic_notifier_call_chain+0x2e/0x40 [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805fe4fc>] mlx5_eq_async_int+0x13c/0x200 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecf0c6>] notifier_call_chain+0x4e/0xa0 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecf17e>] atomic_notifier_call_chain+0x2e/0x40 [Sat Apr 8 17:52:21 UTC 2023] [<000003ff8061318e>] mlx5_irq_int_handler+0x2e/0x48 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f1455a>] __handle_irq_event_percpu+0x6a/0x250 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f14770>] handle_irq_event_percpu+0x30/0x78 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f1a0c8>] handle_percpu_irq+0x68/0xa0 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f134d2>] generic_handle_irq+0x3a/0x60 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e960ce>] zpci_floating_irq_handler+0xe6/0x1b8 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a594f54a6>] do_airq_interrupt+0x96/0x130 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f1455a>] __handle_irq_event_percpu+0x6a/0x250 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f14770>] handle_irq_event_percpu+0x30/0x78 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f1a0c8>] handle_percpu_irq+0x68/0xa0 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f134d2>] generic_handle_irq+0x3a/0x60 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e30e42>] do_IRQ+0x7a/0xb0 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a5972a408>] io_int_handler+0x12c/0x294 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e2752e>] enabled_wait+0x46/0xd8 [Sat Apr 8 17:52:21 UTC 2023] ([<0000002a58e2752e>] enabled_wait+0x46/0xd8) [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e278aa>] arch_cpu_idle+0x2a/0x40 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ee1536>] do_idle+0xee/0x1b0 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ee17a6>] cpu_startup_entry+0x36/0x40 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e3ab38>] smp_init_secondary+0xc8/0xe8 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e3a770>] smp_start_secondary+0x88/0x90 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a5972a09c>] kernel_thread_starter+0x0/0x10 [Sat Apr 8 17:52:21 UTC 2023] Last Breaking-Event-Address: [Sat Apr 8 17:52:21 UTC 2023] [<0000002a5939a286>] refcount_warn_saturate+0xce/0x140 [Sat Apr 8 17:52:21 UTC 2023] ---[ end trace 6ec6f9c6f666ca2d ]--- [Sat Apr 8 17:52:21 UTC 2023] specification exception: 0006 ilc:3 [#1] SMP [Sat Apr 8 17:52:21 UTC 2023] Modules linked in: sysdigcloud_probe(OE) vhost_net vhost macvtap macvlan tap rpcsec_gss_krb5 auth_rpcgss nfsv3 nfs_acl nfs lockd grace fscache ebtable_broute binfmt_misc nbd veth xt_statistic ipt_REJECT nf_reject_ipv4 ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs iptable_mangle ip6table_mangle ip6table_nat xt_mark sunrpc lcs ctcm fsm zfcp scsi_transport_fc dasd_fba_mod dasd_eckd_mod dasd_mod nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp xt_multiport xt_set ip_set_hash_net ip_set_hash_ip ip_set tcp_diag inet_diag xt_comment xt_nat act_gact sch_multiq act_mirred act_pedit act_tunnel_key cls_flower act_police cls_u32 vxlan ip6_udp_tunnel udp_tunnel dummy sch_ingress mlx5_ib ib_uverbs ib_core mlx5_core tls mlxfw ptp pps_core xt_MASQUERADE iptable_nat xt_addrtype xt_conntrack br_netfilter bridge stp llc aufs ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bpfilter xfrm_user xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key xfrm_algo bonding s390_trng [Sat Apr 8 17:52:21 UTC 2023] vfio_ccw chsc_sch vfio_mdev mdev vfio_iommu_type1 eadm_sch vfio sch_fq_codel ip_tables x_tables overlay raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 linear nf_tables nf_nat nf_conntrack_netlink nfnetlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c qeth_l3 qeth_l2 pkey zcrypt crc32_vx_s390 ghash_s390 prng aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common qeth ccwgroup qdio scsi_dh_emc scsi_dh_rdac scsi_dh_alua dm_multipath [last unloaded: sysdigcloud_probe] [Sat Apr 8 17:52:21 UTC 2023] CPU: 12 PID: 83893 Comm: kworker/u400:91 Kdump: loaded Tainted: G W OE 5.4.0-128-generic #144~18.04.1-Ubuntu [Sat Apr 8 17:52:21 UTC 2023] Hardware name: IBM 8562 GT2 A00 (LPAR) [Sat Apr 8 17:52:21 UTC 2023] Workqueue: mlx5e mlx5e_update_stats_work [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] Krnl PSW : 0404d00180000000 0000002a58ec51d8 (queue_work_on+0x30/0x70) [Sat Apr 8 17:52:21 UTC 2023] R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3 [Sat Apr 8 17:52:21 UTC 2023] Krnl GPRS: 1d721b7c57e8d7f5 0000000000000001 0000000000000200 0000006222a0e800 [Sat Apr 8 17:52:21 UTC 2023] 0000005a8e94a4e1 0000000000000000 0000000000000000 000003e016d23d08 [Sat Apr 8 17:52:21 UTC 2023] 0000005a8e94a4e1 0000006287800120 0000003b8dbbd740 0700003b8dbbd740 [Sat Apr 8 17:52:21 UTC 2023] 00000062690c6600 000003ff8069c808 000003e016d23ae0 000003e016d23aa8 [Sat Apr 8 17:52:21 UTC 2023] Krnl Code: 0000002a58ec51c6: f0a0a7190001 srp 1817(11,%r10),1,0                                           0000002a58ec51cc: e3b0f0a00004 lg %r11,160(%r15)                                          #0000002a58ec51d2: eb11400000e6 laog %r1,%r1,0(%r4)                                          >0000002a58ec51d8: 07e0 bcr 14,%r0                                           0000002a58ec51da: a7110001 tmll %r1,1                                           0000002a58ec51de: a7840016 brc 8,0000002a58ec520a                                           0000002a58ec51e2: a7280000 lhi %r2,0                                           0000002a58ec51e6: a7b20300 tmhh %r11,768 [Sat Apr 8 17:52:21 UTC 2023] Call Trace: [Sat Apr 8 17:52:21 UTC 2023] ([<000003e016d23ae0>] 0x3e016d23ae0) [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805fab0a>] cmd_exec+0x44a/0xab0 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805fb2b0>] mlx5_cmd_exec+0x40/0x70 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<000003ff80657cb0>] mlx5_eswitch_get_vport_stats+0xb0/0x2a0 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<000003ff80644602>] mlx5e_rep_update_hw_counters+0x52/0xb8 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<000003ff8061f1ec>] mlx5e_update_stats_work+0x44/0x58 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ec56f4>] process_one_work+0x274/0x4d0 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ec5998>] worker_thread+0x48/0x560 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecd014>] kthread+0x144/0x160 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a5972a094>] ret_from_fork+0x28/0x30 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a5972a09c>] kernel_thread_starter+0x0/0x10 [Sat Apr 8 17:52:21 UTC 2023] Last Breaking-Event-Address: [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805ee060>] 0x3ff805ee060 [Sat Apr 8 17:52:21 UTC 2023] Kernel panic - not syncing: Fatal exception: panic_on_oops Oops output: [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805ee060>] 0x3ff805ee060 [Sat Apr 8 17:52:21 UTC 2023] Kernel panic - not syncing: Fatal exception: panic_on_oops ------------ [Michael] I had a look into the dump from wdc3-qz1-sr2-rk086-s05: crash> sys The system was up and running since: UPTIME: 282 days, 02:16:10 There a a lot of martian source messages again like: [Sun Apr 16 11:09:28 UTC 2023] IPv4: martian source 11.44.203.141 from 11.21.133.2, on dev ipsec0 [Sun Apr 16 11:09:28 UTC 2023] ll header: 00000000: ff ff ff ff ff ff fe ff 0b 15 85 02 08 06 I hope that we get them suppressed soon. Then at the following time a first issue can be observed: NFS timeout [Sun Apr 16 11:09:39 UTC 2023] nfs: server ccistorwdc0751-sec-fz.service.softlayer.com not responding, timed out The reason could be a) the server b) the network c) the local network adapter Then about 1:05 hour later the first mlx5 related issues are reported [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:08.1 if0200023AF58D: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:07.0 if02000440845F: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:08.1 if0200023AF58D: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:07.0 if02000440845F: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:00.2 p0v0: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:00.3 p0v1: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:00.6 p0v4: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:08.2 p1v0: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:08.3 p1v1: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 ? Then about 15 minutes later the NFS code performs a panic_on_oops ? [Sun Apr 16 12:32:34 UTC 2023] nfs: server ccistorwdc0751-sec-fz.service.softlayer.com not responding, timed out [Sun Apr 16 12:34:10 UTC 2023] Unable to handle kernel pointer dereference in virtual kernel address space [Sun Apr 16 12:34:10 UTC 2023] Failing address: 0000809f00008000 TEID: 0000809f00008803 [Sun Apr 16 12:34:10 UTC 2023] Fault in home space mode while using kernel ASCE. [Sun Apr 16 12:34:10 UTC 2023] AS:00000047431f4007 R3:0000000000000024 [Sun Apr 16 12:34:10 UTC 2023] Oops: 0038 ilc:3 [#1] SMP [Sun Apr 16 12:34:10 UTC 2023] Modules linked in: sysdigcloud_probe(OE) binfmt_misc nbd vhost_net vhost macvtap macvlan tap rpcsec_gss_krb5 auth_rpcgss nfsv3 nfs_acl nfs lockd grace fscache xt_s tatistic ipt_REJECT nf_reject_ipv4 ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs ip6table_mangle ip6table_nat ebt_redirect ebt_ip ebtable_broute sunrpc lcs ctcm fsm zfcp scsi_transport_fc dasd_fba_mod dasd_ eckd_mod dasd_mod nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp xt_multiport xt_set ip_set_hash_net ip_set_hash_ip ip_set tcp_diag inet_diag xt_comment xt_nat act_gact iptable_ mangle xt_mark veth sch_multiq act_mirred act_pedit act_tunnel_key cls_flower act_police cls_u32 vxlan ip6_udp_tunnel udp_tunnel dummy sch_ingress mlx5_ib ib_uverbs ib_core mlx5_core tls mlxfw p tp pps_core xt_MASQUERADE iptable_nat xt_addrtype xt_conntrack br_netfilter bridge stp llc aufs ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bpfilter xfrm_user xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key xfrm_algo [Sun Apr 16 12:34:10 UTC 2023] s390_trng vfio_ccw vfio_mdev chsc_sch mdev vfio_iommu_type1 eadm_sch vfio sch_fq_codel ip_tables x_tables overlay raid10 raid456 async_raid6_recov async_memcpy as ync_pq async_xor async_tx xor raid6_pq raid1 raid0 linear nf_tables nf_nat nf_conntrack_netlink nfnetlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c qeth_l3 qeth_l2 pkey zcrypt crc32_v x_s390 ghash_s390 prng aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common qeth ccwgroup qdio scsi_dh_emc scsi_dh_rdac scsi_dh_alua dm_multipath [la st unloaded: sysdigcloud_probe] [Sun Apr 16 12:34:10 UTC 2023] CPU: 4 PID: 32942 Comm: kubelet Kdump: loaded Tainted: G W OE 5.4.0-110-generic #124~18.04.1+hf334332v20220521b1-Ubuntu [Sun Apr 16 12:34:10 UTC 2023] Hardware name: IBM 8562 GT2 A00 (LPAR) [Sun Apr 16 12:34:10 UTC 2023] Krnl PSW : 0704f00180000000 000003ff8076304a (call_bind+0x3a/0xf8 [sunrpc]) [Sun Apr 16 12:34:10 UTC 2023] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:3 PM:0 RI:0 EA:3 [Sun Apr 16 12:34:10 UTC 2023] Krnl GPRS: 00000000000001dc 0000005d16d22400 00000041b9826500 000003e008637ad8 [Sun Apr 16 12:34:10 UTC 2023] 000003ff807794d6 0000004742e35898 0000000000000000 00000041b9826537 [Sun Apr 16 12:34:10 UTC 2023] 000003ff807ae63c 000003ff80763010 0000809f0000809f 00000041b9826500 [Sun Apr 16 12:34:10 UTC 2023] 00000015a0c80000 000003ff807a1d80 000003e008637a80 000003e008637a48 [Sun Apr 16 12:34:10 UTC 2023] Krnl Code: 000003ff8076303a: a7840041 brc 8,000003ff807630bc                                           000003ff8076303e: e31020c00004 lg %r1,192(%r2)                                          #000003ff80763044: e3a010000004 lg %r10,0(%r1)                                          >000003ff8076304a: e310a4070090 llgc %r1,1031(%r10)                                           000003ff80763050: a7110010 tmll %r1,16                                           000003ff80763054: a7740025 brc 7,000003ff8076309e                                           000003ff80763058: c418ffffe7d8 lgrl %r1,000003ff80760008                                           000003ff8076305e: 91021003 tm 3(%r1),2 [Sun Apr 16 12:34:10 UTC 2023] Call Trace: [Sun Apr 16 12:34:10 UTC 2023] ([<0000000000000000>] 0x0) [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80779454>] __rpc_execute+0x8c/0x488 [sunrpc] [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80779df2>] rpc_execute+0x8a/0x128 [sunrpc] [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80766d62>] rpc_run_task+0x132/0x180 [sunrpc] [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80766e00>] rpc_call_sync+0x50/0xa0 [sunrpc] [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80360e40>] nfs3_rpc_wrapper.constprop.12+0x48/0xe0 [nfsv3] [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80361c5e>] nfs3_proc_getattr+0x6e/0xc8 [nfsv3] [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80aaeaa8>] __nfs_revalidate_inode+0x158/0x3b0 [nfs] [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80aaef9c>] nfs_getattr+0x1bc/0x388 [nfs] [Sun Apr 16 12:34:10 UTC 2023] [<0000004742161032>] vfs_statx+0xaa/0xf8 [Sun Apr 16 12:34:10 UTC 2023] [<0000004742161798>] __do_sys_newstat+0x38/0x60 [Sun Apr 16 12:34:10 UTC 2023] [<000000474277e802>] system_call+0x2a6/0x2c8 [Sun Apr 16 12:34:10 UTC 2023] Last Breaking-Event-Address: [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80779452>] __rpc_execute+0x8a/0x488 [sunrpc] [Sun Apr 16 12:34:10 UTC 2023] Kernel panic - not syncing: Fatal exception: panic_on_oops The network interfaces p0 and p1 are missing: crash> net | grep -P "p0 |p1 "    5b726fa000 macvtap0 It looks like the p0/p1 issues where the network interfaces have been lost but no recovery was attempted. There are no related recovery messages from the mlx5 kernel module. The kernel finally dumps in the area of the NFS/RPC code. That would be the related upstream commit: aaf2e65cac7f net/mlx5: Fix handling of entry refcount when command is not issued to FW ---- [Niklas] I agree that commit does sound like it could be the fix for exactly this issue. I checked the kernel tree at the tag Ubuntu-5.4.0-128.144 and that does not appear to have this fix. If I read things correctly this is again an issue that may occur during a recovery when the PCI device is isolated and thus doesn't respond. So it likely won't help with not losing the interface but it does sound like it could solve the kernel crash/refcount warning. ==================================================================================================== Summary: Looks like this patch (aaf2e65cac7f) is missing in 20.04 and could be reason for the crash. We would like to backport this to 20.04, 20.04 HWE, 22.04 and 22.04 HWE. aaf2e65cac7f net/mlx5: Fix handling of entry refcount when command is not issued to FW https://lore.kernel.org/netdev/20221122022559.89459-6-saeed@kernel.org/ ====================================================================================================
2023-07-07 13:50:39 Roxana Nicolescu linux (Ubuntu Focal): status In Progress Fix Committed
2023-07-07 15:22:07 Frank Heimes ubuntu-z-systems: status In Progress Fix Committed
2023-07-13 15:38:31 Ubuntu Kernel Bot tags architecture-s3903164 bugnameltc-202279 patch severity-high targetmilestone-inin2004 architecture-s3903164 bugnameltc-202279 kernel-spammed-focal-linux patch severity-high targetmilestone-inin2004 verification-needed-focal
2023-08-10 08:19:35 Frank Heimes tags architecture-s3903164 bugnameltc-202279 kernel-spammed-focal-linux patch severity-high targetmilestone-inin2004 verification-needed-focal architecture-s3903164 bugnameltc-202279 kernel-spammed-focal-linux patch severity-high targetmilestone-inin2004 verification-done-focal
2023-08-10 15:23:34 Launchpad Janitor linux (Ubuntu Focal): status Fix Committed Fix Released
2023-08-10 15:23:34 Launchpad Janitor cve linked 2020-36691
2023-08-10 15:23:34 Launchpad Janitor cve linked 2022-0168
2023-08-10 15:23:34 Launchpad Janitor cve linked 2022-1184
2023-08-10 15:23:34 Launchpad Janitor cve linked 2022-27672
2023-08-10 15:23:34 Launchpad Janitor cve linked 2022-4269
2023-08-10 15:23:34 Launchpad Janitor cve linked 2023-1611
2023-08-10 15:23:34 Launchpad Janitor cve linked 2023-2124
2023-08-10 15:23:34 Launchpad Janitor cve linked 2023-3090
2023-08-10 15:23:34 Launchpad Janitor cve linked 2023-3111
2023-08-10 15:23:34 Launchpad Janitor cve linked 2023-3141
2023-08-10 15:23:34 Launchpad Janitor cve linked 2023-32629
2023-08-10 15:23:34 Launchpad Janitor cve linked 2023-3390
2023-08-10 15:23:34 Launchpad Janitor cve linked 2023-35001
2023-08-10 17:15:49 Frank Heimes ubuntu-z-systems: status Fix Committed Fix Released
2023-08-18 23:31:31 Ubuntu Kernel Bot tags architecture-s3903164 bugnameltc-202279 kernel-spammed-focal-linux patch severity-high targetmilestone-inin2004 verification-done-focal architecture-s3903164 bugnameltc-202279 kernel-spammed-focal-linux kernel-spammed-focal-linux-ibm-v2 patch severity-high targetmilestone-inin2004 verification-done-focal verification-needed-focal-linux-ibm
2023-08-18 23:31:43 Ubuntu Kernel Bot tags architecture-s3903164 bugnameltc-202279 kernel-spammed-focal-linux kernel-spammed-focal-linux-ibm-v2 patch severity-high targetmilestone-inin2004 verification-done-focal verification-needed-focal-linux-ibm architecture-s3903164 bugnameltc-202279 kernel-spammed-focal-linux kernel-spammed-focal-linux-azure-v2 kernel-spammed-focal-linux-ibm-v2 patch severity-high targetmilestone-inin2004 verification-done-focal verification-needed-focal-linux-azure verification-needed-focal-linux-ibm
2023-08-23 16:17:37 Ubuntu Kernel Bot tags architecture-s3903164 bugnameltc-202279 kernel-spammed-focal-linux kernel-spammed-focal-linux-azure-v2 kernel-spammed-focal-linux-ibm-v2 patch severity-high targetmilestone-inin2004 verification-done-focal verification-needed-focal-linux-azure verification-needed-focal-linux-ibm architecture-s3903164 bugnameltc-202279 kernel-spammed-focal-linux kernel-spammed-focal-linux-azure-v2 kernel-spammed-focal-linux-bluefield-v2 kernel-spammed-focal-linux-ibm-v2 patch severity-high targetmilestone-inin2004 verification-done-focal verification-needed-focal-linux-azure verification-needed-focal-linux-bluefield verification-needed-focal-linux-ibm
2023-10-17 14:50:59 bugproxy tags architecture-s3903164 bugnameltc-202279 kernel-spammed-focal-linux kernel-spammed-focal-linux-azure-v2 kernel-spammed-focal-linux-bluefield-v2 kernel-spammed-focal-linux-ibm-v2 patch severity-high targetmilestone-inin2004 verification-done-focal verification-needed-focal-linux-azure verification-needed-focal-linux-bluefield verification-needed-focal-linux-ibm architecture-s3903164 bugnameltc-202279 kernel-spammed-focal-linux kernel-spammed-focal-linux-azure-v2 kernel-spammed-focal-linux-bluefield-v2 kernel-spammed-focal-linux-ibm-v2 patch severity-high targetmilestone-inin2004 verification-done-focal verification-needed-focal-linux-azure verification-needed-focal-linux-ibm