linux 4.15.0-109-generic network DoS regression vs -108
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
Undecided
|
Thadeu Lima de Souza Cascardo | ||
Bionic |
Fix Released
|
Critical
|
Thadeu Lima de Souza Cascardo | ||
Eoan |
Fix Released
|
Undecided
|
Unassigned | ||
Focal |
Fix Released
|
Undecided
|
Unassigned | ||
Groovy |
Fix Released
|
Undecided
|
Thadeu Lima de Souza Cascardo |
Bug Description
[Impact]
On systems using cgroups and sockets extensively, like docker, kubernetes, lxd, libvirt, a crash might happen when using linux 4.15.0-109-generic.
[Fix]
Revert the patch that disables sk_alloc cgroup refcounting when tasks are added to net_prio cgroup.
[Test case]
Test that such environments where the issue is reproduced survive some hours of uptime. A different bug was reproduced with a work-in-progress code and was not reproduced with the culprit reverted.
[Regression potential]
The reverted commit fix a memory leak on similar scenarios. But a leak is better than a crash. Two other bugs have been opened to track a real fix for this issue and the leak.
-------
Reported from a user:
Several of our infrastructure VMs recently started crashing (oops
attached), after they upgraded to -109. -108 appears to be stable.
Analysing the crash, it appears to be a wild pointer access in a BPF
filter, which makes this (probably) a network-traffic triggered crash.
[ 696.396831] general protection fault: 0000 [#1] SMP PTI
[ 696.396843] Modules linked in: iscsi_target_mod target_core_mod ipt_MASQUERADE nf_nat_
[ 696.396966] CPU: 6 PID: 0 Comm: swapper/6 Not tainted 4.15.0-109-generic #110-Ubuntu
[ 696.396979] Hardware name: Xen HVM domU, BIOS 4.7.6-1.26 12/03/2018
[ 696.396993] RIP: 0010:__
[ 696.397005] RSP: 0018:ffff893fdc
[ 696.397015] RAX: 6d69546e6f697469 RBX: 0000000000000000 RCX: 0000000000000014
[ 696.397028] RDX: 0000000000000000 RSI: ffff893fd0360000 RDI: ffff893fb5154800
[ 696.397041] RBP: ffff893fdcb83ad0 R08: 0000000000000001 R09: 0000000000000000
[ 696.397058] R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000014
[ 696.397075] R13: ffff893fb5154800 R14: 0000000000000020 R15: ffff893fc6ba4d00
[ 696.397091] FS: 000000000000000
[ 696.397107] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 696.397119] CR2: 000000c0001b4000 CR3: 00000006dce0a004 CR4: 00000000003606e0
[ 696.397135] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 696.397152] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 696.397169] Call Trace:
[ 696.397175] <IRQ>
[ 696.397183] sk_filter_
[ 696.397191] tcp_v4_
[ 696.397199] ip_local_
[ 696.397208] ip_local_
[ 696.397215] ? ip_rcv_
[ 696.397223] ip_rcv_
[ 696.397230] ip_rcv+0x296/0x360
[ 696.397238] ? inet_del_
[ 696.397249] __netif_
[ 696.397261] ? skb_send_
[ 696.397271] ? tcp4_gro_
[ 696.397280] __netif_
[ 696.397290] ? __netif_
[ 696.397300] netif_receive_
[ 696.397309] napi_gro_
[ 696.397317] xennet_
[ 696.397325] net_rx_
[ 696.397334] __do_softirq+
[ 696.397344] irq_exit+0xc5/0xd0
[ 696.397352] xen_evtchn_
[ 696.397361] xen_hvm_
[ 696.397371] </IRQ>
[ 696.397378] RIP: 0010:native_
[ 696.397390] RSP: 0018:ffff94c486
[ 696.397405] RAX: ffffffff8efc1800 RBX: 0000000000000006 RCX: 0000000000000000
[ 696.397419] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 696.397435] RBP: ffff94c4862cbe80 R08: 0000000000000002 R09: 0000000000000001
[ 696.397449] R10: 0000000000100000 R11: 0000000000000397 R12: 0000000000000006
[ 696.397462] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 696.397479] ? __sched_
[ 696.397489] default_
[ 696.397499] arch_cpu_
[ 696.397507] default_
[ 696.397515] do_idle+0x172/0x1f0
[ 696.397522] cpu_startup_
[ 696.397530] start_secondary
[ 696.397538] secondary_
[ 696.397545] Code: 89 5d b0 49 29 cc 45 01 a7 80 00 00 00 44 89 e1 48 29 c8 48 89 4d a8 49 89 87 d8 00 00 00 89 d2 48 8d 84 d6 38 03 00 00 48 8b 00 <4c> 8b 70 10 4c 8d 68 10 4d 85 f6 0f 84 f6 00 00 00 49 8d 47 30
[ 696.397584] RIP: __cgroup_
[ 696.397607] ---[ end trace ec5c84424d511a6f ]---
[ 696.397616] Kernel panic - not syncing: Fatal exception in interrupt
[ 696.397876] Kernel Offset: 0xd600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000
We've correlated some of the other crashes, and the ASCII was a bit of a
red herring. All the others are a NULL pointer deference in the same
place, so the problem is likely OoB memory read (possibly
use-after-free) of a piece of memory which is usually zero, but not always.
It is actually the control VM's for our test farms which were impacted,
one of which was reliably crashing every 5 minutes or so, and others on
more sporadic intervals up to about a day. In all cases, reverting to
the -108 kernel has resolved the crashes.
Unfortunately, attempts to repro this off our production environment
with a packet trace aren't going quite so well. We're still experimenting.
CVE References
summary: |
- placeholder + linux 4.15.0-109-generic network DoS regression vs -108 |
description: | updated |
Changed in linux (Ubuntu): | |
assignee: | nobody → Thadeu Lima de Souza Cascardo (cascardo) |
information type: | Private Security → Public Security |
description: | updated |
description: | updated |
Changed in linux (Ubuntu Bionic): | |
status: | In Progress → Fix Committed |
Changed in linux (Ubuntu Groovy): | |
status: | Invalid → In Progress |
Changed in linux (Ubuntu Focal): | |
status: | New → In Progress |
Changed in linux (Ubuntu Eoan): | |
status: | New → In Progress |
tags: | added: patch |
Changed in linux (Ubuntu Eoan): | |
status: | In Progress → Fix Committed |
Changed in linux (Ubuntu Focal): | |
status: | In Progress → Fix Committed |
So, sock_cgroup_data is overloaded with net_prio prioidx and net_cls classid. There are two small patches for those two subsystems in 4.15.0-109 compared with 4.15.0-108. At first look, they appear harmless, but we have little information on the workload that generates this crash.
I tried manipulating net_prio and net_cls while running a socket with a cgroup bpf program attached on ingress, but with little luck reproducint. It seems the workload is more complicated than that.
It might be worth asking the reporter to try with a test kernel with those two commits reverted. On bionic tree, they are:
5eebba2159d707a e9533a52839e1ba 71754c4426 ce9ac4c1036403b 0a6d391f7c
a3e9313430937c4
Cascardo.