Getting interrupts on isolated cores which causes significant jitter during low-latency work

Bug #2023391 reported by Paweł Żak
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-lowlatency (Ubuntu)
New
Undecided
Unassigned

Bug Description

Summary:
LOC, IWI, RES and CAL interrupts are observed on isolated cores on which low-latency benchmark is performed. Interrupts are caused by simple Go application (printing "Hello world" every 1 second) which runs on different, non-isolated cores. Similar Python application doesn't cause such problems.

Tested on Ubuntu 22.04.2 LTS (GNU/Linux 5.15.0-68-lowlatency x86_64) (compiled with: "Full Dynticks System (tickless)" and "No Forced Preemption (Server)").
Would like to find out what causes this issue (Go itself? Kernel issue? Lack of proper kernel settings/parameters? Other?). Looking for help with hunting down root cause!

Reason:
To run Go-based applications on environments when lowlatency workloads are executed.

Details:

Hardware:
2 x Intel(R) Xeon(R) Gold 6438N (32 cores each)

BIOS:
Hyperthreading disabled

OS and configuration:
Ubuntu 22.04.2 LTS (GNU/Linux 5.15.0-68-lowlatency x86_64) (compiled with: "Full Dynticks System (tickless)" and "No Forced Preemption (Server)" from https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/jammy/log/?h=lowlatency-next)

irqbalance stopped and disabled:
systemctl stop irqbalance.service
systemctl disable irqbalance.service

Based on workload type, experiments and knowledge found in the Internet, following kernel parameters were used:
cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-5.15.0-68-lowlatency root=UUID=5c9c2ea3-e0c6-4dd8-ae70-57e0c0af20d3 ro ro rhgb quiet ipv6.disable=1 audit=0 selinux=0 hugepages=256 hugepagesz=1G intel_iommu=on iommu=pt nmi_watchdog=0 mce=off tsc=reliable nosoftlockup hpet=disable skew_tick=1 acpi_pad.disable=1 nowatchdog nomce numa_balancing=disable irqaffinity=0 rcu_nocb_poll processor.max_cstate=0 clocksource=tsc nosmt nohz=on nohz_full=20-23 rcu_nocbs=20-23 isolcpus=nohz,domain,managed_irq,20-23

For every core/socket: Cx states (for x > 0) were disabled, particular power governor was used and fixed uncore values were set.
To achieve that power.py script from https://github.com/intel/CommsPowerManagement was used. Check "prepare_cpus.sh" for particular commands and "cpu_prepared.png" for results.
CPUs 20-23 are "isolated" (thanks to proper kernel parameters) - benchmark/workload will be run on them.

cat /sys/devices/virtual/workqueue/cpumask
ffffffff,ff0fffff
(kernel threads moved from CPU20-23)

"get_irqs.sh" - script which checks which target CPUs are permitted for a given IRQ sources. "get_irqs_output.txt" contains output of mentioned script.
"lscpu_output.txt" - contains output of 'lscpu' command.

JITTER tool - Baseline
jitter is benchmarking tool which is meant for measuring the "jitter" in the execution time caused by OS and/or the underlying architecture.

git clone https://github.com/FDio/archived-pma_tools
cd archived-pma_tools/jitter

Put "run_jitter.sh" script inside above directory.

Run:
make
./run_jitter.sh

Results:
- "jitter_base.txt" - output from "run_jitter.sh" script
- "jitter_base.png" - chart created from above output

Comment:
jitter tool shows intervals and jitter in CPU Core cycles. Benchmark is done on 2000 MHz core so on graph values are divided by 2 and presented in nanoseconds.
Very stable results, no significant jitters (max jitter: 51ns) during 335 seconds.
No interruptions made on isolated CPU20 during benchmark.

JITTER tool - Python
"hello.py" - simple Python app which prints "Hello world" every 1 second
"run_python_hello.sh" - script to run python app on particular (non-isolated) core

python3 --version
Python 3.10.6

In first console "./run_python_hello.sh" was started, in second console "./run_jitter.sh" was run.

Results:
- "jitter_python.txt" - output from "run_jitter.sh" script
- "jitter_python.png" - chart created from above output

Comment:
Acceptable result, one noticeable jitter (1190ns), the remaining jitters did not exceed 60ns during 336 seconds.
No interruptions made on isolated CPU20 during benchmark.

JITTER tool - Golang
"hello.go" - simple Golang app which prints "Hello world" every 1 second
"go.mod" - go module definition
"run_go_hello.sh" - script to run Go app on particular (non-isolated) core

go version
go version go1.20.5 linux/amd64

In first console Go app was built: "go build" and started: "./run_go_hello.sh", in second console "./run_jitter.sh" was run.

Results:
- "jitter_go.txt" - output from "run_jitter.sh" script
- "jitter_go.png" - chart created from above output

Comment:
34 significant jitters (the worst had: 44961ns) during 335 seconds.
Following interruptions were made on isolated CPU20 during benchmark:
LOC: 67
IWI: 34
RES: 34
RES: 34

It seems that every jitter is made every ~10s.

What is also interesting that for idle and isolated CPU22 and CPU23 no interruptions were made during benchmark. For CPU24 (not isolated) only LOC were made (335283 of them).

Notes:
1. Instead of static isolation (using kernel parameters) I tried also with cpuset and its shield turned on. Unfortunately, results were even worse (jitters were "bigger" and more interruptions were made to shielded cores), moreover cset was not able to move kernel threads outside of shielded pool.
2. I checked it also on Realtime kernel (GNU/Linux 5.15.65-rt49 x86_64 -> https://mirrors.edge.kernel.org/pub/linux/kernel/v5.x/linux-5.15.65.tar.gz patched with https://cdn.kernel.org/pub/linux/kernel/projects/rt/5.15/older/patch-5.15.65-rt49.patch.gz) and problem with interrupts and jitters done by Go app doesn't exist there. However, RT kernel is not the best solution for everyone and it would be great to not have jitters also on lowlatency tickless kernel.
3. I also did a lot of experiments with different kernel parameters, seems that this combination was the best (however, maybe I missed something).
4. Same situation with Go app built using 1.19.x and 1.20.2.
5. I'm aware that this kind of benchmark should be executed for hours, but for now these results are pretty meaningful.

I'm aware of this bug submitted for realtime kernel https://bugs.launchpad.net/ubuntu-realtime/+bug/1992164 where https://launchpad.net/~jsalisbury assisted a lot. It helped me to tune my parameters but right now I'm stuck.

Revision history for this message
Paweł Żak (pawelzak) wrote (last edit ):

Archive contained all files/scripts/results mentioned

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.