Getting interrupts on isolated cores which causes significant jitter during low-latency work
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux-lowlatency (Ubuntu) |
New
|
Undecided
|
Unassigned |
Bug Description
Summary:
LOC, IWI, RES and CAL interrupts are observed on isolated cores on which low-latency benchmark is performed. Interrupts are caused by simple Go application (printing "Hello world" every 1 second) which runs on different, non-isolated cores. Similar Python application doesn't cause such problems.
Tested on Ubuntu 22.04.2 LTS (GNU/Linux 5.15.0-
Would like to find out what causes this issue (Go itself? Kernel issue? Lack of proper kernel settings/
Reason:
To run Go-based applications on environments when lowlatency workloads are executed.
Details:
Hardware:
2 x Intel(R) Xeon(R) Gold 6438N (32 cores each)
BIOS:
Hyperthreading disabled
OS and configuration:
Ubuntu 22.04.2 LTS (GNU/Linux 5.15.0-
irqbalance stopped and disabled:
systemctl stop irqbalance.service
systemctl disable irqbalance.service
Based on workload type, experiments and knowledge found in the Internet, following kernel parameters were used:
cat /proc/cmdline
BOOT_IMAGE=
For every core/socket: Cx states (for x > 0) were disabled, particular power governor was used and fixed uncore values were set.
To achieve that power.py script from https:/
CPUs 20-23 are "isolated" (thanks to proper kernel parameters) - benchmark/workload will be run on them.
cat /sys/devices/
ffffffff,ff0fffff
(kernel threads moved from CPU20-23)
"get_irqs.sh" - script which checks which target CPUs are permitted for a given IRQ sources. "get_irqs_
"lscpu_output.txt" - contains output of 'lscpu' command.
JITTER tool - Baseline
jitter is benchmarking tool which is meant for measuring the "jitter" in the execution time caused by OS and/or the underlying architecture.
git clone https:/
cd archived-
Put "run_jitter.sh" script inside above directory.
Run:
make
./run_jitter.sh
Results:
- "jitter_base.txt" - output from "run_jitter.sh" script
- "jitter_base.png" - chart created from above output
Comment:
jitter tool shows intervals and jitter in CPU Core cycles. Benchmark is done on 2000 MHz core so on graph values are divided by 2 and presented in nanoseconds.
Very stable results, no significant jitters (max jitter: 51ns) during 335 seconds.
No interruptions made on isolated CPU20 during benchmark.
JITTER tool - Python
"hello.py" - simple Python app which prints "Hello world" every 1 second
"run_python_
python3 --version
Python 3.10.6
In first console "./run_
Results:
- "jitter_python.txt" - output from "run_jitter.sh" script
- "jitter_python.png" - chart created from above output
Comment:
Acceptable result, one noticeable jitter (1190ns), the remaining jitters did not exceed 60ns during 336 seconds.
No interruptions made on isolated CPU20 during benchmark.
JITTER tool - Golang
"hello.go" - simple Golang app which prints "Hello world" every 1 second
"go.mod" - go module definition
"run_go_hello.sh" - script to run Go app on particular (non-isolated) core
go version
go version go1.20.5 linux/amd64
In first console Go app was built: "go build" and started: "./run_
Results:
- "jitter_go.txt" - output from "run_jitter.sh" script
- "jitter_go.png" - chart created from above output
Comment:
34 significant jitters (the worst had: 44961ns) during 335 seconds.
Following interruptions were made on isolated CPU20 during benchmark:
LOC: 67
IWI: 34
RES: 34
RES: 34
It seems that every jitter is made every ~10s.
What is also interesting that for idle and isolated CPU22 and CPU23 no interruptions were made during benchmark. For CPU24 (not isolated) only LOC were made (335283 of them).
Notes:
1. Instead of static isolation (using kernel parameters) I tried also with cpuset and its shield turned on. Unfortunately, results were even worse (jitters were "bigger" and more interruptions were made to shielded cores), moreover cset was not able to move kernel threads outside of shielded pool.
2. I checked it also on Realtime kernel (GNU/Linux 5.15.65-rt49 x86_64 -> https:/
3. I also did a lot of experiments with different kernel parameters, seems that this combination was the best (however, maybe I missed something).
4. Same situation with Go app built using 1.19.x and 1.20.2.
5. I'm aware that this kind of benchmark should be executed for hours, but for now these results are pretty meaningful.
I'm aware of this bug submitted for realtime kernel https:/
Archive contained all files/scripts/ results mentioned