Stress Tests failed during Server Certification test

Bug #2069674 reported by Amy Gou
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Stress-ng
New
Undecided
Unassigned

Bug Description

Hi Ubuntu team,

We're testing our 8-sockets server with Ubuntu 24.04, the config is as below:

CPU: 8x Intel 8490H (60 Cores)
Mem: 8x 256G
Hard Disk: Total 5TB
NIC: Intel X710_10G

The case Stress tests failed, sub-cases are:

stress/cpu_stress_ng_test
stress/memory_stress_ng

Could you please help to take a look at the logs to analyze them?

Many thanks.

Revision history for this message
Amy Gou (goujm1) wrote :
Revision history for this message
Amy Gou (goujm1) wrote :
Jürgen Gmach (jugmac00)
affects: launchpad → ubuntu
Revision history for this message
Amy Gou (goujm1) wrote :

Hi Ubuntu team,

is there any update for the defect?

Many thanks.

Revision history for this message
Michael Reed (mreed8855) wrote (last edit ):

It appears that disk stress also failed and the system crashed? This has been fixed in a later kernel in version 6.8.0-35.35. The version that is deployed in these results is 6.8.0-31.31. The fix is described in this bug https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2058557. Once the kernel is updated run test-storage to see if disk stress-ng passes.

Revision history for this message
Michael Reed (mreed8855) wrote :

The stress_ng cpu tests appears to have timed out? Or did you terminate the test manually?

Revision history for this message
Amy Gou (goujm1) wrote :

hi Michael,
for Stress_ng CPU,it fails as timeout.

Revision history for this message
Michael Reed (mreed8855) wrote :

What version of stress-ng are you running?

Install devscripts
sudo apt install devscripts

Find the version
rmadison stress-ng | grep noble

Revision history for this message
Amy Gou (goujm1) wrote :

hi Michael,

for Stress-ng version, the System is ruined for the other OS test, here is the version from the log for your reference:

name: stress-ng
version: 0.17.08-0~202405302230~ubuntu24.04.1

Revision history for this message
Michael Reed (mreed8855) wrote :

Thank you Amy for the stress-ng version information. One thing we can try is increasing the swap space to 2X (RAM on the system) There are occasions where this may help. Please run test-stress to run memory and cpu stres-ng after increasing the swap space.

Revision history for this message
Amy Gou (goujm1) wrote :

Hi Michael,

For Disk Stress, the current latest Kernel can fix the issue and achieve pass result, see attachment "Disk-Test_2024-07-18T05.30.24.485316.html".

For CPU Stress and Mem Stress, we expend the swap size and ongoing with the test execution, due to Physical CPU Core=60*8=480 and Physical MEM=256GB*8=2TB, the stress takes more time than ever, 1-2 days, will do the update once we get any result later then.

Revision history for this message
Amy Gou (goujm1) wrote :
Revision history for this message
Amy Gou (goujm1) wrote :

Hi Michael,

The test case "test-stress" run stress-ng and keeps running about 20 hours, CPU usage is about 100% and there is CPU Hard Lockup under dmesg.
ubuntu@SR950V3:/var/log$ cat dmesg | grep -i lockup
[ 17.361583] kernel: watchdog: Watchdog detected hard LOCKUP on cpu 817
[ 17.361583] kernel: ? watchdog_hardlockup_check+0x1cb/0x3b0

due to the system operation feedback very slow, i will collect the log once it is better response status.

Revision history for this message
Amy Gou (goujm1) wrote :

Hi Michael,

with Swap file >2 times of MEm size, the stress test on CPU/MEM still fail as time out. see attachment "Stress-SR950V3-0722.zip" for detailed info reference.

besides, I do the other investigation, separate the CPU and MEM stress, till now, CPU Stress still failed as below, MEM stress is still under ongoing, assume it root caused the high cores, will reduce the core count to see the result once after MEM stress test completed tomorrow:
ubuntu@SR950V3:~$ sudo /usr/lib/checkbox-provider-base/bin/stress_ng_test.py cpu -b 7200
Estimated total run time is 120 minutes

22 Jul 03:52: Running multiple stress-ng stressors in parallel for 7200
seconds...
** stress-ng timed out and was forcefully terminated

retval is 1
**************************************************************
** stress-ng test failed!
**************************************************************
ubuntu@SR950V3:~$ sudo /usr/lib/checkbox-provider-base/bin/stress_ng_test.py memory -b 300 -t 10 -s 0 -k
Minimum swap space is set to 0 GiB
Total memory is 2015.1 GiB
Constant run time is 300 seconds per stressor
Variable run time is 20451 seconds per stressor
Number of NUMA nodes is 8
Estimated total run time is 1473 minutes

22 Jul 07:13: Running stress-ng bsearch stressor for 300 seconds...
stress-ng: info: [1451389] setting to a 5 mins run per stressor
stress-ng: info: [1451389] dispatching hogs: 960 bsearch
stress-ng: warn: [1451389] WARNING! using HPET clocksource (refer to /sys/devices/system/clocksource/clocksource0), this may impact benchmarking performance
stress-ng: info: [1451389] skipped: 0
stress-ng: info: [1451389] passed: 959: bsearch (959)
stress-ng: info: [1451389] failed: 0
stress-ng: info: [1451389] metrics untrustworthy: 0
stress-ng: info: [1451389] successful run completed in 5 mins

Revision history for this message
Amy Gou (goujm1) wrote :
Revision history for this message
Amy Gou (goujm1) wrote :

Update the latest status for CPU Stress:
1. with reduced CPU 480, (from 60 Core to 30Core for each), the CPU Stress can achieve Pass result.
2. with CPU 960 Core, Extend the execution time, fail on both 4 hours and 6 hours.
3. with CPU 960 Cores, execute the test with command stress-ng and figure the --af-alg stressor failed.
sudo stress-ng --aggressive --verify --timeout 7200 --metrics-brief --tz --times --verbose --af-alg 960 --bsearch 960 --context 960 --cpu 960 --crypt 960 --hsearch 960 --longjmp 960 --lsearch 960 --matrix 960 --qsort 960 --str 960 --stream 960 --tsearch 960 --vecmath 960 --wcs 960 >StresssngTest.log
4. with CPU 960 Cores, investigate the individual stressor --af-alg result.

Revision history for this message
Amy Gou (goujm1) wrote :

update the test status and need help to move forward:
For CPU Stress, with CPU 960 Cores, af-alg stressors fails when multiple stressors execute as below such error, see attachment "CPUStress-MultipleStressor-Fail.zip" for reference.
stress-ng: fail: [256886] af-alg: xts(aes): read failed: errno=22 (Invalid argument)
stress-ng: debug: [256886] af-alg: [256886] exited (instance 32 on CPU 263)
stress-ng: info: [256780] for a 9620.77s run time:
stress-ng: info: [256780] 9235939.57s available CPU time
stress-ng: info: [256780] 5689357.95s user time ( 61.60%)
stress-ng: info: [256780] 3386150.07s system time ( 36.66%)
stress-ng: info: [256780] 9075508.02s total time ( 98.26%)
stress-ng: info: [256780] load average: 227.75 897.78 3301.64
stress-ng: info: [256780] skipped: 0
stress-ng: info: [256780] passed: 13994: af-alg (554) bsearch (960) context (960) cpu (960) crypt (960) hsearch (960) longjmp (960) lsearch (960) matrix (960) qsort (960) str (960) stream (960) tsearch (960) vecmath (960) wcs (960)
stress-ng: info: [256780] failed: 405: af-alg (405)
stress-ng: info: [256780] metrics untrustworthy: 0
stress-ng: info: [256780] unsuccessful run completed in 2 hours, 40 mins, 20.77 secs

Revision history for this message
Amy Gou (goujm1) wrote :
Jeff Lane  (bladernr)
affects: ubuntu → stress-ng
Revision history for this message
Jeff Lane  (bladernr) wrote :

Thank you Amy.

Unfortunately, we do not have anything that is 8-socket to try to recreate this on. The biggest machine I have is an SR850p 4-socket.

Questions:
1: for ease of re-testing/debugging, does this still happen if you redue the --timeout from 7200 to 1800 or 3600 (cutting the re-run time down just to make debugging faster/easier)
2: the actual test runs using 0 as the arg for each of those stressors (e.g. --af-alg 0), I see that you've changed these to 960... when you say "with reduced CPU 480" do you mean you're changing the CPU itself out, or are changing the command from, for example, `--af-alg 960` to --af-alg 480` but leaving the physical CPUs the same?
3: does this fail if you run the stress-ng command separately, but ONLY run --af-alg (removing bsearch, context, cpu, crypt, etc)?
4: Remind me how much RAM the machine has, and how much swap space is being used? do you see memory consumption go way up during this test, and see it hitting a lot of swap space?
5: af-alg is a crypto algorithm stressor... can you send the output of this: `sudo stress-ng --af-alg-dump`?
6: This was not an issue on previous tests for this system (the older generation), IIRC this was successfully tested then, so is this perhaps only with the latest CPU family?

I moved this to stress-ng, rather than the generic Ubuntu target.

Revision history for this message
Amy Gou (goujm1) wrote :

Hi jeff,

here is the feedback:
1.with 3600, af-alg still failed as attachment" test0729-3600.log".
stress-ng: info: [2452247] skipped: 0
stress-ng: info: [2452247] passed: 13453: af-alg (13) bsearch (960) context (960) cpu (960) crypt (960) hsearch (960) longjmp (960) lsearch (960) matrix (960) qsort (960) str (960) stream (960) tsearch (960) vecmath (960) wcs (960)
stress-ng: info: [2452247] failed: 946: af-alg (946)
stress-ng: info: [2452247] metrics untrustworthy: 0
stress-ng: info: [2452247] unsuccessful run completed in 1 hour, 23 mins, 50.90 secs.

2. After go through the script the value is thread count, hence 960 I do with the command. Plus, with reduce the CPU core under BIOS to reduce the total thread from 960 to 480, the CPU stress achieve pass result.
Script reference: stressor_list = stressor_list + " {}".format(self.thread_count)

3. No, the command stress-ng with only 1 stressor--af-alg, the test pass 2/2 times. see attachment "test-1af-alg-pass.log" in CPUStress-MultipleStressor-Fail.zip.
#stress-ng --verify --timeout 7200 --metrics-brief --tz --times --verbose --af-alg 960

4. it is 7GB swap size as before failed result, with Swap update to 4095GB(>2*mem size(2015GB)), MEM stress pass in 36 hours buy CPU stress still fail as time out, using the test script “test-stress”.
BTW, RAM total: 2.0T, RAM free: 1.9T, swap free: 4.0T.

5. Here is the output:
ubuntu@SR950V3:~$ sudo stress-ng --af-alg-dump
stress-ng: info: [2524177] defaulting to a 1 day run per stressor
stress-ng: error: [2524177] No stress workers invoked

6. it is intel EagleStream SPR processor “Intel(R) Xeon(R) Platinum 8490H 60 Core 3.5GHz”.

Besides, raise the attachment for reference:
Test-Stress-CPUFail-MEM-Pass.zip: the html log with test-stress and dmesg log. CPU Stress Fail, and MEM stress Pass.
Command-Stress-ng-CPU-af-alg-3600-Fail.zip: command stress-ng test with only af-alg stressor, Fail.

Revision history for this message
Amy Gou (goujm1) wrote :
Revision history for this message
Colin Ian King (colin-king) wrote :

The af-alg stressor will force the kernel to load crypto algorithms, so with so many instances it may be that there is module loading contention occurring when loading in so many crypto modules with all that concurrency.

By the way, to dump the crypto al-alg info one should use:

stress-ng --af-alg 1 --af-alg-dump -t 1

Revision history for this message
Jeff Lane  (bladernr) wrote :

Thanks Colin. I wonder if this is one that needs to be run on all cores (it is just doing that by default) or if there's some other way this should be run that's more appropriate?

Revision history for this message
Amy Gou (goujm1) wrote :

Dump file with the update command "stress-ng --af-alg 1 --af-alg-dump -t 1"

Revision history for this message
Jeff Lane  (bladernr) wrote :

@Colin - if this is just a case of too many instances, perhaps we should cut how aggressive this one is? Do you think this is more a problem with having so many cores (8 way system) or more about just being too aggressive with the test case?

Revision history for this message
Amy Gou (goujm1) wrote :

Hi all,

is there any update on it?

Revision history for this message
Amy Gou (goujm1) wrote :

Here is the attachment for Stress-ng pass CPU with 480 cores.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.