ubuntu_ltp_* tests unable to finish properly with B-azure-fips

Bug #2076241 reported by Portia Stephens
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
ubuntu-kernel-tests
Fix Released
Undecided
Po-Hsu Lin

Bug Description

In sru-s20240429 and sru-s20240610, the ubuntu_ltp_* tests were found unable to finish properly with B-azure-fips kernel, and eventually trigger the `sut-test` failure on them.

Here is the result from sru-s20240610
* ubuntu_ltp
  - report cuts-off at fs:fs_fill test, failed on Standard_D4_v4 only.
* ubuntu_ltp_controllers
  - report cuts-off at memcg_test_3 test, failed on Standard_B1ms
  - report cuts-off at memcg_stress test, failed on Standard_D4_v4, Standard_D4s_v3-gen2
* ubuntu_ltp_cve
  - report cuts-off at cve-2016-8655 test, failed on Standard_B1ms, Standard_D4_v4
  - report cuts-off at cve-2018-18559 test, failed on Standard_D4s_v3-gen2
* ubunut_ltp_syscall
  - report cuts-off at setsockopt06 test, failed on Standard_B1ms
  - report cuts-off at bind06 test, failed on Standard_D4_v4, Standard_D4s_v3-gen2

The result from sru-s20240610 is quite similar, just the ubuntu_ltp_cve this time cuts-off at cve-2016-8655 test on Standard_D4s_v3-gen2.

Note that the cve-2016-8655 is actually the setsockopt06 test, and cve-2018-18559 is the bind06 test.

I have done some experiments on Standard_D4s_v3-gen2 with kernel in sru-s20240610 (4.15.0-2088-azure-fips):
* ubunut_ltp_controllers:
  - If we skip memcg_stress test, it will be able to finish properly.
* ubuntu_ltp_cve:
  - If we skip cve-2016-8655 and cve-2018-18559 tests, it will be able to finish properly.
* ubuntu_ltp_syscalls:
  - If we skip bind06 and writev03 tests, it will be able to finish properly (setsockopt06 works fine in this case, not sure why).

Here is the code to skip a certain test:
diff --git a/ubuntu_ltp_syscalls/control b/ubuntu_ltp_syscalls/control
index 4f93c546..684a8ed2 100644
--- a/ubuntu_ltp_syscalls/control
+++ b/ubuntu_ltp_syscalls/control
@@ -24,6 +24,9 @@ if result == 'GOOD':
                 # Special case for msgstress04 (lp:1943802 / lp:1943652)
                 if testcase == 'msgstress04':
                     timeout_threshold = 60*60
+ if testcase in ['bind06', 'writev03'] and platform.release() == '4.15.0-2088-azure-fips':
+ print('skipping bind06 for testing purpose')
+ continue
                 job.run_test_detail(NAME, test_name=testcase, tag=testcase, timeout=timeout_threshold)
 else:
     print("ERROR: test failed to build, skipping all the sub tests")

With my manual test on Standard_D4_v4 with 4.15.0-2088-azure-fips, I noticed that my idle SSH session will hang after a certain period (I recorded one at about 7m21s). If it's running something, like htop, it will be fine.
And setsockopt06, bind06 test can pass without any immediate crash. Not sure what is the cause of this failure that we see here.

It's also worthy to note that "running something" seems to limited to commands that will keep generating output. Commands like "dmesg -w" and "tail -f /var/log/syslog" will hang too if there is no output to update.

According to Magali, the last bionic fips openssh update is from January, so this might be something else in the kernel.

== Original bug report ==
On azure-fips platforms multiple tests in ubuntu_ltp, ubuntu_ltp_controllers, ubuntu_ltp_cve, and ubuntu_ltp_syscalls are causing the system to be unresponsive. When running locally the tests run to completion but the system hangs sometime after.

Po-Hsu Lin (cypressyew)
tags: added: azure bionic fips sru-s20240429 sru-s20240610 ubuntu-ltp ubuntu-ltp-controllers ubuntu-ltp-cve ubuntu-ltp-syscalls
summary: - ubuntu_ltp_* tests completing but causes system to hang
+ ubuntu_ltp_* tests unable to finish properly with B-azure-fips
Po-Hsu Lin (cypressyew)
description: updated
description: updated
Po-Hsu Lin (cypressyew)
description: updated
description: updated
Po-Hsu Lin (cypressyew)
description: updated
Po-Hsu Lin (cypressyew)
description: updated
Po-Hsu Lin (cypressyew)
description: updated
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Even this simple command from host to the target system will fail:

$ time ssh $USER@$SUT_IP "date; sleep 300 ; date"
Thu Aug 8 15:49:46 UTC 2024
^C
real 7m50.194s
user 0m0.005s
sys 0m0.018s

260 seconds is not working as well.
$ time sutssh 52.175.206.46 "date; sleep 260 ; date"
Thu Aug 8 16:06:48 UTC 2024
^C
real 10m8.938s
user 0m0.017s
sys 0m0.000s

250 seconds works.
$ time ssh $USER@$SUT_IP "date; sleep 250 ; date"
Thu Aug 8 16:10:55 UTC 2024
Thu Aug 8 16:15:05 UTC 2024

real 4m11.454s
user 0m0.020s
sys 0m0.002s

Now I wonder if these cases are really failing? Or they took too long to make some noise?

Po-Hsu Lin (cypressyew)
description: updated
Revision history for this message
Magali Lemes do Sacramento (magalilemes) wrote :

Thank you, @cypressyew, for investigating (yet) another FIPS issue!
As for the command `time ssh $USER@$SUT_IP "date; sleep 300 ; date"`, which instance was used?
Also, is this command run after the tests have run or is the command run on a brand new untouched Bionic Azure FIPS instance without the tests?

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Hi Magali,
It's tested on an Azure Standard_D4_v4 instance with 4.15.0-2088-azure-fips. Even a freshly deployed Bionic Azure FIPS instance without any other tests can reproduce this issue.

To rule out potential firewall and network issues, I have this verified from two different endpoints: our server azure@obruchev and my computer. SSH connections will hang in both scenarios.

The previous version of kernel that I can get is 4.15.0-2074-azure-fips, a bit old, but I can reproduce this issue with the sleep command too.

Po-Hsu Lin (cypressyew)
Changed in ubuntu-kernel-tests:
assignee: nobody → Po-Hsu Lin (cypressyew)
status: New → In Progress
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Investigation shows this is because the command we use to install the openssh-server package from fipsdevppa:
  DEBIAN_FRONTEND=noninteractive UCF_FORCE_CONFFNEW=1 apt-get install --yes --allow-downgrades libssl1.1 libssl1.1-hmac openssh-server openssh-server-hmac libssl-dev

This will overwrite the existing config file with the one from the package, the "ClientAliveInterval 120" setting from /etc/ssh/sshd_config (shipped with our cloud image) will be gone and consequently causing this session timeout issue here.

New releases (Focal+) is not affected as we have this setting written into a file under /etc/ssh/sshd_config.d/.

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Fix landed in CKCT, this can be closed now. Thanks!

Changed in ubuntu-kernel-tests:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.