tripleo-ansible "tripleo_compute_node" molecule job failing on "tripleo_iscsid" task

Bug #2008822 reported by Douglas Viroel
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Won't Fix
Critical
Unassigned

Bug Description

tripleo-ansible-centos-stream-molecule-tripleo_compute_node job is failing since 2023-02-17:
https://zuul.opendev.org/t/openstack/builds?job_name=tripleo-ansible-centos-stream-molecule-tripleo_compute_node&project=openstack/tripleo-ansible

with the following error:

TASK [tripleo_iscsid : Create fcontext entry for iscsi] ************************
failed: [tripleo_compute_node] (item={'path': '/etc/iscsi', 'setype': 'container_file_t'}) => changed=false
  ansible_loop_var: item
  item:
    path: /etc/iscsi
    setype: container_file_t
  msg: SELinux is disabled on this host.
failed: [tripleo_compute_node] (item={'path': '/etc/target', 'setype': 'container_file_t'}) => changed=false
  ansible_loop_var: item
  item:
    path: /etc/target
    setype: container_file_t
  msg: SELinux is disabled on this host.
failed: [tripleo_compute_node] (item={'path': '/var/lib/iscsi', 'setype': 'container_file_t'}) => changed=false
  ansible_loop_var: item
  item:
    path: /var/lib/iscsi
    setype: container_file_t
  msg: SELinux is disabled on this host.

PLAY RECAP *********************************************************************
tripleo_compute_node : ok=106 changed=62 unreachable=0 failed=1 skipped=26 rescued=0 ignored=0

https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_815/874630/1/check/tripleo-ansible-centos-stream-molecule-tripleo_compute_node/815d07e/reports.html

Revision history for this message
Marios Andreou (marios-b) wrote :

This is likely related to https://review.opendev.org/c/openstack/tripleo-ansible/+/860611 which merged on Feb 16th

Looks like the job in question tripleo-ansible-centos-stream-molecule-tripleo_compute_node did not run in that patch so we didn't catch the problem there

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-ansible (master)
Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-ansible (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-ansible/+/875858
Committed: https://opendev.org/openstack/tripleo-ansible/commit/0eb375990192df42567e718d95659bfdb01c474d
Submitter: "Zuul (22348)"
Branch: master

commit 0eb375990192df42567e718d95659bfdb01c474d
Author: Brendan Shephard <email address hidden>
Date: Wed Mar 1 20:10:20 2023 +1000

    Only set selinux context when enabled

    If selinux is disabled, we don't need to, and can't set the fcontext on
    the list of files. This change adds a conditional statement to this task
    based on the status of selinux.

    Closes-Bug: #2008822

    Change-Id: Id0135cc2456b1716e3bedace16ca4b9cee89d7b4

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
Douglas Viroel (dviroel) wrote :

'molecule-tripleo_compute_node' job still failing[1], now in a different task:

TASK [tripleo_container_manage : Manage container systemd services and cleanup old systemd healthchecks for /var/lib/tripleo-config/container-startup-config/iscsid] ***
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: NoneType: None
fatal: [tripleo_compute_node]: FAILED! => changed=false
  msg: Service iscsid has not started yet

@bshephard, can you take a look again?
Can you also include this role to trigger this specific molecule job?[2]

[1] https://zuul.opendev.org/t/openstack/builds?job_name=tripleo-ansible-centos-stream-molecule-tripleo_compute_node&project=openstack/tripleo-ansible
[2] https://github.com/openstack/tripleo-ansible/blob/0eb375990192df42567e718d95659bfdb01c474d/zuul.d/molecule.yaml#L402-L405

Douglas Viroel (dviroel)
Changed in tripleo:
status: Fix Released → Triaged
Revision history for this message
Brendan Shephard (bshephar) wrote :

I haven't been able to reproduce this issue yet, in my local testing:

So, running the molecule job for the tripleo_compute_node role locally like:

    6 python -m virtualenv --system-site-packages "${HOME}/test-python"
    7 ${HOME}/test-python/bin/pip install -r requirements.txt -r test-requirements.txt -r molecule-requirements.txt
    8 source ${HOME}/test-python/bin/activate
    9 source ansible-test-env.rc
   10 cd tripleo_ansible/roles/tripleo_compute_node/
   11 molecule --base-config ../../../.config/molecule/config.yml test --all

That task seems to work fine:
TASK [tripleo_container_manage : Manage container systemd services and cleanup old systemd healthchecks for /var/lib/tripleo-config/container-startup-config/iscsid] ***
ok: [tripleo_compute_node]

So, I'm not too sure what is breaking it in the CI environment just yet.

Revision history for this message
Brendan Shephard (bshephar) wrote :

I thought it might have been because I had run the molecule job as root, but a re-run as a non-root user yields the same result. No errors during the execution.

So I'm still unsure. Where is this job defined? Can we add -vvv to the ansible-args defined in the job here?

. /home/zuul/test-python/bin/activate
. /home/zuul/src/opendev.org/openstack/tripleo-ansible/ansible-test-env.rc
pytest --color=no --html=/home/zuul/zuul-output/logs/reports.html --self-contained-html --ansible-args='' /home/zuul/src/opendev.org/openstack/tripleo-ansible/tests/test_molecule.py

So changing it to:
. /home/zuul/test-python/bin/activate
. /home/zuul/src/opendev.org/openstack/tripleo-ansible/ansible-test-env.rc
pytest --color=no --html=/home/zuul/zuul-output/logs/reports.html --self-contained-html --ansible-args='-vvv' /home/zuul/src/opendev.org/openstack/tripleo-ansible/tests/test_molecule.py

Revision history for this message
Brendan Shephard (bshephar) wrote :

Ah, right. I found it. Ok, hopefully this will give us the verbose output and some clues:
https://review.opendev.org/c/openstack/tripleo-ansible/+/876259

Revision history for this message
Brendan Shephard (bshephar) wrote :

Frustratingly, it works fine in that job:

[fedora@fedora ~]$ grep "tripleo_container_manage : Manage container systemd services and cleanup old systemd healthchecks for /var/lib/tripleo-config/container-startup-config/iscsid" ansible.log -A5
2023-03-03 09:49:42,127 p=38208 u=zuul n=ansible | TASK [tripleo_container_manage : Manage container systemd services and cleanup old systemd healthchecks for /var/lib/tripleo-config/container-startup-config/iscsid] ***
2023-03-03 09:49:42,135 p=38208 u=zuul n=ansible | host: tripleo_compute_node, task: TASK: tripleo_container_manage : meta

Maybe we should add the -vvv to one of the failing reviews instead and see if the error shows up there.

Revision history for this message
Roberto Alfieri (ralfieri) wrote :

Hey there, probably I found the problem. It seems a YAML indention issue in the container definition template file.

Analyzing the logs, the error returned by container_systemd module was `The error was: NoneType: None`

When the definition file was rendered, the healthcheck for iscsid service is incorrectly defined as none, even if a "test" is defined.

Hopefully this [1] should solve the problem.

[1] https://review.opendev.org/c/openstack/tripleo-ansible/+/876580

Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/tripleo-ansible/+/876580
Committed: https://opendev.org/openstack/tripleo-ansible/commit/539caf1a78c4f826ababf62e3120146a445bffaa
Submitter: "Zuul (22348)"
Branch: master

commit 539caf1a78c4f826ababf62e3120146a445bffaa
Author: Roberto Alfieri <email address hidden>
Date: Mon Mar 6 14:31:19 2023 +0100

    fixed indentation in iscsid container definition template file

    Signed-off-by: Roberto Alfieri <email address hidden>
    Closes-bug: #2008822
    Change-Id: I0848e207d2fb35ae258f7e5c22da8c3768628e13

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
Marios Andreou (marios-b) wrote :

thanks @Manoj for head's up - this seems to be back so moving the card back to triaged

see job history @ [1] so we have fails after the indentation fix from comment #10

tripleo-ansible-centos-stream-molecule-tripleo_compute_node openstack/tripleo-ansible master periodic-weekly 40 mins 1 sec 2023-03-11 08:02:28 FAILURE
tripleo-ansible-centos-stream-molecule-tripleo_compute_node openstack/tripleo-ansible master check 876589,2
38 mins 59 secs 2023-03-09 16:58:42 FAILURE
tripleo-ansible-centos-stream-molecule-tripleo_compute_node openstack/tripleo-ansible master check 874630,6
34 mins 34 secs 2023-03-09 13:40:10 FAILURE

eg one of those at [2] looks like:

2023-03-09 14:14:17.107124 | centos-9-stream | TASK [tripleo_container_manage : Manage container systemd services and cleanup old systemd healthchecks for /var/lib/tripleo-config/container-startup-config/iscsid] ***
2023-03-09 14:14:17.107128 | centos-9-stream | An exception occurred during task execution. To see the full traceback, use -vvv. The error was: NoneType: None
2023-03-09 14:14:17.107132 | centos-9-stream | fatal: [tripleo_compute_node]: FAILED! => changed=false
2023-03-09 14:14:17.107144 | centos-9-stream | msg: Service iscsid has not started yet

[1] https://zuul.opendev.org/t/openstack/builds?job_name=tripleo-ansible-centos-stream-molecule-tripleo_compute_node&project=openstack/tripleo-ansible

[2] https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_c9c/874630/6/check/tripleo-ansible-centos-stream-molecule-tripleo_compute_node/c9c15aa/job-output.txt

Changed in tripleo:
status: Fix Released → Triaged
Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
Roberto Alfieri (ralfieri) wrote (last edit ):

I was able to reproduce the issue locally using the same playbooks as zuul:

`ANSIBLE_ROLES_PATH=zuul-jobs/roles ansible-playbook tripleo-ansible/zuul.d/playbooks/run.yml -i inventory -e ansible_user=${USER} -e ansible_user_dir=${HOME} -e tripleo_role_name=tripleo_compute_node`

a couple of times it failed, a couple of times it succeeded. It seems that sometime the tripleo_iscsid systemd service took more than 25 seconds to startup so it fails. That limit is the result of a couple of values hardcoded in the container_systemd plugin (5 attempts every 5 seconds). I proposed [1] to raise those values.

[1] https://review.opendev.org/c/openstack/tripleo-ansible/+/877305

Revision history for this message
Alan Bishop (alan-bishop) wrote :

It does seem like a timing issue, but from what I can tell the molecule job should be mocking a bunch of stuff [1].

[1] https://opendev.org/openstack/tripleo-ansible/src/branch/master/tripleo_ansible/roles/tripleo_compute_node/molecule/default/prepare.yml#L40

When I look at the failing job's logs, the timestamps show <1 second between the time the iscsid service is restarted and when they declare the restart failed.

2023-03-09 14:14:17.106941 | centos-9-stream | TASK [Run iscsid containers] **************************************************
2023-03-09 14:14:17.107124 | centos-9-stream | TASK [tripleo_container_manage : Manage container systemd services and cleanup old systemd healthchecks for /var/lib/tripleo-config/container-startup-config/iscsid] ***
2023-03-09 14:14:17.107128 | centos-9-stream | An exception occurred during task execution. To see the full traceback, use -vvv. The error was: NoneType: None
2023-03-09 14:14:17.107132 | centos-9-stream | fatal: [tripleo_compute_node]: FAILED! => changed=false
2023-03-09 14:14:17.107144 | centos-9-stream | msg: Service iscsid has not started yet

Revision history for this message
Roberto Alfieri (ralfieri) wrote :

Which logs are you looking?

I found this in ansible-execution.log:

2023-03-09 14:13:33,366 p=134527 u=zuul n=ansible | ok: [tripleo_compute_node]
2023-03-09 14:13:33,383 p=134527 u=zuul n=ansible | TASK [tripleo_container_manage : Manage container systemd services and cleanup old systemd healthchecks for /var/lib/tripleo-config/container-startup-config/iscsid] ***
2023-03-09 14:14:08,812 p=134527 u=zuul n=ansible | An exception occurred during task execution. To see the full traceback, use -vvv. The error was: NoneType: None
2023-03-09 14:14:08,813 p=134527 u=zuul n=ansible | fatal: [tripleo_compute_node]: FAILED! => changed=false
  msg: Service iscsid has not started yet

25 seconds as expected.

Revision history for this message
Roberto Alfieri (ralfieri) wrote :

Ok, found it. If you are looking at "job-execution.log" i think that the timing there aren't right. It keeps the same time from the line:

2023-03-09 14:14:17.093367 | centos-9-stream | ../../../tests/test_molecule.py F [100%]

<...>

to the lines:

<...>
2023-03-09 14:14:17.462963 | PLAY RECAP
2023-03-09 14:14:17.463060 | centos-9-stream | ok: 2 changed: 0 unreachable: 0 failed: 1 skipped: 1 rescued: 0 ignored: 0
2023-03-09 14:14:17.464601 |
2023-03-09 14:14:17.588315 | RUN END RESULT_NORMAL: [untrusted : opendev.org/openstack/tripleo-ansible/zuul.d/playbooks/run.yml@master]
2023-03-09 14:14:17.592542 | POST-RUN START: [trusted : opendev.org/opendev/base-jobs/playbooks/base/post.yaml@master]

Revision history for this message
Alan Bishop (alan-bishop) wrote :

Whoops, my bad. I was looking at the job-output.txt timestamps. I see the 25s time lapse in the ansible-execution.log file.

However, I also see the messages indicating the mocks I referenced are in place:

2023-03-09 13:53:07,036 p=33034 u=zuul n=ansible | TASK [Install packages that would otherwise be installed or needing to be mocked] ***
2023-03-09 13:53:13,984 p=33034 u=zuul n=ansible | changed: [tripleo_compute_node]
2023-03-09 13:53:13,995 p=33034 u=zuul n=ansible | TASK [Mock systemd-modules-load.service service] *******************************
2023-03-09 13:53:15,677 p=33034 u=zuul n=ansible | changed: [tripleo_compute_node]
2023-03-09 13:53:15,685 p=33034 u=zuul n=ansible | TASK [Mock sysctl] *************************************************************
2023-03-09 13:53:17,068 p=33034 u=zuul n=ansible | changed: [tripleo_compute_node]
2023-03-09 13:53:17,076 p=33034 u=zuul n=ansible | TASK [Mock chrony] *************************************************************
2023-03-09 13:53:20,972 p=33034 u=zuul n=ansible | changed: [tripleo_compute_node]

It seems maybe the mocks are not working as intended?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/tripleo-ansible/+/877305
Committed: https://opendev.org/openstack/tripleo-ansible/commit/7a1b531de5080ae2e78be2006b4a59ac523381e5
Submitter: "Zuul (22348)"
Branch: master

commit 7a1b531de5080ae2e78be2006b4a59ac523381e5
Author: Roberto Alfieri <email address hidden>
Date: Mon Mar 13 16:50:00 2023 +0100

    Raise wait time and retries limits for service to comes up

    Signed-off-by: Roberto Alfieri <email address hidden>
    Partial-bug: #2008822
    Change-Id: Iaff25c8cf0864a9d776e66b19906d5fbab0f1e2a

Revision history for this message
Marios Andreou (marios-b) wrote :

as posted in [1] the ci team has started the process of removing the master periodic rdo lines [2]

I am going to mark the bug as "won't fix" for the cix tracking.

[1] https://lists.openstack.org/pipermail/openstack-discuss/2023-March/033080.html
[2] https://review.rdoproject.org/r/c/config/+/48074

Changed in tripleo:
status: In Progress → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.