standalone-upgrade-ussuri can't start containers during deployment container-puppet step2

Bug #1907769 reported by Marios Andreou
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

At [1][2][3] the ussuri gate tripleo-ci-centos-8-standalone-upgrade-ussuri is failing during the deployment and specifically cannot start containers in step2:

 2020-12-11 05:04:16.226666 | fa163ed5-5107-07e9-7cb3-000000000e66 | TIMING | Debug output for task: Run puppet host configuration for step 2 | standalone | 0:22:22.252103 | 0.06s
...
 2020-12-11 05:04:21.501837 | fa163ed5-5107-07e9-7cb3-000000000e73 | TIMING | Start containers for step 2 using paunch | standalone | 0:22:27.527279 | 0.26s
 2020-12-11 05:04:21.698224 | fa163ed5-5107-07e9-7cb3-000000000e74 | TASK | Wait for containers to start for step 2 using paunch
 2020-12-11 05:04:21.858922 | fa163ed5-5107-07e9-7cb3-000000000e74 | WAITING | Wait for containers to start for step 2 using paunch | standalone | 1200 retries left
...
 2020-12-11 06:07:05.369731 | fa163ed5-5107-07e9-7cb3-000000000e74 | FATAL | Wait for containers to start for step 2 using paunch | standalone | error={"ansible_job_id": "119721351989.98999", "attempts": 1200, "changed": false, "finished": 0, "started": 1}

This is Ussuri gate blocker.

[1] https://872de5c590dd926ff0db-30e72828a36544d0c7466f2989d78bfe.ssl.cf1.rackcdn.com/766516/1/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/3072fa1/logs/undercloud/home/zuul/standalone_deploy.log
[2] http://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_cee/766657/3/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/cee1c62/logs/undercloud/home/zuul/standalone_deploy.log
[3] https://5328919e718f2b3f8d20-cbda4aa1c62aa706d6caf520e6cec7d3.ssl.cf5.rackcdn.com/762659/1/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/d196262/logs/undercloud/home/zuul/standalone_deploy.log

tags: added: promotion-blocker
description: updated
Revision history for this message
Marios Andreou (marios-b) wrote :

been digging here for a while... can't find any root cause yet... adding some notes to avoid repeated efforts later:

there are a few errors that are red herrings... using [1] as a reference since it is a green job for example:

0x7f1f78c22898>): [Errno 2] No such file or directory: '/var/log/validations/bc
764e20-1120-1a3a-1363-000000000008_deploy_steps_playbook_2020-12-10T18:08:49.69
1685Z.json'
Exception: Deployment failed

The errors.log is empty [2] and I can't find an error in the container outputs [3] that is not also duplicated in the 'good' run [1]

[1] https://7eeddc8c208d72a47f96-284a4b285d9a15fafe54e6a481b891ab.ssl.cf1.rackcdn.com/757821/12/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/e0fc8e9/logs/
[2] https://872de5c590dd926ff0db-30e72828a36544d0c7466f2989d78bfe.ssl.cf1.rackcdn.com/766516/1/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/3072fa1/logs/undercloud/var/log/extra/errors.txt.txt
[3] https://872de5c590dd926ff0db-30e72828a36544d0c7466f2989d78bfe.ssl.cf1.rackcdn.com/766516/1/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/3072fa1/logs/undercloud/var/log/containers/stdouts/index.html

Revision history for this message
Alex Schultz (alex-schultz) wrote :

it's pacemaker mismatch from the 8.3 update. we'll need new containers

wes hayutin (weshayutin)
Changed in tripleo:
milestone: none → wallaby-2
Revision history for this message
John Fulton (jfulton-org) wrote :

I'm seeing this using master on an undercloud I built last night

2020-12-11 22:42:23.169488 | 24420151-53ff-3b09-e045-00000000824d | FATAL | Check containers status | oc0-controller-0 | error={"changed": false, "msg": "Failed container(s): ['mysql_wait_bundle'], check logs in /var/log/containers/stdouts/"}

 http://paste.openstack.org/show/800991/

Revision history for this message
John Fulton (jfulton-org) wrote :
Revision history for this message
John Fulton (jfulton-org) wrote :

pacemaker-2.0.3-5 [in container] VS pacemaker-2.0.4-6 [on container host]

[root@oc0-controller-0 ~]# podman exec -ti mysql_wait_bundle rpm -qa | grep pace
pacemaker-2.0.3-5.el8_2.1.x86_64
pacemaker-cluster-libs-2.0.3-5.el8_2.1.x86_64
pacemaker-schemas-2.0.3-5.el8_2.1.noarch
pacemaker-remote-2.0.3-5.el8_2.1.x86_64
pacemaker-libs-2.0.3-5.el8_2.1.x86_64
pacemaker-cli-2.0.3-5.el8_2.1.x86_64
[root@oc0-controller-0 ~]#

[root@oc0-controller-0 ~]# rpm -qa | grep pacemaker
pacemaker-libs-2.0.4-6.el8.x86_64
pacemaker-schemas-2.0.4-6.el8.noarch
ansible-pacemaker-1.0.4-0.20201118111731.accaf26.el8.noarch
pacemaker-2.0.4-6.el8.x86_64
pacemaker-remote-2.0.4-6.el8.x86_64
pacemaker-cluster-libs-2.0.4-6.el8.x86_64
pacemaker-cli-2.0.4-6.el8.x86_64
puppet-pacemaker-1.1.1-0.20201202163950.2f4b6fa.el8.noarch
[root@oc0-controller-0 ~]#

Revision history for this message
John Fulton (jfulton-org) wrote :

Using master and CentOS 8.3.2011 I was able to workaround this by downgrading pacemaker.

[root@oc0-ceph-0 yum.repos.d]# dnf downgrade pacemaker
baseos 86 kB/s | 2.2 MB 00:26
Dependencies resolved.
===============================================================================================
 Package Arch Version Repository Size
===============================================================================================
Downgrading:
 pacemaker x86_64 2.0.3-5.el8_2.1 tripleo-centos-highavailability 435 k
 pacemaker-cli x86_64 2.0.3-5.el8_2.1 tripleo-centos-highavailability 327 k
 pacemaker-cluster-libs x86_64 2.0.3-5.el8_2.1 AppStream 125 k
 pacemaker-libs x86_64 2.0.3-5.el8_2.1 AppStream 649 k
 pacemaker-remote x86_64 2.0.3-5.el8_2.1 tripleo-centos-highavailability 120 k
 pacemaker-schemas noarch 2.0.3-5.el8_2.1 AppStream 65 k
 sbd x86_64 1.4.1-3.el8 AppStream 74 k

Transaction Summary

Revision history for this message
wes hayutin (weshayutin) wrote :

Using content providers ensures the base-os and dependency rpms are at the same version across containers and nodes. We can guarantee that because the containers are built for each patch/review in upstream now.

This bug is now resolved now that the remaining ussuri repos have been migrated to content providers.
https://review.opendev.org/c/openstack/tripleo-common/+/757821
https://review.opendev.org/c/openstack/python-tripleoclient/+/757836

https://hackmd.io/YH9xtBmOQbSVqPu7fjgoog

Changed in tripleo:
status: Triaged → Fix Released
Revision history for this message
Marios Andreou (marios-b) wrote :

we are still hitting this in the component pipeline for the newly added tripleo-component standalone-upgrade-ussuri at [1]. You can see the pacemaker installed on the host [2]:

        * pacemaker-libs.x86_64 2.0.4-6.el8 @quickstart-centos-appstreams

In that case unfortunately there are no content providers so we either have to use something like the patch at [3] or we need a *Train* promotion to get the new containers pushed and available. The job is currently using the train current-tripleo @ e59fa31c15c5563057de9cfe85bf3826 which is from the 04th [4]

[1] https://logserver.rdoproject.org/01/31201/2/check/periodic-tripleo-ci-centos-8-standalone-upgrade-tripleo-ussuri/9031b8a/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz
[2] https://logserver.rdoproject.org/01/31201/2/check/periodic-tripleo-ci-centos-8-standalone-upgrade-tripleo-ussuri/9031b8a/logs/undercloud/var/log/extra/package-list-installed.txt.gz
[3] https://review.opendev.org/c/openstack/tripleo-quickstart-extras/+/767082
[4] https://trunk.rdoproject.org/centos8-train/current-tripleo/e5/9f/e59fa31c15c5563057de9cfe85bf3826/

Revision history for this message
Marios Andreou (marios-b) wrote :

as a followup for comment #8, we had a promotion of Train on the 20th Dec and the Ussuri job is now green again @ https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-8-standalone-upgrade-tripleo-ussuri

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-quickstart-extras (master)

Change abandoned by "chandan kumar <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/tripleo-quickstart-extras/+/767082
Reason: abandoning in favor of content provider

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.