Bug #1978997 “tripleo-ci-centos-9-scenario001-standalone failed ...” : Bugs : tripleo

chandan kumar (chkumar246) on 2022-06-17

tags:

added: alert promotion-blocker

Revision history for this message

Takashi Kajinami (kajinamit) wrote on 2022-06-17 (last edit on 2022-06-17):

#1

Download full text (3.8 KiB)

There seems to be an issue with redis ant it is never promoted.

https://411deec051d768e42c33-554a55c2926ce345d8a8f0805ecfe993.ssl.cf5.rackcdn.com/846159/4/check/tripleo-ci-centos-9-scenario001-standalone/a9a4889/logs/undercloud/var/log/extra/pcs.txt
~~~
* Container bundle: redis-bundle [198.72.124.73:5001/tripleomastercentos9/openstack-redis:pcmklatest]:
* redis-bundle-0 (ocf:heartbeat:redis): Unpromoted standalone
~~~

ceilometer-manage is trying to connect redis because redis is configured as its incoming storage but can't access because of no promoted master.

https://411deec051d768e42c33-554a55c2926ce345d8a8f0805ecfe993.ssl.cf5.rackcdn.com/846159/4/check/tripleo-ci-centos-9-scenario001-standalone/a9a4889/logs/undercloud/var/log/containers/haproxy/haproxy.log
~~~
Jun 16 17:24:28 standalone haproxy[7]: Server redis_be/standalone.ctlplane.localdomain is DOWN, reason: Layer4 connection problem, info: "Connection refused at step 1 of tcp-check (connect port 6379)", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 16 17:24:28 standalone haproxy[7]: backend redis_be has no server available!
...
Jun 16 17:37:45 standalone haproxy[7]: 192.168.24.3:58548 [16/Jun/2022:17:37:45.108] redis redis_be/<NOSRV> -1/-1/0 0 SC 2/1/0/0/0 0/0
Jun 16 17:37:45 standalone haproxy[7]: 192.168.24.3:58550 [16/Jun/2022:17:37:45.111] redis redis_be/<NOSRV> -1/-1/0 0 SC 2/1/0/0/0 0/0
Jun 16 17:37:45 standalone haproxy[7]: 192.168.24.3:58552 [16/Jun/2022:17:37:45.113] redis redis_be/<NOSRV> -1/-1/0 0 SC 2/1/0/0/0 0/0
Jun 16 17:37:45 standalone haproxy[7]: 192.168.24.3:58556 [16/Jun/2022:17:37:45.114] redis redis_be/<NOSRV> -1/-1/0 0 SC 2/1/0/0/0 0/0
Jun 16 17:37:45 standalone haproxy[7]: 192.168.24.3:58558 [16/Jun/2022:17:37:45.117] redis redis_be/<NOSRV> -1/-1/0 0 SC 2/1/0/0/0 0/0
Jun 16 17:37:45 standalone haproxy[7]: 192.168.24.3:58560 [16/Jun/2022:17:37:45.120] redis redis_be/<NOSRV> -1/-1/0 0 SC 2/1/0/0/0 0/0
Jun 16 17:37:45 standalone haproxy[7]: 192.168.24.3:58562 [16/Jun/2022:17:37:45.121] redis redis_be/<NOSRV> -1/-1/0 0 SC 2/1/0/0/0 0/0
Jun 16 17:37:45 standalone haproxy[7]: 192.168.24.3:58564 [16/Jun/2022:17:37:45.122] redis redis_be/<NOSRV> -1/-1/0 0 SC 2/1/0/0/0 0/0
Jun 16 17:37:45 standalone haproxy[7]: 192.168.24.3:58566 [16/Jun/2022:17:37:45.123] redis redis_be/<NOSRV> -1/-1/0 0 SC 2/1/0/0/0 0/0
Jun 16 17:37:45 standalone haproxy[7]: 192.168.24.3:58568 [16/Jun/2022:17:37:45.124] redis redis_be/<NOSRV> -1/-1/0 0 SC 2/1/0/0/0 0/0
Jun 16 17:37:45 standalone haproxy[7]: 192.168.24.3:58570 [16/Jun/2022:17:37:45.124] redis redis_be/<NOSRV> -1/-1/0 0 SC 2/1/0/0/0 0/0
Jun 16 17:37:45 standalone haproxy[7]: 192.168.24.3:58572 [16/Jun/2022:17:37:45.125] redis redis_be/<NOSRV> -1/-1/0 0 SC 2/1/0/0/0 0/0
Jun 16 17:37:45 standalone haproxy[7]: 192.168.24.3:58574 [16/Jun/2022:17:37:45.129] redis redis_be/<NOSRV> -1/-1/0 0 SC 2/1/0/0/0 0/0
Jun 16 17:37:45 standalone haproxy[7]: 192.168.24.3:58576 [16/Jun/2022:17:37:45.130] redis redis_be/<NOSRV> -1/-1/0 0 SC 2/1/0/0/0 0/0
Jun 16 17:37:45 standalone haproxy[7]: 192.168.24.3:58578 [16/Jun/2022:17:37:45.135] redis redis_be/<NOSRV> -1/-1/0 0 SC 2/1/0/0...

There seems to be an issue with redis ant it is never promoted.

https://411deec051d768e42c33-554a55c2926ce345d8a8f0805ecfe993.ssl.cf5.rackcdn.com/846159/4/check/tripleo-ci-centos-9-scenario001-standalone/a9a4889/logs/undercloud/var/log/extra/pcs.txt
~~~
  * Container bundle: redis-bundle [198.72.124.73:5001/tripleomastercentos9/openstack-redis:pcmklatest]:
    * redis-bundle-0	(ocf:heartbeat:redis):	 Unpromoted standalone
~~~

ceilometer-manage is trying to connect redis because redis is configured as its incoming storage but can't access because of no promoted master.

https://411deec051d768e42c33-554a55c2926ce345d8a8f0805ecfe993.ssl.cf5.rackcdn.com/846159/4/check/tripleo-ci-centos-9-scenario001-standalone/a9a4889/logs/undercloud/var/log/containers/haproxy/haproxy.log
~~~
Jun 16 17:24:28 standalone haproxy[7]: Server redis_be/standalone.ctlplane.localdomain is DOWN, reason: Layer4 connection problem, info: "Connection refused at step 1 of tcp-check (connect port 6379)", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jun 16 17:24:28 standalone haproxy[7]: backend redis_be has no server available!
...
Jun 16 17:37:45 standalone haproxy[7]: 192.168.24.3:58548 [16/Jun/2022:17:37:45.108] redis redis_be/<NOSRV> -1/-1/0 0 SC 2/1/0/0/0 0/0
Jun 16 17:37:45 standalone haproxy[7]: 192.168.24.3:58550 [16/Jun/2022:17:37:45.111] redis redis_be/<NOSRV> -1/-1/0 0 SC 2/1/0/0/0 0/0
Jun 16 17:37:45 standalone haproxy[7]: 192.168.24.3:58552 [16/Jun/2022:17:37:45.113] redis redis_be/<NOSRV> -1/-1/0 0 SC 2/1/0/0/0 0/0
Jun 16 17:37:45 standalone haproxy[7]: 192.168.24.3:58556 [16/Jun/2022:17:37:45.114] redis redis_be/<NOSRV> -1/-1/0 0 SC 2/1/0/0/0 0/0
Jun 16 17:37:45 standalone haproxy[7]: 192.168.24.3:58558 [16/Jun/2022:17:37:45.117] redis redis_be/<NOSRV> -1/-1/0 0 SC 2/1/0/0/0 0/0
Jun 16 17:37:45 standalone haproxy[7]: 192.168.24.3:58560 [16/Jun/2022:17:37:45.120] redis redis_be/<NOSRV> -1/-1/0 0 SC 2/1/0/0/0 0/0
Jun 16 17:37:45 standalone haproxy[7]: 192.168.24.3:58562 [16/Jun/2022:17:37:45.121] redis redis_be/<NOSRV> -1/-1/0 0 SC 2/1/0/0/0 0/0
Jun 16 17:37:45 standalone haproxy[7]: 192.168.24.3:58564 [16/Jun/2022:17:37:45.122] redis redis_be/<NOSRV> -1/-1/0 0 SC 2/1/0/0/0 0/0
Jun 16 17:37:45 standalone haproxy[7]: 192.168.24.3:58566 [16/Jun/2022:17:37:45.123] redis redis_be/<NOSRV> -1/-1/0 0 SC 2/1/0/0/0 0/0
Jun 16 17:37:45 standalone haproxy[7]: 192.168.24.3:58568 [16/Jun/2022:17:37:45.124] redis redis_be/<NOSRV> -1/-1/0 0 SC 2/1/0/0/0 0/0
Jun 16 17:37:45 standalone haproxy[7]: 192.168.24.3:58570 [16/Jun/2022:17:37:45.124] redis redis_be/<NOSRV> -1/-1/0 0 SC 2/1/0/0/0 0/0
Jun 16 17:37:45 standalone haproxy[7]: 192.168.24.3:58572 [16/Jun/2022:17:37:45.125] redis redis_be/<NOSRV> -1/-1/0 0 SC 2/1/0/0/0 0/0
Jun 16 17:37:45 standalone haproxy[7]: 192.168.24.3:58574 [16/Jun/2022:17:37:45.129] redis redis_be/<NOSRV> -1/-1/0 0 SC 2/1/0/0/0 0/0
Jun 16 17:37:45 standalone haproxy[7]: 192.168.24.3:58576 [16/Jun/2022:17:37:45.130] redis redis_be/<NOSRV> -1/-1/0 0 SC 2/1/0/0/0 0/0
Jun 16 17:37:45 standalone haproxy[7]: 192.168.24.3:58578 [16/Jun/2022:17:37:45.135] redis redis_be/<NOSRV> -1/-1/0 0 SC 2/1/0/0/0 0/0
Jun 16 17:37:45 standalone haproxy[7]: 192.168.24.3:58580 [16/Jun/2022:17:37:45.693] redis redis_be/<NOSRV> -1/-1/0 0 SC 2/1/0/0/0 0/0
Jun 16 17:37:45 standalone haproxy[7]: 192.168.24.3:58582 [16/Jun/2022:17:37:45.697] redis redis_be/<NOSRV> -1/-1/0 0 SC 2/1/0/0/0 0/0
Jun 16 17:37:45 standalone haproxy[7]: 192.168.24.3:58584 [16/Jun/2022:17:37:45.700] redis redis_be/<NOSRV> -1/-1/0 0 SC 2/1/0/0/0 0/0
Jun 16 17:37:45 standalone haproxy[7]: 192.168.24.3:58586 [16/Jun/2022:17:37:45.703] redis redis_be/<NOSRV> -1/-1/0 0 SC 2/1/0/0/0 0/0
Jun 16 17:37:45 standalone haproxy[7]: 192.168.24.3:58588 [16/Jun/2022:17:37:45.704] redis redis_be/<NOSRV> -1/-1/0 0 SC 2/1/0/0/0 0/0
~~~

Revision history for this message

chandan kumar (chkumar246) wrote on 2022-06-17:

#2

In passing job https://ac8676a2e4bb22cb45b3-6ec72dc5947dcd41834023b354511109.ssl.cf5.rackcdn.com/842370/6/gate/tripleo-ci-centos-9-scenario001-standalone/93b1333/logs/undercloud/var/log/extra/podman/containers/redis-bundle-podman-0/podman_info.log

```
pacemaker.x86_64 2.1.2-4.el9 @quickstart-centos-highavailability
pacemaker-cli.x86_64 2.1.2-4.el9 @quickstart-centos-highavailability
pacemaker-cluster-libs.x86_64 2.1.2-4.el9 @quickstart-centos-highavailability
pacemaker-libs.x86_64 2.1.2-4.el9 @quickstart-centos-highavailability
pacemaker-remote.x86_64 2.1.2-4.el9 @quickstart-centos-highavailability
pacemaker-schemas.noarch 2.1.2-4.el9 @quickstart-centos-highavailability
```

and in failed one
https://411deec051d768e42c33-554a55c2926ce345d8a8f0805ecfe993.ssl.cf5.rackcdn.com/846159/4/check/tripleo-ci-centos-9-scenario001-standalone/a9a4889/logs/undercloud/var/log/extra/podman/containers/redis-bundle-podman-0/podman_info.log

```
pacemaker.x86_64 2.1.3-2.el9 @quickstart-centos-highavailability
pacemaker-cli.x86_64 2.1.3-2.el9 @quickstart-centos-highavailability
pacemaker-cluster-libs.x86_64 2.1.3-2.el9 @quickstart-centos-highavailability
pacemaker-libs.x86_64 2.1.3-2.el9 @quickstart-centos-highavailability
pacemaker-remote.x86_64 2.1.3-2.el9 @quickstart-centos-highavailability
pacemaker-schemas.noarch 2.1.3-2.el9 @quickstart-centos-highavailability
```

Revision history for this message

Matthias Runge (mrunge) wrote on 2022-06-17:

#3

We have seen redis issues with gnocchi in the past. The connection to redis is unreliable and gnocchi will retry then. However, it is an issue if redis does not select a primary.

Revision history for this message

Matthias Runge (mrunge) wrote on 2022-06-17:

#4

I wonder if the redis log message is a red herring here. In my more recent devstack setups I found that gnocchi does not work (well) with more recent sqlalchemy versions.

Revision history for this message

Matthias Runge (mrunge) wrote on 2022-06-17:

#5

E.g see here: https://ac8676a2e4bb22cb45b3-6ec72dc5947dcd41834023b354511109.ssl.cf5.rackcdn.com/842370/6/gate/tripleo-ci-centos-9-scenario001-standalone/93b1333/logs/undercloud/var/log/containers/gnocchi/app.log

2022-06-15 21:45:49,096 [17] WARNING py.warnings: /usr/lib/python3.9/site-packages/gnocchi/indexer/sqlalchemy.py:482: SAWarning: relationship 'ResourceHistory.metrics' will copy column resource_history.id to column metric.resource_id, which conflicts with relationship(s): 'Metric.resource' (copies resource.id to metric.resource_id), 'Resource.metrics' (copies resource.id to metric.resource_id). If this is not the intention, consider if these relationships should be linked with back_populates, or if viewonly=True should be applied to one or more if they are read-only. For the less common case that foreign key constraints are partially overlapping, the orm.foreign() annotation can be used to isolate the columns that should be written towards. To silence this warning, add the parameter 'overlaps="metrics,resource"' to the 'ResourceHistory.metrics' relationship. (Background on this error at: https://sqlalche.me/e/14/qzyx)
resource_type = session.query(ResourceType).get(name)

Revision history for this message

Alfredo Moralejo (amoralej) wrote on 2022-06-17:

#6

I'm finding similar issue in RDO. In case it helps, i'm adding redis logs for job passing and failing:

Last passing:

https://logserver.rdoproject.org/63/43563/2/check/rdoinfo-tripleo-master-testing-centos-9-scenario001-standalone/5aaa84a/logs/undercloud/var/log/containers/redis/redis.log.txt.gz

Failing:

https://logserver.rdoproject.org/73/42273/8/check/rdoinfo-tripleo-master-testing-centos-9-scenario001-standalone/e0f6746/logs/undercloud/var/log/containers/redis/redis.log.txt.gz

I see some logs in the first one which are not in the failing ones:

115:M 13 Jun 2022 13:55:24.315 * Discarding previously cached master state.
115:M 13 Jun 2022 13:55:24.315 # Setting secondary replication ID to 213924cf53ab965342abadfa215441591e17c616, valid up to offset: 1. New replication ID is d57b4a27c96225a5dc1b3835d0fad08e271c70a7
115:M 13 Jun 2022 13:55:24.316 * MASTER MODE enabled (user request from 'id=13 addr=/var/run/redis/redis.sock:0 laddr=/var/run/redis/redis.sock:0 fd=8 name= age=0 idle=0 flags=U db=0 sub=0 psub=0 multi=-1 qbuf=34 qbuf-free=40920 argv-mem=12 obl=0 oll=0 omem=0 tot-mem=61476 events=r cmd=slaveof user=default redir=-1')

Revision history for this message

Damien Ciabrini (dciabrin) wrote on 2022-06-17:

#7

I'm looking into it.

In response to #2, looking at chandar's jobs logs for the passing and failing job (resp. [1] and [2]), the passing job has redis effectively promoted in pacemaker [1]:

Jun 15 21:29:29.080 standalone.localdomain pacemaker-attrd [63860] (attrd_peer_update) notice: Setting master-redis[standalone]: (unset) -> 1 | from standalone

Whereas the failing job never sets master-redis to 1 in pacemaker.

This setting is driven by the behaviour of the redis resource-agent, I don't think pacemaker at fault here.

I do see that in the passing job [1], redis-bundle is restarted once because of the way we set up the resource in pacemaker at creation time. It's not in the failing job [2]. But again, I don't see why this would interfere with the redis resource agent setting the master-redis flag to 1.

I am going to try and replicate locally on a standalone environment to see what could make the redis resource agent misbehave.

[1] https://ac8676a2e4bb22cb45b3-6ec72dc5947dcd41834023b354511109.ssl.cf5.rackcdn.com/842370/6/gate/tripleo-ci-centos-9-scenario001-standalone/93b1333/logs/undercloud/var/log/pacemaker/pacemaker.log
[2] https://411deec051d768e42c33-554a55c2926ce345d8a8f0805ecfe993.ssl.cf5.rackcdn.com/846159/4/check/tripleo-ci-centos-9-scenario001-standalone/a9a4889/logs/undercloud/var/log/pacemaker/pacemaker.log

Revision history for this message

Takashi Kajinami (kajinamit) wrote on 2022-06-17:

#8

Download full text (8.7 KiB)

So initially pacemaker started redis-bundle-podman-0, which represents the podman container.
This completed without any problem.
~~~
Jun 16 17:26:47.502 standalone.localdomain pacemaker-schedulerd[61881] (log_list_item) notice: Actions: Start redis-bundle-podman-0 ( standalone )
Jun 16 17:26:47.503 standalone.localdomain pacemaker-controld [61882] (te_rsc_command) notice: Initiating start operation redis-bundle-podman-0_start_0 locally on standalone | action 66
...
Jun 16 17:26:49.231 standalone.localdomain pacemaker-execd [61879] (log_finished) info: redis-bundle-podman-0 start (call 48, PID 92341) exited with status 0 (execution time 1.728s)
~~~

later pacemaer started starting the nested resources but at this moment it tried to restart the root resource ( redis-bundle-podman-0 ) because of resource definition change.

~~~
Jun 16 17:26:55.128 standalone.localdomain pacemaker-schedulerd[61881] (log_list_item) notice: Actions: Restart redis-bundle-podman-0 ( standalone ) due to resource definition change
Jun 16 17:26:55.129 standalone.localdomain pacemaker-schedulerd[61881] (log_list_item) notice: Actions: Start redis-bundle-0 ( standalone )
Jun 16 17:26:55.129 standalone.localdomain pacemaker-schedulerd[61881] (log_list_item) notice: Actions: Start redis:0 ( redis-bundle-0 )
~~~

Then it succeeded to stop the container.
~~~
Jun 16 17:26:56.030 standalone.localdomain pacemaker-controld [61882] (log_executor_event) notice: Result of stop operation for redis-bundle-podman-0 on standalone: ok | CIB update 150, graph action confirmed; call=51 key=redis-bundle-podman-0_stop_0 rc=0
...
Jun 16 17:26:56.030 standalone.localdomain pacemaker-controld [61882] (log_executor_event) notice: Result of stop operation for redis-bundle-podman-0 on standalone: ok | CIB update 150, graph action confirmed; call=51 key=redis-bundle-podman-0_stop_0 rc=0
~~~

Then all resources were started but the redis resource was not promoted at that time.
~~~
Jun 16 17:26:56.033 standalone.localdomain pacemaker-controld [61882] (te_rsc_command) notice: Initiating start operation redis-bundle-podman-0_start_0 locally on standalone | action 11
...
Jun 16 17:26:57.719 standalone.localdomain pacemaker-controld [61882] (log_executor_event) notice: Result of start operation for redis-bundle-podman-0 on standalone: ok | CIB update 152, graph action confirmed; call=52 key=redis-bundle-podman-0_start_0 rc=0
...
Jun 16 17:26:57.728 standalone.localdomain pacemaker-controld [61882] (te_rsc_command) notice: Initiating start operation redis-bundle-0_start_0 locally on standalone | action 71
...
Jun 16 17:26:58.264 standalone.localdomain pacemaker-controld [61882] (log_executor_event) notice: Result of start operation for redis-bundle-0 on standalone: ok | CIB update 161, graph action confirmed; call=8 key=redis-bundle-0_start_0 rc=0
...
Jun 16 17:26:58.314 standalone.localdomain pacemaker-schedulerd[61881] (rsc_action_default) info: Leave redis-bundle-podman-0...

So initially pacemaker started redis-bundle-podman-0, which represents the podman container.
This completed without any problem.
~~~
Jun 16 17:26:47.502 standalone.localdomain pacemaker-schedulerd[61881] (log_list_item)  notice: Actions: Start      redis-bundle-podman-0       (                             standalone )
Jun 16 17:26:47.503 standalone.localdomain pacemaker-controld  [61882] (te_rsc_command)         notice: Initiating start operation redis-bundle-podman-0_start_0 locally on standalone | action 66
...
Jun 16 17:26:49.231 standalone.localdomain pacemaker-execd     [61879] (log_finished)   info: redis-bundle-podman-0 start (call 48, PID 92341) exited with status 0 (execution time 1.728s)
~~~

later pacemaer started starting the nested resources but at this moment it tried to restart the root resource ( redis-bundle-podman-0 ) because of resource definition change.

~~~
Jun 16 17:26:55.128 standalone.localdomain pacemaker-schedulerd[61881] (log_list_item)  notice: Actions: Restart    redis-bundle-podman-0       (                             standalone )  due to resource definition change
Jun 16 17:26:55.129 standalone.localdomain pacemaker-schedulerd[61881] (log_list_item)  notice: Actions: Start      redis-bundle-0              (                             standalone )
Jun 16 17:26:55.129 standalone.localdomain pacemaker-schedulerd[61881] (log_list_item)  notice: Actions: Start      redis:0                     (                         redis-bundle-0 )
~~~

Then it succeeded to stop the container.
~~~
Jun 16 17:26:56.030 standalone.localdomain pacemaker-controld  [61882] (log_executor_event)     notice: Result of stop operation for redis-bundle-podman-0 on standalone: ok | CIB update 150, graph action confirmed; call=51 key=redis-bundle-podman-0_stop_0 rc=0
...
Jun 16 17:26:56.030 standalone.localdomain pacemaker-controld  [61882] (log_executor_event)     notice: Result of stop operation for redis-bundle-podman-0 on standalone: ok | CIB update 150, graph action confirmed; call=51 key=redis-bundle-podman-0_stop_0 rc=0
~~~

Then all resources were started but the redis resource was not promoted at that time.
~~~
Jun 16 17:26:56.033 standalone.localdomain pacemaker-controld  [61882] (te_rsc_command)         notice: Initiating start operation redis-bundle-podman-0_start_0 locally on standalone | action 11
...
Jun 16 17:26:57.719 standalone.localdomain pacemaker-controld  [61882] (log_executor_event)     notice: Result of start operation for redis-bundle-podman-0 on standalone: ok | CIB update 152, graph action confirmed; call=52 key=redis-bundle-podman-0_start_0 rc=0
...
Jun 16 17:26:57.728 standalone.localdomain pacemaker-controld  [61882] (te_rsc_command)         notice: Initiating start operation redis-bundle-0_start_0 locally on standalone | action 71
...
Jun 16 17:26:58.264 standalone.localdomain pacemaker-controld  [61882] (log_executor_event)     notice: Result of start operation for redis-bundle-0 on standalone: ok | CIB update 161, graph action confirmed; call=8 key=redis-bundle-0_start_0 rc=0
...
Jun 16 17:26:58.314 standalone.localdomain pacemaker-schedulerd[61881] (rsc_action_default)     info: Leave   redis-bundle-podman-0     (Started standalone)
Jun 16 17:26:58.314 standalone.localdomain pacemaker-schedulerd[61881] (rsc_action_default)     info: Leave   redis-bundle-0    (Started standalone)
Jun 16 17:26:58.315 standalone.localdomain pacemaker-schedulerd[61881] (log_list_item)  notice: Actions: Start      redis:0                     (                         redis-bundle-0 )
...
Jun 16 17:27:14.128 standalone.localdomain pacemaker-schedulerd[61881] (pcmk__set_instance_roles)       info: redis-bundle-master: Promoted 0 instances of a possible 1
Jun 16 17:27:14.130 standalone.localdomain pacemaker-schedulerd[61881] (rsc_action_default)     info: Leave   redis-bundle-podman-0     (Started standalone)
Jun 16 17:27:14.130 standalone.localdomain pacemaker-schedulerd[61881] (rsc_action_default)     info: Leave   redis-bundle-0    (Started standalone)
Jun 16 17:27:14.130 standalone.localdomain pacemaker-schedulerd[61881] (rsc_action_default)     info: Leave   redis:0   (Unpromoted redis-bundle-0)
Jun 16 17:27:18.944 standalone.localdomain pacemaker-schedulerd[61881] (rsc_action_digest_cmp)  info: Parameters to 60000ms-interval monitor action for redis-bundle-podman-0 on standalone changed: hash was d471669d7652938e96e5707b76cb2e0d vs. now 37af7d806cbe0f2196499a8a0821dca0 (reschedule:3.15.0) 0:0;8:37:0:616fc775-6f0d-428d-87ee-1eab09e92e5a
~~~

but after that the redis resource were left untouched. There are a few logs indicating monitor operations which were eventually cancelled.

~~~
Jun 16 17:27:18.944 standalone.localdomain pacemaker-schedulerd[61881] (rsc_action_digest_cmp)  info: Parameters to 60000ms-interval monitor action for redis-bundle-podman-0 on standalone changed: hash was d471669d7652938e96e5707b76cb2e0d vs. now 37af7d806cbe0f2196499a8a0821dca0 (reschedule:3.15.0) 0:0;8:37:0:616fc775-6f0d-428d-87ee-1eab09e92e5a
Jun 16 17:27:18.944 standalone.localdomain pacemaker-schedulerd[61881] (pcmk__set_instance_roles)       info: redis-bundle-master: Promoted 0 instances of a possible 1
Jun 16 17:27:18.945 standalone.localdomain pacemaker-schedulerd[61881] (RecurringOp)    info:  Start recurring monitor (60s) for redis-bundle-podman-0 on standalone
Jun 16 17:27:18.945 standalone.localdomain pacemaker-schedulerd[61881] (RecurringOp)    info:  Start recurring monitor (60s) for redis-bundle-podman-0 on standalone
Jun 16 17:27:18.946 standalone.localdomain pacemaker-schedulerd[61881] (rsc_action_default)     info: Leave   redis-bundle-podman-0     (Started standalone)
Jun 16 17:27:18.946 standalone.localdomain pacemaker-schedulerd[61881] (rsc_action_default)     info: Leave   redis-bundle-0    (Started standalone)
Jun 16 17:27:18.946 standalone.localdomain pacemaker-schedulerd[61881] (rsc_action_default)     info: Leave   redis:0   (Unpromoted redis-bundle-0)
Jun 16 17:27:18.955 standalone.localdomain pacemaker-controld  [61882] (te_rsc_command)         notice: Initiating monitor operation redis-bundle-podman-0_monitor_60000 locally on standalone | action 8
Jun 16 17:27:18.955 standalone.localdomain pacemaker-controld  [61882] (do_lrm_rsc_op)  notice: Requesting local execution of monitor operation for redis-bundle-podman-0 on standalone | transition_key=8:40:0:616fc775-6f0d-428d-87ee-1eab09e92e5a op_key=redis-bundle-podman-0_monitor_60000
Jun 16 17:27:18.957 standalone.localdomain pacemaker-execd     [61879] (cancel_recurring_action)        info: Cancelling ocf operation redis-bundle-podman-0_monitor_60000
Jun 16 17:27:18.958 standalone.localdomain pacemaker-controld  [61882] (log_executor_event)     info: Result of monitor operation for redis-bundle-podman-0 on standalone: Cancelled | graph action confirmed; call=53 key=redis-bundle-podman-0_monitor_60000
Jun 16 17:27:18.958 standalone.localdomain pacemaker-based     [61877] (cib_perform_op)         info: +  /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='redis-bundle-podman-0']/lrm_rsc_op[@id='redis-bundle-podman-0_monitor_60000']:  @transition-key=8:40:0:616fc775-6f0d-428d-87ee-1eab09e92e5a, @transition-magic=-1:193;8:40:0:616fc775-6f0d-428d-87ee-1eab09e92e5a, @call-id=-1, @rc-code=193, @op-status=-1, @last-rc-change=1655400438, @exec-time=0, @op-digest=37af7d806cbe0f2196499a8a0821dca0
Jun 16 17:27:19.466 standalone.localdomain pacemaker-based     [61877] (cib_perform_op)         info: +  /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='redis-bundle-podman-0']/lrm_rsc_op[@id='redis-bundle-podman-0_monitor_60000']:  @transition-magic=0:0;8:40:0:616fc775-6f0d-428d-87ee-1eab09e92e5a, @call-id=62, @rc-code=0, @op-status=0, @exec-time=506
Jun 16 17:27:19.465 standalone.localdomain pacemaker-controld  [61882] (log_executor_event)     notice: Result of monitor operation for redis-bundle-podman-0 on standalone: ok | CIB update 182, graph action unconfirmed; call=62 key=redis-bundle-podman-0_monitor_60000 rc=0
Jun 16 17:27:19.467 standalone.localdomain pacemaker-controld  [61882] (process_graph_event)    info: Transition 40 action 8 (redis-bundle-podman-0_monitor_60000 on standalone) confirmed: ok | rc=0 call-id=62
Jun 16 17:29:23.038 standalone.localdomain pacemaker-schedulerd[61881] (pcmk__set_instance_roles)       info: redis-bundle-master: Promoted 0 instances of a possible 1
Jun 16 17:29:23.040 standalone.localdomain pacemaker-schedulerd[61881] (rsc_action_default)     info: Leave   redis-bundle-podman-0     (Started standalone)
Jun 16 17:29:23.040 standalone.localdomain pacemaker-schedulerd[61881] (rsc_action_default)     info: Leave   redis-bundle-0    (Started standalone)
Jun 16 17:29:23.040 standalone.localdomain pacemaker-schedulerd[61881] (rsc_action_default)     info: Leave   redis:0   (Unpromoted redis-bundle-0)
~~~

Revision history for this message

chandan kumar (chkumar246) wrote on 2022-06-17:

#9

846287: [DNM] Downgrade pacemaker and resource-agents | https://review.opendev.org/c/openstack/tripleo-common/+/846287

Ronelle Landy (rlandy) on 2022-06-17

Changed in tripleo:
importance:	High → Critical

Revision history for this message

Takashi Kajinami (kajinamit) wrote on 2022-06-17 (last edit on 2022-06-20):

#10

So the problem is reproduced if we downgrade only resource-agents.
https://review.opendev.org/c/openstack/tripleo-common/+/846351
https://zuul.opendev.org/t/openstack/build/72c725792946461f9a16760c6437fb8e

On the oher hand, it is not reproduced when we downgrade pacemaker/pacemaker_remote
https://review.opendev.org/c/openstack/tripleo-common/+/846352
https://zuul.opendev.org/t/openstack/build/eed22d317f0e4c67b358a5f308b036af

So this is likely the problem with pacemaker_remote (as pacemaker package is not used in the container)

Good
~~~
pacemaker.x86_64 2.1.2-4.el9 @quickstart-centos-highavailability
pacemaker-cli.x86_64 2.1.2-4.el9 @quickstart-centos-highavailability
pacemaker-cluster-libs.x86_64 2.1.2-4.el9 @quickstart-centos-highavailability
pacemaker-libs.x86_64 2.1.2-4.el9 @quickstart-centos-highavailability
pacemaker-remote.x86_64 2.1.2-4.el9 @quickstart-centos-highavailability
pacemaker-schemas.noarch 2.1.2-4.el9 @quickstart-centos-highavailability
~~~

Bad
~~~
pacemaker.x86_64 2.1.3-2.el9 @quickstart-centos-highavailability
pacemaker-cli.x86_64 2.1.3-2.el9 @quickstart-centos-highavailability
pacemaker-cluster-libs.x86_64 2.1.3-2.el9 @quickstart-centos-highavailability
pacemaker-libs.x86_64 2.1.3-2.el9 @quickstart-centos-highavailability
pacemaker-remote.x86_64 2.1.3-2.el9 @quickstart-centos-highavailability
pacemaker-schemas.noarch 2.1.3-2.el9 @quickstart-centos-highavailability
~~~
~~~

So the problem is reproduced if we downgrade only resource-agents.
 https://review.opendev.org/c/openstack/tripleo-common/+/846351
 https://zuul.opendev.org/t/openstack/build/72c725792946461f9a16760c6437fb8e

On the oher hand, it is not reproduced when we downgrade pacemaker/pacemaker_remote
 https://review.opendev.org/c/openstack/tripleo-common/+/846352
 https://zuul.opendev.org/t/openstack/build/eed22d317f0e4c67b358a5f308b036af

So this is likely the problem with pacemaker_remote (as pacemaker package is not used in the container)

Good
~~~
pacemaker.x86_64                               2.1.2-4.el9                         @quickstart-centos-highavailability
pacemaker-cli.x86_64                           2.1.2-4.el9                         @quickstart-centos-highavailability
pacemaker-cluster-libs.x86_64                  2.1.2-4.el9                         @quickstart-centos-highavailability
pacemaker-libs.x86_64                          2.1.2-4.el9                         @quickstart-centos-highavailability
pacemaker-remote.x86_64                        2.1.2-4.el9                         @quickstart-centos-highavailability
pacemaker-schemas.noarch                       2.1.2-4.el9                         @quickstart-centos-highavailability
~~~

Bad
~~~
pacemaker.x86_64                               2.1.3-2.el9                         @quickstart-centos-highavailability
pacemaker-cli.x86_64                           2.1.3-2.el9                         @quickstart-centos-highavailability
pacemaker-cluster-libs.x86_64                  2.1.3-2.el9                         @quickstart-centos-highavailability
pacemaker-libs.x86_64                          2.1.3-2.el9                         @quickstart-centos-highavailability
pacemaker-remote.x86_64                        2.1.3-2.el9                         @quickstart-centos-highavailability
pacemaker-schemas.noarch                       2.1.3-2.el9                         @quickstart-centos-highavailability
~~~
~~~

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-06-17: Related fix merged to tripleo-common (master)

#11

Reviewed: https://review.opendev.org/c/openstack/tripleo-common/+/846287
Committed: https://opendev.org/openstack/tripleo-common/commit/bf87555420bfd17d8b90f70ea9616476741b1819
Submitter: "Zuul (22348)"
Branch: master

commit bf87555420bfd17d8b90f70ea9616476741b1819
Author: Chandan Kumar (raukadah) <email address hidden>
Date: Fri Jun 17 14:51:49 2022 +0530

Downgrade pacemaker and resource-agents

    The patch downgrades:
    pacemaker pacemaker-remote resource-agents
    in container builds to avoid
    errors at deployment step 5 with latest versions.

    Related-Bug: #1978997
    Signed-off-by: Chandan Kumar (raukadah) <email address hidden>
    Change-Id: Ie5288864cd6f346a5bb5b481b4aa5dbd1abb9f47

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-06-19: Related fix proposed to tripleo-heat-templates (master)

#12

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/846474

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-06-20: Related fix proposed to tripleo-heat-templates (stable/wallaby)

#13

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/846557

Revision history for this message

Damien Ciabrini (dciabrin) wrote on 2022-06-20:

#14

Ok the failure to promote the Redis resource seems to come from a change in the default output of crm_attribute in pacemaker.x86_64 2.1.3-2.el9.

Prior to that version, trying to fetch a non-existing attribute from the CIB would return an empty string. With the newer pacemaker, we get the "(null)" string instead. This breaks the logics implemented in the redis resource agent.

This is probably a regression. I've filed https://bugzilla.redhat.com/show_bug.cgi?id=2099331 to track that pacemaker issue separately.

Revision history for this message

John Fulton (jfulton-org) wrote on 2022-06-22:

#15

Still seeing symptoms of this bug. E.g.

https://review.opendev.org/c/openstack/puppet-tripleo/+/845854/

Conjecture: the following needs to go through the line

https://review.opendev.org/c/openstack/tripleo-common/+/846287

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-06-22: Related fix proposed to tripleo-common (master)

#16

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/tripleo-common/+/847222

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-06-23: Related fix merged to tripleo-common (master)

#17

Reviewed: https://review.opendev.org/c/openstack/tripleo-common/+/847222
Committed: https://opendev.org/openstack/tripleo-common/commit/0b6ae01e797f3f34c83d14fa03151b33ce6894bb
Submitter: "Zuul (22348)"
Branch: master

commit 0b6ae01e797f3f34c83d14fa03151b33ce6894bb
Author: Ronelle Landy <email address hidden>
Date: Wed Jun 22 17:16:13 2022 -0400

Downgrade pacemaker, resource-agents - exact ver

    https://review.opendev.org/c/openstack/tripleo-common/+/846287
    downgrades pacemaker and resource-agents but does
    not specify the version. So the problem resurfaced when
    these rpms upgraded yet again.

This patch specifies the downgrade version.

Change-Id: If221c0c4cfe4b7a08568916400f4b50a72ab9e21
Related-Bug: #1978997

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-06-23: Change abandoned on tripleo-heat-templates (master)

#18

Change abandoned by "John Fulton <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/846474
Reason: fixed by https://review.opendev.org/c/openstack/tripleo-common/+/847222

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-06-23: Change abandoned on tripleo-heat-templates (stable/wallaby)

#19

Change abandoned by "John Fulton <email address hidden>" on branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/846557
Reason: fixed by https://review.opendev.org/c/openstack/tripleo-common/+/847222

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-06-24: Related fix proposed to tripleo-common (stable/wallaby)

#20

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/tripleo-common/+/847437

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-06-28: Related fix merged to tripleo-common (stable/wallaby)

#21

Reviewed: https://review.opendev.org/c/openstack/tripleo-common/+/847437
Committed: https://opendev.org/openstack/tripleo-common/commit/e8c7d086a48429c4eade1518955fb765a74d464f
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit e8c7d086a48429c4eade1518955fb765a74d464f
Author: Chandan Kumar (raukadah) <email address hidden>
Date: Fri Jun 17 14:51:49 2022 +0530

Downgrade pacemaker and resource-agents

    The patch downgrades:
    pacemaker pacemaker-remote resource-agents
    in container builds to avoid
    errors at deployment step 5 with latest versions.

This patch also specifies the downgrade version.

    Related-Bug: #1978997
    Signed-off-by: Chandan Kumar (raukadah) <email address hidden>
    Change-Id: Ie5288864cd6f346a5bb5b481b4aa5dbd1abb9f47

tags:

added: in-stable-wallaby

Ronelle Landy (rlandy) on 2022-06-29

Changed in tripleo:
status:	Triaged → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-07-21: Related fix proposed to tripleo-common (master)

#22

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/tripleo-common/+/850676

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-08-17: Related fix merged to tripleo-common (master)

#23

Reviewed: https://review.opendev.org/c/openstack/tripleo-common/+/850676
Committed: https://opendev.org/openstack/tripleo-common/commit/d407c857a6e96683e9834136349d880d92f2f94d
Submitter: "Zuul (22348)"
Branch: master

commit d407c857a6e96683e9834136349d880d92f2f94d
Author: Takashi Kajinami <email address hidden>
Date: Fri Jul 22 02:11:46 2022 +0900

Stop downgrading pacemaker

This reverts the following two changes in a single commit.

commit bf87555420bfd17d8b90f70ea9616476741b1819
Downgrade pacemaker and resource-agents

commit 0b6ae01e797f3f34c83d14fa03151b33ce6894bb
Downgrade pacemaker, resource-agents - exact ver

The bug[1] in pacemaker was already fixed and the new pacemaker package
( 2.1.4-2 ) with the fix was already released in CentOS Stream 9 repo.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2099331

Related-Bug: #1978997
Change-Id: I480ff3878c2ed3ae8222fe3b8a47af790673bcc8

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-11-24: Related fix proposed to tripleo-common (stable/wallaby)

#24

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/tripleo-common/+/865402

Revision history for this message

Sandeep Yadav (sandeepyadav93) wrote on 2022-11-24:

#25

Reopening this bug till wallaby cherry-pick merges.

Wallaby jobs are failing as the versions which we are trying to downgrade to are no longer available in the repos.

https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_0a6/865397/1/check/tripleo-ci-centos-9-content-provider/0a6c693/logs/container-builds/5d4ead5b-f90f-45f5-b774-4366676dd58e/base/redis/redis-build.log

~~~
No package pacemaker-2.1.2-4.el9 available.
No package pacemaker-remote-1.2-4.el9 available.
No package resource-agents-4.10.0-17.el9 available.
Error: No packages marked for downgrade.
~~~

Changed in tripleo:
status:	Fix Released → In Progress
milestone:	zed-1 → antelope-1

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-11-24: Related fix merged to tripleo-common (stable/wallaby)

#26

Reviewed: https://review.opendev.org/c/openstack/tripleo-common/+/865402
Committed: https://opendev.org/openstack/tripleo-common/commit/f7cd739c38611ae914a7bf3e48ce9f4368c89caf
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit f7cd739c38611ae914a7bf3e48ce9f4368c89caf
Author: Takashi Kajinami <email address hidden>
Date: Fri Jul 22 02:11:46 2022 +0900

Stop downgrading pacemaker

This reverts the following two changes in a single commit.

commit bf87555420bfd17d8b90f70ea9616476741b1819
Downgrade pacemaker and resource-agents

commit 0b6ae01e797f3f34c83d14fa03151b33ce6894bb
Downgrade pacemaker, resource-agents - exact ver

The bug[1] in pacemaker was already fixed and the new pacemaker package
( 2.1.4-2 ) with the fix was already released in CentOS Stream 9 repo.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2099331

    Related-Bug: #1978997
    Change-Id: I480ff3878c2ed3ae8222fe3b8a47af790673bcc8
    (cherry picked from commit d407c857a6e96683e9834136349d880d92f2f94d)

Revision history for this message

Sandeep Yadav (sandeepyadav93) wrote on 2022-11-24:

#27

Fix for wallaby merged: https://review.opendev.org/c/openstack/tripleo-common/+/865402

Changed in tripleo:
status:	In Progress → Fix Released

tripleo

tripleo-ci-centos-9-scenario001-standalone failed during step5 because gnocchi couldn't connect to redis

Bug Description

Other bug subscribers

Remote bug watches