kolla-ansible

Comment 0 for bug 1930293

Revision history for this message

Radosław Piliszek (yoctozepto) wrote on 2021-05-31: multinode rabbitmq unstable kolla ansible actions

Multinode rabbitmq kolla ansible actions may fail depending on the order of stops and starts.
It can be randomly wrong and cause any changes (updates, upgrades, config change) to multinode rabbitmq fail the run.

Example failure:

ara summary: (It shows stop on 'secondary1' last, yet first to start is 'secondary2')

Stopping all rabbitmq instances but the first node secondary1 kolla_docker 0:02:38 0:00:00 SKIPPED
Stopping all rabbitmq instances but the first node secondary2 kolla_docker 0:02:38 0:00:07 CHANGED
Stopping all rabbitmq instances but the first node primary kolla_docker 0:02:38 0:00:09 CHANGED
Stopping rabbitmq on the first node secondary2 kolla_docker 0:02:48 0:00:00 SKIPPED
Stopping rabbitmq on the first node primary kolla_docker 0:02:48 0:00:00 SKIPPED
Stopping rabbitmq on the first node secondary1 kolla_docker 0:02:48 0:00:17 CHANGED
Restart rabbitmq container secondary2 include_tasks 0:03:06 0:00:00 OK
Restart rabbitmq container secondary1 include_tasks 0:03:06 0:00:00 OK
Restart rabbitmq container primary include_tasks 0:03:06 0:00:00 OK
Restart rabbitmq container secondary2 kolla_docker 0:03:06 0:00:01 CHANGED
Waiting for rabbitmq to start secondary2 command 0:03:07 0:10:06 FAILED
Restart rabbitmq container secondary1 kolla_docker 0:13:14 0:00:01 CHANGED
Waiting for rabbitmq to start secondary1 command 0:13:15 0:00:05 CHANGED
Restart rabbitmq container primary kolla_docker 0:13:21 0:00:01 CHANGED
Waiting for rabbitmq to start primary command 0:13:23 0:00:07 CHANGED

docker logs for the failing rabbitmq: (It shows the order is the actual problem)

2021-05-31T13:48:33.608436819Z BOOT FAILED
2021-05-31T13:48:33.608444389Z ===========
2021-05-31T13:48:33.608571562Z Timeout contacting cluster nodes: [rabbit@primary,rabbit@secondary1].
2021-05-31T13:48:33.608687375Z
2021-05-31T13:48:33.608727786Z BACKGROUND
2021-05-31T13:48:33.608930872Z ==========
2021-05-31T13:48:33.608990003Z
2021-05-31T13:48:33.609201178Z This cluster node was shut down while other nodes were still running.
2021-05-31T13:48:33.609556107Z To avoid losing data, you should start the other nodes first, then
2021-05-31T13:48:33.609564438Z start this one. To force this node to start, first invoke
2021-05-31T13:48:33.609612299Z "rabbitmqctl force_boot". If you do so, any changes made on other
2021-05-31T13:48:33.609766853Z cluster nodes after this one was shut down may be lost.
2021-05-31T13:48:33.609805674Z
2021-05-31T13:48:33.609895306Z DIAGNOSTICS
2021-05-31T13:48:33.609953178Z ===========
2021-05-31T13:48:33.609981468Z
2021-05-31T13:48:33.610106611Z attempted to contact: [rabbit@primary,rabbit@secondary1]
2021-05-31T13:48:33.610173433Z
2021-05-31T13:48:33.610252235Z rabbit@primary:
2021-05-31T13:48:33.610450790Z * unable to connect to epmd (port 4369) on primary: address (cannot connect to host/port)
2021-05-31T13:48:33.610635545Z
2021-05-31T13:48:33.610760428Z rabbit@secondary1:
2021-05-31T13:48:33.610963233Z * unable to connect to epmd (port 4369) on secondary1: address (cannot connect to host/port)
2021-05-31T13:48:33.611150918Z
2021-05-31T13:48:33.611189209Z
2021-05-31T13:48:33.611298392Z Current node details:
2021-05-31T13:48:33.611434945Z * node name: rabbit@secondary2

Example failure:

ara summary: (It shows stop on 'secondary1' last, yet first to start is 'secondary2')

Stopping all rabbitmq instances but the first node	secondary1	kolla_docker	0:02:38	0:00:00	SKIPPED
Stopping all rabbitmq instances but the first node	secondary2	kolla_docker	0:02:38	0:00:07	CHANGED
Stopping all rabbitmq instances but the first node	primary	kolla_docker	0:02:38	0:00:09	CHANGED
Stopping rabbitmq on the first node	secondary2	kolla_docker	0:02:48	0:00:00	SKIPPED
Stopping rabbitmq on the first node	primary	kolla_docker	0:02:48	0:00:00	SKIPPED
Stopping rabbitmq on the first node	secondary1	kolla_docker	0:02:48	0:00:17	CHANGED
Restart rabbitmq container	secondary2	include_tasks	0:03:06	0:00:00	OK
Restart rabbitmq container	secondary1	include_tasks	0:03:06	0:00:00	OK
Restart rabbitmq container	primary	include_tasks	0:03:06	0:00:00	OK
Restart rabbitmq container	secondary2	kolla_docker	0:03:06	0:00:01	CHANGED
Waiting for rabbitmq to start	secondary2	command	0:03:07	0:10:06	FAILED
Restart rabbitmq container	secondary1	kolla_docker	0:13:14	0:00:01	CHANGED
Waiting for rabbitmq to start	secondary1	command	0:13:15	0:00:05	CHANGED
Restart rabbitmq container	primary	kolla_docker	0:13:21	0:00:01	CHANGED
Waiting for rabbitmq to start	primary	command	0:13:23	0:00:07	CHANGED

docker logs for the failing rabbitmq: (It shows the order is the actual problem)

2021-05-31T13:48:33.608436819Z BOOT FAILED
2021-05-31T13:48:33.608444389Z ===========
2021-05-31T13:48:33.608571562Z Timeout contacting cluster nodes: [rabbit@primary,rabbit@secondary1].
2021-05-31T13:48:33.608687375Z 
2021-05-31T13:48:33.608727786Z BACKGROUND
2021-05-31T13:48:33.608930872Z ==========
2021-05-31T13:48:33.608990003Z 
2021-05-31T13:48:33.609201178Z This cluster node was shut down while other nodes were still running.
2021-05-31T13:48:33.609556107Z To avoid losing data, you should start the other nodes first, then
2021-05-31T13:48:33.609564438Z start this one. To force this node to start, first invoke
2021-05-31T13:48:33.609612299Z "rabbitmqctl force_boot". If you do so, any changes made on other
2021-05-31T13:48:33.609766853Z cluster nodes after this one was shut down may be lost.
2021-05-31T13:48:33.609805674Z 
2021-05-31T13:48:33.609895306Z DIAGNOSTICS
2021-05-31T13:48:33.609953178Z ===========
2021-05-31T13:48:33.609981468Z 
2021-05-31T13:48:33.610106611Z attempted to contact: [rabbit@primary,rabbit@secondary1]
2021-05-31T13:48:33.610173433Z 
2021-05-31T13:48:33.610252235Z rabbit@primary:
2021-05-31T13:48:33.610450790Z   * unable to connect to epmd (port 4369) on primary: address (cannot connect to host/port)
2021-05-31T13:48:33.610635545Z 
2021-05-31T13:48:33.610760428Z rabbit@secondary1:
2021-05-31T13:48:33.610963233Z   * unable to connect to epmd (port 4369) on secondary1: address (cannot connect to host/port)
2021-05-31T13:48:33.611150918Z 
2021-05-31T13:48:33.611189209Z 
2021-05-31T13:48:33.611298392Z Current node details:
2021-05-31T13:48:33.611434945Z  * node name: rabbit@secondary2