Hi,
I just found that kolla-ansible timeouting on start/stop services (only some of them). The reason is that when docker container (some services) receive SIGTERM instead RC 0 RC 143 is returned.
I've already tested it and add some debug messages into code of kolla_systemd_worker below :
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service timeouted for wait for dead after 120.
Waiting for unit -> wait_for_unit(service=kolla-proxysql-container.service, timeout=120, state=deadUnit kolla-proxysql-container.service - waiting for 5.
Unit kolla-proxysql-container.service dead == dead, return.
Waiting for unit -> wait_for_unit(service=kolla-haproxy-container.service, timeout=120, state=runningUnit kolla-haproxy-container.service running == running, return.
Waiting for unit -> wait_for_unit(service=kolla-proxysql-container.service, timeout=120, state=runningUnit kolla-proxysql-container.service running == running, return.
Code is checking 'dead' but it is 'failed' (but still valid turned off service - just not exited with 0 but 143)
Then I tried to stop all services with systemctl stop $service
kolla-cinder_api-container.service > Active: inactive (dead) since Thu 2024-01-04 19:00:43 UTC; 30s ago
kolla-cinder_scheduler-container.service > Active: inactive (dead) since Thu 2024-01-04 19:01:26 UTC; 30s ago
kolla-cron-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:02:06 UTC; 30s ago
kolla-designate_api-container.service > Active: inactive (dead) since Thu 2024-01-04 19:02:47 UTC; 30s ago
kolla-designate_backend_bind9-container.service > Active: inactive (dead) since Thu 2024-01-04 19:03:27 UTC; 30s ago
kolla-designate_central-container.service > Active: inactive (dead) since Thu 2024-01-04 19:04:10 UTC; 30s ago
kolla-designate_mdns-container.service > Active: inactive (dead) since Thu 2024-01-04 19:04:51 UTC; 30s ago
kolla-designate_producer-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:06:32 UTC; 30s ago
kolla-designate_sink-container.service > Active: inactive (dead) since Thu 2024-01-04 19:07:14 UTC; 30s ago
kolla-designate_worker-container.service > Active: inactive (dead) since Thu 2024-01-04 19:07:59 UTC; 30s ago
kolla-fluentd-container.service > Active: inactive (dead) since Thu 2024-01-04 19:08:41 UTC; 30s ago
kolla-glance_api-container.service > Active: inactive (dead) since Thu 2024-01-04 19:09:22 UTC; 30s ago
kolla-haproxy-container.service > Active: inactive (dead) since Thu 2024-01-04 19:10:03 UTC; 30s ago
kolla-haproxy_ssh-container.service > Active: inactive (dead) since Thu 2024-01-04 19:10:43 UTC; 30s ago
kolla-heat_api-container.service > Active: inactive (dead) since Thu 2024-01-04 19:11:25 UTC; 30s ago
kolla-heat_api_cfn-container.service > Active: inactive (dead) since Thu 2024-01-04 19:12:06 UTC; 30s ago
kolla-heat_engine-container.service > Active: inactive (dead) since Thu 2024-01-04 19:12:57 UTC; 30s ago
kolla-horizon-container.service > Active: inactive (dead) since Thu 2024-01-04 19:13:40 UTC; 30s ago
kolla-keepalived-container.service > Active: inactive (dead) since Thu 2024-01-04 19:14:21 UTC; 30s ago
kolla-keystone-container.service > Active: inactive (dead) since Thu 2024-01-04 19:15:03 UTC; 30s ago
kolla-keystone_fernet-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:15:44 UTC; 30s ago
kolla-keystone_ssh-container.service > Active: inactive (dead) since Thu 2024-01-04 19:16:24 UTC; 30s ago
kolla-kolla_toolbox-container.service > Active: inactive (dead) since Thu 2024-01-04 19:17:05 UTC; 30s ago
kolla-letsencrypt_lego-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:17:46 UTC; 30s ago
kolla-letsencrypt_webserver-container.service > Active: inactive (dead) since Thu 2024-01-04 19:18:27 UTC; 30s ago
kolla-magnum_api-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:19:07 UTC; 30s ago
kolla-magnum_conductor-container.service > Active: inactive (dead) since Thu 2024-01-04 19:19:57 UTC; 30s ago
kolla-mariadb-container.service > Active: inactive (dead) since Thu 2024-01-04 19:20:38 UTC; 30s ago
kolla-mariadb_clustercheck-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:21:18 UTC; 30s ago
kolla-memcached-container.service > Active: inactive (dead) since Thu 2024-01-04 19:21:59 UTC; 30s ago
kolla-neutron_bgp_dragent-container.service > Active: inactive (dead) since Thu 2024-01-04 19:22:45 UTC; 30s ago
kolla-neutron_dhcp_agent-container.service > Active: inactive (dead) since Thu 2024-01-04 19:23:27 UTC; 30s ago
kolla-neutron_l3_agent-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:24:08 UTC; 30s ago
kolla-neutron_metadata_agent-container.service > Active: inactive (dead) since Thu 2024-01-04 19:24:49 UTC; 30s ago
kolla-neutron_openvswitch_agent-container.service > Active: inactive (dead) since Thu 2024-01-04 19:25:53 UTC; 30s ago
kolla-neutron_server-container.service > Active: inactive (dead) since Thu 2024-01-04 19:27:31 UTC; 30s ago
kolla-nova_api-container.service > Active: inactive (dead) since Thu 2024-01-04 19:28:15 UTC; 30s ago
kolla-nova_conductor-container.service > Active: inactive (dead) since Thu 2024-01-04 19:29:28 UTC; 30s ago
kolla-nova_scheduler-container.service > Active: inactive (dead) since Thu 2024-01-04 19:30:14 UTC; 30s ago
kolla-nova_spicehtml5proxy-container.service > Active: inactive (dead) since Thu 2024-01-04 19:30:55 UTC; 30s ago
kolla-octavia_api-container.service > Active: inactive (dead) since Thu 2024-01-04 19:31:37 UTC; 30s ago
kolla-octavia_health_manager-container.service > Active: inactive (dead) since Thu 2024-01-04 19:32:18 UTC; 30s ago
kolla-octavia_housekeeping-container.service > Active: inactive (dead) since Thu 2024-01-04 19:33:00 UTC; 30s ago
kolla-octavia_worker-container.service > Active: inactive (dead) since Thu 2024-01-04 19:33:45 UTC; 30s ago
kolla-openvswitch_db-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:34:26 UTC; 30s ago
kolla-openvswitch_vswitchd-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:35:06 UTC; 30s ago
kolla-placement_api-container.service > Active: inactive (dead) since Thu 2024-01-04 19:35:48 UTC; 30s ago
kolla-proxysql-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:36:28 UTC; 30s ago
kolla-rabbitmq-container.service > Active: inactive (dead) since Thu 2024-01-04 19:37:16 UTC; 30s ago
kolla-redis-container.service > Active: inactive (dead) since Thu 2024-01-04 19:37:57 UTC; 30s ago
kolla-redis_sentinel-container.service > Active: inactive (dead) since Thu 2024-01-04 19:38:38 UTC; 30s ago
kolla-skyline_apiserver-container.service > Active: inactive (dead) since Thu 2024-01-04 19:39:19 UTC; 30s ago
kolla-skyline_console-container.service > Active: inactive (dead) since Thu 2024-01-04 19:40:00 UTC; 30s ago
Problem services - everytime exited with 143
kolla-cron-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:02:06 UTC; 30s ago
kolla-designate_producer-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:06:32 UTC; 30s ago
kolla-keystone_fernet-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:15:44 UTC; 30s ago
kolla-letsencrypt_lego-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:17:46 UTC; 30s ago
kolla-magnum_api-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:19:07 UTC; 30s ago
kolla-mariadb_clustercheck-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:21:18 UTC; 30s ago
kolla-neutron_l3_agent-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:24:08 UTC; 30s ago
kolla-openvswitch_db-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:34:26 UTC; 30s ago
kolla-openvswitch_vswitchd-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:35:06 UTC; 30s ago
kolla-proxysql-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:36:28 UTC; 30s ago
Some reading about -> https://www.groundcover.com/kubernetes-troubleshooting/exit-code-143 , https://<email address hidden>/msg30473.html
Soooo, I will send a patch to fix this as 2 minutes for restart service is realy too much ...5 controllers means 2x5 x 2 (haproxy proxysql) - 20 minutes instead of few seconds.
Fix proposed to branch: master /review. opendev. org/c/openstack /kolla- ansible/ +/904805
Review: https:/