Kolla-ansible timeout after 120 second while stop/start services when systemd used

Bug #2048130 reported by Michal Arbet
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
kolla-ansible
Fix Committed
High
Michal Arbet
Antelope
Fix Committed
High
Michal Arbet
Bobcat
Fix Committed
High
Michal Arbet
Caracal
Fix Committed
High
Michal Arbet

Bug Description

Hi,

I just found that kolla-ansible timeouting on start/stop services (only some of them). The reason is that when docker container (some services) receive SIGTERM instead RC 0 RC 143 is returned.

I've already tested it and add some debug messages into code of kolla_systemd_worker below :

Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service - waiting for 5.
Unit kolla-haproxy-container.service timeouted for wait for dead after 120.

Waiting for unit -> wait_for_unit(service=kolla-proxysql-container.service, timeout=120, state=deadUnit kolla-proxysql-container.service - waiting for 5.
Unit kolla-proxysql-container.service dead == dead, return.

Waiting for unit -> wait_for_unit(service=kolla-haproxy-container.service, timeout=120, state=runningUnit kolla-haproxy-container.service running == running, return.

Waiting for unit -> wait_for_unit(service=kolla-proxysql-container.service, timeout=120, state=runningUnit kolla-proxysql-container.service running == running, return.

Code is checking 'dead' but it is 'failed' (but still valid turned off service - just not exited with 0 but 143)

Then I tried to stop all services with systemctl stop $service

kolla-cinder_api-container.service > Active: inactive (dead) since Thu 2024-01-04 19:00:43 UTC; 30s ago
kolla-cinder_scheduler-container.service > Active: inactive (dead) since Thu 2024-01-04 19:01:26 UTC; 30s ago
kolla-cron-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:02:06 UTC; 30s ago
kolla-designate_api-container.service > Active: inactive (dead) since Thu 2024-01-04 19:02:47 UTC; 30s ago
kolla-designate_backend_bind9-container.service > Active: inactive (dead) since Thu 2024-01-04 19:03:27 UTC; 30s ago
kolla-designate_central-container.service > Active: inactive (dead) since Thu 2024-01-04 19:04:10 UTC; 30s ago
kolla-designate_mdns-container.service > Active: inactive (dead) since Thu 2024-01-04 19:04:51 UTC; 30s ago
kolla-designate_producer-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:06:32 UTC; 30s ago
kolla-designate_sink-container.service > Active: inactive (dead) since Thu 2024-01-04 19:07:14 UTC; 30s ago
kolla-designate_worker-container.service > Active: inactive (dead) since Thu 2024-01-04 19:07:59 UTC; 30s ago
kolla-fluentd-container.service > Active: inactive (dead) since Thu 2024-01-04 19:08:41 UTC; 30s ago
kolla-glance_api-container.service > Active: inactive (dead) since Thu 2024-01-04 19:09:22 UTC; 30s ago
kolla-haproxy-container.service > Active: inactive (dead) since Thu 2024-01-04 19:10:03 UTC; 30s ago
kolla-haproxy_ssh-container.service > Active: inactive (dead) since Thu 2024-01-04 19:10:43 UTC; 30s ago
kolla-heat_api-container.service > Active: inactive (dead) since Thu 2024-01-04 19:11:25 UTC; 30s ago
kolla-heat_api_cfn-container.service > Active: inactive (dead) since Thu 2024-01-04 19:12:06 UTC; 30s ago
kolla-heat_engine-container.service > Active: inactive (dead) since Thu 2024-01-04 19:12:57 UTC; 30s ago
kolla-horizon-container.service > Active: inactive (dead) since Thu 2024-01-04 19:13:40 UTC; 30s ago
kolla-keepalived-container.service > Active: inactive (dead) since Thu 2024-01-04 19:14:21 UTC; 30s ago
kolla-keystone-container.service > Active: inactive (dead) since Thu 2024-01-04 19:15:03 UTC; 30s ago
kolla-keystone_fernet-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:15:44 UTC; 30s ago
kolla-keystone_ssh-container.service > Active: inactive (dead) since Thu 2024-01-04 19:16:24 UTC; 30s ago
kolla-kolla_toolbox-container.service > Active: inactive (dead) since Thu 2024-01-04 19:17:05 UTC; 30s ago
kolla-letsencrypt_lego-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:17:46 UTC; 30s ago
kolla-letsencrypt_webserver-container.service > Active: inactive (dead) since Thu 2024-01-04 19:18:27 UTC; 30s ago
kolla-magnum_api-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:19:07 UTC; 30s ago
kolla-magnum_conductor-container.service > Active: inactive (dead) since Thu 2024-01-04 19:19:57 UTC; 30s ago
kolla-mariadb-container.service > Active: inactive (dead) since Thu 2024-01-04 19:20:38 UTC; 30s ago
kolla-mariadb_clustercheck-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:21:18 UTC; 30s ago
kolla-memcached-container.service > Active: inactive (dead) since Thu 2024-01-04 19:21:59 UTC; 30s ago
kolla-neutron_bgp_dragent-container.service > Active: inactive (dead) since Thu 2024-01-04 19:22:45 UTC; 30s ago
kolla-neutron_dhcp_agent-container.service > Active: inactive (dead) since Thu 2024-01-04 19:23:27 UTC; 30s ago
kolla-neutron_l3_agent-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:24:08 UTC; 30s ago
kolla-neutron_metadata_agent-container.service > Active: inactive (dead) since Thu 2024-01-04 19:24:49 UTC; 30s ago
kolla-neutron_openvswitch_agent-container.service > Active: inactive (dead) since Thu 2024-01-04 19:25:53 UTC; 30s ago
kolla-neutron_server-container.service > Active: inactive (dead) since Thu 2024-01-04 19:27:31 UTC; 30s ago
kolla-nova_api-container.service > Active: inactive (dead) since Thu 2024-01-04 19:28:15 UTC; 30s ago
kolla-nova_conductor-container.service > Active: inactive (dead) since Thu 2024-01-04 19:29:28 UTC; 30s ago
kolla-nova_scheduler-container.service > Active: inactive (dead) since Thu 2024-01-04 19:30:14 UTC; 30s ago
kolla-nova_spicehtml5proxy-container.service > Active: inactive (dead) since Thu 2024-01-04 19:30:55 UTC; 30s ago
kolla-octavia_api-container.service > Active: inactive (dead) since Thu 2024-01-04 19:31:37 UTC; 30s ago
kolla-octavia_health_manager-container.service > Active: inactive (dead) since Thu 2024-01-04 19:32:18 UTC; 30s ago
kolla-octavia_housekeeping-container.service > Active: inactive (dead) since Thu 2024-01-04 19:33:00 UTC; 30s ago
kolla-octavia_worker-container.service > Active: inactive (dead) since Thu 2024-01-04 19:33:45 UTC; 30s ago
kolla-openvswitch_db-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:34:26 UTC; 30s ago
kolla-openvswitch_vswitchd-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:35:06 UTC; 30s ago
kolla-placement_api-container.service > Active: inactive (dead) since Thu 2024-01-04 19:35:48 UTC; 30s ago
kolla-proxysql-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:36:28 UTC; 30s ago
kolla-rabbitmq-container.service > Active: inactive (dead) since Thu 2024-01-04 19:37:16 UTC; 30s ago
kolla-redis-container.service > Active: inactive (dead) since Thu 2024-01-04 19:37:57 UTC; 30s ago
kolla-redis_sentinel-container.service > Active: inactive (dead) since Thu 2024-01-04 19:38:38 UTC; 30s ago
kolla-skyline_apiserver-container.service > Active: inactive (dead) since Thu 2024-01-04 19:39:19 UTC; 30s ago
kolla-skyline_console-container.service > Active: inactive (dead) since Thu 2024-01-04 19:40:00 UTC; 30s ago

Problem services - everytime exited with 143

kolla-cron-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:02:06 UTC; 30s ago
kolla-designate_producer-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:06:32 UTC; 30s ago
kolla-keystone_fernet-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:15:44 UTC; 30s ago
kolla-letsencrypt_lego-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:17:46 UTC; 30s ago
kolla-magnum_api-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:19:07 UTC; 30s ago
kolla-mariadb_clustercheck-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:21:18 UTC; 30s ago
kolla-neutron_l3_agent-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:24:08 UTC; 30s ago
kolla-openvswitch_db-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:34:26 UTC; 30s ago
kolla-openvswitch_vswitchd-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:35:06 UTC; 30s ago
kolla-proxysql-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:36:28 UTC; 30s ago

Some reading about -> https://www.groundcover.com/kubernetes-troubleshooting/exit-code-143 , https://<email address hidden>/msg30473.html

Soooo, I will send a patch to fix this as 2 minutes for restart service is realy too much ...5 controllers means 2x5 x 2 (haproxy proxysql) - 20 minutes instead of few seconds.

Changed in kolla-ansible:
assignee: nobody → Michal Arbet (michalarbet)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (master)
Changed in kolla-ansible:
status: New → In Progress
Changed in kolla-ansible:
importance: Undecided → High
Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :

For containers running cron this seems plausible, cron doesn't expect to be stopped. Maybe we should consider replacing those with something based on systemd timers?

For the openstack containers I cannot reproduce the issue, e.g. kolla-designate_producer-container.service stops just cleanly for me. Can you check the log in /var/log/kolla for the affected service whether it shows some issue during shutdown?

Revision history for this message
Michal Arbet (michalarbet) wrote :
Download full text (21.2 KiB)

Okay, now I know why I reported designate

I was iterating through the systemd services and called stop against them..
For specific reason I have on controller0 set connection to DB to controller0 for designate

BUT as I stopped mariadb before designate .. designate didn't have mariadb available ..so it failed ..on controller1 should be OK in my env

2024-01-04 19:47:39.177 7 INFO oslo_service.service [None req-c370c25f-22f5-4cf9-b663-001fa6d7b48c - - - - - -] Caught SIGTERM, stopping children
2024-01-04 19:47:39.183 7 INFO designate.service [None req-c370c25f-22f5-4cf9-b663-001fa6d7b48c - - - - - -] Stopping producer service
2024-01-04 19:47:39.186 7 INFO oslo_service.service [None req-c370c25f-22f5-4cf9-b663-001fa6d7b48c - - - - - -] Waiting on 5 children to exit
2024-01-04 19:47:40.578 20 INFO designate.service [None req-3c99cc92-ed1c-4018-a7e5-fe9834d9e8c5 - - - - - -] Stopping producer service
2024-01-04 19:47:40.580 20 ERROR oslo_service.threadgroup [None req-3c99cc92-ed1c-4018-a7e5-fe9834d9e8c5 - - - - - -] Error waiting on timer.: oslo_messaging.rpc.client.RemoteError: Remote error: DBConnectionError (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '192.168.205.10' ([Errno 111] ECONNREFUSED)")
[SQL: SELECT 1]
(Background on this error at: https://sqlalche.me/e/14/e3q8)
['Traceback (most recent call last):\n', ' File "/var/lib/kolla/venv/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1900, in _execute_context\n self.dialect.do_execute(\n', ' File "/var/lib/kolla/venv/lib/python3.11/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute\n cursor.execute(statement, parameters)\n', ' File "/var/lib/kolla/venv/lib/python3.11/site-packages/pymysql/cursors.py", line 158, in execute\n result = self._query(query)\n ^^^^^^^^^^^^^^^^^^\n', ' File "/var/lib/kolla/venv/lib/python3.11/site-packages/pymysql/cursors.py", line 325, in _query\n conn.query(q)\n', ' File "/var/lib/kolla/venv/lib/python3.11/site-packages/pymysql/connections.py", line 549, in query\n self._affected_rows = self._read_query_result(unbuffered=unbuffered)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n', ' File "/var/lib/kolla/venv/lib/python3.11/site-packages/pymysql/connections.py", line 779, in _read_query_result\n result.read()\n', ' File "/var/lib/kolla/venv/lib/python3.11/site-packages/pymysql/connections.py", line 1157, in read\n first_packet = self.connection._read_packet()\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n', ' File "/var/lib/kolla/venv/lib/python3.11/site-packages/pymysql/connections.py", line 696, in _read_packet\n packet_header = self._read_bytes(4)\n ^^^^^^^^^^^^^^^^^^^\n', ' File "/var/lib/kolla/venv/lib/python3.11/site-packages/pymysql/connections.py", line 752, in _read_bytes\n raise err.OperationalError(\n', "pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query')\n", '\nThe above exception was the direct cause of the following exception:\n\n', 'Traceback (most recent call last):\n', ' File "/var/lib/kolla/venv/lib/python3.11/site-packages/oslo_db/sqlalchemy/engines...

Revision history for this message
Michal Arbet (michalarbet) wrote :

But I am afraid that we still can't fix non-openstack services ....

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (master)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/904805
Committed: https://opendev.org/openstack/kolla-ansible/commit/b1fd2b40f7cd1c6c457bd42b25ca32dc1e5e0f4f
Submitter: "Zuul (22348)"
Branch: master

commit b1fd2b40f7cd1c6c457bd42b25ca32dc1e5e0f4f
Author: Michal Arbet <email address hidden>
Date: Thu Jan 4 22:26:13 2024 +0100

    Fix long service restarts while using systemd

    Some containers exiting with 143 instead of 0, but
    this is still OK. This patch just allows
    ExitCode 143 (SIGTERM) as fix. Details in
    bugreport.

    Services which exited with 143 (SIGTERM):

    kolla-cron-container.service
    kolla-designate_producer-container.service
    kolla-keystone_fernet-container.service
    kolla-letsencrypt_lego-container.service
    kolla-magnum_api-container.service
    kolla-mariadb_clustercheck-container.service
    kolla-neutron_l3_agent-container.service
    kolla-openvswitch_db-container.service
    kolla-openvswitch_vswitchd-container.service
    kolla-proxysql-container.service

    Partial-Bug: #2048130
    Change-Id: Ia8c85d03404cfb368e4013066c67acd2a2f68deb

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/2023.2)

Fix proposed to branch: stable/2023.2
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/904740

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/2023.1)

Fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/904841

no longer affects: kolla-ansible/yoga
no longer affects: kolla-ansible/zed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/2023.1)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/904841
Committed: https://opendev.org/openstack/kolla-ansible/commit/09ca4bdcf5962a52edc95bdb015ca7e748443c0a
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit 09ca4bdcf5962a52edc95bdb015ca7e748443c0a
Author: Michal Arbet <email address hidden>
Date: Thu Jan 4 22:26:13 2024 +0100

    Fix long service restarts while using systemd

    Some containers exiting with 143 instead of 0, but
    this is still OK. This patch just allows
    ExitCode 143 (SIGTERM) as fix. Details in
    bugreport.

    Services which exited with 143 (SIGTERM):

    kolla-cron-container.service
    kolla-designate_producer-container.service
    kolla-keystone_fernet-container.service
    kolla-letsencrypt_lego-container.service
    kolla-magnum_api-container.service
    kolla-mariadb_clustercheck-container.service
    kolla-neutron_l3_agent-container.service
    kolla-openvswitch_db-container.service
    kolla-openvswitch_vswitchd-container.service
    kolla-proxysql-container.service

    Partial-Bug: #2048130
    Change-Id: Ia8c85d03404cfb368e4013066c67acd2a2f68deb
    (cherry picked from commit b1fd2b40f7cd1c6c457bd42b25ca32dc1e5e0f4f)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/2023.2)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/904740
Committed: https://opendev.org/openstack/kolla-ansible/commit/51ca1bc6967d56cd6534a0280c17a9cf6677408d
Submitter: "Zuul (22348)"
Branch: stable/2023.2

commit 51ca1bc6967d56cd6534a0280c17a9cf6677408d
Author: Michal Arbet <email address hidden>
Date: Thu Jan 4 22:26:13 2024 +0100

    Fix long service restarts while using systemd

    Some containers exiting with 143 instead of 0, but
    this is still OK. This patch just allows
    ExitCode 143 (SIGTERM) as fix. Details in
    bugreport.

    Services which exited with 143 (SIGTERM):

    kolla-cron-container.service
    kolla-designate_producer-container.service
    kolla-keystone_fernet-container.service
    kolla-letsencrypt_lego-container.service
    kolla-magnum_api-container.service
    kolla-mariadb_clustercheck-container.service
    kolla-neutron_l3_agent-container.service
    kolla-openvswitch_db-container.service
    kolla-openvswitch_vswitchd-container.service
    kolla-proxysql-container.service

    Partial-Bug: #2048130
    Change-Id: Ia8c85d03404cfb368e4013066c67acd2a2f68deb
    (cherry picked from commit b1fd2b40f7cd1c6c457bd42b25ca32dc1e5e0f4f)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (master)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on kolla-ansible (master)

Change abandoned by "Michal Nasiadka <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/905208
Reason: does not help

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (master)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/905121
Committed: https://opendev.org/openstack/kolla-ansible/commit/f1331fa8875678377dd2c54e58d3cbe8d2aec45f
Submitter: "Zuul (22348)"
Branch: master

commit f1331fa8875678377dd2c54e58d3cbe8d2aec45f
Author: Michal Nasiadka <email address hidden>
Date: Tue Jan 9 15:34:12 2024 +0100

    haproxy: Use -f configdir instead of for and xargs

    -f configdir has been supported in HaProxy since 1.7

    Partial-Bug: #2048130
    Change-Id: Icb95cb072cb3eb5c27ffd79a127069bfcf21a61a

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/905131
Committed: https://opendev.org/openstack/kolla-ansible/commit/4f031961a39b611586105ee707fdc485e55c9518
Submitter: "Zuul (22348)"
Branch: master

commit 4f031961a39b611586105ee707fdc485e55c9518
Author: Michal Nasiadka <email address hidden>
Date: Tue Jan 9 17:14:37 2024 +0100

    mariadb-clustercheck: Use socat wrapper

    Depends-On: https://review.opendev.org/c/openstack/kolla/+/904981

    Partial-Bug: #2048130
    Change-Id: Ib971b694c12ccb51e0893978dd5962040f2fbd22

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/905117
Committed: https://opendev.org/openstack/kolla-ansible/commit/5aac09c9a21918b499ecbded67704fbfe1e44349
Submitter: "Zuul (22348)"
Branch: master

commit 5aac09c9a21918b499ecbded67704fbfe1e44349
Author: Michal Nasiadka <email address hidden>
Date: Tue Jan 9 14:16:45 2024 +0100

    openvswitch: Stop using intermediate scripts

    This will fix exit codes - details in bug.

    Basically openvswitch treats TERM as something fatal (exits
    with 143) and the only solution for graceful exit is using
    ovs-appctl and sending exit command, just like ovs-ctl utility
    does.

    Depends-On: https://review.opendev.org/c/openstack/kolla/+/905189

    Partial-Bug: #2048130
    Change-Id: I523018cb98944de60d7b95404deb7cebd641f33a

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.