Cascade delete with prom listener fails

Bug #2038529 reported by Maximilian Stinsky
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
octavia
In Progress
Medium
Gregory Thiemonge

Bug Description

Greetings,

a couple of days ago we upgraded octavia to yoga(10.1.0) in our test environment.
We also upgraded our octavia-tempest-plguin version to 2.4.1 to get the new prometheus listener tests.

Since those upgrades tempest fails for its tearDownClass in `octavia_tempest_plugin.tests.api.v2.test_listener.ListenerAPITest.*`.

As this fails 'almost' everytime for us I tried to debug this and for me it seems that there could be a race condition in cascade delete.

The traceback I am getting for why the cascade delete is not working is the following:
[Traceback (most recent call last):, File "/var/lib/kolla/venv/lib/python3.8/site-packages/taskflow/engines/action_engine/executor.py", line 53, in _execute_task, result = task.execute(**arguments), File "/var/lib/kolla/venv/lib/python3.8/site-packages/octavia/controller/worker/v2/tasks/network_tasks.py", line 704, in execute, self.network_driver.update_vip(loadbalancer, for_delete=True), File "/var/lib/kolla/venv/lib/python3.8/site-packages/octavia/network/drivers/neutron/allowed_address_pairs.py", line 644, in update_vip, self._update_security_group_rules(load_balancer,, File "/var/lib/kolla/venv/lib/python3.8/site-packages/octavia/network/drivers/neutron/allowed_address_pairs.py", line 221, in _update_security_group_rules, self._create_security_group_rule(sec_grp_id, port_protocol[1],, File "/var/lib/kolla/venv/lib/python3.8/site-packages/octavia/network/drivers/neutron/base.py", line 160, in _create_security_group_rule, self.neutron_client.create_security_group_rule(rule), File "/var/lib/kolla/venv/lib/python3.8/site-packages/neutronclient/v2_0/client.py", line 1049, in create_security_group_rule, return self.post(self.security_group_rules_path, body=body), File "/var/lib/kolla/venv/lib/python3.8/site-packages/neutronclient/v2_0/client.py", line 361, in post, return self.do_request("POST", action, body=body,, File "/var/lib/kolla/venv/lib/python3.8/site-packages/neutronclient/v2_0/client.py", line 297, in do_request, self._handle_fault_response(status_code, replybody, resp), File "/var/lib/kolla/venv/lib/python3.8/site-packages/neutronclient/v2_0/client.py", line 272, in _handle_fault_response, exception_handler_v20(status_code, error_body), File "/var/lib/kolla/venv/lib/python3.8/site-packages/neutronclient/v2_0/client.py", line 90, in exception_handler_v20, raise client_exc(message=error_message,, neutronclient.common.exceptions.Conflict: Security group rule already exists. Rule id is 08bedc57-cc6e-41bb-8a13-597887980dc5., Neutron server returns request_ids: ['req-f1bdc5cc-bfda-412d-952a-98eb4e18dc81']]

This is getting triggert from the following flow:
Task 'delete_update_vip_8beed3b6-b8e8-472b-a9a4-883a52675176' (33c5a41f-f3ab-4406-831e-4175d353d585) transitioned into state 'FAILURE' from state 'RUNNING'

After digging through the code the delete is going through the following code [1] which it should never go through on a delete task?

If I downgrade the octavia-tempest-plugin to a version that does not include the Prometheus protocol the delete always works without any issue which makes me to believe that there might be some race condition when the new prometheus listener is configured on a loadbalancer.

The lb that got into a provisioning_status ERROR after a cascade delete can correctly be deleted when executing a cascade delete a second time on the loadbalancer.

Does anyone maybe has an idea what this could be triggered by?

[1] https://github.com/openstack/octavia/blob/10.1.0/octavia/network/drivers/neutron/allowed_address_pairs.py#L220-L225

Revision history for this message
Gregory Thiemonge (gthiemonge) wrote :

a similar issue was reported (by me) for the scenario tests:

https://storyboard.openstack.org/#!/story/2010338

IIRC I investigated this issue but I didn't find anything

@maximilian is it 100% reproducible in your env?

in the CI, we run the API Tests in noop mode (it means that we have dummy neutron/amphora/etc drivers), so maybe we don't see this issue because it's hidden by the dummy drivers, we can try to create a job with the default drivers for testing it.

--

a few notes from the storyboard:

It's weird that Octavia sends a POST request to Neutron during the deletion of a LB. Maybe on (cascade) DELETE requests, we could:

* either avoid adding/creating resources
* or ignore conflicts

Revision history for this message
Maximilian Stinsky (mstinsky) wrote :

Hi Gregory,

yes its very reproducible. Almost every time I run `octavia_tempest_plugin.tests.api.v2.test_listener.ListenerAPITest.*` against our test env this happens in the tearDownClass. And like I said only when the prometheus listener tests are also running, never without the prometheus listener checks.

Our test env is a fully functional openstack installation, so there are no noop drivers in place.

Anything you would suggest I can provide to get more information regarding this issue?

Revision history for this message
Gregory Thiemonge (gthiemonge) wrote :
Download full text (7.5 KiB)

I just reproduced it in the CI (in an API job with default drivers)

https://zuul.opendev.org/t/openstack/build/f05b0aa44bfc4189a95be4b1872d6f33/log/controller/logs/screen-o-cw.txt#7922

Oct 05 12:44:19.237221 np0035422599 octavia-worker[119890]: WARNING octavia.controller.worker.v2.controller_worker [-] Task 'delete_update_vip_844703f4-f450-46af-8773-26f6040e68de' (d00c21cc-ccf6-43eb-8ac4-81aafd6c9c89) transitioned into state 'FAILURE' from state 'RUNNING'
7893 Oct 05 12:44:19.237221 np0035422599 octavia-worker[119890]: 6 predecessors (most recent first):
7894 Oct 05 12:44:19.237221 np0035422599 octavia-worker[119890]: Flow 'octavia-delete-listener_flow-844703f4-f450-46af-8773-26f6040e68de'
7895 Oct 05 12:44:19.237221 np0035422599 octavia-worker[119890]: |__Flow 'listeners_delete_flow'
7896 Oct 05 12:44:19.237221 np0035422599 octavia-worker[119890]: |__Atom 'octavia.controller.worker.v2.tasks.database_tasks.MarkLBAmphoraeHealthBusy' {'intention': 'EXECUTE', 'state': 'SUCCESS', 'requires': {'loadbalancer': {'admin_state_up': True, 'description': None, 'loadbalancer_id': '1aa6ea94-27cb-4575-ae69-ebf35878eaad', 'name': 'tempest-lb_member_lb1_listener-511914142', 'project_id': '242390148eb444ff83aaa0b3ac34bd2b', 'vip_address': 'fd05:8ba:c9a3:1::295', 'vip_network_id': '93188c95-a1bd-41a3-b1d9-25edfc27e0f9', 'vip_port_id': '48c4de05-a690-4a1d-8549-e52489fd54aa', 'vip_subnet_id': 'e1093e40-f811-452e-a5c8-59b683c94398', 'vip_qos_policy_id': None, 'availability_zone': None}}, 'provides': None}
7897 Oct 05 12:44:19.237221 np0035422599 octavia-worker[119890]: |__Atom 'octavia.controller.worker.v2.tasks.compute_tasks.NovaServerGroupDelete' {'intention': 'EXECUTE', 'state': 'SUCCESS', 'requires': {'server_group_id': None}, 'provides': None}
7898 Oct 05 12:44:19.237221 np0035422599 octavia-worker[119890]: |__Atom 'octavia.controller.worker.v2.tasks.lifecycle_tasks.LoadBalancerToErrorOnRevertTask' {'intention': 'EXECUTE', 'state': 'SUCCESS', 'requires': {'loadbalancer': {'admin_state_up': True, 'description': None, 'loadbalancer_id': '1aa6ea94-27cb-4575-ae69-ebf35878eaad', 'name': 'tempest-lb_member_lb1_listener-511914142', 'project_id': '242390148eb444ff83aaa0b3ac34bd2b', 'vip_address': 'fd05:8ba:c9a3:1::295', 'vip_network_id': '93188c95-a1bd-41a3-b1d9-25edfc27e0f9', 'vip_port_id': '48c4de05-a690-4a1d-8549-e52489fd54aa', 'vip_subnet_id': 'e1093e40-f811-452e-a5c8-59b683c94398', 'vip_qos_policy_id': None, 'availability_zone': None}}, 'provides': None}
7899 Oct 05 12:44:19.237221 np0035422599 octavia-worker[119890]: |__Flow 'octavia-delete-loadbalancer-flow': neutronclient.common.exceptions.Conflict: Security group rule already exists. Rule id is a9e62944-c466-4a27-ac51-a057c8ac9646.
7900 Oct 05 12:44:19.237221 np0035422599 octavia-worker[119890]: Neutron server returns request_ids: ['req-bd6fc2f2-e1a4-460a-8865-7c9b22a32660']
7901 Oct 05 12:44:19.237221 np0035422599 octavia-worker[119890]: ERROR octavia.controller.worker.v2.controller_worker Traceback (most recent call last):
7902 Oct 05 12:44:19.237221 np0035422599 octavia-worker[119890]: ERROR octavia.controller.worker.v2.controller_worker File "/usr/local/lib...

Read more...

Revision history for this message
Gregory Thiemonge (gthiemonge) wrote (last edit ):
Download full text (3.4 KiB)

I may have found the root cause,

The deletion of the listeners is performed in parallel in a cascade delete:

https://opendev.org/openstack/octavia/src/branch/master/octavia/controller/worker/v2/flows/load_balancer_flows.py#L267-L271

In a local run:

Oct 05 07:42:47 gthiemon-devstack octavia-worker[2318143]: DEBUG octavia.common.base_taskflow [-] Task 'delete_update_vip_22963119-46f2-4602-ad3e-383a901ef690' (05c388fb-869e-46c6-8262-253267b1d961) transitioned into state 'RUNNING' from state 'PENDING' {{(pid=2318143) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:190}}
Oct 05 07:42:47 gthiemon-devstack octavia-worker[2318143]: DEBUG octavia.common.base_taskflow [-] Task 'delete_update_vip_206f37f9-9472-45b6-83ac-ca56728e29db' (31a421ad-f23e-4e27-a285-3521427e5776) transitioned into state 'RUNNING' from state 'PENDING' {{(pid=2318143) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:190}}
Oct 05 07:42:47 gthiemon-devstack octavia-worker[2318143]: DEBUG octavia.common.base_taskflow [-] Task 'delete_update_vip_41f809ea-b22c-4640-899a-6283163a6644' (f19ef136-79f0-4002-8b6d-58df240f750e) transitioned into state 'RUNNING' from state 'PENDING' {{(pid=2318143) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:190}}
Oct 05 07:42:47 gthiemon-devstack octavia-worker[2318143]: DEBUG octavia.common.base_taskflow [-] Task 'delete_update_vip_b60a0b2c-eb69-425d-bf9e-b3932c21d25e' (c95ec23e-2641-4d76-8436-a147c028bb25) transitioned into state 'RUNNING' from state 'PENDING' {{(pid=2318143) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:190}}
[..]
Oct 05 07:42:52 gthiemon-devstack octavia-worker[2318143]: DEBUG octavia.common.base_taskflow [-] Task 'delete_update_vip_d7f478ef-0d2c-4ae1-9c0e-43dae2ed81fb' (84d40bae-c0a8-4696-880b-6785fcebe9a9) transitioned into state 'SUCCESS' from state 'RUNNING' with result 'None' {{(pid=2318143) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:178}}
Oct 05 07:42:53 gthiemon-devstack octavia-worker[2318143]: DEBUG octavia.common.base_taskflow [-] Task 'delete_update_vip_3385191a-22a5-482c-b446-a09db3fb7de9' (fe60f53f-ea74-4265-a53a-07152f99af29) transitioned into state 'SUCCESS' from state 'RUNNING' with result 'None' {{(pid=2318143) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:178}}
Oct 05 07:42:53 gthiemon-devstack octavia-worker[2318143]: DEBUG octavia.common.base_taskflow [-] Task 'delete_update_vip_14c1e5ee-ab93-4c3a-a880-a71bf30f5320' (ea2844ff-2c16-4ea6-a956-5234df516171) transitioned into state 'SUCCESS' from state 'RUNNING' with result 'None' {{(pid=2318143) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:178}}
Oct 05 07:42:54 gthiemon-devstack octavia-worker[2318143]: DEBUG octavia.common.base_taskflow [-] Task 'delete_update_vip_700e6723-8990-4494-933c-a8d94f6d3b36' (dd5f7980-88e1-4698-866f-4004c34fefce) transitioned into state 'SUCCESS' from state 'RUNNING' with result 'None' {{(pid=2318143) _task_receiver /opt/stack/taskflow/taskflow/listeners/logging.py:178}}

And each task fetches the current state of the SGs, then computes a delta and applies new SGs (if needed) when updating the VIP. So yes,...

Read more...

Changed in octavia:
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Gregory Thiemonge (gthiemonge) wrote :

Note: we don't need to call update_vip for each listener during the cascade delete flow. we can only call it once, before deleting the listeners.
it would fix this bug and optimize the flow

Changed in octavia:
assignee: nobody → Gregory Thiemonge (gthiemonge)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to octavia (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/octavia/+/897466

Changed in octavia:
status: Confirmed → In Progress
Revision history for this message
Maximilian Stinsky (mstinsky) wrote :

Thanks Gregory for the quick proposed fix.

So the problem was not the prometheus listeners itself, but the additional added listeners to tempest pushed the race conditions to trigger more often.

I will test your patch in our test env for the yoga release tomorrow and will give you feedback.

Revision history for this message
Maximilian Stinsky (mstinsky) wrote :

Hi Gregory,

I just backported your patch to yoga(10.1.0) in our test env and can confirm that this bug seems to be fixed with it.

Thanks for the work! I would appreciate if when this is merged to master we could get backports to earlier releases :)

Revision history for this message
Gregory Thiemonge (gthiemonge) wrote :

Hi Maximilian, thanks for testing it! could you CR+1 the patch if it works for you please?

CI is failing for other reasons (grenade jobs broken), but a review would be appreciated

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.