Comment 0 for bug 2003359

Revision history for this message
Anton Kurbatov (akurbatov) wrote :

We found the issue when a created HA DVR router gets stuck in the backup state and does not go into primary state.
Preconditions:
1) there is no router with a specific external network yet
2) the router needs to go through a quick creation->deletion, and then the next creation of the router can get stuck in the backup state

The reason for such behavior is not removed fip-ns on the agent while the floatingip_agent_gateway port was removed.
Further is a demo with the help of which I managed to reproduce this behavior on a single node devstack setup with.

Сreate a router and quickly delete it while the l3 agent processes the external GW adding:

[root@devstack ~]# r_id=$(openstack router create r1 --distributed --ha -c id -f value); sleep 30 # give time to process
[root@devstack ~]# count_fip_requests() { journalctl -u <email address hidden> | grep 'FloatingIP agent gateway port received' | wc -l; }
[root@devstack ~]# # add an external gateway and then delete the router while the agent processes gw
[root@devstack ~]# fip_requests=$(count_fip_requests); openstack router set $r_id --external-gateway public; while :; do [[ $fip_requests == $(count_fip_requests) ]] && { echo "waiting before deletion..."; sleep 1; } || break; done; openstack router delete $r_id
waiting before deletion...
waiting before deletion...
[root@devstack ~]#

As a result fip-ns is not deleted even though the floatingip_agent_gateway port was removed:

[root@devstack ~]# ip netns
fip-8d4bc2d5-c6e7-44d0-99f7-1333bafa991f (id: 1)
[root@devstack ~]# openstack port list --network public -c ID -c device_owner -c status --long
<empty>
[root@devstack ~]#

Re-create the router together with external gw now:

[root@devstack ~]# openstack router create r1 --ha --distributed --external-gateway public

In the logs, one can see a traceback that the creation of this router failed initially, followed by a successful creation:

ERROR neutron.agent.l3.dvr_fip_ns Traceback (most recent call last):
ERROR neutron.agent.l3.dvr_fip_ns File "/opt/stack/neutron/neutron/agent/l3/dvr_fip_ns.py", line 152, in create_or_update_gateway_port
ERROR neutron.agent.l3.dvr_fip_ns self._update_gateway_port(
ERROR neutron.agent.l3.dvr_fip_ns File "/opt/stack/neutron/neutron/agent/l3/dvr_fip_ns.py", line 323, in _update_gateway_port
ERROR neutron.agent.l3.dvr_fip_ns self.driver.set_onlink_routes(
ERROR neutron.agent.l3.dvr_fip_ns File "/opt/stack/neutron/neutron/agent/linux/interface.py", line 193, in set_onlink_routes
ERROR neutron.agent.l3.dvr_fip_ns onlink = device.route.list_onlink_routes(constants.IP_VERSION_4)
ERROR neutron.agent.l3.dvr_fip_ns File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 633, in list_onlink_routes
ERROR neutron.agent.l3.dvr_fip_ns routes = self.list_routes(ip_version, scope='link')
ERROR neutron.agent.l3.dvr_fip_ns File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 629, in list_routes
ERROR neutron.agent.l3.dvr_fip_ns return list_ip_routes(self._parent.namespace, ip_version, scope=scope,
ERROR neutron.agent.l3.dvr_fip_ns File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 1585, in list_ip_routes
ERROR neutron.agent.l3.dvr_fip_ns routes = privileged.list_ip_routes(namespace, ip_version, device=device,
ERROR neutron.agent.l3.dvr_fip_ns File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 333, in wrapped_f
ERROR neutron.agent.l3.dvr_fip_ns return self(f, *args, **kw)
ERROR neutron.agent.l3.dvr_fip_ns File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 423, in __call__
ERROR neutron.agent.l3.dvr_fip_ns do = self.iter(retry_state=retry_state)
ERROR neutron.agent.l3.dvr_fip_ns File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 360, in iter
ERROR neutron.agent.l3.dvr_fip_ns return fut.result()
ERROR neutron.agent.l3.dvr_fip_ns File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 439, in result
ERROR neutron.agent.l3.dvr_fip_ns return self.__get_result()
ERROR neutron.agent.l3.dvr_fip_ns File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 391, in __get_result
ERROR neutron.agent.l3.dvr_fip_ns raise self._exception
ERROR neutron.agent.l3.dvr_fip_ns File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 426, in __call__
ERROR neutron.agent.l3.dvr_fip_ns result = fn(*args, **kwargs)
ERROR neutron.agent.l3.dvr_fip_ns File "/usr/local/lib/python3.9/site-packages/oslo_privsep/priv_context.py", line 271, in _wrap
ERROR neutron.agent.l3.dvr_fip_ns return self.channel.remote_call(name, args, kwargs,
ERROR neutron.agent.l3.dvr_fip_ns File "/usr/local/lib/python3.9/site-packages/oslo_privsep/daemon.py", line 215, in remote_call
ERROR neutron.agent.l3.dvr_fip_ns raise exc_type(*result[2])
ERROR neutron.agent.l3.dvr_fip_ns neutron.privileged.agent.linux.ip_lib.NetworkInterfaceNotFound: Network interface fg-b489f216-35not found in namespace fip-8d4bc2d5-c6e7-44d0-99f7-1333bafa991f.

The result is the following state:

[root@devstack ~]# ip netns
fip-8d4bc2d5-c6e7-44d0-99f7-1333bafa991f (id: 2)
qrouter-1f384e52-533c-49ed-b809-71f6358a2e5b
snat-1f384e52-533c-49ed-b809-71f6358a2e5b (id: 1)
[root@devstack ~]# openstack port list --network public -c ID -c device_owner -c status --long
+--------------------------------------+----------------------------------+--------+
| ID | Device Owner | Status |
+--------------------------------------+----------------------------------+--------+
| 17679644-d775-4182-b5b3-f2035e6483d9 | network:router_gateway | DOWN |
| b489f216-356a-456a-82ab-849e43a3226d | network:floatingip_agent_gateway | ACTIVE |
+--------------------------------------+----------------------------------+--------+
[root@devstack ~]#
[root@devstack ~]# cat /opt/stack/data/neutron/ha_confs/1f384e52-533c-49ed-b809-71f6358a2e5b/state
backup
[root@devstack ~]# stat /opt/stack/data/neutron/ha_confs/1f384e52-533c-49ed-b809-71f6358a2e5b/neutron-keepalived-state-change.log
...
Access: 2023-01-19 11:10:10.715245690 -0500
Modify: 2023-01-19 11:10:18.976208238 -0500
Change: 2023-01-19 11:10:18.976208238 -0500
 Birth: 2023-01-19 11:10:10.715245690 -0500
[root@devstack ~]# stat /var/run/netns/fip-8d4bc2d5-c6e7-44d0-99f7-1333bafa991f
...
Access: 2023-01-19 11:10:19.533205713 -0500
Modify: 2023-01-19 11:10:19.533205713 -0500
Change: 2023-01-19 11:10:19.533205713 -0500
 Birth: -
[root@devstack ~]#

By timestamp we can see that a keepalived monitoring started to work before the fip-ns was re-created after unsuccessful first attempt to create a router.
So keepalived monitoring is still bound to the FIP-ns that was created on the previously stuck namespace.

Adding an external gw and removing a router has a race condition and it's not always possible to get 100% reproduction. To achieve 100% reproduction, just add a small sleep with the following patch:

[root@devstack neutron]# git diff
diff --git a/neutron/agent/l3/dvr_local_router.py b/neutron/agent/l3/dvr_local_router.py
index 6e37c09511..d01eb0de9b 100644
--- a/neutron/agent/l3/dvr_local_router.py
+++ b/neutron/agent/l3/dvr_local_router.py
@@ -837,6 +837,8 @@ class DvrLocalRouter(dvr_router_base.DvrRouterBase):
                 self.agent.context, ex_gw_port['network_id'])
             LOG.debug("FloatingIP agent gateway port received from the "
                       "plugin: %s", fip_agent_port)
+ import time
+ time.sleep(5)
         self.fip_ns.create_or_update_gateway_port(fip_agent_port)

     def update_routing_table(self, operation, route):
[root@devstack neutron]#