We found the issue when a created HA DVR router gets stuck in the backup state and does not go into primary state.
Preconditions:
1) there is no router with a specific external network yet
2) the router needs to go through a quick creation->deletion, and then the next creation of the router can get stuck in the backup state
The reason for such behavior is not removed fip-ns on the agent while the floatingip_agent_gateway port was removed.
Further is a demo with the help of which I managed to reproduce this behavior on a single node devstack setup with.
Сreate a router and quickly delete it while the l3 agent processes the external GW adding:
[root@devstack ~]# r_id=$(openstack router create r1 --distributed --ha -c id -f value); sleep 30 # give time to process
[root@devstack ~]# count_fip_requests() { journalctl -u <email address hidden> | grep 'FloatingIP agent gateway port received' | wc -l; }
[root@devstack ~]# # add an external gateway and then delete the router while the agent processes gw
[root@devstack ~]# fip_requests=$(count_fip_requests); openstack router set $r_id --external-gateway public; while :; do [[ $fip_requests == $(count_fip_requests) ]] && { echo "waiting before deletion..."; sleep 1; } || break; done; openstack router delete $r_id
waiting before deletion...
waiting before deletion...
[root@devstack ~]#
As a result fip-ns is not deleted even though the floatingip_agent_gateway port was removed:
[root@devstack ~]# ip netns
fip-8d4bc2d5-c6e7-44d0-99f7-1333bafa991f (id: 1)
[root@devstack ~]# openstack port list --network public -c ID -c device_owner -c status --long
<empty>
[root@devstack ~]#
Re-create the router together with external gw now:
[root@devstack ~]# openstack router create r1 --ha --distributed --external-gateway public
In the logs, one can see a traceback that the creation of this router failed initially, followed by a successful creation:
ERROR neutron.agent.l3.dvr_fip_ns Traceback (most recent call last):
ERROR neutron.agent.l3.dvr_fip_ns File "/opt/stack/neutron/neutron/agent/l3/dvr_fip_ns.py", line 152, in create_or_update_gateway_port
ERROR neutron.agent.l3.dvr_fip_ns self._update_gateway_port(
ERROR neutron.agent.l3.dvr_fip_ns File "/opt/stack/neutron/neutron/agent/l3/dvr_fip_ns.py", line 323, in _update_gateway_port
ERROR neutron.agent.l3.dvr_fip_ns self.driver.set_onlink_routes(
ERROR neutron.agent.l3.dvr_fip_ns File "/opt/stack/neutron/neutron/agent/linux/interface.py", line 193, in set_onlink_routes
ERROR neutron.agent.l3.dvr_fip_ns onlink = device.route.list_onlink_routes(constants.IP_VERSION_4)
ERROR neutron.agent.l3.dvr_fip_ns File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 633, in list_onlink_routes
ERROR neutron.agent.l3.dvr_fip_ns routes = self.list_routes(ip_version, scope='link')
ERROR neutron.agent.l3.dvr_fip_ns File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 629, in list_routes
ERROR neutron.agent.l3.dvr_fip_ns return list_ip_routes(self._parent.namespace, ip_version, scope=scope,
ERROR neutron.agent.l3.dvr_fip_ns File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 1585, in list_ip_routes
ERROR neutron.agent.l3.dvr_fip_ns routes = privileged.list_ip_routes(namespace, ip_version, device=device,
ERROR neutron.agent.l3.dvr_fip_ns File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 333, in wrapped_f
ERROR neutron.agent.l3.dvr_fip_ns return self(f, *args, **kw)
ERROR neutron.agent.l3.dvr_fip_ns File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 423, in __call__
ERROR neutron.agent.l3.dvr_fip_ns do = self.iter(retry_state=retry_state)
ERROR neutron.agent.l3.dvr_fip_ns File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 360, in iter
ERROR neutron.agent.l3.dvr_fip_ns return fut.result()
ERROR neutron.agent.l3.dvr_fip_ns File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 439, in result
ERROR neutron.agent.l3.dvr_fip_ns return self.__get_result()
ERROR neutron.agent.l3.dvr_fip_ns File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 391, in __get_result
ERROR neutron.agent.l3.dvr_fip_ns raise self._exception
ERROR neutron.agent.l3.dvr_fip_ns File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 426, in __call__
ERROR neutron.agent.l3.dvr_fip_ns result = fn(*args, **kwargs)
ERROR neutron.agent.l3.dvr_fip_ns File "/usr/local/lib/python3.9/site-packages/oslo_privsep/priv_context.py", line 271, in _wrap
ERROR neutron.agent.l3.dvr_fip_ns return self.channel.remote_call(name, args, kwargs,
ERROR neutron.agent.l3.dvr_fip_ns File "/usr/local/lib/python3.9/site-packages/oslo_privsep/daemon.py", line 215, in remote_call
ERROR neutron.agent.l3.dvr_fip_ns raise exc_type(*result[2])
ERROR neutron.agent.l3.dvr_fip_ns neutron.privileged.agent.linux.ip_lib.NetworkInterfaceNotFound: Network interface fg-b489f216-35not found in namespace fip-8d4bc2d5-c6e7-44d0-99f7-1333bafa991f.
The result is the following state:
[root@devstack ~]# ip netns
fip-8d4bc2d5-c6e7-44d0-99f7-1333bafa991f (id: 2)
qrouter-1f384e52-533c-49ed-b809-71f6358a2e5b
snat-1f384e52-533c-49ed-b809-71f6358a2e5b (id: 1)
[root@devstack ~]# openstack port list --network public -c ID -c device_owner -c status --long
+--------------------------------------+----------------------------------+--------+
| ID | Device Owner | Status |
+--------------------------------------+----------------------------------+--------+
| 17679644-d775-4182-b5b3-f2035e6483d9 | network:router_gateway | DOWN |
| b489f216-356a-456a-82ab-849e43a3226d | network:floatingip_agent_gateway | ACTIVE |
+--------------------------------------+----------------------------------+--------+
[root@devstack ~]#
[root@devstack ~]# cat /opt/stack/data/neutron/ha_confs/1f384e52-533c-49ed-b809-71f6358a2e5b/state
backup
[root@devstack ~]# stat /opt/stack/data/neutron/ha_confs/1f384e52-533c-49ed-b809-71f6358a2e5b/neutron-keepalived-state-change.log
...
Access: 2023-01-19 11:10:10.715245690 -0500
Modify: 2023-01-19 11:10:18.976208238 -0500
Change: 2023-01-19 11:10:18.976208238 -0500
Birth: 2023-01-19 11:10:10.715245690 -0500
[root@devstack ~]# stat /var/run/netns/fip-8d4bc2d5-c6e7-44d0-99f7-1333bafa991f
...
Access: 2023-01-19 11:10:19.533205713 -0500
Modify: 2023-01-19 11:10:19.533205713 -0500
Change: 2023-01-19 11:10:19.533205713 -0500
Birth: -
[root@devstack ~]#
By timestamp we can see that a keepalived monitoring started to work before the fip-ns was re-created after unsuccessful first attempt to create a router.
So keepalived monitoring is still bound to the FIP-ns that was created on the previously stuck namespace.
Adding an external gw and removing a router has a race condition and it's not always possible to get 100% reproduction. To achieve 100% reproduction, just add a small sleep with the following patch:
[root@devstack neutron]# git diff
diff --git a/neutron/agent/l3/dvr_local_router.py b/neutron/agent/l3/dvr_local_router.py
index 6e37c09511..d01eb0de9b 100644
--- a/neutron/agent/l3/dvr_local_router.py
+++ b/neutron/agent/l3/dvr_local_router.py
@@ -837,6 +837,8 @@ class DvrLocalRouter(dvr_router_base.DvrRouterBase): self.agent.context, ex_gw_port['network_id']) LOG.debug("FloatingIP agent gateway port received from the " "plugin: %s", fip_agent_port)
+ import time
+ time.sleep(5) self.fip_ns.create_or_update_gateway_port(fip_agent_port)
We found the issue when a created HA DVR router gets stuck in the backup state and does not go into primary state.
Preconditions:
1) there is no router with a specific external network yet
2) the router needs to go through a quick creation->deletion, and then the next creation of the router can get stuck in the backup state
The reason for such behavior is not removed fip-ns on the agent while the floatingip_ agent_gateway port was removed.
Further is a demo with the help of which I managed to reproduce this behavior on a single node devstack setup with.
Сreate a router and quickly delete it while the l3 agent processes the external GW adding:
[root@devstack ~]# r_id=$(openstack router create r1 --distributed --ha -c id -f value); sleep 30 # give time to process requests( ) { journalctl -u <email address hidden> | grep 'FloatingIP agent gateway port received' | wc -l; } $(count_ fip_requests) ; openstack router set $r_id --external-gateway public; while :; do [[ $fip_requests == $(count_ fip_requests) ]] && { echo "waiting before deletion..."; sleep 1; } || break; done; openstack router delete $r_id
[root@devstack ~]# count_fip_
[root@devstack ~]# # add an external gateway and then delete the router while the agent processes gw
[root@devstack ~]# fip_requests=
waiting before deletion...
waiting before deletion...
[root@devstack ~]#
As a result fip-ns is not deleted even though the floatingip_ agent_gateway port was removed:
[root@devstack ~]# ip netns c6e7-44d0- 99f7-1333bafa99 1f (id: 1)
fip-8d4bc2d5-
[root@devstack ~]# openstack port list --network public -c ID -c device_owner -c status --long
<empty>
[root@devstack ~]#
Re-create the router together with external gw now:
[root@devstack ~]# openstack router create r1 --ha --distributed --external-gateway public
In the logs, one can see a traceback that the creation of this router failed initially, followed by a successful creation:
ERROR neutron. agent.l3. dvr_fip_ ns Traceback (most recent call last): agent.l3. dvr_fip_ ns File "/opt/stack/ neutron/ neutron/ agent/l3/ dvr_fip_ ns.py", line 152, in create_ or_update_ gateway_ port agent.l3. dvr_fip_ ns self._update_ gateway_ port( agent.l3. dvr_fip_ ns File "/opt/stack/ neutron/ neutron/ agent/l3/ dvr_fip_ ns.py", line 323, in _update_ gateway_ port agent.l3. dvr_fip_ ns self.driver. set_onlink_ routes( agent.l3. dvr_fip_ ns File "/opt/stack/ neutron/ neutron/ agent/linux/ interface. py", line 193, in set_onlink_routes agent.l3. dvr_fip_ ns onlink = device. route.list_ onlink_ routes( constants. IP_VERSION_ 4) agent.l3. dvr_fip_ ns File "/opt/stack/ neutron/ neutron/ agent/linux/ ip_lib. py", line 633, in list_onlink_routes agent.l3. dvr_fip_ ns routes = self.list_ routes( ip_version, scope='link') agent.l3. dvr_fip_ ns File "/opt/stack/ neutron/ neutron/ agent/linux/ ip_lib. py", line 629, in list_routes agent.l3. dvr_fip_ ns return list_ip_ routes( self._parent. namespace, ip_version, scope=scope, agent.l3. dvr_fip_ ns File "/opt/stack/ neutron/ neutron/ agent/linux/ ip_lib. py", line 1585, in list_ip_routes agent.l3. dvr_fip_ ns routes = privileged. list_ip_ routes( namespace, ip_version, device=device, agent.l3. dvr_fip_ ns File "/usr/local/ lib/python3. 9/site- packages/ tenacity/ __init_ _.py", line 333, in wrapped_f agent.l3. dvr_fip_ ns return self(f, *args, **kw) agent.l3. dvr_fip_ ns File "/usr/local/ lib/python3. 9/site- packages/ tenacity/ __init_ _.py", line 423, in __call__ agent.l3. dvr_fip_ ns do = self.iter( retry_state= retry_state) agent.l3. dvr_fip_ ns File "/usr/local/ lib/python3. 9/site- packages/ tenacity/ __init_ _.py", line 360, in iter agent.l3. dvr_fip_ ns return fut.result() agent.l3. dvr_fip_ ns File "/usr/lib64/ python3. 9/concurrent/ futures/ _base.py" , line 439, in result agent.l3. dvr_fip_ ns return self.__get_result() agent.l3. dvr_fip_ ns File "/usr/lib64/ python3. 9/concurrent/ futures/ _base.py" , line 391, in __get_result agent.l3. dvr_fip_ ns raise self._exception agent.l3. dvr_fip_ ns File "/usr/local/ lib/python3. 9/site- packages/ tenacity/ __init_ _.py", line 426, in __call__ agent.l3. dvr_fip_ ns result = fn(*args, **kwargs) agent.l3. dvr_fip_ ns File "/usr/local/ lib/python3. 9/site- packages/ oslo_privsep/ priv_context. py", line 271, in _wrap agent.l3. dvr_fip_ ns return self.channel. remote_ call(name, args, kwargs, agent.l3. dvr_fip_ ns File "/usr/local/ lib/python3. 9/site- packages/ oslo_privsep/ daemon. py", line 215, in remote_call agent.l3. dvr_fip_ ns raise exc_type( *result[ 2]) agent.l3. dvr_fip_ ns neutron. privileged. agent.linux. ip_lib. NetworkInterfac eNotFound: Network interface fg-b489f216-35not found in namespace fip-8d4bc2d5- c6e7-44d0- 99f7-1333bafa99 1f.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
ERROR neutron.
The result is the following state:
[root@devstack ~]# ip netns c6e7-44d0- 99f7-1333bafa99 1f (id: 2) 1f384e52- 533c-49ed- b809-71f6358a2e 5b 533c-49ed- b809-71f6358a2e 5b (id: 1) ------- ------- ------- ------- ----+-- ------- ------- ------- ------- ----+-- ------+ ------- ------- ------- ------- ----+-- ------- ------- ------- ------- ----+-- ------+ d775-4182- b5b3-f2035e6483 d9 | network: router_ gateway | DOWN | 356a-456a- 82ab-849e43a322 6d | network: floatingip_ agent_gateway | ACTIVE | ------- ------- ------- ------- ----+-- ------- ------- ------- ------- ----+-- ------+ data/neutron/ ha_confs/ 1f384e52- 533c-49ed- b809-71f6358a2e 5b/state data/neutron/ ha_confs/ 1f384e52- 533c-49ed- b809-71f6358a2e 5b/neutron- keepalived- state-change. log netns/fip- 8d4bc2d5- c6e7-44d0- 99f7-1333bafa99 1f
fip-8d4bc2d5-
qrouter-
snat-1f384e52-
[root@devstack ~]# openstack port list --network public -c ID -c device_owner -c status --long
+------
| ID | Device Owner | Status |
+------
| 17679644-
| b489f216-
+------
[root@devstack ~]#
[root@devstack ~]# cat /opt/stack/
backup
[root@devstack ~]# stat /opt/stack/
...
Access: 2023-01-19 11:10:10.715245690 -0500
Modify: 2023-01-19 11:10:18.976208238 -0500
Change: 2023-01-19 11:10:18.976208238 -0500
Birth: 2023-01-19 11:10:10.715245690 -0500
[root@devstack ~]# stat /var/run/
...
Access: 2023-01-19 11:10:19.533205713 -0500
Modify: 2023-01-19 11:10:19.533205713 -0500
Change: 2023-01-19 11:10:19.533205713 -0500
Birth: -
[root@devstack ~]#
By timestamp we can see that a keepalived monitoring started to work before the fip-ns was re-created after unsuccessful first attempt to create a router.
So keepalived monitoring is still bound to the FIP-ns that was created on the previously stuck namespace.
Adding an external gw and removing a router has a race condition and it's not always possible to get 100% reproduction. To achieve 100% reproduction, just add a small sleep with the following patch:
[root@devstack neutron]# git diff agent/l3/ dvr_local_ router. py b/neutron/ agent/l3/ dvr_local_ router. py .d01eb0de9b 100644 agent/l3/ dvr_local_ router. py agent/l3/ dvr_local_ router. py dvr_router_ base.DvrRouterB ase):
self. agent.context, ex_gw_port[ 'network_ id'])
LOG. debug(" FloatingIP agent gateway port received from the "
"plugin: %s", fip_agent_port)
self. fip_ns. create_ or_update_ gateway_ port(fip_ agent_port)
diff --git a/neutron/
index 6e37c09511.
--- a/neutron/
+++ b/neutron/
@@ -837,6 +837,8 @@ class DvrLocalRouter(
+ import time
+ time.sleep(5)
def update_ routing_ table(self, operation, route):
[root@devstack neutron]#