fip will loss when it migrate between dvr-sant agent and dvr_no_external in Rocky

Bug #1992542 reported by Jin Ren
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
New
Wishlist
Unassigned

Bug Description

We hava 4 node,3 control node and 1 compute node ,our control node and network node are installed together, we use Rocky.

[TestCase]
Internal ip is 172.16.135.206
Float ip is 13.5.4.113

1,Create a vm with fip in control node
2,Shut down vm
3,Migrate this vm from control node to computer
4,Start up vm

[expect result]
Try to ping internet,fip worked,

[actually]
Fip does not work, fip can't ping internet successfully.

This nat rule is in snat ns

[root@CRH-KZ-3 neutron]# ip netns exec snat-3597ff2f-60c9-4310-b11b-3d808e63c4b9 iptables -t nat -S
-A POSTROUTING -j neutron-postrouting-bottom
-A neutron-l3-agent-OUTPUT -d 13.5.4.113/32 -j DNAT --to-destination 172.16.135.206
-A neutron-l3-agent-PREROUTING -d 13.5.4.113/32 -j DNAT --to-destination 172.16.135.206

I cannot find this fip 13.5.4.113
[root@CRH-KZ-3 neutron]# ip netns exec snat-3597ff2f-60c9-4310-b11b-3d808e63c4b9 ip a
4: qg-ecb0baea-0a@if10454455: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether fa:16:3e:d7:bd:fb brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 13.5.4.232/32 brd 13.5.4.232 scope global qg-ecb0baea-0a
       valid_lft forever preferred_lft forever
    inet 13.5.4.32/32 brd 13.5.4.32 scope global qg-ecb0baea-0a
       valid_lft forever preferred_lft forever
    inet 13.5.4.4/32 scope global qg-ecb0baea-0a
       valid_lft forever preferred_lft forever
    inet6 2022:419:1710:eeee::389/64 scope global nodad
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fed7:bdfb/64 scope link
       valid_lft forever preferred_lft forever

It seems fip loss during migration.If I add addr fip to qg-ecb0baea-0a,fip will work normally

I try to debug it.

Before migration,our vm with fip is in control node, all traffic will go through interface fg-5ff577fd-8c(mac addr fa:16:3e:ec:cc:f0) in fip namespace.
Below is our sw info.

ARM-R3-14-45U-SPINE-98.1>show arp | inc 13.5.4
Internet 13.5.4.113 1 fa16.3eec.ccf0(a interface in fip ns) ARPA vlan205 te0/7

Below is interface fg-5ff577fd-8c in fip ns
[root@CRH-KZ-3 ~]# ip netns exec fip-d3840bac-d92c-4fa7-beb3-6e39c403af84 ip a
2: fg-5ff577fd-8c@if10454106: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether fa:16:3e:ec:cc:f0 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 13.5.4.70/24 brd 13.5.4.255 scope global fg-5ff577fd-8c
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:feec:ccf0/64 scope link
       valid_lft forever preferred_lft forever

After migration,the same vm is in compute node,all traffic will go through qg-ecb0baea-0a(mac addr fa:16:3e:d7:bd:fb) in snat namespace.Because fip is not added into snat ns,I ping to internet ,this request traffic will go through snat ns, but relay through fip ns.
Below is our sw info.mac in sw does not change
ARM-R3-14-45U-SPINE-98.1>show arp | inc 13.5.4
Internet 13.5.4.113 1 fa16.3eec.ccf0(a interface in fip ns) ARPA vlan205 te0/7

Below is our cofiguration.

l3_agent.ini in control node

[root@CRH-KZ-3 neutron]# cat l3_agent.ini
[DEFAULT]
debug = True
interface_driver = neutron.agent.linux.interface.OVSInterfaceDriver
external_network_bridge =
ha_vrrp_auth_password = xxxx
interface_driver = openvswitch
agent_mode = dvr_snat
enable_metadata_proxy = false
ovs_use_veth = True
[agent]
extensions=fwaas,fip_qos
[ovs]

l3_agent.ini in compute node
[root@CRH-JS-7 ~]# cat /etc/neutron/l3_agent.ini
[DEFAULT]
debug = True
interface_driver = openvswitch
external_network_bridge =
ha_vrrp_health_check_interval = 30
agent_mode = dvr_no_external
enable_metadata_proxy = false
ovs_use_veth = True
[agent]
extensions=fip_qos
[ovs]
ovsdb_debug = true

[root@CRH-KZ-3 ~]# neutron agent-list | grep CRH-KZ-3
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
| 1119b4f8-b60a-47a2-9550-c0ae9837d73a | DHCP agent | CRH-KZ-3 | nova | :-) | True | neutron-dhcp-agent |
| 7051de00-be68-4f85-81d6-78c73ece87ef | Open vSwitch agent | CRH-KZ-3 | | :-) | True | neutron-openvswitch-agent |
| 91fcfb25-f38c-4e75-a203-ea46ec797781 | L3 agent | CRH-KZ-3 | nova | :-) | True | neutron-l3-agent |
| 943429e6-e88c-4ea7-923f-91e36684d24d | Metadata agent | CRH-KZ-3 | | :-) | True | neutron-metadata-agent |
[root@CRH-KZ-3 ~]#
[root@CRH-KZ-3 ~]#
[root@CRH-KZ-3 ~]#
[root@CRH-KZ-3 ~]# nova service-list | grep CRH-KZ-3
| 5491120c-da2b-41ee-91f7-896cc918a637 | nova-conductor | CRH-KZ-3 | internal | enabled | up | 2022-10-12T03:11:31.000000 | - | False |
| 036adf98-aa04-4532-ab92-e168b273469c | nova-consoleauth | CRH-KZ-3 | internal | enabled | up | 2022-10-12T03:11:23.000000 | - | False |
| 11b8a94c-5171-48b3-a782-fcf0b63ba621 | nova-scheduler | CRH-KZ-3 | internal | enabled | up | 2022-10-12T03:11:31.000000 | - | False |
| f13fef0a-65c6-4a8a-9903-de85c13e7671 | nova-compute | CRH-KZ-3 | sfdev-az-02 | enabled | up | 2022-10-12T03:11:24.000000 | - | False |
[root@CRH-KZ-3 ~]#

[root@CRH-KZ-3 ~]# neutron agent-list | grep CRH-JS-7
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
| 01ab3e58-693b-448e-823c-3dd08355bdcf | Metadata agent | CRH-JS-7 | | :-) | True | neutron-metadata-agent |
| 042cf399-ae41-4a27-9810-2fdd9f6d787f | Open vSwitch agent | CRH-JS-7 | | :-) | True | neutron-openvswitch-agent |
| 5b2d6292-b249-4e2a-b0a2-2980c773b3a4 | DHCP agent | CRH-JS-7 | nova | :-) | True | neutron-dhcp-agent |
| aa176740-f463-417a-b12e-7be2ba6e2e00 | L3 agent | CRH-JS-7 | nova | :-) | True | neutron-l3-agent |
[root@CRH-KZ-3 ~]#
[root@CRH-KZ-3 ~]#
[root@CRH-KZ-3 ~]# nova service-list | grep CRH-JS-7
| b797cf8f-8037-47a8-b9fe-479c4d04a4ca | nova-compute | CRH-JS-7 | sfdev-az-02 | enabled | up | 2022-10-12T03:12:21.000000 | - | False |
[root@CRH-KZ-3 ~]#

Revision history for this message
LIU Yulong (dragon889) wrote :

So, your controller node is dvr_snat + compute. And your single compute node is dvr_no_external.

So, looks like a new bug of running mixed dvr_snat and compute nodes. We have marked such configuration as not supported:
https://review.opendev.org/c/openstack/neutron/+/801503

Because there are still many other problems.
https://bugs.launchpad.net/neutron/+bug/1934666
https://bugs.launchpad.net/neutron/+bug/1945306

Revision history for this message
Jin Ren (renforwards) wrote :

Thanks for you reply.You are right,our controller node is dvr_snat + compute,in actually,our controller is control + network + compute. So we should select dvr in our controller node,
Is this best practice for us?

Revision history for this message
LIU Yulong (dragon889) wrote :

The best practice should be:
1. controller node for neutron-server only
2. N compute node(s) with L3 agent mode 'dvr' or 'dvr_no_external'
3. 3 or more network nodes with L3 agent mode 'dvr_snat'

In such case, then you can test migrate one VM from 'dvr' compute node to 'dvr_no_external' compute node, the floating IP should be set in network node with L3 agent mode 'dvr_snat'. Conversely, VM from 'dvr_no_external' to 'dvr', floating IP should be moved from network node to 'dvr' compute node.

tags: added: l3-dvr-backlog
Changed in neutron:
importance: Undecided → Low
importance: Low → Wishlist
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.