OVN: HA chassis group priority is different than gateway chassis priority

Bug #1995078 reported by Michal Nasiadka
48
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Networking ML2 Generic Switch
Triaged
High
Unassigned
neutron
In Progress
High
Unassigned

Bug Description

OpenStack release affected - Wallaby, Xena and Yoga for sure
OVN version: 21.12 (from CentOS NFV SIG repos)
Host OS: CentOS Stream 8

Neutron creates External ports for bare metal instances and uses ha_chassis_group.
Neutron normally defines a different priority for Routers LRP gateway chassis and ha_chassis_group.

I have a router with two VLANs attached - external (used for internet connectivity - SNAT or DNAT/Floating IP) and internal VLAN network hosting bare metal servers (and some Geneve networks for VMs).

If an External port’s HA chassis group active chassis is different than gateway chassis (external vlan network) active chassis - those bare metal servers have intermittent network connectivity for any traffic going through that router.

In a thread on ovs-discuss ML - Numan Siddique wrote that "it is recommended that the
same controller which is actively handling the gateway traffic also
handles the external ports"

More information in this thread - https://mail.openvswitch.org/pipermail/ovs-discuss/2022-October/052067.html

Bugzilla reference:
* (OSP17): https://bugzilla.redhat.com/show_bug.cgi?id=1826364
* (OSP17): https://bugzilla.redhat.com/show_bug.cgi?id=2259161

Tags: ovn
Revision history for this message
Elvira García Ruiz (elviragr) wrote :

I don't have a deployment with baremetal servers but this looks like a legit bug.

Changed in neutron:
status: New → Confirmed
importance: Undecided → High
tags: added: ovn
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello Michal:

It is quite difficult for me to create an environment with baremetal ports. Can you check what Numan is suggesting in the mail?

"That could be the issue. You can perhaps arp for the router ip from
your bare metal machine and see if you get 2 arp replies - one from
the controller which binds the external port and one from the gateway
chassis controller."

If I'm not wrong, the HA chassis group will assign the highest priority chassis in "HA_Chassis" table and will detect failovers. Other ports (not external ones) should use the same chassis. In your case you are using VLAN that implies we are explicitly sending this traffic to a centralized router port, that is in the chassis hosting the distributed GW port [1]. I would need to check this issue with Lucas.

Regards.

[1]https://github.com/openvswitch/ovs/commit/85706c34d53d4810f54bec1de662392a3c06a996

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello again:

This issue you are hitting here is similar to [1]. For distributed VLAN traffic, what you need to do is to configure the GW chassis to define this [2]. "external_ids:ovn-chassis-mac-mappings" is a list of key-pairs. The key is the physnet, the value is the MAC address. The OVN controller will replace the local LRP with the defined MAC address if a packet is for a distributed port. That will solve the issue you have in your deployment using VLAN networks.

Please let me know if that helped you.

Regards.

[1]https://bugzilla.redhat.com/show_bug.cgi?id=1766930
[2]https://github.com/ovn-org/ovn/blob/main/controller/ovn-controller.8.xml#L239

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (master)
Changed in kolla-ansible:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (master)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/864510
Committed: https://opendev.org/openstack/kolla-ansible/commit/8bf8656dbad3def707eca2d8ddd2c9bfed389b86
Submitter: "Zuul (22348)"
Branch: master

commit 8bf8656dbad3def707eca2d8ddd2c9bfed389b86
Author: Bartosz Bezak <email address hidden>
Date: Tue Nov 15 11:08:15 2022 +0100

    Generate ovn-chassis-mac-mappings on ovn-controller group

    Previously ovn-chassis-mac-mappings [1] has been added only to
    ovn-controller-compute group. However external ports are being
    scheduled on network nodes, therefore we need also do that there.

    Closes-Bug: 1995078

    [1] https://github.com/ovn-org/ovn/blob/v22.09.0/controller/ovn-controller.8.xml#L239

    Change-Id: Ie62e9220bad56262cad602ca1480e6ca65827819

Changed in kolla-ansible:
status: In Progress → Fix Released
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello Michal:

Can we consider this bug in Neutron as not valid? So far, I see the problem you had was the definition of the "ovn-chassis-mac-mappings" in the wrong group [1]; this is the parameter commented in c#4 that should be added to the GW chassis. Did that solved the issue?

Regards.

[1]https://review.opendev.org/c/openstack/kolla-ansible/+/864510

Revision history for this message
Michal Nasiadka (mnasiadka) wrote :

Hi Rodolfo,

At the moment it seems that it has fixed the issue. Basically we added that in the past not thinking about external ports.

I marked it as invalid in Neutron, if we'll see any issues - we'll reopen in future.
Thanks for your help!

Changed in neutron:
status: Confirmed → Invalid
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/yoga)

Fix proposed to branch: stable/yoga
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/864482

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/xena)

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/864483

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/864484

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello Michal:

Nice to read that!

Regards.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/yoga)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/864482
Committed: https://opendev.org/openstack/kolla-ansible/commit/86e2f2df428cc8e8157942b8776600fa9674ff99
Submitter: "Zuul (22348)"
Branch: stable/yoga

commit 86e2f2df428cc8e8157942b8776600fa9674ff99
Author: Bartosz Bezak <email address hidden>
Date: Tue Nov 15 11:08:15 2022 +0100

    Generate ovn-chassis-mac-mappings on ovn-controller group

    Previously ovn-chassis-mac-mappings [1] has been added only to
    ovn-controller-compute group. However external ports are being
    scheduled on network nodes, therefore we need also do that there.

    Closes-Bug: 1995078

    [1] https://github.com/ovn-org/ovn/blob/v22.09.0/controller/ovn-controller.8.xml#L239

    Change-Id: Ie62e9220bad56262cad602ca1480e6ca65827819
    (cherry picked from commit 8bf8656dbad3def707eca2d8ddd2c9bfed389b86)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/864483
Committed: https://opendev.org/openstack/kolla-ansible/commit/55fb10b7820fa6c5de45774ca793427bc27a3e38
Submitter: "Zuul (22348)"
Branch: stable/xena

commit 55fb10b7820fa6c5de45774ca793427bc27a3e38
Author: Bartosz Bezak <email address hidden>
Date: Tue Nov 15 11:08:15 2022 +0100

    Generate ovn-chassis-mac-mappings on ovn-controller group

    Previously ovn-chassis-mac-mappings [1] has been added only to
    ovn-controller-compute group. However external ports are being
    scheduled on network nodes, therefore we need also do that there.

    Closes-Bug: 1995078

    [1] https://github.com/ovn-org/ovn/blob/v22.09.0/controller/ovn-controller.8.xml#L239

    Change-Id: Ie62e9220bad56262cad602ca1480e6ca65827819
    (cherry picked from commit 8bf8656dbad3def707eca2d8ddd2c9bfed389b86)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/864484
Committed: https://opendev.org/openstack/kolla-ansible/commit/40a4240e88573b43e3b7a7b98a9949f8029a1d18
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 40a4240e88573b43e3b7a7b98a9949f8029a1d18
Author: Bartosz Bezak <email address hidden>
Date: Tue Nov 15 11:08:15 2022 +0100

    Generate ovn-chassis-mac-mappings on ovn-controller group

    Previously ovn-chassis-mac-mappings [1] has been added only to
    ovn-controller-compute group. However external ports are being
    scheduled on network nodes, therefore we need also do that there.

    Closes-Bug: 1995078

    [1] https://github.com/ovn-org/ovn/blob/v22.09.0/controller/ovn-controller.8.xml#L239

    Change-Id: Ie62e9220bad56262cad602ca1480e6ca65827819
    (cherry picked from commit 8bf8656dbad3def707eca2d8ddd2c9bfed389b86)

Revision history for this message
Michal Nasiadka (mnasiadka) wrote (last edit ):

After doing all those changes in kolla-ansible - we found that data traffic bandwidth for External VLAN ports is kilobytes per second (around 30-100 kbytes/sec) compared to 200 megabytes/sec before.
Probably that's related to MAC flooding issues in OVN.

If we remove ovn-chassis-mac-mappings on network nodes - and ha_chassis_group active chassis and gateway_chassis active chassis are the same - everything works fine (except HA).

Any other ideas?

Changed in neutron:
status: Invalid → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 15.0.0.0rc1

This issue was fixed in the openstack/kolla-ansible 15.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 13.7.0

This issue was fixed in the openstack/kolla-ansible 13.7.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 14.7.0

This issue was fixed in the openstack/kolla-ansible 14.7.0 release.

no longer affects: kolla-ansible
Revision history for this message
Bartosz Bezak (bbezak) wrote (last edit ):

In Centralised configuration (no DVR), this problem still persist: i.e. Traffic from VLAN external ports (Baremetal) is not reaching router as external port HA chassis is scheduled on different active chassis than gateway chassis (external vlan network) active chassis.

Traffic starts to go normally when chassis priorities got manually altered.

Tested on yoga with OVN 22.09

Changed in neutron:
assignee: nobody → Rodolfo Alonso (rodolfo-alonso-hernandez)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/872023

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/872033

Changed in neutron:
status: Confirmed → In Progress
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/872023
Committed: https://opendev.org/openstack/neutron/commit/ac231c817473c018dde8fa31594b1c9a78a36c13
Submitter: "Zuul (22348)"
Branch: master

commit ac231c817473c018dde8fa31594b1c9a78a36c13
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Mon Jan 23 16:17:46 2023 +0100

    Improve "sync_ha_chassis_group" method

    The method "sync_ha_chassis_group" now creates (or retrieves) a
    HA Chassis Group register and updates the needed HA Chassis registers
    in a single transaction. That is possible using the new ovsdbapp
    release 2.2.1 (check the depends-on patch).

    Depends-On: https://review.opendev.org/c/openstack/ovsdbapp/+/871836

    Related-Bug: #1995078
    Change-Id: I936855214c635de0e89d5d13a86562f5b282633c

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/872033
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible wallaby-eol

This issue was fixed in the openstack/kolla-ansible wallaby-eol release.

Bartosz Bezak (bbezak)
no longer affects: kolla-ansible/wallaby
no longer affects: kolla-ansible/xena
no longer affects: kolla-ansible/yoga
no longer affects: kolla-ansible/zed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/872033
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/2023.1)

Related fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/neutron/+/903897

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/2023.1)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/903897
Committed: https://opendev.org/openstack/neutron/commit/8c26736027fd2c066eef6cd05c89ff2364a570c0
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit 8c26736027fd2c066eef6cd05c89ff2364a570c0
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Mon Jan 23 16:17:46 2023 +0100

    Improve "sync_ha_chassis_group" method

    The method "sync_ha_chassis_group" now creates (or retrieves) a
    HA Chassis Group register and updates the needed HA Chassis registers
    in a single transaction. That is possible using the new ovsdbapp
    release 2.2.1 (check the depends-on patch).

    Depends-On: https://review.opendev.org/c/openstack/ovsdbapp/+/871836

    Conflicts:
      neutron/common/ovn/utils.py
      neutron/tests/functional/plugins/ml2/drivers/ovn/mech_driver/ovsdb/test_impl_idl.py
      neutron/tests/functional/plugins/ml2/drivers/ovn/mech_driver/ovsdb/test_maintenance.py

    Related-Bug: #1995078
    Change-Id: I936855214c635de0e89d5d13a86562f5b282633c
    (cherry picked from commit ac231c817473c018dde8fa31594b1c9a78a36c13)

Revision history for this message
Austin Cormier (acormier86) wrote :

Hi Rodolfo,

We are hitting this issue in our environment. I'm assuming anyone who is attempting to use OVN with baremetal/external ports in VLAN tenant networks in a HA environment would hit this. I'm assuming the workaround here is to centralize all external traffic to a single node?

Does the following have any dependencies on other fixes?
 https://review.opendev.org/c/openstack/neutron/+/872033

We would be willing to test any patches you may have available to help.

Revision history for this message
Austin Cormier (acormier86) wrote :

It wasn't clear whether https://review.opendev.org/c/openstack/neutron/+/903897 was a blocker for this.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/872033
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Changed in networking-generic-switch:
status: New → Triaged
importance: Undecided → High
description: updated
Revision history for this message
Graeme Moss (gramimoss) wrote :

We are still having problems with HA chassis groups where they are not matching gateway chassis I have the following setup.

OS version: ubuntu 22.04
Deployment: Kolla-ansible
Openstack release: 2023.2
Neutron Version: 23.1.1.dev102 (git_version 17dc151171)
ovn-nbctl: 23.09.0
Open vSwitch Library: 3.2.0
DB Schema: 7.1.0

We are not using distrobuted FIP or anything DVR.

## How to replicate the problem

- Create a VLAN Network is openstack
- Create a Router attached to external network and attach the VLAN network you have created.
- create a baremetal node and attach to vlan network with floating ip

what we found is that we can see the HA chassis is not matching the Gateway chassis and we are unable to ping the baremetal node even from within a VM in the same VLAN

How we fixed the problem

# External network port on the router
root@osc5:~# ovn-nbctl list Gateway_Chassis | grep -A2 -B4 lrp-e2138195-d4ef-4c69-a59e-9d56c2f697c6

_uuid : eaf33035-3bc1-45b8-85f9-6e251dd99d3c
chassis_name : osc4
external_ids : {}
name : lrp-e2138195-d4ef-4c69-a59e-9d56c2f697c6_osc4
options : {}
priority : 2

_uuid : d330d371-fb21-4beb-87ac-6bbc0c9a4103
chassis_name : osc5
external_ids : {}
name : lrp-e2138195-d4ef-4c69-a59e-9d56c2f697c6_osc5
options : {}
priority : 3
--

_uuid : c5ebe89f-8dc6-4d0f-83b0-bcf95509c32e
chassis_name : osc3
external_ids : {}
name : lrp-e2138195-d4ef-4c69-a59e-9d56c2f697c6_osc3
options : {}
priority : 1

# Vlan network
root@osc5:~# ovn-nbctl ha-chassis-group-list | grep -A8 neutron-9ab88c92-010c-4259-a476-9212e8a1309a
d64d3416-b0ee-4ea2-8db1-49c8a14f766d (neutron-9ab88c92-010c-4259-a476-9212e8a1309a)
    4b17607b-85c9-4661-a3e1-e87e4446ae58 (osc3)
    priority 32766

    7fa38a26-91f3-4bac-9435-20ae120c3b91 (osc4)
    priority 32767

    e9a94171-5960-491d-bf26-a93457ac0ba5 (osc5)
    priority 32765

Commands to change the priority to match the HA-chassis-groups
root@osc5:~# ovn-nbctl set gateway_chassis eaf33035-3bc1-45b8-85f9-6e251dd99d3c priority=3
root@osc5:~# ovn-nbctl set gateway_chassis c5ebe89f-8dc6-4d0f-83b0-bcf95509c32e priority=2
root@osc5:~# ovn-nbctl set gateway_chassis d330d371-fb21-4beb-87ac-6bbc0c9a4103 priority=1

After which service returns to a working state.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/872033
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
Graeme Moss (gramimoss) wrote :

So after trying to debug the neutron code I've found some which I believe is the problem this is mainly for 2023.2 as I know changes are in master but that doesn't fix the problem that people have been having for so long now.

https://opendev.org/openstack/neutron/src/commit/15cf75b770c08787b5cba7d07a69f30e34fa15e6/neutron/common/ovn/utils.py#L1064

when this code is run for the first time it can't find a HA_chassis_group and so creates one with random priorities these have a 1 in X(number of chassis) to match what the Routers gateway_chassis are. The ha-chassis-group priorities have to match the gateway_chassis for it to work and this function has nothing to even check what the routers gateway_chassis are.

I really don't even know where to start on trying to code a fix in and the documentation on the commands don't exist for me to find out how to even lookup the router from the network. I did find this https://opendev.org/openstack/neutron/src/commit/15cf75b770c08787b5cba7d07a69f30e34fa15e6/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/impl_idl_ovn.py#L480 but again don't know how to even call it from utils.py

So this is at least an update to so well we are still having this problem and would greatly need help trying to get a fix in place till something better comes along.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

@Lucas,

1. Should we sync the order of selected chassis then? Is there a stable sorting order that we could consistently apply to lists? (Sort by id / hash from id?) so that independent lists end up with identical order?

2. How does this bug play with enable-chassis-as-extport-host which defines dedicated nodes for external ports?

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello Graeme:

The code you are referring [1] doesn't exist in master anymore. Now we have a method to create HA Chassis Group for networks [2] and another for routers [3].

But the issue you have is in the OVN L3 scheduler. This is why I started [4]. Now, the OVN L3 scheduler selects the port chassis based on the load of the GW chassis (the default "leastloaded" scheduler). The goal of this scheduler is to balance the number of GW port across the available GW chassis.

But what is needed with these baremetal ports is not to "manually" (Neutron in this case) schedule the ports but allow OVN to do it. That is done using a HA_Chassis_Group. It will take all the GW chassis, assign priorities and always schedule the GW ports on the same hi prio chassis, that will match the chassis with the baremetal ports.

This way of scheduling is not as efficient, in terms of load distribution, as the "leastloaded" scheduler, but will meet your requirements using baremetal ports. The scheduler is configurable.

The patch [4] was started some time ago but the code of the OVN L3 scheduler has changed a lot in the past year. Now I have no time to continue with it but you are more than welcome to rebase it and finish it.

Regards.

[1]https://opendev.org/openstack/neutron/src/commit/15cf75b770c08787b5cba7d07a69f30e34fa15e6/neutron/common/ovn/utils.py#L1064
[2]https://github.com/openstack/neutron/blob/19a6e8e626c06b4a592aec9967456cb82dddbd4d/neutron/common/ovn/utils.py#L1194
[3]https://github.com/openstack/neutron/blob/19a6e8e626c06b4a592aec9967456cb82dddbd4d/neutron/common/ovn/utils.py#L1174
[4]https://review.opendev.org/c/openstack/neutron/+/872033

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/872033
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
Gabriel Samfira (gabriel-samfira) wrote :

Hi folks,

Just hit this issue as well. I think this issue is made worse when you want to connect 2 or more VLAN networks to the same virtual router.

Each VLAN network will have a HA Chassis Group, each with different priorities assigned to the GW chassis. The LRP could have its own priorities set to the Gateway Chassis.

To add a bit more entropy, a VLAN network may be connected to multiple virtual routers. So the question is: is there a sane way to sync priorities across all connected routers and networks?

It feels like it might be easier to just set a static priority for each chassis and just ensure HA if the highest priority chassis is to fail.

Changed in neutron:
assignee: Rodolfo Alonso (rodolfo-alonso-hernandez) → nobody
Revision history for this message
Gabriel Samfira (gabriel-samfira) wrote (last edit ):

It turns out, we don't really need an l3 scheduler. The l3 scheduler deals with allocating a chassis for the external port set on a gateway. But to fix this it's enough to:

1) Check if the network has a HA chassis group created for it
2) Set ha_chassis_group on on the lswitch port created for the VLAN network in the router

Right now this is being done only for external ports:

https://github.com/openstack/neutron/blob/f1d8fbdb15fceed599287868b43c44032ce2e289/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py#L712-L716

Do you see any issue in setting this for router interfaces as well?

Revision history for this message
Gabriel Samfira (gabriel-samfira) wrote :

To reproduce this on a non functioning setup, you can run the following:

NET_ID="<your VLAN network ID>"

GW_PORT_ID=$(openstack port list --network $NET_ID --device-owner network:router_interface -c id --format=value)

HA_CHASSIS_GRP=$(ovn-nbctl ha-chassis-group-list neutron-$NET_ID | head -n1 | awk '{print $1}')

ovn-nbctl set Logical_Router_Port lrp-$GW_PORT_ID ha_chassis_group=$HA_CHASSIS_GRP

At this point you should have ping.

Revision history for this message
Gabriel Samfira (gabriel-samfira) wrote :
Download full text (3.5 KiB)

Adding a bit more context here before proposing a PR on gerrit with a proof of concept.

Assumptions:

VLAN network: 10.31.1.0/24
Geneve network: 192.168.20.0/24

Both networks connected to a virtual router. The virtual router has an external gateway set.

Current neutron implementation:

Ports and scheduling:

* external router ports (gateway), are bound to Gateway Chassis. If there are nodes with ovn-cms-options=enable-chassis-as-gw set, those nodes will be preferred for these types of ports
* External ports (bare metal) are also bount to either nodes marked as enable-chassis-gw or marked as enable-chassis-as-extport-host
* Internal router ports are not bound to anything

Behavior in non VLAN/Flat networks:

When a VM needs to send packets to a different network and needs to go through it's gateway, the chassis where that VM resides will replace the source MAC address of the packet with the MAC address of the internal router interface, then send it to the external port for processing. OVN takes care of this without issue.

Behavior on VLAN networks:

In a VLAN network, when a VM tries to send a packet to another network, the same flow takes place. The source MAC is replaced with the internal router port MAC, but due to ovn-chassis-mac-mappings, the internal port MAC is replaced with the MAC address of the interface local to the chassis where the VM is. Without the ovn-chassis-mac-mappings, the packet is just dropped. We see it exiting the veth device, but is then dropped. With the mac mappings set, things work. The paket is put on the physical network with the destination MAC set to the external port.

However when we have a baremetal port (external) involved, the physical machine cannot send packets out. It can receive them (we see the echo requests come in), but echo replies go into the great beyond. If we look at the tcpdump output on any of the chassis marked as "enable-chassis-gw" as we attempt to ping the baremetal node (10.31.1.23) from a VM connected to 192.168.20.0/24, we see continuous arp requests made by the baremetal node to find 10.31.1.1. But because the internal router port is not bound to anything, nothing ever replies. So the baremetal node does not know where to send the packets.

When the external gateway is bound to same chassis as the baremetal external port, we get a reply. If the external port is bound to a different chassis, no ARP reply is received.

If however we bind the internal router port (10.31.1.1) to the same ha_chassis_group as the VLAN network (10.31.1.0/24), the same chassis is selected for both the external port and the internal router port. In this case it does not matter if the external port of the router is bound to the same chassis as the highest priority chassis in the ha_chassis_group of the VLAN network. Traffic flows normally.

This only seems to be needed for internal ports of VLAN networks. External ports get bound to a gateway chassis anyway as they need to respond to ARP requests made by the upstream gateway. The same thing seems to be true for internal ports of VLAN networks in the case of downstream external ports (bare metal nodes).

ON a Caracal installation, the patch here:

https://pa...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/931892

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/931892
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/872033
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
Graeme Moss (gramimoss) wrote :

Hi
Sorry I've been so quiet in regards to this bug I've had dead lines that needed to be meet.

I have been able to create a patch related to this and by no means call it but it has been working with out any problems for our cloud. As this is a patch for 2023.2 I Have not yet tested it for 2024.1, I have not submitted a PR due to I think rodoflo changes would be a better way.

I also have not been able to figure out how to even write unittests for this as the tests are more complex then the code :) and I'm no coder.

I hope this at least helps in same way.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/939518

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Michal Nasiadka <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/939518
Reason: not needed anymore - see https://bugs.launchpad.net/neutron/+bug/2092271

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/939961

Revision history for this message
Vasyl Saienko (vsaienko) wrote :

Sounds like this bug is duplicate of now https://bugs.launchpad.net/neutron/+bug/2052821 and was already fixed.

Revision history for this message
Michal Nasiadka (mnasiadka) wrote :

No, that bug is a different bug.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.