[SRU] OVN: ovnmeta namespaces missing during scalability test causing DHCP issues

Bug #2017748 reported by Lucas Alvares Gomes
28
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Status tracked in Epoxy
Antelope
New
Undecided
Unassigned
Bobcat
New
Undecided
Unassigned
Caracal
Fix Released
Undecided
Unassigned
Dalmation
Fix Released
Undecided
Unassigned
Epoxy
Fix Released
Undecided
Unassigned
Yoga
New
Undecided
Unassigned
Zed
Won't Fix
Undecided
Unassigned
neutron
Status tracked in Ussuri
Ussuri
Fix Released
High
Terry Wilson
Victoria
New
Undecided
Unassigned
Wallaby
New
Undecided
Unassigned
Xena
New
Undecided
Unassigned
neutron (Ubuntu)
Status tracked in Plucky
Focal
In Progress
Undecided
Unassigned
Jammy
In Progress
Undecided
Hua Zhang
Noble
Fix Released
Undecided
Unassigned
Oracular
Fix Released
Undecided
Unassigned
Plucky
Fix Released
Undecided
Unassigned

Bug Description

[Impact]

ovnmeta- namespaces are missing intermittently then can't reach to VMs

[Test Case]
Not able to reproduce this easily, so I run charmed-openstack-tester, the result is below:

======
Totals
======
Ran: 469 tests in 4273.6309 sec.
 - Passed: 398
 - Skipped: 69
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 2
Sum of execute time for each test: 4387.2727 sec.

2 failed tests (tempest.api.object_storage.test_account_quotas.AccountQuotasTest and octavia_tempest_plugin.tests.scenario.v2.test_traffic_ops.TrafficOperationsScenarioTest) is not related to the fix

[Where problems could occur]
This patches are related to ovn metadata agent in compute.
VM's connectivity can possibly be affected by this patch when ovn is used.
Biding port to datapath could be affected.

[Others]

== ORIGINAL DESCRIPTION ==

Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=2187650

During a scalability test it was noted that a few VMs where having issues being pinged (2 out of ~5000 VMs in the test conducted). After some investigation it was found that the VMs in question did not receive a DHCP lease:

udhcpc: no lease, failing
FAIL
checking http://169.254.169.254/2009-04-04/instance-id
failed 1/20: up 181.90. request failed

And the ovnmeta- namespaces for the networks that the VMs was booting from were missing. Looking into the ovn-metadata-agent.log:

2023-04-18 06:56:09.864 353474 DEBUG neutron.agent.ovn.metadata.agent [-] There is no metadata port for network 9029c393-5c40-4bf2-beec-27413417eafa or it has no MAC or IP addresses configured, tearing the namespace down if needed _get_provision_params /usr/lib/python3.9/site-packages/neutron/agent/ovn/metadata/agent.py:495

Apparently, when the system is under stress (scalability tests) there are some edge cases where the metadata port information has not yet being propagated by OVN to the Southbound database and when the PortBindingChassisEvent event is being handled and try to find either the metadata port of the IP information on it (which is updated by ML2/OVN during subnet creation) it can not be found and fails silently with the error shown above.

Note that, running the same tests but with less concurrency did not trigger this issue. So only happens when the system is overloaded.

Related branches

Changed in neutron:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/881487
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
Terry Wilson (otherwiseguy) wrote : Re: OVN: ovnmeta namespaces missing during scalability test causing DHCP issues

We've done some internal testing and what we see is that ovsdb-server, when there is a backlog sending out event notifications batches updates and can merge "insert" and "update" operations that happen close together. This is intended behavior.

What this means is that our PortBindingUpdatedEvent (or PortBindingChassisCreatedEvent) which looks for "update" events don't fire when we get a Port_Binding "create" that has the chassis field set.

I'm working on a fix.

Changed in neutron:
assignee: Lucas Alvares Gomes (lucasagomes) → Terry Wilson (otherwiseguy)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/903796

Revision history for this message
yatin (yatinkarel) wrote : Re: OVN: ovnmeta namespaces missing during scalability test causing DHCP issues

<< What this means is that our PortBindingUpdatedEvent (or PortBindingChassisCreatedEvent) which looks for "update" events don't fire when we get a Port_Binding "create" that has the chassis field set.

The behavior looks similar to what we saw in https://bugzilla.redhat.com/show_bug.cgi?id=2214289 for some LogicalSwitch events

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/2023.2)

Fix proposed to branch: stable/2023.2
Review: https://review.opendev.org/c/openstack/neutron/+/904715

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/2023.1)

Fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/neutron/+/904716

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/903796
Committed: https://opendev.org/openstack/neutron/commit/a641e8aec09c1e33a15a34b19d92675ed2c85682
Submitter: "Zuul (22348)"
Branch: master

commit a641e8aec09c1e33a15a34b19d92675ed2c85682
Author: Terry Wilson <email address hidden>
Date: Fri Dec 15 21:00:43 2023 +0000

    Handle creation of Port_Binding with chassis set

    When there is a backlog of notifications to be sent, it is possible
    that ovsdb-server will merge insert and update notifications. Due
    to this, we need to handle the situation where we see a Port_Binding
    created with the chassis set.

    Closes-Bug: #2017748

    Change-Id: Idfae87cf6c60e9e18ede91ea20857cea5322738c

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/2023.2)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/904715
Committed: https://opendev.org/openstack/neutron/commit/e9cf2fd6cca8a3d5c06bcb073cb310cd61208b41
Submitter: "Zuul (22348)"
Branch: stable/2023.2

commit e9cf2fd6cca8a3d5c06bcb073cb310cd61208b41
Author: Terry Wilson <email address hidden>
Date: Fri Dec 15 21:00:43 2023 +0000

    Handle creation of Port_Binding with chassis set

    When there is a backlog of notifications to be sent, it is possible
    that ovsdb-server will merge insert and update notifications. Due
    to this, we need to handle the situation where we see a Port_Binding
    created with the chassis set.

    Closes-Bug: #2017748

    Change-Id: Idfae87cf6c60e9e18ede91ea20857cea5322738c
    (cherry picked from commit a641e8aec09c1e33a15a34b19d92675ed2c85682)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/2023.1)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/904716
Committed: https://opendev.org/openstack/neutron/commit/b992d639b974f35612d6bb0057f35c452129aed3
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit b992d639b974f35612d6bb0057f35c452129aed3
Author: Terry Wilson <email address hidden>
Date: Fri Dec 15 21:00:43 2023 +0000

    Handle creation of Port_Binding with chassis set

    When there is a backlog of notifications to be sent, it is possible
    that ovsdb-server will merge insert and update notifications. Due
    to this, we need to handle the situation where we see a Port_Binding
    created with the chassis set.

    Closes-Bug: #2017748

    Change-Id: Idfae87cf6c60e9e18ede91ea20857cea5322738c
    (cherry picked from commit a641e8aec09c1e33a15a34b19d92675ed2c85682)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 24.0.0.0b1

This issue was fixed in the openstack/neutron 24.0.0.0b1 development milestone.

Revision history for this message
Seyeong Kim (seyeongkim) wrote (last edit ): Re: OVN: ovnmeta namespaces missing during scalability test causing DHCP issues

A customer has the similar issue. Although I can't reproduce this in my local environment. I prepared debdiff for yoga.
Our support engineer pointed this out ( patch 2 ) and it makes sense to backport.
As you can see the description, it is happening intermittently with high load. the customer also faced this few times and can't reproduce even they want.

There are two commits inside the debdiff file

[PATCH 1/2] ovn-metadata: Refactor events
[PATCH 2/2] Handle creation of Port_Binding with chassis set

patch 1 is needed because of massive conflict
Also, I removed commit 2's neutron/agent/ovn/extensions/qos_hwol.py

This could be the code I need to be careful

Above 2023.1 already has above patches.

tags: added: sts
Changed in neutron (Ubuntu):
status: New → Fix Released
Seyeong Kim (seyeongkim)
description: updated
Seyeong Kim (seyeongkim)
Changed in neutron (Ubuntu Focal):
assignee: nobody → Seyeong Kim (seyeongkim)
Changed in neutron (Ubuntu Jammy):
assignee: nobody → Seyeong Kim (seyeongkim)
Hua Zhang (zhhuabj)
summary: - OVN: ovnmeta namespaces missing during scalability test causing DHCP
- issues
+ [SRU] OVN: ovnmeta namespaces missing during scalability test causing
+ DHCP issues
Seyeong Kim (seyeongkim)
Changed in neutron (Ubuntu Jammy):
assignee: Seyeong Kim (seyeongkim) → nobody
Changed in neutron (Ubuntu Focal):
assignee: Seyeong Kim (seyeongkim) → nobody
Revision history for this message
Hua Zhang (zhhuabj) wrote :

After carefully reviewing many patches, I finally backported the following 3 patches to Yoga.

[PATCH 1/3] 686698284b Update tap ip in metadata agent when metadata port ip updated
[PATCH 2/3] 6205158831 ovn-metadata: Refactor events
[PATCH 3/3] b992d639b9 Handle creation of Port_Binding with chassis set

I didn't backport the following patch edf48e46a1,

edf48e46a1 Improve agent provision performance for large networks

as doing so will introduce more dependent patches. Since we opted not to backport this patch, we have to address the code conflict within the provision_datapath function in patch 6205158831. The process of resolving code conflicts for this backport can be found here - https://paste.ubuntu.com/p/2V3SXVvsHx/

Due to the absence of a local reproducer, I only built a test package [1] and verified basic network functions. The test results [2] indicate successful functionality.

[1] https://launchpad.net/~zhhuabj/+archive/ubuntu/focal-yoga-test
[2] https://paste.ubuntu.com/p/m9vp3TJgyv/

Revision history for this message
Brian Haley (brian-haley) wrote :

Sorry, just clicked the wrong buttons, trying to get this targeted to the UCA back to Ussuri.

no longer affects: neutron
Hua Zhang (zhhuabj)
no longer affects: cloud-archive/zed
Revision history for this message
Hua Zhang (zhhuabj) wrote :

One of our customer helped test focal_yoga.debdiff on one isolated compute host in their test env. post installation they have created about 100 networks, routers and VMs that were spawned on this isolated compute host. they haven't seen any issues so far with VM creation (all the VMs were created successfully)

Considering this backport involves code refactoring, I do not intend to backport it to Ussuri.

Revision history for this message
Hua Zhang (zhhuabj) wrote :

Pls pause the current SRU work for now.

As I encountered a TypeError(https://paste.ubuntu.com/p/bKh59QJJf8/) when testing the following 3 backported patches.

[PATCH 1/3] 686698284b Update tap ip in metadata agent when metadata port ip updated
[PATCH 2/3] 6205158831 ovn-metadata: Refactor events
[PATCH 3/3] b992d639b9 Handle creation of Port_Binding with chassis set

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 22.2.0

This issue was fixed in the openstack/neutron 22.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 23.2.0

This issue was fixed in the openstack/neutron 23.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (unmaintained/zed)

Fix proposed to branch: unmaintained/zed
Review: https://review.opendev.org/c/openstack/neutron/+/926666

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (unmaintained/yoga)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/926656
Committed: https://opendev.org/openstack/neutron/commit/952e960414e7c15d4d4351bf2300ce53a69e4051
Submitter: "Zuul (22348)"
Branch: unmaintained/yoga

commit 952e960414e7c15d4d4351bf2300ce53a69e4051
Author: Terry Wilson <email address hidden>
Date: Tue Aug 20 10:20:52 2024 -0500

    Handle creation of Port_Binding with chassis set

    When there is a backlog of notifications to be sent, it is possible
    that ovsdb-server will merge insert and update notifications. Due
    to this, we need to handle the situation where we see a Port_Binding
    created with the chassis set.

    Closes-Bug: #2017748
    Change-Id: Idfae87cf6c60e9e18ede91ea20857cea5322738c

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (unmaintained/zed)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/926666
Committed: https://opendev.org/openstack/neutron/commit/7bfbd4c88ff02000da73b1455cb43fb4f2c72107
Submitter: "Zuul (22348)"
Branch: unmaintained/zed

commit 7bfbd4c88ff02000da73b1455cb43fb4f2c72107
Author: Terry Wilson <email address hidden>
Date: Tue Aug 20 10:20:52 2024 -0500

    Handle creation of Port_Binding with chassis set

    When there is a backlog of notifications to be sent, it is possible
    that ovsdb-server will merge insert and update notifications. Due
    to this, we need to handle the situation where we see a Port_Binding
    created with the chassis set.

    Closes-Bug: #2017748
    Change-Id: Idfae87cf6c60e9e18ede91ea20857cea5322738c

tags: added: in-unmaintained-zed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/881487
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Hua Zhang (zhhuabj)
description: updated
Revision history for this message
Edward Hope-Morley (hopem) wrote :

Deleted debdiff previously submitted as per @haleyb since they need resubmitting now that the upstream backports have been done.

Hua Zhang (zhhuabj)
Changed in neutron (Ubuntu Oracular):
status: New → Fix Released
Changed in neutron (Ubuntu Noble):
status: New → Fix Released
Hua Zhang (zhhuabj)
Changed in neutron (Ubuntu Focal):
status: New → Won't Fix
Changed in neutron (Ubuntu Jammy):
assignee: nobody → Hua Zhang (zhhuabj)
status: New → In Progress
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote (last edit ):

Hi Joshua,

Thanks for your work on this!

Regarding the debdiff for SRU to Jammy in comment #27.

1) Please fix the version for stable releases (increment of ubuntu0.1, usually), not development series (increment of ubuntu1, usually).
See the versioning reference [1].

2) Please add DEP-3 headers to the .patch file.
See the DEP-3 reference [2].

In practice, if you start with a git formatted patch, as you did, which already has 'Subject:' (and some description in the commit message), and 'From:', you should only add these two headers:
Bug-Ubuntu: https://launchpad.net/bugs/<NUMBER>
Origin: [<upstream|backport>,] https://git-commit-url
(e.g., see the `Committed:` link in comment #23)

The optional (but recommended) keyword 'upstream' is for no changes required to the commit taken from the upstream git repo/branch, and 'backport' is for any changes required. Note that an unchanged patch from an upstream stable/backport branch is still considerd 'upstream', not 'backport' (regardless of upstream branch type).

3) Regarding the 'Test Case' section, thanks for running a full regression test suite!

However, is there really no way to verify these code changes are behaving as expected?

Rodrigo recently confirmed that synthetic/mock tests can do that, and/or maybe you can
exercise OVN directly through commands (sorry, I don't know much about OVN) and verify
with logs (or attach debugger) that the different code path/behavior are as expected.

I'll mark Jammy as Incomplete for now -- please revert it once you address the above.

Thank you!

[1] https://wiki.ubuntu.com/SecurityTeam/UpdatePreparation#Update_the_packaging
[2] https://dep-team.pages.debian.net/deps/dep3/

Changed in neutron (Ubuntu Jammy):
status: In Progress → Incomplete
Revision history for this message
Hua Zhang (zhhuabj) wrote :
Revision history for this message
Hua Zhang (zhhuabj) wrote :
Download full text (3.4 KiB)

Hi Mauricio,

I have uploaded a new debdiff to delivery your comment, I use the verison 2:20.5.0-0ubuntu2.1 and add DEP-3 headers now. As for the 'Test Case' section, I have been trying to find a reproducer for the past two weeks, but all attempts have failed, it's quite difficult to reproduce the behavior that ovsdb will merge insert and update notifications.

The essence of this fix patch is to address the problem through a retry mechanism.

Without this fix patch:
1, MetadataAgent#start call sync, then call provision_datapath to trigger the initial creation of metadata namespace.
2, only ROW_UPDATE event in PortBindingChassisCreatedEvent can trigger provision_datapath the second time.
3, no more ROW_UPDATE event in PortBindingChassisCreatedEvent can trigger provision_datapath the third time due to the comment [2] (and not old.chassis).

class PortBindingChassisCreatedEvent(PortBindingChassisEvent):
    def init(self, metadata_agent):
        events = (self.ROW_UPDATE,)
        super(PortBindingChassisCreatedEvent, self).init(metadata_agent, events)

    def match_fn(self, event, row, old):
        return (row.chassis[0].name == self.agent.chassis and not old.chassis)

As said in comment[2]:

What this means is that our PortBindingUpdatedEvent (or PortBindingChassisCreatedEvent) which looks for "update" events don't fire when we get a Port_Binding "create" that has the chassis field set.

Yes, that's because match_fn includes the condition 'not old.chassis',

1, openstack port create --network private --fixed-ip subnet=private_subnet $PORT_NAME
When creating a port, we can see an insert event via 'ovsdb monitor', but nothing from the neutron-metadata-agent log since PortBindingChassisCreatedEvent doesn't monitor ROW_INSERT

2, openstack server add port cirros-0.4.0-054427 $PORT_NAME
When adding a port to a VM, we can see an update event via 'ovsdb monitor' and logs from neutron-metadata-agent log as well because it meets the condition (row.chassis[0].name == self.agent.chassis and not old.chassis)

3, openstack port set $PORT_NAME --fixed-ip subnet=private_subnet,ip-address=192.168.21.$((RANDOM % 255 + 1))
When updating a port's IP, we can see an update event via 'ovsdb monitor', but nothing from neutron-metadata-agent log because it doesn't meet the condition (and not old.chassis)

In other words, the current condition (ROW_UPDATE and row.chassis[0].name == self.agent.chassis and not old.chassis) only gives provision_datapath one chance to run. if an issue occurs with ovsdb at this time, subsequent ovsdb update events like above step 3 will not give provision_datapath another change to run.

The current fix patch [1] changes the condtion from

ROW_UPDATE and row.chassis[0].name == self.agent.chassis and not old.chassis

to

(ROW_INSERT or ROW_UPDATE) and row.chassis[0].name == self.agent.chassis and not old.chassis

Thus, now provision_datapath has two chances to run, making this patch act as a retry mechanism, and of course it can solve the problem.

Anyway, since I cannot reproduce the problem, I can only theretically say that this patch can solve the problem as mentioned above. Even if it cannot s...

Read more...

Changed in neutron (Ubuntu Jammy):
status: Incomplete → In Progress
Changed in neutron (Ubuntu Focal):
status: Won't Fix → In Progress
Revision history for this message
Hua Zhang (zhhuabj) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.