[SRU] OVN: ovnmeta namespaces missing during scalability test causing DHCP issues

Bug #2017748 reported by Lucas Alvares Gomes
34
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Fix Released
Undecided
Unassigned
Antelope
Won't Fix
Undecided
Unassigned
Bobcat
Won't Fix
Undecided
Unassigned
Caracal
Fix Released
Undecided
Unassigned
Dalmatian
Fix Released
Undecided
Unassigned
Epoxy
Fix Released
Undecided
Unassigned
Yoga
Fix Released
Undecided
Unassigned
Zed
Won't Fix
Undecided
Unassigned
neutron
Status tracked in Ussuri
Ussuri
Fix Released
High
Terry Wilson
Victoria
New
Undecided
Unassigned
Wallaby
New
Undecided
Unassigned
Xena
New
Undecided
Unassigned
neutron (Ubuntu)
Fix Released
Undecided
Unassigned
Jammy
Fix Released
Undecided
Hua Zhang
Noble
Fix Released
Undecided
Unassigned
Oracular
Fix Released
Undecided
Unassigned
Plucky
Fix Released
Undecided
Unassigned

Bug Description

[Impact]

During scalability tests where extreme load is generated by creating thousands
of VMs all at the same time, some VMs fail to get a DHCP lease and cannot be
pinged or sshed to after deployment.

The ovnmeta namespaces for networks that the VMs were created in are missing.
The following lines are present in neutron-ovn-metadata-agent.log:

2024-02-29 03:33:18.297 1080866 INFO neutron.agent.ovn.metadata.agent [-] Port 9a75c431-42c4-47bf-af0d-22e0d5ee11a8 in datapath 3be5f44d-39de-4c38-a77f-06c0d9ee42b0 bound to our chassis
2024-02-29 03:33:18.306 1080866 DEBUG neutron.agent.ovn.metadata.agent [-] There is no metadata port for network 3be5f44d-39de-4c38-a77f-06c0d9ee42b0 or it has no MAC or IP addresses configured, tearing the namespace down if needed _get_provision_params /usr/lib/python3/dist-packages/neutron/agent/ovn/metadata/agent.py:494

What is happening is that under extreme load, sometimes the metadata port
information has not been propagated by OVN to the Southbound database, which
usually takes the form of a update notification, and when
PortBindingChassisEvent event is triggered in ovn-metadata-agent, it only looks
for update notifications, finds none, so it doesn't know any metadata port or IP
information, fails, logs the message above, and tears down the ovnmetadata
namespace for that VM.

Eventually ovsdb-server catches up, and merges insert and update notifications
and sends them out as a insert notification, which PortBindingChassisEvent
currently ignores, and the metadata is never applied to the VM.

This is a race condition, and it doesn't happen when under normal conditions,
as the metadata would just be delivered a update notification.

The fix is to also listen for insert notifications, and act on them.

[Test Case]

This can't be reproduced in the lab, even after many attempts.

A user sees this issue daily in production, where they run a scalability test
every night, in which they create a new tenant, create all necessary resources
(networks, subnets, routers, load balancers, etc.) and start several thousand
VMs. They then audit the deployment and verify that everything deployed
correctly.

Most days there are a small number of VMs that are unreachable, and those VMs
have the following messages in neutron-ovn-metadata-agent.log:

2024-02-29 03:33:18.306 1080866 DEBUG neutron.agent.ovn.metadata.agent [-] There is no metadata port for network 3be5f44d-39de-4c38-a77f-06c0d9ee42b0 or it has no MAC or IP addresses configured, tearing the namespace down if needed _get_provision_params /usr/lib/python3/dist-packages/neutron/agent/ovn/metadata/agent.py:494

There are test packages available in:

https://launchpad.net/~mruffell/+archive/ubuntu/sf375454-updates

Some previous test packages have been running in the user's test environment for
several months, with zero metadata namespace issues since rollout. We issued
the user a hotfix and it has been running in production for the past month
and they have also had zero metadata namespace issues since rollout.

When this enters -proposed, it will be verified in the user's production
environment and subject to their nightly runs of their scalability tests, with
the results collected after a week or so of runs. After that we should be
confident the -proposed packages fix the issue.

Additionally, runs will be done with charmed-openstack-tester between -updates and -proposed to see if there are any differences in test execution.

[Where problems could occur]

We are changing ovn-metadata-agent in neutron, and any issues would be limtied
to ovn-metadata-agent only. ovn-metadata-agent will now listen for both
insert and update notifications by ovsdb-server, instead of just update
notifications beforehand. It shouldn't impact any existing functionality.

If a regression were to occur, it would affect attaching metadata namespaces to
newly created VMs, which prevents it from getting its initial metadata URL /
DHCP lease / IP address information, which would cause connectivity issues for
newly created VMs. It shouldn't impact any existing VMs.

There are no workarounds if a regression were to occur, other than to downgrade
the package.

[Other info]

This was fixed upstream by:

commit a641e8aec09c1e33a15a34b19d92675ed2c85682
From: Terry Wilson <email address hidden>
Date: Fri, 15 Dec 2023 21:00:43 +0000
Subject: Handle creation of Port_Binding with chassis set
Link: https://opendev.org/openstack/neutron/commit/a641e8aec09c1e33a15a34b19d92675ed2c85682

This patch landed in Caracal. The patch is for Zed, Antelope and Bobcat, but it
depends on the following commit:

commit 6801589510242affc78497660d34377603774074
From: Jakub Libosvar <email address hidden>
Date: Thu, 21 Sep 2023 19:40:36 +0000
Subject: ovn-metadata: Refactor events
Link: https://opendev.org/openstack/neutron/commit/6801589510242affc78497660d34377603774074

After some discussion, we (mruffell, brian-haley, hopem) decided that it would
be too much of a regression risk to backport "ovn-metadata: Refactor events"
to Zed, Antelope and Bobcat, we marked this "Won't fix".

Now, the user is on yoga, so, Brian Haley wrote a new backport that does not
depend on "ovn-metadata: Refactor events" which is the following commit in
neutron yoga:

commit 952e960414e7c15d4d4351bf2300ce53a69e4051
From: Terry Wilson <email address hidden>
Date: Tue, 20 Aug 2024 10:20:52 -0500
Subject: Handle creation of Port_Binding with chassis set
Link: https://opendev.org/openstack/neutron/commit/952e960414e7c15d4d4351bf2300ce53a69e4051

This is what we are suggesting for SRU to jammy / yoga.

There is a low chance of an upgrade regression for users going from yoga -> zed
-> antelope -> bobcat -> caracal (fixed), due to users likely not running
heavy stress tests during series upgrade, and would likely run heavy
stress tests when they land on caracal instead.

If we have to, we will consider zed, antelope, bobcat in the future, but for
now, just yoga only.

== ORIGINAL DESCRIPTION ==

Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=2187650

During a scalability test it was noted that a few VMs where having issues being pinged (2 out of ~5000 VMs in the test conducted). After some investigation it was found that the VMs in question did not receive a DHCP lease:

udhcpc: no lease, failing
FAIL
checking http://169.254.169.254/2009-04-04/instance-id
failed 1/20: up 181.90. request failed

And the ovnmeta- namespaces for the networks that the VMs was booting from were missing. Looking into the ovn-metadata-agent.log:

2023-04-18 06:56:09.864 353474 DEBUG neutron.agent.ovn.metadata.agent [-] There is no metadata port for network 9029c393-5c40-4bf2-beec-27413417eafa or it has no MAC or IP addresses configured, tearing the namespace down if needed _get_provision_params /usr/lib/python3.9/site-packages/neutron/agent/ovn/metadata/agent.py:495

Apparently, when the system is under stress (scalability tests) there are some edge cases where the metadata port information has not yet being propagated by OVN to the Southbound database and when the PortBindingChassisEvent event is being handled and try to find either the metadata port of the IP information on it (which is updated by ML2/OVN during subnet creation) it can not be found and fails silently with the error shown above.

Note that, running the same tests but with less concurrency did not trigger this issue. So only happens when the system is overloaded.

Related branches

Changed in neutron:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/881487
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
Terry Wilson (otherwiseguy) wrote : Re: OVN: ovnmeta namespaces missing during scalability test causing DHCP issues

We've done some internal testing and what we see is that ovsdb-server, when there is a backlog sending out event notifications batches updates and can merge "insert" and "update" operations that happen close together. This is intended behavior.

What this means is that our PortBindingUpdatedEvent (or PortBindingChassisCreatedEvent) which looks for "update" events don't fire when we get a Port_Binding "create" that has the chassis field set.

I'm working on a fix.

Changed in neutron:
assignee: Lucas Alvares Gomes (lucasagomes) → Terry Wilson (otherwiseguy)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/903796

Revision history for this message
yatin (yatinkarel) wrote : Re: OVN: ovnmeta namespaces missing during scalability test causing DHCP issues

<< What this means is that our PortBindingUpdatedEvent (or PortBindingChassisCreatedEvent) which looks for "update" events don't fire when we get a Port_Binding "create" that has the chassis field set.

The behavior looks similar to what we saw in https://bugzilla.redhat.com/show_bug.cgi?id=2214289 for some LogicalSwitch events

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/2023.2)

Fix proposed to branch: stable/2023.2
Review: https://review.opendev.org/c/openstack/neutron/+/904715

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/2023.1)

Fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/neutron/+/904716

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/903796
Committed: https://opendev.org/openstack/neutron/commit/a641e8aec09c1e33a15a34b19d92675ed2c85682
Submitter: "Zuul (22348)"
Branch: master

commit a641e8aec09c1e33a15a34b19d92675ed2c85682
Author: Terry Wilson <email address hidden>
Date: Fri Dec 15 21:00:43 2023 +0000

    Handle creation of Port_Binding with chassis set

    When there is a backlog of notifications to be sent, it is possible
    that ovsdb-server will merge insert and update notifications. Due
    to this, we need to handle the situation where we see a Port_Binding
    created with the chassis set.

    Closes-Bug: #2017748

    Change-Id: Idfae87cf6c60e9e18ede91ea20857cea5322738c

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/2023.2)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/904715
Committed: https://opendev.org/openstack/neutron/commit/e9cf2fd6cca8a3d5c06bcb073cb310cd61208b41
Submitter: "Zuul (22348)"
Branch: stable/2023.2

commit e9cf2fd6cca8a3d5c06bcb073cb310cd61208b41
Author: Terry Wilson <email address hidden>
Date: Fri Dec 15 21:00:43 2023 +0000

    Handle creation of Port_Binding with chassis set

    When there is a backlog of notifications to be sent, it is possible
    that ovsdb-server will merge insert and update notifications. Due
    to this, we need to handle the situation where we see a Port_Binding
    created with the chassis set.

    Closes-Bug: #2017748

    Change-Id: Idfae87cf6c60e9e18ede91ea20857cea5322738c
    (cherry picked from commit a641e8aec09c1e33a15a34b19d92675ed2c85682)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/2023.1)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/904716
Committed: https://opendev.org/openstack/neutron/commit/b992d639b974f35612d6bb0057f35c452129aed3
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit b992d639b974f35612d6bb0057f35c452129aed3
Author: Terry Wilson <email address hidden>
Date: Fri Dec 15 21:00:43 2023 +0000

    Handle creation of Port_Binding with chassis set

    When there is a backlog of notifications to be sent, it is possible
    that ovsdb-server will merge insert and update notifications. Due
    to this, we need to handle the situation where we see a Port_Binding
    created with the chassis set.

    Closes-Bug: #2017748

    Change-Id: Idfae87cf6c60e9e18ede91ea20857cea5322738c
    (cherry picked from commit a641e8aec09c1e33a15a34b19d92675ed2c85682)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 24.0.0.0b1

This issue was fixed in the openstack/neutron 24.0.0.0b1 development milestone.

Revision history for this message
Seyeong Kim (seyeongkim) wrote (last edit ): Re: OVN: ovnmeta namespaces missing during scalability test causing DHCP issues

A customer has the similar issue. Although I can't reproduce this in my local environment. I prepared debdiff for yoga.
Our support engineer pointed this out ( patch 2 ) and it makes sense to backport.
As you can see the description, it is happening intermittently with high load. the customer also faced this few times and can't reproduce even they want.

There are two commits inside the debdiff file

[PATCH 1/2] ovn-metadata: Refactor events
[PATCH 2/2] Handle creation of Port_Binding with chassis set

patch 1 is needed because of massive conflict
Also, I removed commit 2's neutron/agent/ovn/extensions/qos_hwol.py

This could be the code I need to be careful

Above 2023.1 already has above patches.

tags: added: sts
Changed in neutron (Ubuntu):
status: New → Fix Released
Seyeong Kim (seyeongkim)
description: updated
Seyeong Kim (seyeongkim)
Changed in neutron (Ubuntu Focal):
assignee: nobody → Seyeong Kim (seyeongkim)
Changed in neutron (Ubuntu Jammy):
assignee: nobody → Seyeong Kim (seyeongkim)
Hua Zhang (zhhuabj)
summary: - OVN: ovnmeta namespaces missing during scalability test causing DHCP
- issues
+ [SRU] OVN: ovnmeta namespaces missing during scalability test causing
+ DHCP issues
Seyeong Kim (seyeongkim)
Changed in neutron (Ubuntu Jammy):
assignee: Seyeong Kim (seyeongkim) → nobody
Changed in neutron (Ubuntu Focal):
assignee: Seyeong Kim (seyeongkim) → nobody
Revision history for this message
Hua Zhang (zhhuabj) wrote :

After carefully reviewing many patches, I finally backported the following 3 patches to Yoga.

[PATCH 1/3] 686698284b Update tap ip in metadata agent when metadata port ip updated
[PATCH 2/3] 6205158831 ovn-metadata: Refactor events
[PATCH 3/3] b992d639b9 Handle creation of Port_Binding with chassis set

I didn't backport the following patch edf48e46a1,

edf48e46a1 Improve agent provision performance for large networks

as doing so will introduce more dependent patches. Since we opted not to backport this patch, we have to address the code conflict within the provision_datapath function in patch 6205158831. The process of resolving code conflicts for this backport can be found here - https://paste.ubuntu.com/p/2V3SXVvsHx/

Due to the absence of a local reproducer, I only built a test package [1] and verified basic network functions. The test results [2] indicate successful functionality.

[1] https://launchpad.net/~zhhuabj/+archive/ubuntu/focal-yoga-test
[2] https://paste.ubuntu.com/p/m9vp3TJgyv/

Revision history for this message
Brian Haley (brian-haley) wrote :

Sorry, just clicked the wrong buttons, trying to get this targeted to the UCA back to Ussuri.

no longer affects: neutron
Hua Zhang (zhhuabj)
no longer affects: cloud-archive/zed
Revision history for this message
Hua Zhang (zhhuabj) wrote :

One of our customer helped test focal_yoga.debdiff on one isolated compute host in their test env. post installation they have created about 100 networks, routers and VMs that were spawned on this isolated compute host. they haven't seen any issues so far with VM creation (all the VMs were created successfully)

Considering this backport involves code refactoring, I do not intend to backport it to Ussuri.

Revision history for this message
Hua Zhang (zhhuabj) wrote :

Pls pause the current SRU work for now.

As I encountered a TypeError(https://paste.ubuntu.com/p/bKh59QJJf8/) when testing the following 3 backported patches.

[PATCH 1/3] 686698284b Update tap ip in metadata agent when metadata port ip updated
[PATCH 2/3] 6205158831 ovn-metadata: Refactor events
[PATCH 3/3] b992d639b9 Handle creation of Port_Binding with chassis set

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 22.2.0

This issue was fixed in the openstack/neutron 22.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 23.2.0

This issue was fixed in the openstack/neutron 23.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (unmaintained/zed)

Fix proposed to branch: unmaintained/zed
Review: https://review.opendev.org/c/openstack/neutron/+/926666

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (unmaintained/yoga)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/926656
Committed: https://opendev.org/openstack/neutron/commit/952e960414e7c15d4d4351bf2300ce53a69e4051
Submitter: "Zuul (22348)"
Branch: unmaintained/yoga

commit 952e960414e7c15d4d4351bf2300ce53a69e4051
Author: Terry Wilson <email address hidden>
Date: Tue Aug 20 10:20:52 2024 -0500

    Handle creation of Port_Binding with chassis set

    When there is a backlog of notifications to be sent, it is possible
    that ovsdb-server will merge insert and update notifications. Due
    to this, we need to handle the situation where we see a Port_Binding
    created with the chassis set.

    Closes-Bug: #2017748
    Change-Id: Idfae87cf6c60e9e18ede91ea20857cea5322738c

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (unmaintained/zed)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/926666
Committed: https://opendev.org/openstack/neutron/commit/7bfbd4c88ff02000da73b1455cb43fb4f2c72107
Submitter: "Zuul (22348)"
Branch: unmaintained/zed

commit 7bfbd4c88ff02000da73b1455cb43fb4f2c72107
Author: Terry Wilson <email address hidden>
Date: Tue Aug 20 10:20:52 2024 -0500

    Handle creation of Port_Binding with chassis set

    When there is a backlog of notifications to be sent, it is possible
    that ovsdb-server will merge insert and update notifications. Due
    to this, we need to handle the situation where we see a Port_Binding
    created with the chassis set.

    Closes-Bug: #2017748
    Change-Id: Idfae87cf6c60e9e18ede91ea20857cea5322738c

tags: added: in-unmaintained-zed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/881487
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Hua Zhang (zhhuabj)
description: updated
Revision history for this message
Edward Hope-Morley (hopem) wrote :

Deleted debdiff previously submitted as per @haleyb since they need resubmitting now that the upstream backports have been done.

Hua Zhang (zhhuabj)
Changed in neutron (Ubuntu Oracular):
status: New → Fix Released
Changed in neutron (Ubuntu Noble):
status: New → Fix Released
Hua Zhang (zhhuabj)
Changed in neutron (Ubuntu Focal):
status: New → Won't Fix
Changed in neutron (Ubuntu Jammy):
assignee: nobody → Hua Zhang (zhhuabj)
status: New → In Progress
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote (last edit ):

Hi Joshua,

Thanks for your work on this!

Regarding the debdiff for SRU to Jammy in comment #27.

1) Please fix the version for stable releases (increment of ubuntu0.1, usually), not development series (increment of ubuntu1, usually).
See the versioning reference [1].

2) Please add DEP-3 headers to the .patch file.
See the DEP-3 reference [2].

In practice, if you start with a git formatted patch, as you did, which already has 'Subject:' (and some description in the commit message), and 'From:', you should only add these two headers:
Bug-Ubuntu: https://launchpad.net/bugs/<NUMBER>
Origin: [<upstream|backport>,] https://git-commit-url
(e.g., see the `Committed:` link in comment #23)

The optional (but recommended) keyword 'upstream' is for no changes required to the commit taken from the upstream git repo/branch, and 'backport' is for any changes required. Note that an unchanged patch from an upstream stable/backport branch is still considerd 'upstream', not 'backport' (regardless of upstream branch type).

3) Regarding the 'Test Case' section, thanks for running a full regression test suite!

However, is there really no way to verify these code changes are behaving as expected?

Rodrigo recently confirmed that synthetic/mock tests can do that, and/or maybe you can
exercise OVN directly through commands (sorry, I don't know much about OVN) and verify
with logs (or attach debugger) that the different code path/behavior are as expected.

I'll mark Jammy as Incomplete for now -- please revert it once you address the above.

Thank you!

[1] https://wiki.ubuntu.com/SecurityTeam/UpdatePreparation#Update_the_packaging
[2] https://dep-team.pages.debian.net/deps/dep3/

Changed in neutron (Ubuntu Jammy):
status: In Progress → Incomplete
Revision history for this message
Hua Zhang (zhhuabj) wrote :
Revision history for this message
Hua Zhang (zhhuabj) wrote :
Download full text (3.4 KiB)

Hi Mauricio,

I have uploaded a new debdiff to delivery your comment, I use the verison 2:20.5.0-0ubuntu2.1 and add DEP-3 headers now. As for the 'Test Case' section, I have been trying to find a reproducer for the past two weeks, but all attempts have failed, it's quite difficult to reproduce the behavior that ovsdb will merge insert and update notifications.

The essence of this fix patch is to address the problem through a retry mechanism.

Without this fix patch:
1, MetadataAgent#start call sync, then call provision_datapath to trigger the initial creation of metadata namespace.
2, only ROW_UPDATE event in PortBindingChassisCreatedEvent can trigger provision_datapath the second time.
3, no more ROW_UPDATE event in PortBindingChassisCreatedEvent can trigger provision_datapath the third time due to the comment [2] (and not old.chassis).

class PortBindingChassisCreatedEvent(PortBindingChassisEvent):
    def init(self, metadata_agent):
        events = (self.ROW_UPDATE,)
        super(PortBindingChassisCreatedEvent, self).init(metadata_agent, events)

    def match_fn(self, event, row, old):
        return (row.chassis[0].name == self.agent.chassis and not old.chassis)

As said in comment[2]:

What this means is that our PortBindingUpdatedEvent (or PortBindingChassisCreatedEvent) which looks for "update" events don't fire when we get a Port_Binding "create" that has the chassis field set.

Yes, that's because match_fn includes the condition 'not old.chassis',

1, openstack port create --network private --fixed-ip subnet=private_subnet $PORT_NAME
When creating a port, we can see an insert event via 'ovsdb monitor', but nothing from the neutron-metadata-agent log since PortBindingChassisCreatedEvent doesn't monitor ROW_INSERT

2, openstack server add port cirros-0.4.0-054427 $PORT_NAME
When adding a port to a VM, we can see an update event via 'ovsdb monitor' and logs from neutron-metadata-agent log as well because it meets the condition (row.chassis[0].name == self.agent.chassis and not old.chassis)

3, openstack port set $PORT_NAME --fixed-ip subnet=private_subnet,ip-address=192.168.21.$((RANDOM % 255 + 1))
When updating a port's IP, we can see an update event via 'ovsdb monitor', but nothing from neutron-metadata-agent log because it doesn't meet the condition (and not old.chassis)

In other words, the current condition (ROW_UPDATE and row.chassis[0].name == self.agent.chassis and not old.chassis) only gives provision_datapath one chance to run. if an issue occurs with ovsdb at this time, subsequent ovsdb update events like above step 3 will not give provision_datapath another change to run.

The current fix patch [1] changes the condtion from

ROW_UPDATE and row.chassis[0].name == self.agent.chassis and not old.chassis

to

(ROW_INSERT or ROW_UPDATE) and row.chassis[0].name == self.agent.chassis and not old.chassis

Thus, now provision_datapath has two chances to run, making this patch act as a retry mechanism, and of course it can solve the problem.

Anyway, since I cannot reproduce the problem, I can only theretically say that this patch can solve the problem as mentioned above. Even if it cannot s...

Read more...

Changed in neutron (Ubuntu Jammy):
status: Incomplete → In Progress
Changed in neutron (Ubuntu Focal):
status: Won't Fix → In Progress
Revision history for this message
Hua Zhang (zhhuabj) wrote :
Revision history for this message
Edward Hope-Morley (hopem) wrote :

@zhhuabj Can you please provide debdiffs for Antelope and Bobcat. I see that Zed was marked as Won't Fix but I think we need to do that one as well?

Revision history for this message
Hua Zhang (zhhuabj) wrote :
Revision history for this message
Hua Zhang (zhhuabj) wrote :
Revision history for this message
Hua Zhang (zhhuabj) wrote :
Revision history for this message
Hua Zhang (zhhuabj) wrote :

Hi @hopem , zed seems to have been EOL in Apr, 2024 according to the page [1] , do we still need to do it? Anyway, I have already uploaded debdiff for it as well.

[1] https://ubuntu.com/openstack/docs/supported-versions

Revision history for this message
Edward Hope-Morley (hopem) wrote :

On reviewing the A and B SRUs we have decided not to pursue them since they are significantly larger due to extra dependencies that had to be pulled in in order to do the backport. This is a result of some refactoring that happened in Antelope. Since this issue is hopefully not something that would be expected to interrupt an upgrade through these releases we can hopefully safely skip them and focus on Jammy/Yoga. If at any point this turns out to not be the case we can always re-review the other backports.

Revision history for this message
Brian Haley (brian-haley) wrote :

I agree with Ed's assessment and think we should just go forward with the jammy/yoga change. It is also much smaller in scope and simply addresses the bug in question by making sure we don't accidentally merge two notifications into one, causing issues with the port binding. Please refer to that debdiff or the Yoga change [0] for more information.

[0] https://review.opendev.org/c/openstack/neutron/+/926656

Revision history for this message
Matthew Ruffell (mruffell) wrote :

This has been sponsored to jammy/yoga.

The git repo for stable/yoga is:
https://code.launchpad.net/~mruffell/ubuntu/+source/neutron/+git/neutron/+ref/stable/yoga

The git tag for this merge is:
debian/2%20.5.0-0ubuntu2.1

Merge proposal:
https://code.launchpad.net/~mruffell/ubuntu/+source/neutron/+git/neutron/+merge/484628

I also agree with not backporting to zed (now unsupported), and antelope, bobcat
due to the necessary dependency commit "ovn-metadata: Refactor events" being too
large and likely to cause a regression.

There is a low chance of an upgrade regression for users going from yoga -> zed
-> antelope -> bobcat -> caracal (fixed), due to users likely not running
heavy stress tests during series upgrade, and would likely run heavy
stress tests when they land on caracal instead.

Revision history for this message
Matthew Ruffell (mruffell) wrote :

Feedback for Joshua:

I have attached the final jammy/yoga debdiff for you to study. The same things
apply since when I last sponsored your octavia upload.

- debian/changelog: You need to describe your change and then follow with the
patch file, instead of just having a single line of the patch file. The
description I wrote is:

* Under heavy load, OVN metadata notifications can be held up
  leading to ovsdb-server merging insert and update notifications.
  This can lead to metadata port being missing for some VMs which
  breaks connectivity, e.g. missing DHCP leases. (LP: #2017748)
  - d/p/lp2017748-handle-creation-of-Port_Binding-with-chassis-set.patch

- I renamed the patch to "lp2017748-handle-creation-of-Port_Binding-with-chassis-set.patch"
to put the lp bug number infront of it.

- I refreshed the patch, and moved the dep3 tags to under the Subject block.
I also indented the Subject block to match dep3 requirements.

As for the SRU template, I think you really need to be more descriptive of what
the change does, e.g. the impact section needs to be more than one line of

> ovnmeta- namespaces are missing intermittently then can't reach to VMs

I sponsored for now due to the original description having the necessary
details for the SRU Team to make an informed decision.

For the "where problems could occur" section, I think you really need to consider
the impact to users if a regression were to occur, and what symptoms users would
likely see, and how they might be able to correct it / workaround it.

Revision history for this message
Hua Zhang (zhhuabj) wrote :

Hi Mattew, thanks for your feedback. Next time, I will follow the new standards.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Hi,

a) please reword the impact section. It currently says:

  ovnmeta- namespaces are missing intermittently then can't reach to VMs

That is very brief, and seems to indicate that under certain conditions (which ones?) VMs can't be reached (by what? Is the whole network down? Is it permanently down, or a blip?)

b) Please clarify the test plan

The test plan currently hints that the charmed-openstack-tester is the way to go to reproduce the issue. Is that a reliable reproducer? The two tests that failed, are those the one we should observe and that must pass with the proposed package? Could you please point at the code for this tester, so at least we have it as a reference in the test plan?

Changed in neutron (Ubuntu Jammy):
status: In Progress → Incomplete
Hua Zhang (zhhuabj)
description: updated
Changed in neutron (Ubuntu Jammy):
status: Incomplete → New
Revision history for this message
Hua Zhang (zhhuabj) wrote :

Hi Andreas (@ahasenack),

I have reworded the following line

ovnmeta- namespaces are missing intermittently then can't reach to VMs

to

The ovn metadata namespace may be missing intermittently under certain conditions, such as high load. This prevents VMs from retrieving metadata (e.g., ssh keys), making them unreachable. The issue is not easily reproducible.

and rewords the test plan to the following lines.

This issue is theoretically reproducible under certian condistions, such as high load. Howevr, in practice, it has proven extremely difficult to reproduce.

I first talked with the fix author, Brian, who confirmed that he does not have a reprodcer. I then did almost 10 tests attempts to reproduce the issue, but was unsuccefully. I have copied the test results from jira to pastebin for your reference - https://paste.ubuntu.com/p/H6vh8jycvC/

Given the lack of a reproducer, I continued to run the charmed-openstack-tester according to SRU standards to ensure no regressions were introduced. and as of today (20250509), this fix has also been deployed in a customer env via hotfix, and no regression issues have been observed so far. Of course, it remains unclear whether the fix actually resolves the original problem, as the issue itself is rare in the customer env as well. Currently, the customer is requesting feedback from their tenants, but no reponse has been received yet. But I can say for sure that there is no regressions.

Pls let me know if this info can continue to push sru forward. thanks a lot.

Revision history for this message
Hua Zhang (zhhuabj) wrote :

Hi Andreas (@ahasenack),

I just added some details about how the customer is really impacted into [Test Case] as well, thanks.

I can't find a reproducer so I have also done further research into how the issue was reproduced on the customer's side. The customer runs they nightly scripts few times a day which basically create a tenant and all resources (networks, subnets, routers, vms, lbs, etc.) and verifies of they were created correctly. Intermittently customer notices that VM is unreachable after creation, and in the last two use cases, they saw that the ovn namespace was missing on the compute host, due to this even though the VM was created, metadata URL is not reachable. We do see similar logs in the customer's env, such as:

VM: 08af4c45-2755-41d6-960c-ce67ecb183cc on host sdtpdc41s100020.xxx.net
created: 2024-02-29T03:34:17Z
network: 3be5f44d-39de-4c38-a77f-06c0d9ee42b0
from neutron-ovn-metadata-agent.log.4.gz:
2024-02-29 03:33:18.297 1080866 INFO neutron.agent.ovn.metadata.agent [-] Port 9a75c431-42c4-47bf-af0d-22e0d5ee11a8 in datapath 3be5f44d-39de-4c38-a77f-06c0d9ee42b0 bound to our chassis
2024-02-29 03:33:18.306 1080866 DEBUG neutron.agent.ovn.metadata.agent [-] There is no metadata port for network 3be5f44d-39de-4c38-a77f-06c0d9ee42b0 or it has no MAC or IP addresses configured, tearing the namespace down if needed _get_provision_params /usr/lib/python3/dist-packages/neutron/agent/ovn/metadata/agent.py:494
2024-02-29 03:34:30.275 1080866 INFO neutron.agent.ovn.metadata.agent [-] Port 4869dcc4-e1dd-4d5c-94dc-f491b8b4211c in datapath 3be5f44d-39de-4c38-a77f-06c0d9ee42b0 bound to our chassis
2024-02-29 03:34:30.284 1080866 DEBUG neutron.agent.ovn.metadata.agent [-] There is no metadata port for network 3be5f44d-39de-4c38-a77f-06c0d9ee42b0 or it has no MAC or IP addresses configured, tearing the namespace down if needed _get_provision_params /usr/lib/python3/dist-packages/neutron/agent/ovn/metadata/agent.py:494

BTW, the hotfix containing the fix has been deployed in the the customer's env for a long time, the customer reported: "We applied the packages on one of the compute hosts and performed a test with creating 100 VMs in sequence but we did not see any failures in VM creation."

description: updated
description: updated
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I would ask to include in the test plan a single run of the charmed-openstack-tester, just to confirm that the package in proposed has no regressions. I know you have run this many many many times with the test packages before this SRU (THANKS!), but I hope a single run is not too much extra effort, and it would be against the proposed packages which is what we will deliver to all jammy openstack users.

Changed in neutron (Ubuntu Jammy):
status: New → Fix Committed
tags: added: verification-needed verification-needed-jammy
Revision history for this message
Andreas Hasenack (ahasenack) wrote : Please test proposed package

Hello Lucas, or anyone else affected,

Accepted neutron into jammy-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/neutron/2:20.5.0-0ubuntu2.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-jammy to verification-done-jammy. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-jammy. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

description: updated
Revision history for this message
Guillaume Boutry (gboutry) wrote :

Hello Lucas, or anyone else affected,

Accepted neutron into yoga-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:yoga-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-yoga-needed to verification-yoga-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-yoga-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-yoga-needed
Revision history for this message
Hua Zhang (zhhuabj) wrote :

Verified neutron-ovn-metadata-agent=2:20.5.0-0ubuntu2.1 successfully through two ways:

1, Successfully pass a basic VM creation test involving metadata - https://paste.ubuntu.com/p/62DyJXXtGp/
2, Sucessfully pass the charmed-openstack-tester test - https://paste.ubuntu.com/p/GnV6x2qZ4b/

BTW, the four failed tests in charmed-openstack-tester were due to timeouts and are unrelatd to metadata. Therefore, the charmed-openstack-tester test can be considered as passed.

======
Totals
======
Ran: 469 tests in 4083.4874 sec.
 - Passed: 395
 - Skipped: 70
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 4
Sum of execute time for each test: 5412.5994 sec.

$ grep -r '... FAILED' tmp.3VRsbOixAd-openstack-release-test-results
{1} heat_tempest_plugin.tests.api.test_heat_api.resources_delete_stack_with_resources.test_request [0.047527s] ... FAILED
{1} heat_tempest_plugin.tests.api.test_heat_api.environments_delete_envstack.test_request [0.058136s] ... FAILED
{1} heat_tempest_plugin.tests.api.test_heat_api.stacks_delete_stack.test_request [0.142534s] ... FAILED
{1} setUpClass (octavia_tempest_plugin.tests.scenario.v2.test_traffic_ops.TrafficOperationsScenarioTest) [0.000000s] ... FAILED

tags: added: verification-done-jammy
removed: verification-needed-jammy
Revision history for this message
Hua Zhang (zhhuabj) wrote :

Verified focal-yoga successfully - https://paste.ubuntu.com/p/PkgNDMmKrN/
I used this parameter(export TEST_DEPLOY_TIMEOUT=7200) this time, but the following 4 same tests still experienced timeouts.

======
Totals
======
Ran: 469 tests in 5010.4106 sec.
 - Passed: 395
 - Skipped: 70
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 4
$ grep -r '... FAILED' tmp.vcc6ybP4fL-openstack-release-test-results
{0} heat_tempest_plugin.tests.api.test_heat_api.resources_delete_stack_with_resources.test_request [0.152193s] ... FAILED
{0} heat_tempest_plugin.tests.api.test_heat_api.environments_delete_envstack.test_request [0.044985s] ... FAILED
{0} heat_tempest_plugin.tests.api.test_heat_api.stacks_delete_stack.test_request [0.052373s] ... FAILED
{1} setUpClass (octavia_tempest_plugin.tests.scenario.v2.test_traffic_ops.TrafficOperationsScenarioTest) [0.000000s] ... FAILED

I further checked the metadata-agent logs and found that these 4 timeout issues were not caused by the metadata, metadata worked very well - https://paste.ubuntu.com/p/X5VKMCV4Hq/

Then, I used cloud:focal-yoga(2:20.5.0-0ubuntu2~cloud0) instead of cloud:focal-yoga/proposed(2:20.5.0-0ubuntu2.1~cloud0) in the file tests/distro-regression/tests/bundles/focal-yoga.yaml to run charmed-openstack-tester again, the same four errors still occurred, which further indicates that these 4 errors were not caused by this fix. - https://paste.ubuntu.com/p/NbyYqx8wz9/

======
Totals
======
Ran: 469 tests in 3822.6757 sec.
 - Passed: 395
 - Skipped: 70
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 4
$ grep -r '... FAILED' /tmp/tmp.x0e428f5JU-openstack-release-test-results
{1} heat_tempest_plugin.tests.api.test_heat_api.environments_delete_envstack.test_request [2.181656s] ... FAILED
{0} heat_tempest_plugin.tests.api.test_heat_api.resources_delete_stack_with_resources.test_request [2.557922s] ... FAILED
{0} heat_tempest_plugin.tests.api.test_heat_api.stacks_delete_stack.test_request [0.139749s] ... FAILED
{1} setUpClass (octavia_tempest_plugin.tests.scenario.v2.test_traffic_ops.TrafficOperationsScenarioTest) [0.000000s] ... FAILED

Revision history for this message
Trent Lloyd (lathiat) wrote (last edit ):

Have reverted verification-done-jammy back to verification-needed-jammy; as per discussions held separately more verification on the above test is needed around the failing tests. Joshua is working on those now.

At a minimum we should run the same test without the patch, and see if they also fail then. Ideally we'd thoroughly investigate each of those failures and determine the cause specifically, and with evidence (e.g. log lines) - even though they they don't appear related to this change, they could be broken by this upload for various reasons, and we should be sure about that as a minimum - but more ideally we'd just figure out why the tests are broken and fix them.

tags: added: verification-needed-jammy
removed: verification-done-jammy
Revision history for this message
Hua Zhang (zhhuabj) wrote :
Download full text (4.7 KiB)

After reading some source code of charmed-openstack-tester and zaza-openstack-tests, I finally set up a debug env by setting keep-workspace=true, which prevents the tempest workspace from being deleted after running charmed-openstack-tester. This allows me to quickly debug a single failed test by running the following similar tempest command.

1, tempest run --workspace zaza --regex 'octavia_tempest_plugin.tests.scenario.v2.test_traffic_ops.TrafficOperationsScenarioTest'

I got the following two kinds of logs

#the log from ~/.tempest/zaza/tempest.log
2025-06-13 03:29:56.706 310372 INFO tempest.lib.common.ssh [-] Creating ssh connection to '10.149.142.113:22' as 'cirros' with public key authentication
2025-06-13 03:29:56.708 310372 INFO paramiko.transport [-] Connected (version 2.0, client dropbear_2020.81)
2025-06-13 03:29:56.808 310372 INFO paramiko.transport [-] Authentication (publickey) successful!
2025-06-13 03:29:56.809 310372 INFO tempest.lib.common.ssh [-] ssh connection to cirros@10.149.142.113 successfully created
2025-06-13 03:29:56.816 310372 INFO octavia_tempest_plugin.tests.validators [-] Validate URL got exception: HTTPConnectionPool(host='10.149.142.113', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f961dec6860>: Failed to establish a new connection: [Errno 111] Connection refused')). Retrying
...
2025-06-13 03:44:54.182 310372 INFO octavia_tempest_plugin.tests.validators [-] Validate URL got exception: HTTPConnectionPool(host='10.149.142.113', port=80): Max retries exceeded wit
h url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f961dec70a0>: Failed to establish a new connection: [Errno 111] Connection refused')). Retrying.
2025-06-13 03:44:59.188 310372 INFO tempest.test [-] <class 'tempest.lib.exceptions.TimeoutException'> raised in TrafficOperationsScenarioTest.setUpClass. Invoking tearDownClass.

#the log from /var/log/octavia/octavia-worker.log
...
                                                       |__Flow 'octavia-create-loadbalancer-flow': octavia.common.exceptions.ComputeWaitTimeoutException: Waiting for compute id 6246b204-53e3-4362-9351-1afdf99deb96 to go active timeout.
2025-06-13 02:15:47.862 198168 ERROR octavia.controller.worker.v2.controller_worker Traceback (most recent call last):
2025-06-13 02:15:47.862 198168 ERROR octavia.controller.worker.v2.controller_worker File "/usr/lib/python3/dist-packages/taskflow/engines/action_engine/executor.py", line 53, in _execute_task
2025-06-13 02:15:47.862 198168 ERROR octavia.controller.worker.v2.controller_worker result = task.execute(**arguments)
2025-06-13 02:15:47.862 198168 ERROR octavia.controller.worker.v2.controller_worker File "/usr/lib/python3/dist-packages/octavia/controller/worker/v2/tasks/compute_tasks.py", line 302, in execute
2025-06-13 02:15:47.862 198168 ERROR octavia.controller.worker.v2.controller_worker raise exceptions.ComputeWaitTimeoutException(id=compute_id)
2025-06-13 02:15:47.862 198168 ERROR octavia.controller.worker.v2.controller_worker octavia.common.exceptions.ComputeWaitTimeoutException: Waiting...

Read more...

tags: added: verification-done-jammy verification-yoga-done
removed: verification-needed-jammy verification-yoga-needed
Revision history for this message
Edward Hope-Morley (hopem) wrote :

@zhhuabj thank you for confirming that those failures are unrelated to your patch. I am happy that the verification is complete so should be good for release.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package neutron - 2:20.5.0-0ubuntu2.1

---------------
neutron (2:20.5.0-0ubuntu2.1) jammy; urgency=medium

  * Under heavy load, OVN metadata notifications can be held up
    leading to ovsdb-server merging insert and update notifications.
    This can lead to metadata port being missing for some VMs which
    breaks connectivity, e.g. missing DHCP leases. (LP: #2017748)
    - d/p/lp2017748-handle-creation-of-Port_Binding-with-chassis-set.patch

 -- Zhang Hua <email address hidden> Tue, 26 Nov 2024 15:04:34 +0800

Changed in neutron (Ubuntu Jammy):
status: Fix Committed → Fix Released
Revision history for this message
Nick Rosbrook (enr0n) wrote : Update Released

The verification of the Stable Release Update for neutron has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Edward Hope-Morley (hopem) wrote :

The verification of the Stable Release Update for neutron has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Edward Hope-Morley (hopem) wrote :

This bug was fixed in the package neutron - 2:20.5.0-0ubuntu2.1~cloud0
---------------

 neutron (2:20.5.0-0ubuntu2.1~cloud0) focal-yoga; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 neutron (2:20.5.0-0ubuntu2.1) jammy; urgency=medium
 .
   * Under heavy load, OVN metadata notifications can be held up
     leading to ovsdb-server merging insert and update notifications.
     This can lead to metadata port being missing for some VMs which
     breaks connectivity, e.g. missing DHCP leases. (LP: #2017748)
     - d/p/lp2017748-handle-creation-of-Port_Binding-with-chassis-set.patch

Hua Zhang (zhhuabj)
no longer affects: neutron (Ubuntu Focal)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/956177

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/956177
Committed: https://opendev.org/openstack/neutron/commit/3056fa0877d52241df90f91a527f11823cf36374
Submitter: "Zuul (22348)"
Branch: master

commit 3056fa0877d52241df90f91a527f11823cf36374
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Jul 30 15:01:20 2025 +0000

    [OVN] Handle creation of Port_Binding with chassis set (2)

    This is the follow-up of [1], missing in the OVN agent "metadata"
    extension implementation.

    When there is a backlog of notifications to be sent, it is possible
    that ovsdb-server will merge insert and update notifications. Due
    to this, we need to handle the situation where we see a Port_Binding
    created with the chassis set.

    [1]https://review.opendev.org/c/openstack/neutron/+/903796

    Related-Bug: #2017748
    Signed-off-by: Rodolfo Alonso Hernandez <email address hidden>
    Change-Id: I333fc0d4e61bc9aace007ca5de413d9cdc12c4a5

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.