[SRU] OVN: ovnmeta namespaces missing during scalability test causing DHCP issues
| Affects | Status | Importance | Assigned to | Milestone | ||
|---|---|---|---|---|---|---|
| Ubuntu Cloud Archive |
Fix Released
|
Undecided
|
Unassigned | |||
| Antelope |
Won't Fix
|
Undecided
|
Unassigned | |||
| Bobcat |
Won't Fix
|
Undecided
|
Unassigned | |||
| Caracal |
Fix Released
|
Undecided
|
Unassigned | |||
| Dalmatian |
Fix Released
|
Undecided
|
Unassigned | |||
| Epoxy |
Fix Released
|
Undecided
|
Unassigned | |||
| Yoga |
Fix Released
|
Undecided
|
Unassigned | |||
| Zed |
Won't Fix
|
Undecided
|
Unassigned | |||
| neutron | Status tracked in Ussuri | |||||
| Ussuri |
Fix Released
|
High
|
Terry Wilson | |||
| Victoria |
New
|
Undecided
|
Unassigned | |||
| Wallaby |
New
|
Undecided
|
Unassigned | |||
| Xena |
New
|
Undecided
|
Unassigned | |||
| neutron (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | |||
| Jammy |
Fix Released
|
Undecided
|
Hua Zhang | |||
| Noble |
Fix Released
|
Undecided
|
Unassigned | |||
| Oracular |
Fix Released
|
Undecided
|
Unassigned | |||
| Plucky |
Fix Released
|
Undecided
|
Unassigned | |||
Bug Description
[Impact]
During scalability tests where extreme load is generated by creating thousands
of VMs all at the same time, some VMs fail to get a DHCP lease and cannot be
pinged or sshed to after deployment.
The ovnmeta namespaces for networks that the VMs were created in are missing.
The following lines are present in neutron-
2024-02-29 03:33:18.297 1080866 INFO neutron.
2024-02-29 03:33:18.306 1080866 DEBUG neutron.
What is happening is that under extreme load, sometimes the metadata port
information has not been propagated by OVN to the Southbound database, which
usually takes the form of a update notification, and when
PortBindingChas
for update notifications, finds none, so it doesn't know any metadata port or IP
information, fails, logs the message above, and tears down the ovnmetadata
namespace for that VM.
Eventually ovsdb-server catches up, and merges insert and update notifications
and sends them out as a insert notification, which PortBindingChas
currently ignores, and the metadata is never applied to the VM.
This is a race condition, and it doesn't happen when under normal conditions,
as the metadata would just be delivered a update notification.
The fix is to also listen for insert notifications, and act on them.
[Test Case]
This can't be reproduced in the lab, even after many attempts.
A user sees this issue daily in production, where they run a scalability test
every night, in which they create a new tenant, create all necessary resources
(networks, subnets, routers, load balancers, etc.) and start several thousand
VMs. They then audit the deployment and verify that everything deployed
correctly.
Most days there are a small number of VMs that are unreachable, and those VMs
have the following messages in neutron-
2024-02-29 03:33:18.306 1080866 DEBUG neutron.
There are test packages available in:
https:/
Some previous test packages have been running in the user's test environment for
several months, with zero metadata namespace issues since rollout. We issued
the user a hotfix and it has been running in production for the past month
and they have also had zero metadata namespace issues since rollout.
When this enters -proposed, it will be verified in the user's production
environment and subject to their nightly runs of their scalability tests, with
the results collected after a week or so of runs. After that we should be
confident the -proposed packages fix the issue.
Additionally, runs will be done with charmed-
[Where problems could occur]
We are changing ovn-metadata-agent in neutron, and any issues would be limtied
to ovn-metadata-agent only. ovn-metadata-agent will now listen for both
insert and update notifications by ovsdb-server, instead of just update
notifications beforehand. It shouldn't impact any existing functionality.
If a regression were to occur, it would affect attaching metadata namespaces to
newly created VMs, which prevents it from getting its initial metadata URL /
DHCP lease / IP address information, which would cause connectivity issues for
newly created VMs. It shouldn't impact any existing VMs.
There are no workarounds if a regression were to occur, other than to downgrade
the package.
[Other info]
This was fixed upstream by:
commit a641e8aec09c1e3
From: Terry Wilson <email address hidden>
Date: Fri, 15 Dec 2023 21:00:43 +0000
Subject: Handle creation of Port_Binding with chassis set
Link: https:/
This patch landed in Caracal. The patch is for Zed, Antelope and Bobcat, but it
depends on the following commit:
commit 6801589510242af
From: Jakub Libosvar <email address hidden>
Date: Thu, 21 Sep 2023 19:40:36 +0000
Subject: ovn-metadata: Refactor events
Link: https:/
After some discussion, we (mruffell, brian-haley, hopem) decided that it would
be too much of a regression risk to backport "ovn-metadata: Refactor events"
to Zed, Antelope and Bobcat, we marked this "Won't fix".
Now, the user is on yoga, so, Brian Haley wrote a new backport that does not
depend on "ovn-metadata: Refactor events" which is the following commit in
neutron yoga:
commit 952e960414e7c15
From: Terry Wilson <email address hidden>
Date: Tue, 20 Aug 2024 10:20:52 -0500
Subject: Handle creation of Port_Binding with chassis set
Link: https:/
This is what we are suggesting for SRU to jammy / yoga.
There is a low chance of an upgrade regression for users going from yoga -> zed
-> antelope -> bobcat -> caracal (fixed), due to users likely not running
heavy stress tests during series upgrade, and would likely run heavy
stress tests when they land on caracal instead.
If we have to, we will consider zed, antelope, bobcat in the future, but for
now, just yoga only.
== ORIGINAL DESCRIPTION ==
Reported at: https:/
During a scalability test it was noted that a few VMs where having issues being pinged (2 out of ~5000 VMs in the test conducted). After some investigation it was found that the VMs in question did not receive a DHCP lease:
udhcpc: no lease, failing
FAIL
checking http://
failed 1/20: up 181.90. request failed
And the ovnmeta- namespaces for the networks that the VMs was booting from were missing. Looking into the ovn-metadata-
2023-04-18 06:56:09.864 353474 DEBUG neutron.
Apparently, when the system is under stress (scalability tests) there are some edge cases where the metadata port information has not yet being propagated by OVN to the Southbound database and when the PortBindingChas
Note that, running the same tests but with less concurrency did not trigger this issue. So only happens when the system is overloaded.
Related branches
- Guillaume Boutry (community): Approve
-
Diff: 100 lines (+79/-0)3 files modifieddebian/changelog (+10/-0)
debian/patches/lp2017748-handle-creation-of-Port_Binding-with-chassis-set.patch (+68/-0)
debian/patches/series (+1/-0)
- Ubuntu OpenStack uploaders: Pending requested
-
Diff: 86 lines (+74/-0)2 files modifieddebian/patches/handle-creation-of-Port_Binding-with-chassis-set.patch (+73/-0)
debian/patches/series (+1/-0)
| Changed in neutron: | |
| status: | Triaged → In Progress |
| description: | updated |
| Changed in neutron (Ubuntu Focal): | |
| assignee: | nobody → Seyeong Kim (seyeongkim) |
| Changed in neutron (Ubuntu Jammy): | |
| assignee: | nobody → Seyeong Kim (seyeongkim) |
| summary: |
- OVN: ovnmeta namespaces missing during scalability test causing DHCP - issues + [SRU] OVN: ovnmeta namespaces missing during scalability test causing + DHCP issues |
| Changed in neutron (Ubuntu Jammy): | |
| assignee: | Seyeong Kim (seyeongkim) → nobody |
| Changed in neutron (Ubuntu Focal): | |
| assignee: | Seyeong Kim (seyeongkim) → nobody |
| no longer affects: | cloud-archive/zed |
| description: | updated |
| Changed in neutron (Ubuntu Oracular): | |
| status: | New → Fix Released |
| Changed in neutron (Ubuntu Noble): | |
| status: | New → Fix Released |
| Changed in neutron (Ubuntu Focal): | |
| status: | New → Won't Fix |
| Changed in neutron (Ubuntu Jammy): | |
| assignee: | nobody → Hua Zhang (zhhuabj) |
| status: | New → In Progress |
| description: | updated |
| Changed in neutron (Ubuntu Jammy): | |
| status: | Incomplete → New |
| description: | updated |
| description: | updated |
| no longer affects: | neutron (Ubuntu Focal) |
Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master /review. opendev. org/c/openstack /neutron/ +/881487
Review: https:/
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.