Netplan is not setting up SRIOV Virtual Functions on Jammy Charmed OpenStack during boot

Bug #1977851 reported by Itai Levy
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
netplan.io (Ubuntu)
Triaged
Low
Unassigned

Bug Description

Trying to deploy Charmed OpenStack (Yoga) Jammy series with OVN Hardware Offload.

# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04 LTS"

# uname -a
Linux node3 5.15.0-35-generic #36-Ubuntu SMP Sat May 21 02:24:07 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

# cat /etc/openstack-release
OPENSTACK_CODENAME=yoga

As part of the charms bundle the following config is used:
 ovn-chassis:
    charm: ch:ovn-chassis
    # Please update the `bridge-interface-mappings` to values suitable for the
    # hardware used in your deployment. See the referenced documentation at the
    # top of this file.
    options:
      ovn-bridge-mappings: datacentre:br-ex
      bridge-interface-mappings: *data-port
      enable-hardware-offload: true
      sriov-numvfs: "ens1f1:8"
    channel: 22.03/stable
    bindings:
      "": *internal-space
      data: *overlay-space

This is translated to the following netplan file on the deployed node:
 cat /etc/netplan/150-charm-ovn.yaml
###############################################################################
# [ WARNING ]
# Configuration file maintained by Juju. Local changes may be overwritten.
# Config managed by ovn-chassis charm
###############################################################################
network:
  version: 2
  ethernets:
    ens1f1:
      virtual-function-count: 8
      embedded-switch-mode: switchdev
      delay-virtual-functions-rebind: true

However after reboot of the deployed servers, the SRIOV VFs are not enabled on the NVIDIA NIC:
# lspci | grep -i nox
08:00.0 Ethernet controller: Mellanox Technologies MT42822 BlueField-2 integrated ConnectX-6 Dx network controller
08:00.1 Ethernet controller: Mellanox Technologies MT42822 BlueField-2 integrated ConnectX-6 Dx network controller

When manually running the netplan, VFs are configured (and switch mode change is failing as the NIC is already bounded - I believe this is expected):

#netplan --debug apply
.
.
.
   ens1f1:
      delay-virtual-functions-rebind: true
      embedded-switch-mode: switchdev
      match:
        macaddress: 04:3f:72:9e:0b:a1
      mtu: 1500
      set-name: ens1f1
      virtual-function-count: 8
.
.
.

DEBUG:Found VFs of 0000:08:00.1: ['0000:08:02.3', '0000:08:02.4', '0000:08:02.5', '0000:08:02.6', '0000:08:02.7', '0000:08:03.0', '0000:08:03.1', '0000:08:03.2']
Error: mlx5_core: Can't change mode, E-Switch is busy.
kernel answers: Device or resource busy
Traceback (most recent call last):
  File "/usr/sbin/netplan", line 23, in <module>
    netplan.main()
  File "/usr/share/netplan/netplan/cli/core.py", line 50, in main
    self.run_command()
  File "/usr/share/netplan/netplan/cli/utils.py", line 247, in run_command
    self.func()
  File "/usr/share/netplan/netplan/cli/commands/apply.py", line 61, in run
    self.run_command()
  File "/usr/share/netplan/netplan/cli/utils.py", line 247, in run_command
    self.func()
  File "/usr/share/netplan/netplan/cli/commands/apply.py", line 245, in command_apply
    NetplanApply.process_sriov_config(config_manager, exit_on_error)
  File "/usr/share/netplan/netplan/cli/commands/apply.py", line 376, in process_sriov_config
    apply_sriov_config(config_manager)
  File "/usr/share/netplan/netplan/cli/sriov.py", line 492, in apply_sriov_config
    pcidev.devlink_set('eswitch', 'mode', eswitch_mode)
  File "/usr/share/netplan/netplan/cli/sriov.py", line 143, in devlink_set
    subprocess.check_call(
  File "/usr/lib/python3.10/subprocess.py", line 369, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/sbin/devlink', 'dev', 'eswitch', 'set', 'pci/0000:08:00.1', 'mode', 'switchdev']' returned non-zero exit status 1.
root@node3:/home/ubuntu#

# lspci | grep -i nox
08:00.0 Ethernet controller: Mellanox Technologies MT42822 BlueField-2 integrated ConnectX-6 Dx network controller
08:00.1 Ethernet controller: Mellanox Technologies MT42822 BlueField-2 integrated ConnectX-6 Dx network controller
08:00.2 DMA controller: Mellanox Technologies MT42822 BlueField-2 SoC Management Interface
08:02.3 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
08:02.4 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
08:02.5 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
08:02.6 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
08:02.7 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
08:03.0 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
08:03.1 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
08:03.2 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function

Tags: fr-2523
Itai Levy (etlvnvda)
affects: openvswitch (Ubuntu) → plan (Ubuntu)
Frode Nordahl (fnordahl)
affects: plan (Ubuntu) → netplan.io (Ubuntu)
Lukas Märdian (slyon)
tags: added: fr-2523
Revision history for this message
Itai Levy (etlvnvda) wrote :

Important note: after moving to bond configuration I dont see this issue anymore. it seems like its happening only when using a single interface for high speed fabric.

Revision history for this message
Lukas Märdian (slyon) wrote :

Thank you for your report. We'll be tracking this at "Low" priority for now.
I've talked to an internal team that has access to such setup and hardware, and they will try to reproduce the issue for further investigation.

Changed in netplan.io (Ubuntu):
status: New → Triaged
importance: Undecided → Low
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.