Comment 0 for bug 1943863

Revision history for this message
Vladimir Grevtsev (vlgrevtsev) wrote :

== Env
focal/ussuri + ovn, latest stable charms
juju status: https://paste.ubuntu.com/p/2725tV47ym/

== Problem description

DPDK instance can't be launched after the fresh deployment (focal/ussuri + OVN, latest stable charms), raising a below error:

$ os server show dpdk-test-instance -f yaml
OS-DCF:diskConfig: MANUAL
OS-EXT-AZ:availability_zone: ''
OS-EXT-SRV-ATTR:host: null
OS-EXT-SRV-ATTR:hypervisor_hostname: null
OS-EXT-SRV-ATTR:instance_name: instance-00000218
OS-EXT-STS:power_state: NOSTATE
OS-EXT-STS:task_state: null
OS-EXT-STS:vm_state: error
OS-SRV-USG:launched_at: null
OS-SRV-USG:terminated_at: null
accessIPv4: ''
accessIPv6: ''
addresses: ''
config_drive: 'True'
created: '2021-09-15T18:51:00Z'
fault:
  code: 500
  created: '2021-09-15T18:52:01Z'
  details: "Traceback (most recent call last):\n File \"/usr/lib/python3/dist-packages/nova/conductor/manager.py\"\
    , line 651, in build_instances\n scheduler_utils.populate_retry(\n File \"\
    /usr/lib/python3/dist-packages/nova/scheduler/utils.py\", line 919, in populate_retry\n\
    \ raise exception.MaxRetriesExceeded(reason=msg)\nnova.exception.MaxRetriesExceeded:\
    \ Exceeded maximum number of retries. Exceeded max scheduling attempts 3 for instance\
    \ 1bb2d1b7-e2e9-4d76-a346-a9b06ff22c73. Last exception: internal error: process\
    \ exited while connecting to monitor: 2021-09-15T18:51:53.485265Z qemu-system-x86_64:\
    \ -chardev socket,id=charnet0,path=/run/libvirt-vhost-user/vhu3ba44fdc-7c,server:\
    \ Failed to bind socket to /run/libvirt-vhost-user/vhu3ba44fdc-7c: No such file\
    \ or directory\n"
  message: 'Exceeded maximum number of retries. Exceeded max scheduling attempts 3
    for instance 1bb2d1b7-e2e9-4d76-a346-a9b06ff22c73. Last exception: internal error:
    process exited while connecting to monitor: 2021-09-15T18:51:53.485265Z qemu-system-x86_64:
    -chardev '
flavor: m1.medium.project.dpdk (4f452aa3-2b2c-4f2e-8465-5e3c2d8ec3f1)
hostId: ''
id: 1bb2d1b7-e2e9-4d76-a346-a9b06ff22c73
image: auto-sync/ubuntu-bionic-18.04-amd64-server-20210907-disk1.img (3851450e-e73d-489b-a356-33650690ed7a)
key_name: ubuntu-keypair
name: dpdk-test-instance
project_id: cdade870811447a89e2f0199373a0d95
properties: ''
status: ERROR
updated: '2021-09-15T18:52:01Z'
user_id: 13a0e7862c6641eeaaebbde1ae096f9e
volumes_attached: ''

For the record, a "generic" instances (e.g non-DPDK/non-SRIOV) are scheduling/starting without any issues.

== Steps to reproduce

openstack network create --external --provider-network-type vlan --provider-segment xxx --provider-physical-network dpdkfabric ext_net_dpdk
openstack subnet create --allocation-pool start=<redacted>,end=<redacted> --network ext_net_dpdk --subnet-range <redacted>/23 --gateway <redacted> --no-dhcp ext_net_dpdk_subnet

openstack aggregate create --zone nova dpdk
openstack aggregate set --property dpdk=true dpdk

openstack aggregate add host dpdk <fqdn>

openstack aggregate show dpdk --max-width=80

openstack flavor set --property aggregate_instance_extra_specs:dpdk=true --property hw:mem_page_size=large m1.medium.dpdk

openstack server create --config-drive true --network ext_net_dpdk --key-name ubuntu-keypair --image focal --flavor m1.medium.dpdk dpdk-test-instance

== Analysis
[before redeployment] nova-compute log : https://pastebin.canonical.com/p/FgPYNb3bPj/
[fresh deployment] juju crashdump: https://drive.google.com/file/d/1W_w3CAUq4ggp4alDnpCk08mSaCL6Uaxk/view?usp=sharing

<on hypervisor>

# ovs-vsctl get open_vswitch . other_config
{dpdk-extra="--pci-whitelist 0000:3e:00.0 --pci-whitelist 0000:40:00.0", dpdk-init="true", dpdk-lcore-mask="0x1000001", dpdk-socket-mem="4096,4096"}

# cat /etc/tmpfiles.d/nova-ovs-vhost-user.conf
# Create libvirt writeable directory for vhost-user sockets
d /run/libvirt-vhost-user 0770 libvirt-qemu kvm - -

In fact, none of the compute hosts have that file: https://paste.ubuntu.com/p/XJRFypbMQf/ (however, the error from this issue doesn't appear on non-DPDK hosts).

After doing the below command, that missing /run/... file has appeared and VM could have been scheduled and started. However, although it have been started, it wasn't reachable over the network.

# systemd-tmpfiles --create
# stat /run/libvirt-vhost-user
  File: /run/libvirt-vhost-user
  Size: 40 Blocks: 0 IO Block: 4096 directory