qvo ports are not removed correctly when an instance is deleted immediately after creation

Bug #1711637 reported by Anatolii Neliubin
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Won't Fix
Low
Unassigned
9.x
Won't Fix
Medium
Unassigned

Bug Description

Detailed bug description:
When a VM is deleted immediately after creation there might be situations when qvo ports will remain marked with tag "4095". Please take a look at the example:
root@controller:~# ssh -q compute "ovs-vsctl show|grep -B1 4095" root@controller:~# nova boot --image TestVM --flavor m1.micro --nic net-id=<skipped> --availability-zone nova:compute mirantis-test; nova delete mirantis-test
+--------------------------------------+-------------------------------------------------+
| Property | Value |
+--------------------------------------+-------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | - |
| OS-EXT-SRV-ATTR:hypervisor_hostname | - |
| OS-EXT-SRV-ATTR:instance_name | instance-00016ad3 |
| OS-EXT-STS:power_state | 0 |
| OS-EXT-STS:task_state | scheduling |
| OS-EXT-STS:vm_state | building |
| OS-SRV-USG:launched_at | - |
| OS-SRV-USG:terminated_at | - |
| accessIPv4 | |
| accessIPv6 | |
| adminPass | TTLtU9Wrrbhp |
| config_drive | |
| created | 2017-08-18T13:53:29Z |
| flavor | m1.micro (<skipped>) |
| hostId | |
| id | 20d468f5-02af-4f14-9a73-afe0ad454c7e |
| image | TestVM (<skipped>) |
| key_name | - |
| metadata | {} |
| name | mirantis-test |
| os-extended-volumes:volumes_attached | [] |
| progress | 0 |
| security_groups | default |
| status | BUILD |
| tenant_id | <skipped>|
| updated | 2017-08-18T13:53:29Z |
| user_id | <skipped>|
+--------------------------------------+-------------------------------------------------+
Request to delete server mirantis-test has been accepted.
root@lxf824s001:~# ssh -q compute "ovs-vsctl show|grep -B1 4095" Port "qvod12f5eb3-f4"
            tag: 4095
root@controller:~#

Steps to reproduce:

Expected results:
There should not be any qvo ports with tag 4095
Reproducibility:
Not reproducible on the environments based on virtual machines. HW nodes only.
Workaround:
To make a pause between creating and deleting a VM, which sometimes cannot be possible because of using heat stack scenarios
Description of the environment:
- MOS 8.0 MU4
- Operation system: Ubuntu 14.04
- Network model: Neutron

tags: added: customer-found
Changed in mos:
importance: Undecided → High
Changed in mos:
milestone: none → 8.0-updates
Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Anatolii, the command you provided doesn't wait until a VM is spawned. Could you please try the same with the --poll option?

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

That doesn't matter if command waits for instance booting completion or not.
Stale ports should not remain regardless of the way the instance is booted and cleaned up.

Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Eugene, let's wait until Anatolii answers, ok?

Revision history for this message
Anatolii Neliubin (aneliubin) wrote :

Using --poll options helps to eliminate the issue, so it can be used as a workaround when we launch a bunch of machines "manually". However, as I mentioned before, the customer is heavily using Heat stacks to launch and delete instances, so we need to fix the root cause of this issue because slowing down heat stack (adding WaitConditions?) is not appropriate.

Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Small update on the issue. I managed to reproduce it in my virtual lab using subsequent calls to boot and delete. It turned out that deletion of interface itself is happening when nova receives the 'network-vif-deleted' event from neutron, but if the instance doesn't exist by that moment, nova fails to delete the interface. The neutron-openvswitch-agent plugin does not create or delete interfaces of the instance itself, but it manages interface's properties like VLANs. If the instance does not exist (what happens after deletion) it cannot bind the created interface to any meaningful VLAN and set it to DEAD_VLAN (4095). This is expected behavior. What we see if we delete an instance right after the request of its creation IMO is predictable. Neutron cannot get in time sending the network-vif-deleted event after a port destruction, and nova drops that event. Not sure if this is fixable at all, therefore I've asked Vladylav Drok to kindly take a look on this issue.

Revision history for this message
Vladyslav Drok (vdrok) wrote :

So after doing some investigation I think the following happens:

1. boot request comes in, instance build greenthread is spawned, neutron ports are allocated, info put into instance.info_cache.network_info (asynchronously)
2. instance is deleted, greenthread to handle deletion starts, task_state set to deleting, greenthread is suspended
3. instance build greenthread gets to execute, it sees that task_state became 'deleting' instead of the expected one ('spawning'). during such update of the task_state field an exception is thrown, which triggers deallocate_for_instance method of NetworkAPI
4. this method deletes neutron ports that were created and cleans up cached network info
5. deletion greenthread gets to execute. it goes all the way down to the libvirt driver's cleanup method, which tries to remove the compute nics by using the cached network info, which is now empty. nics are not removed

I'll take a closer look if there is a way to fix this properly, but it is not obvious to me right now :) Does the fact that these devices (qbr, qvb, qvo) are not deleted cause any trouble?

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

> Does the fact that these devices (qbr, qvb, qvo) are not deleted cause any trouble?
Vladyslav, no.

The main "trouble" is OVS agent complaining about unknown ports, which makes neutron red in monitoring dashboards. OVS agent doesn't detect non-ovs artifacts (qbr, qvb, tap) so it doesn't matter.

Still it would be great to have this issue fixed.

Revision history for this message
Jay Pipes (jaypipes) wrote :

This is not a High priority. This is Low priority. No data is corrupted. No data plane lack of connectivity. No control plane lack of connectivity. The only thing that is a minor issue is some OVS virtual bridges left on the compute host which don't hurt anything.

summary: - qvo ports are not removed correctly when a VN is deleted immediately
- after creation
+ qvo ports are not removed correctly when an instance is deleted
+ immediately after creation
Changed in mos:
importance: High → Low
Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

We don't fix Low impact bugs in MUs.

Changed in mos:
status: New → Won't Fix
Revision history for this message
Alexander Rubtsov (arubtsov) wrote :

sla2 for 9.0-updates

tags: added: sla2
Revision history for this message
Alexander Rubtsov (arubtsov) wrote :

sla1 for 9.0-updates

tags: added: sla1
removed: sla2
Revision history for this message
Dmitry Sutyagin (dsutyagin) wrote :

This bug becomes critical when DPDK is enabled, because of hardcoded vhost socket limit in DPDK driver:
http://dpdk.org/browse/dpdk/tree/lib/librte_vhost/socket.c#n94
This limit cannot be increased via configuration, therefore making it essential that there is no buildup of stale interfaces on computes.

Revision history for this message
Denis Ipatov (dipatov) wrote :

Also this bug follows to generate ~1GB logs per hour.

Revision history for this message
Dmitry Sutyagin (dsutyagin) wrote :

I am able to reproduce this issue in a DPDK-enabled lab with a simple loop:

while true; do nova boot --flavor m1.tiny --image TestVM --nic net-id=59691fd6-a122-4e5a-90a0-8285721a18d4 test1 | tee | grep '^| id ' | awk '{print $4}' | tee | xargs nova delete; done

It took a minute or two to generate 12 stale dpdk socket ports / OVS ports with tag 4095. So the probability is very high and potential impact is high if DPDK is used - just a few hours of running some active VM creation & deletion can bring OpenStack down, i.e. user won't be able to spawn any new VMs.

We need to work further to find a solution, I'll investigate myself but any help is highly appreciated.

Revision history for this message
Dmitry Sutyagin (dsutyagin) wrote :

I also confirm the root cause to be identical to what has been reported earlier by Denis and Vladyslav - network_info is empty during cleanup, therefore no ports are deleted.

Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Setting as Won't Fix since there is no solution without refactoring API.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.