instance creation fails with "Failed to allocate the network(s), not rescheduling." because neutron-ovs-agent rpc_loop took too long
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ubuntu Cloud Archive |
New
|
Undecided
|
Unassigned | ||
neutron (Ubuntu) |
Confirmed
|
Undecided
|
Unassigned |
Bug Description
Hi,
I'm running a cloud with 18.02 charms, and origin cloud:xenial-ocata. We use openvswitch and GRE tunnels for networking. We use l2pop.
Today I investigated an instance creation failure. The "fault" message in "nova show" was the following :
Build of instance 59ea33ee-
nova-compute log revealed the following :
2018-04-12 05:58:18.639 1852 ERROR nova.compute.
because the vif-plugged event never came in (below is the "preparing for event" message) :
2018-04-12 05:53:14.539 1852 DEBUG nova.compute.
I got a bit surprised because everything appeared to have worked well. Looking at neutron-ovs-agent logs, I found the reason the event never got sent : the rpc_loop that was running when the instance got created lasted more than 5 minutes !
Beginning of rpc_loop :
2018-04-12 05:52:59.287 3796 DEBUG neutron.
End of rpc_loop :
2018-04-12 05:58:37.857 3796 DEBUG neutron.
As you can see, it lasted 338.570 seconds. Most of this time was spent running "conntrack -D" commands, such as :
2018-04-12 05:55:44.984 3796 DEBUG neutron.
There were 2 batches of those : a first one with 176 commands, and a second one with 194 commands. With each command taking ~0.90 seconds (the machines are pretty heavily loaded, it's the call to neutron-rootwrap that takes most of the time, not the conntrack deletion itself), it does add up.
These conntrack commands are created from /usr/lib/
for ip_version, current_ips in sg_members.items():
add_ips, del_ips = self.ipset.
sg_id, ip_version, current_ips)
if devices and del_ips:
# remove prefix from del_ips
ips = [str(netaddr.
I believe it's trying to remove any existing conntrack entry when a removed from a secgroup.
And it looks like there's no safeguard in place to make sure that this runs in a limited amount of time, or across multiple rpc_loops.
So nova-compute times out waiting for the vif-plugged-event, and instance creation fails.
It should be easy to reproduce if you create a bunch of instances in the same secgroup (we were at 28), delete one, and immediately create one (on the same compute node I guess ?). You can also add a small sleep of 0.5s to neutron-rootwrap to simulate the load.
You can find the full log of the rpc_loop iteration at https:/
Thanks !
tags: | added: sts |
Changed in neutron (Ubuntu): | |
status: | Confirmed → Fix Committed |
status: | Fix Committed → Confirmed |
Status changed to 'Confirmed' because the bug affects multiple users.