Timeout waiting for vif plugging callback for instance
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
Undecided
|
Salvatore Orlando | ||
Icehouse |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
The neutron full job is exhibiting a rather high number of cases where network-vif-plugged timeout are reported.
http://
95.78% of this kind of messages appear for the neutron full job. However, only a fraction of those cause build failures, but that's because the way the tests are executed.
This error is currently being masked by another bug as tempest tries to get the console log of a VM in error state: https:/
This bug will target both neutron and nova pending a better triage.
Fixing this is of paramount importance to get the full job running.
Note: This is different from https:/
Changed in neutron: | |
importance: | Undecided → High |
assignee: | nobody → Salvatore Orlando (salvatore-orlando) |
milestone: | none → juno-2 |
tags: | added: icehouse-backport-potential |
Changed in nova: | |
milestone: | none → juno-2 |
status: | Fix Committed → Fix Released |
tags: | removed: icehouse-backport-potential |
Changed in nova: | |
milestone: | juno-2 → 2014.2 |
It seems the root cause of this bug is a missing check in the server_ external_ events extension. external_ events extension handles notifications from Neutron for VIF plug events. vif-unplugged.
The server_
When a VIF is plugged, a network-vif-plugged event is sent; when a VIF is unplugged neutron sends a network-
In order to optimize communication between these two services, Neutron packs events together when possible.
On the other hand, when processing multiple events on the nova side, if processing for one of those events fails for any reason - all the other events are not processed anymore.
In the specific case of this bug, we always observe the network-vif-plugged being correctly sent from the nova side. However this event is not processed because in the same request there was also a network- vif-unplugged which failed. The traceback in nova looks like this [1].
The failure is happening because the server- external- events controller is trying to dispatch this event to a compute node; unfortunately the instance for which the vif is being unplugged does not have an host. This apparently weird condition however can happen in a few instances. In this case, it's being triggered by a shelve action [2].
In this case the network- vif-unplugged should not be processed simply because there's nothing to do. In general, when more events are packed into a single call to server- external- events, failure in processing one event should not stop event processing. This API extension is not supposed to have an all-or-none behaviour, so it's ok to proceed in case of failures.
Q: So what are you going to do?
A: log an exception in case of failure while dispatching an event and move on to the next event.
Q: Why are you doing this?
A: To prevent an error occurring on an instance to affect the correct spawning of another instance. Also to remove the 1st reason of failure of the neutron full job.
Q: Man, you're ignoring an exception. This is bad. In some communities you might be hanged for this!
A: If you look at the current code there's no handling of any exception whatsoever - and neutron does not bother whether it receives a 200 or a 500, so in my opinion exceptions are already ignored
Q: Yeah but it seems there might be some design flaw here, and you should address that rather than put some tape or hiding the dust under the carpet.
A: I agree with the first part of the statement, not really with the latter. Neutron packs events together only to minimize the number of calls to nova. It is a bug if a failure in one event prevents processing of the others. It is however agreeable that maybe neutron should be informed of which events where processed successfully and which ones not.
Q: How do I know you're not lying? Show me a logstash query!
A: This is not easy. It's not easy to build a query to find situations where more than event is packed together and then one of those events fails. The failure itself is rather common - and in most cases it does not cause a build failure [3]. However, looking only at failed builds when this exception appears, it's very likely to find exactly this failure mode [4].
Q: I've seen it happens only with the full job. Why is that?
A:...