Live migration succeeds but instance-action-list still has unexpected Error status
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Triaged
|
Undecided
|
Unassigned |
Bug Description
Brief Description
-----------------
Live migration succeeds but instance-
Severity
--------
standard
Steps to Reproduce
------------------
1. Launch instance from glance image with flavor
for eg. 3vcpu, 1GB ram, 14GB disk, hw:cpu_policy, hw:mem_
ID b457c0d6-
*If it scheduled on compute-0, migrate it away from compute-0 (cold migrate) to one of the other computes eg. compute-2
3. Attempt to live (block) migration away from compute-2
The instance schedules and appears to have completed on compute-1
Expected Behavior
------------------
Expect the instance to autoschedule (live migrate) on the compatible host
Did not expect nova instance-
Actual Behavior
----------------
Error is reported in the nova instance-
$ nova instance-
Action | Request_ID | Message | Start_Time |
Updated_At |
...
| live-migration | req-a9598f9c-
$ nova instance-action b457c0d6-
| Property | Value
| action | live-migration |
| events | [{u'event': u'compute_
| | u'finish_time': u'2019-
| | u'host': u'compute-1', |
| | u'hostId': u'5420963ef1662
| | u'result': u'Success', |
| | u'start_time': u'2019-
| | u'traceback': None}, |
| | {u'event': u'compute_
| | u'finish_time': u'2019-
| | u'host': u'compute-1', |
| | u'hostId': u'5420963ef1662
| | u'result': u'Success', |
| | u'start_time': u'2019-
| | u'traceback': None}, |
| | {u'event': u'compute_
| | u'finish_time': u'2019-
| | u'host': u'compute-2', |
| | u'hostId': u'cf35d970d7214
| | u'result': u'Success', |
| | u'start_time': u'2019-
| | u'traceback': None}, |
| | {u'event': u'compute_
| | u'finish_time': u'2019-
| | u'host': u'compute-2', |
| | u'hostId': u'cf35d970d7214
| | u'result': u'Success', |
| | u'start_time': u'2019-
| | u'traceback': None}, |
| | {u'event': u'compute_
| | u'finish_time': None, |
| | u'host': u'compute-1', |
| | u'hostId': u'5420963ef1662
| | u'result': None, |
| | u'start_time': u'2019-
| | u'traceback': None}, |
| | {u'event': u'compute_
| | u'finish_time': u'2019-
| | u'host': u'compute-0', |
| | u'hostId': u'4ab1941a19840
| | u'result': u'Success', |
| | u'start_time': u'2019-
| | u'traceback': u' File "/var/lib/
| | return function(self, context, *args, **kwargs) |
| | File "/var/lib/
| | kwargs[
| | File "/var/lib/
| | self.force_
| | File "/var/lib/
| | six.reraise(
| | File "/var/lib/
| | return function(self, context, *args, **kwargs) |
| | File "/var/lib/
| | block_migration, disk_over_commit) |
| | File "/var/lib/
| | self._compare_
| | File "/var/lib/
| | raise exception.
| | '}, |
| | {u'event': u'conductor_
| | u'finish_time': u'2019-
| | u'host': u'nova-
| | u'hostId': u'b94f767d471ef
| | u'result': u'Success', |
| | u'start_time': u'2019-
| | u'traceback': None}] |
| instance_uuid | b457c0d6-
| message | Error |
| project_id | 70c3556265ee461
| request_id | req-a9598f9c-
| start_time | 2019-04-
| updated_at | 2019-04-
| user_id | 319e82fc9a89424
+------
Reproducibility
---------------
yes
System Configuration
-------
2+3
(Lab yow-cgcs-
compute-0:
Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Applications
Processor 0 : 3-21,47-65
Processor 1 : 22-43,66-87
compute-1 and compute-2:
Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
Applications Processor 0 : 3-17,39-53
Processor 1 : 18-35,54-71
Branch/Pull Time/Commit
-------
BUILD_ID=
JOB="STX_
tags: | added: starlingx |
A bit weird, there are two events for compute_ check_can_ live_migrate_ destination with one that didn't finish (or the event wasn't finished anyway):
| | {u'event': u'compute_ check_can_ live_migrate_ destination' , | 8588a36ecea76a7 1540080b5f7a01d 77b37b276aad6' , | 04-11T20: 24:52.000000' , |
| | u'finish_time': None, |
| | u'host': u'compute-1', |
| | u'hostId': u'5420963ef1662
| | u'result': None, |
| | u'start_time': u'2019-
| | u'traceback': None}, |
The one that doesn't really make sense is this one which reports success but hit some failure - and was on compute-0 (which is the host you migrated away from):
| | {u'event': u'compute_ check_can_ live_migrate_ destination' , | 04-11T20: 24:52.000000' , | 3c0b8b1a26c1f9e 39536fdd884b4c4 d1a8ab381a30b' , | 04-11T20: 24:51.000000' , | openstack/ lib/python2. 7/site- packages/ nova/compute/ utils.py" , line 1323, in decorated_function | openstack/ lib/python2. 7/site- packages/ nova/compute/ manager. py", line 214, in decorated_function | \'instance\ '], e, sys.exc_info()) | openstack/ lib/python2. 7/site- packages/ oslo_utils/ excutils. py", line 220, in __exit__ | reraise( ) | openstack/ lib/python2. 7/site- packages/ oslo_utils/ excutils. py", line 196, in force_reraise | self.type_ , self.value, self.tb) | openstack/ lib/python2. 7/site- packages/ nova/compute/ manager. py", line 202, in decorated_function | openstack/ lib/python2. 7/site- packages/ nova/compute/ manager. py", line 6203, in check_can_ live_migrate_ destination | openstack/ lib/python2. 7/site- packages/ nova/virt/ libvirt/ driver. py", line 7195, in check_can_ live_migrate_ destination | cpu(None, source_cpu_info, instance) | openstack/ lib/python2. 7/site- packages/ nova/virt/ libvirt/ driver. py", line 7446, in _compare_cpu | InvalidCPUInfo( reason= m % {\'ret\': ret, \'u\': u}) |
| | u'finish_time': u'2019-
| | u'host': u'compute-0', |
| | u'hostId': u'4ab1941a19840
| | u'result': u'Success', |
| | u'start_time': u'2019-
| | u'traceback': u' File "/var/lib/
| | return function(self, context, *args, **kwargs) |
| | File "/var/lib/
| | kwargs[
| | File "/var/lib/
| | self.force_
| | File "/var/lib/
| | six.reraise(
| | File "/var/lib/
| | return function(self, context, *args, **kwargs) |
| | File "/var/lib/
| | block_migration, disk_over_commit) |
| | File "/var/lib/
| | self._compare_
| | File "/var/lib/
| | raise exception.
| | '}, |
I think what happened here, based on hosts and chronological order, is that you had:
1. create server - it lands on compute-0 live_migrate_ destination failed on compute-0 but the event wasn't marked as a failure for some weird reason.
2. cold migrate to compute-2
3. live migrate: the scheduler tries compute-0 first which fails with InvalidCPUInfo and then reschedules to compute-1 which is successful. So check_can_
Anyway, the reschedule is normal. And the live migration completed successfully on compute-1.
I guess the problem is that because one event failed, the overall action status is 'Error':
| message | Error |
The problem is the code that sets the overall action status is based on if any of the events in that action failed:
https:/ /github. com/openstack/ nova/blob/ 03322bb517925a9 f5a04ebdb41c3fd 31e7962440/ nova/db/ sqlalchemy/ api.py# L5249
But this doesn't take into account t...