Live migration succeeds but instance-action-list still has unexpected Error status

Bug #1824420 reported by Wendy Mitchell
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Triaged
Undecided
Unassigned

Bug Description

Brief Description
-----------------
Live migration succeeds but instance-action-list still has unexpected Error status

Severity
--------
standard

Steps to Reproduce
------------------

1. Launch instance from glance image with flavor
for eg. 3vcpu, 1GB ram, 14GB disk, hw:cpu_policy, hw:mem_page_size=2048
ID b457c0d6-ee31-43da-a481-bc3ad5663d7c

*If it scheduled on compute-0, migrate it away from compute-0 (cold migrate) to one of the other computes eg. compute-2

3. Attempt to live (block) migration away from compute-2
The instance schedules and appears to have completed on compute-1

Expected Behavior
------------------
Expect the instance to autoschedule (live migrate) on the compatible host
Did not expect nova instance-action-list to report Error (at least when the instance was autoscheduled and succeeded the migration)

Actual Behavior
----------------
Error is reported in the nova instance-action-list (even though it autoscheduled and moved to the host)

$ nova instance-action-list b457c0d6-ee31-43da-a481-bc3ad5663d7c
Action | Request_ID | Message | Start_Time |
Updated_At |
...
| live-migration | req-a9598f9c-0fab-4a2d-afbe-7d362ea9bb64 | Error | 2019-04-11T20:24:49.000000 | 2019-04-11T20:25:09.000000

$ nova instance-action b457c0d6-ee31-43da-a481-bc3ad5663d7c req-a9598f9c-0fab-4a2d-afbe-7d362ea9bb64
| Property | Value
| action | live-migration |
| events | [{u'event': u'compute_post_live_migration_at_destination', |
| | u'finish_time': u'2019-04-11T20:25:09.000000', |
| | u'host': u'compute-1', |
| | u'hostId': u'5420963ef16628588a36ecea76a71540080b5f7a01d77b37b276aad6', |
| | u'result': u'Success', |
| | u'start_time': u'2019-04-11T20:25:07.000000', |
| | u'traceback': None}, |
| | {u'event': u'compute_pre_live_migration', |
| | u'finish_time': u'2019-04-11T20:24:58.000000', |
| | u'host': u'compute-1', |
| | u'hostId': u'5420963ef16628588a36ecea76a71540080b5f7a01d77b37b276aad6', |
| | u'result': u'Success', |
| | u'start_time': u'2019-04-11T20:24:53.000000', |
| | u'traceback': None}, |
| | {u'event': u'compute_live_migration', |
| | u'finish_time': u'2019-04-11T20:24:53.000000', |
| | u'host': u'compute-2', |
| | u'hostId': u'cf35d970d7214ea427046101a98b0c29734e063dd82f9ac668bf061c', |
| | u'result': u'Success', |
| | u'start_time': u'2019-04-11T20:24:53.000000', |
| | u'traceback': None}, |
| | {u'event': u'compute_check_can_live_migrate_source', |
| | u'finish_time': u'2019-04-11T20:24:52.000000', |
| | u'host': u'compute-2', |
| | u'hostId': u'cf35d970d7214ea427046101a98b0c29734e063dd82f9ac668bf061c', |
| | u'result': u'Success', |
| | u'start_time': u'2019-04-11T20:24:52.000000', |
| | u'traceback': None}, |
| | {u'event': u'compute_check_can_live_migrate_destination', |
| | u'finish_time': None, |
| | u'host': u'compute-1', |
| | u'hostId': u'5420963ef16628588a36ecea76a71540080b5f7a01d77b37b276aad6', |
| | u'result': None, |
| | u'start_time': u'2019-04-11T20:24:52.000000', |
| | u'traceback': None}, |
| | {u'event': u'compute_check_can_live_migrate_destination', |
| | u'finish_time': u'2019-04-11T20:24:52.000000', |
| | u'host': u'compute-0', |
| | u'hostId': u'4ab1941a198403c0b8b1a26c1f9e39536fdd884b4c4d1a8ab381a30b', |
| | u'result': u'Success', |
| | u'start_time': u'2019-04-11T20:24:51.000000', |
| | u'traceback': u' File "/var/lib/openstack/lib/python2.7/site-packages/nova/compute/utils.py", line 1323, in decorated_function |
| | return function(self, context, *args, **kwargs) |
| | File "/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 214, in decorated_function |
| | kwargs[\'instance\'], e, sys.exc_info()) |
| | File "/var/lib/openstack/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ |
| | self.force_reraise() |
| | File "/var/lib/openstack/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise |
| | six.reraise(self.type_, self.value, self.tb) |
| | File "/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 202, in decorated_function |
| | return function(self, context, *args, **kwargs) |
| | File "/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 6203, in check_can_live_migrate_destination |
| | block_migration, disk_over_commit) |
| | File "/var/lib/openstack/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7195, in check_can_live_migrate_destination |
| | self._compare_cpu(None, source_cpu_info, instance) |
| | File "/var/lib/openstack/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7446, in _compare_cpu |
| | raise exception.InvalidCPUInfo(reason=m % {\'ret\': ret, \'u\': u}) |
| | '}, |
| | {u'event': u'conductor_migrate_server', |
| | u'finish_time': u'2019-04-11T20:24:53.000000', |
| | u'host': u'nova-conductor-6f96f9bf6f-z7xqm', |
| | u'hostId': u'b94f767d471ef257391e8d24d33e3fee0c7b37a6f3c9871d6b6ebe00', |
| | u'result': u'Success', |
| | u'start_time': u'2019-04-11T20:24:49.000000', |
| | u'traceback': None}] |
| instance_uuid | b457c0d6-ee31-43da-a481-bc3ad5663d7c |
| message | Error |
| project_id | 70c3556265ee4617981d506b0f5bbfeb |
| request_id | req-a9598f9c-0fab-4a2d-afbe-7d362ea9bb64 |
| start_time | 2019-04-11T20:24:49.000000 |
| updated_at | 2019-04-11T20:25:09.000000 |
| user_id | 319e82fc9a89424a92787ab4d91d0161 |
+---------------+------------------------------------

Reproducibility
---------------
yes

System Configuration
--------------------
2+3
(Lab yow-cgcs-wildcat-92_98)

compute-0:
Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Applications
Processor 0 : 3-21,47-65
Processor 1 : 22-43,66-87

compute-1 and compute-2:
Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
Applications Processor 0 : 3-17,39-53
Processor 1 : 18-35,54-71

Branch/Pull Time/Commit
--------------------
BUILD_ID="20190410T013000Z"
JOB="STX_build_master_master"

Revision history for this message
Matt Riedemann (mriedem) wrote :
Download full text (4.0 KiB)

A bit weird, there are two events for compute_check_can_live_migrate_destination with one that didn't finish (or the event wasn't finished anyway):

| | {u'event': u'compute_check_can_live_migrate_destination', |
| | u'finish_time': None, |
| | u'host': u'compute-1', |
| | u'hostId': u'5420963ef16628588a36ecea76a71540080b5f7a01d77b37b276aad6', |
| | u'result': None, |
| | u'start_time': u'2019-04-11T20:24:52.000000', |
| | u'traceback': None}, |

The one that doesn't really make sense is this one which reports success but hit some failure - and was on compute-0 (which is the host you migrated away from):

| | {u'event': u'compute_check_can_live_migrate_destination', |
| | u'finish_time': u'2019-04-11T20:24:52.000000', |
| | u'host': u'compute-0', |
| | u'hostId': u'4ab1941a198403c0b8b1a26c1f9e39536fdd884b4c4d1a8ab381a30b', |
| | u'result': u'Success', |
| | u'start_time': u'2019-04-11T20:24:51.000000', |
| | u'traceback': u' File "/var/lib/openstack/lib/python2.7/site-packages/nova/compute/utils.py", line 1323, in decorated_function |
| | return function(self, context, *args, **kwargs) |
| | File "/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 214, in decorated_function |
| | kwargs[\'instance\'], e, sys.exc_info()) |
| | File "/var/lib/openstack/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ |
| | self.force_reraise() |
| | File "/var/lib/openstack/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise |
| | six.reraise(self.type_, self.value, self.tb) |
| | File "/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 202, in decorated_function |
| | return function(self, context, *args, **kwargs) |
| | File "/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 6203, in check_can_live_migrate_destination |
| | block_migration, disk_over_commit) |
| | File "/var/lib/openstack/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7195, in check_can_live_migrate_destination |
| | self._compare_cpu(None, source_cpu_info, instance) |
| | File "/var/lib/openstack/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7446, in _compare_cpu |
| | raise exception.InvalidCPUInfo(reason=m % {\'ret\': ret, \'u\': u}) |
| | '}, |

I think what happened here, based on hosts and chronological order, is that you had:

1. create server - it lands on compute-0
2. cold migrate to compute-2
3. live migrate: the scheduler tries compute-0 first which fails with InvalidCPUInfo and then reschedules to compute-1 which is successful. So check_can_live_migrate_destination failed on compute-0 but the event wasn't marked as a failure for some weird reason.

Anyway, the reschedule is normal. And the live migration completed successfully on compute-1.

I guess the problem is that because one event failed, the overall action status is 'Error':

| message | Error |

The problem is the code that sets the overall action status is based on if any of the events in that action failed:

https://github.com/openstack/nova/blob/03322bb517925a9f5a04ebdb41c3fd31e7962440/nova/db/sqlalchemy/api.py#L5249

But this doesn't take into account t...

Read more...

Changed in nova:
status: New → Triaged
Revision history for this message
Matt Riedemann (mriedem) wrote :

This should probably either be low or medium severity given it's extremely latent behavior.

tags: added: db live-migration
Revision history for this message
Matt Riedemann (mriedem) wrote :

Another solution would be to change the instance action in post_live_migration once we are sure we had a successful live migration to drop the 'Error' message value from any events that may have failed. That maybe has an upside to doing it in the DB API with overwriting existing events because you'd (1) still have the event log of the various hosts that the event ran on so we can see those reschedules and (2) have context on why it's OK to do that rather than blindly just update / overwrite existing events for other operations where maybe that overwrite is not OK.

The other cases I could think of that might have this same issue are server create and resize / cold migrate since those have reschedule loops as well.

Matt Riedemann (mriedem)
tags: added: starlingx
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.