Live migration doesn't work properly for Windows VM

Bug #1657708 reported by Oleksandr Liemieshko
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Status tracked in 10.0.x
10.0.x
Invalid
High
MOS Nova
8.0.x
Invalid
High
Oleksandr Liemieshko
9.x
Invalid
High
MOS Maintenance

Bug Description

When attempting a live-migration of an instance(Windows) from one compute node to another I got a situation when the instance failed to migrate properly.
Instance has "SHUTOFF" status in nova, it is in "shut off" state on an old compute node and "running" on a new one.
Moreover if you open VNC connection to the instance before migration it will continue to work in spite of the states which were provided above. But if you try to open new VNC connection to instance you will get "Failed to connect to server (code: 1006)" error.

In nova:
root@node-7:~# nova show dcd2d83c-2670-470c-8544-f7f858023162
+--------------------------------------+----------------------------------------------------------+
| Property | Value |
+--------------------------------------+----------------------------------------------------------+
| OS-DCF:diskConfig | AUTO |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | node-9.domain.tld |
| OS-EXT-SRV-ATTR:hypervisor_hostname | node-9.domain.tld |
| OS-EXT-SRV-ATTR:instance_name | instance-0000000c |
| OS-EXT-STS:power_state | 4 |
| OS-EXT-STS:task_state | - |
| OS-EXT-STS:vm_state | stopped |
| OS-SRV-USG:launched_at | 2017-01-18T14:55:34.000000 |
| OS-SRV-USG:terminated_at | - |
| accessIPv4 | |
| accessIPv6 | |
| admin_internal_net network | 192.168.111.12 |
| config_drive | |
| created | 2017-01-18T14:53:27Z |
| flavor | m1.medium (3) |
| hostId | ce0eff158351c1c9ccb5742b968fe95700b9cfc9c4665a0795b28629 |
| id | dcd2d83c-2670-470c-8544-f7f858023162 |
| image | win2012r2 (7d7f4c95-ed75-4897-b7f3-21c5a93a94ed) |
| key_name | - |
| metadata | {} |
| name | orig |
| os-extended-volumes:volumes_attached | [] |
| security_groups | default |
| status | SHUTOFF |
| tenant_id | 3bdaac4958d248d3a775294181962df5 |
| updated | 2017-01-18T15:42:53Z |
| user_id | 6a8a2c38b7144e648aaa094464194dde |
+--------------------------------------+----------------------------------------------------------+

On an old compute node:
root@node-9:~# virsh list --all | grep "instance-0000000c"
 - instance-0000000c shut off

On a new compute node:
root@node-8:~# virsh list --all | grep "instance-0000000c"
 61 instance-0000000c running

After "Hard Reboot Instance" it is "running" on both compute nodes

root@node-8:~# date
Thu Jan 19 11:07:19 UTC 2017
root@node-8:~# virsh list --all | grep "instance-0000000c"
 61 instance-0000000c running

root@node-9:~# date
Thu Jan 19 11:07:13 UTC 2017
root@node-9:~# virsh list --all | grep "instance-0000000c"
 142 instance-0000000c running

root@node-7:~# nova show dcd2d83c-2670-470c-8544-f7f858023162
+--------------------------------------+----------------------------------------------------------+
| Property | Value |
+--------------------------------------+----------------------------------------------------------+
| OS-DCF:diskConfig | AUTO |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | node-9.domain.tld |
| OS-EXT-SRV-ATTR:hypervisor_hostname | node-9.domain.tld |
| OS-EXT-SRV-ATTR:instance_name | instance-0000000c |
| OS-EXT-STS:power_state | 1 |
| OS-EXT-STS:task_state | - |
| OS-EXT-STS:vm_state | active |
| OS-SRV-USG:launched_at | 2017-01-18T14:55:34.000000 |
| OS-SRV-USG:terminated_at | - |
| accessIPv4 | |
| accessIPv6 | |
| admin_internal_net network | 192.168.111.12 |
| config_drive | |
| created | 2017-01-18T14:53:27Z |
| flavor | m1.medium (3) |
| hostId | ce0eff158351c1c9ccb5742b968fe95700b9cfc9c4665a0795b28629 |
| id | dcd2d83c-2670-470c-8544-f7f858023162 |
| image | win2012r2 (7d7f4c95-ed75-4897-b7f3-21c5a93a94ed) |
| key_name | - |
| metadata | {} |
| name | orig |
| os-extended-volumes:volumes_attached | [] |
| progress | 0 |
| security_groups | default |
| status | ACTIVE |
| tenant_id | 3bdaac4958d248d3a775294181962df5 |
| updated | 2017-01-19T11:02:21Z |
| user_id | 6a8a2c38b7144e648aaa094464194dde |
+--------------------------------------+----------------------------------------------------------+

Steps to reproduce:
    - Fuel 8.0
    - 1 controller, 2 computes
    - VLAN
    - Ceph for all
    - image for Windows (https://cloudbase.it/windows-cloud-images/)

Scenario:
1. Create new VW from Windows image
2. Open VNC connection to the instance before migration
3. Try to migrate "Live Migrate Instance" few times

Changed in mos:
importance: Undecided → High
assignee: nobody → MOS Nova (mos-nova)
tags: added: cus
tags: added: customer-found
removed: cus
tags: added: support
tags: added: area-nova
Revision history for this message
Timofey Durakov (tdurakov) wrote :

Alexander, could you please provide nova logs from both compute nodes.

Changed in mos:
status: New → Incomplete
assignee: MOS Nova (mos-nova) → Alexander Lemeshko (oliemieshko)
Revision history for this message
Oleksandr Liemieshko (oliemieshko) wrote :
Changed in mos:
assignee: Alexander Lemeshko (oliemieshko) → Timofey Durakov (tdurakov)
tags: added: ct1
Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Cannot reproduce on 9.2, marking as Invalid for 9.x-series.

Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Alexander, I've tried to reproduce the issue on 8.0-MU-4 and failed. I did ~15 migrations in a row and there was no problem with this concrete image. However, I remember there was a bug regarding live migrations using huge ephemeral disks - https://bugs.launchpad.net/nova/+bug/1644248. The reason why live migration could fail is that the default timeout for the migration is only 150 secs and nova has a problem counting down this timeout and may end up failing to migrate an instance. The fix is to increase the live_migration_progress_timeout option or to disable it constantly by putting 0.

If you're still facing the issue, please try to update to the latest MU and collect more information.

Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Moving to Invalid after more than a month in Incomplete w/o feedback. Please feel free to re-open it if you face it once again and have clear steps to reproduce.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.