Mirantis OpenStack

[Nova]Live migration isn't really made, instance stays on same compute

Bug #1544564 reported by Rodion Promyshlennikov on 2016-02-11

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Mirantis OpenStack	Status tracked in 10.0.x
10.0.x	Confirmed	Medium	Timofey Durakov	Mirantis OpenStack 10.0
8.0.x	Won't Fix	Medium	Timofey Durakov	Mirantis OpenStack 8.0-updates
9.x	Won't Fix	Medium	Timofey Durakov	Mirantis OpenStack 9.0

Bug Description

Live migration is not working (works in 1 of 5 or less live-migrations start)

Environment:
mos 8.0 549 iso

Steps to reproduce:
1. Make standard deployment with 3 controllers and 2 computes.
2. Launch instance from image with small flavor (i use Cirros img).
3. Make live-migration to second compute node with block_migration=True (you can make it from cli or Horizon, result should be same)

Expected Result:
VM will successfully migrate to another host

Observed Result:
VM didn't migrate.

Diagnostic snapshot link:
https://drive.google.com/file/d/0B-QiiEr4w70UR2VCenRlWTI5N0U/view?usp=sharing

See original description

Tags:

Rodion Promyshlennikov (rpromyshlennikov) on 2016-02-11

description:

updated

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-02-11:

Rodion, could you please elaborate on what specific error you see?

Timur Nurlygayanov (tnurlygayanov) on 2016-02-11

description:

updated

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-02-11:

We are taking a closer look at the environment right now.

tags:

added: area-nova
removed: nova

Roman Podoliaka (rpodolyaka) on 2016-02-11

tags:

added: release-notes

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-02-11:

Ok. So this is a specific case of live migration (block_migrate=True), when VMs ephemeral disks *are not* shared, but stored on local disks instead, which means they have to be transferred over the network, when a VM is live migrated. This *can* be done, but generally not recommended, as disks may be big, so it's simply inefficient and puts network under high load.

Depending on the root disk size and network bandwidth, it may take a long time for a VM disk to be transferred. In Liberty a special option was introduced to the libvirt driver to abort stuck live migrations:

[libvirt] live_migration_progress_timeout = 150 (IntOpt) Time to wait, in seconds, for migration to make forward progress in transferring data before aborting the operation. Set to 0 to disable timeouts.

And this is exactly what we see on the Rodion's environment:

http://paste.openstack.org/show/486721/

nova-compute aborted the live migration because there was no progress during 150s.

It's unclear at this point, why qemu/libvirt failed to report progress of a block device migration, as we can see from tcpdump logs, that the disk was actually in the middle of migration when we stopped it. We'll take a closer look at this.

User impact is moderate: if block migration fails, an instance will continue to run on the source host. The workaround is to increase live_migration_progress_timeout value in nova.conf or set it to 0 to disable timeouts completely.

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-02-11:

I suggest we downgrade this to High and increase the timeout value in 8.0-mu1. So this should go to release notes in 8.0.

For 9.0, we'll need to take a look, if block migration progress report can be improved from qemu/libvirt side.

tags:

added: move-to-mu

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-02-11:

I actually think this should be Medium for 9.0 as block-migration is the use case we'd like to avoid and this is mostly mitigated by increasing the timeout value anyway.

Still, if we can improve qemu/libvirt that would be even better.

Olga Gusarenko (ogusarenko) on 2016-02-25

tags:

added: 8.0 release-notes-done
removed: release-notes

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-03-29:

Oops, forgot to update the importance for 8.0-updates.

Fuel Devops McRobotson (fuel-devops-robot) on 2016-04-12

Changed in mos:
status:	Confirmed → Won't Fix

Revision history for this message

Dina Belova (dbelova) wrote on 2016-04-13:

Added move-to-10.0 tag due to the fact bug was transferred from 9.0 to 10.0

tags:

added: move-to-10.0

Revision history for this message

Vitaly Sedelnik (vsedelnik) wrote on 2016-05-24:

Won't Fix for 8.0-updates because of Medium importance

Revision history for this message

Yuri Shovkoplias (yuri-shovkoplias) wrote on 2016-07-10:

Guys, it is not a medium importance, we experience this issue in the customer deployments

tags:

added: customer-found

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-07-13:

#10

Yuriy,

1) as we discussed, you were actually seeing a different issue, not this one

2) the fact that it affects any customer deployments does not directly affect the bug importance, which essentially denotes the users impact and whether the problem can easily be avoided by using workarounds

tags:	added: 10.0-reviewed
tags:	removed: move-to-10.0
tags:	removed: move-to-mu

Oleksandr Liemieshko (oliemieshko) on 2016-10-24

tags:

added: ct1

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.