Live migration's assigned ports conflicts

Bug #1498196 reported by Aleksandr Shaposhnikov
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Invalid
High
Oleksii
7.0.x
Invalid
High
MOS Nova

Bug Description

It looks like during live migration some generated port to use for live migration didn't checked for being already used and/or didn't have a re-get new port if the old one got occupied by someone else.

Here is an example of this behavior in nova-compute log files from the source compute node:

2015-09-20T06:25:21.701157+00:00 info: 2015-09-20 06:25:21.700 17037 INFO nova.virt.libvirt.driver [-] [instance: 60493be7-4b00-4f4f-a785-9d4aa6e74f58] Instance spawned successfully.
2015-09-20T06:25:21.828941+00:00 info: 2015-09-20 06:25:21.828 17037 INFO nova.compute.manager [req-8fdf447a-48c4-4b41-8276-9459ae9e5a65 - - - - -] [instance: 60493be7-4b00-4f4f-a785-9d4aa6e74f58] VM
Resumed (Lifecycle Event)
2015-09-20T06:25:37.349069+00:00 err: 2015-09-20 06:25:37.348 17037 ERROR nova.virt.libvirt.driver [req-8150b87f-f87b-4bec-8bab-561dd37605d5 820904596e1d422e9460f472b7b9672f 04ce0fe8f21a4a6b8535c5cefd9f8594 - - -] [instance: 60493be7-4b00-4f4f-a785-9d4aa6e74f58] Live Migration failure: internal error: early end of file from monitor: possible problem:
2015-09-20T06:25:37.116947Z qemu-system-x86_64: -incoming tcp:[::]:49152: Failed to bind socket: Address already in use
2015-09-20T06:25:37.354837+00:00 info: 2015-09-20 06:25:37.354 17037 INFO nova.virt.libvirt.driver [req-8150b87f-f87b-4bec-8bab-561dd37605d5 820904596e1d422e9460f472b7b9672f 04ce0fe8f21a4a6b8535c5cefd9f8594 - - -] [instance: 60493be7-4b00-4f4f-a785-9d4aa6e74f58] Migration running for 0 secs, memory 0% remaining; (bytes processed=0, remaining=0, total=0)
2015-09-20T06:25:37.856147+00:00 err: 2015-09-20 06:25:37.855 17037 ERROR nova.virt.libvirt.driver [req-8150b87f-f87b-4bec-8bab-561dd37605d5 820904596e1d422e9460f472b7b9672f 04ce0fe8f21a4a6b8535c5cefd9f8594 - - -] [instance: 60493be7-4b00-4f4f-a785-9d4aa6e74f58] Migration operation has aborted

Some env description:

root@node-169:~# nova-compute --version
2015.1.1

root@node-169:~# dpkg -l |grep 'nova-compute '|awk '{print $3}'
1:2015.1.1-1~u14.04+mos19662

Steps to reproduce:

Actually this happens during rally testing of pretty big env (~200 nodes) one per 200 iterations so chances for getting that on scale are pretty big. So it should be easily reproduced under following circumastances:
1. Very high rate of migrations.
2. A lot of running VMs/other services with large amount of used TCP ports.

Both of these statements will lead to the higher chances of getting collision for qemu migration port allocation procedure.

Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-09-22_06-53-27.tar.xz

Tags: scale
affects: nova → mos
no longer affects: nova
Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

This requires more investigation so changing to Incomplete. Please set back to New when you have clear problem statement and steps to reproduce.

Changed in mos:
status: New → Incomplete
assignee: nobody → Aleksandr Shaposhnikov (alashai8)
description: updated
Revision history for this message
Dina Belova (dbelova) wrote :

Vitaly, this bug was found during scale tests on 200 nodes - NovaServers.boot_and_live_migrate_server rally test. Steps to reproduce were added to the description.

>> Actually this happens during rally testing of pretty big env (~200 nodes) one per 200 iterations so chances for getting that on scale are pretty big. So it should be easily reproduced under following circumastances:
1. Very high rate of migrations.
2. A lot of running VMs/other services with large amount of used TCP ports.

Both of these statements will lead to the higher chances of getting collision for qemu migration port allocation procedure.

we have no other way to reproduce it right now. The problem is that instance was said to be migrated, although that was not true in fact :) That's the issue

Changed in mos:
status: Incomplete → Confirmed
importance: Undecided → High
assignee: Aleksandr Shaposhnikov (alashai8) → nobody
Changed in mos:
assignee: nobody → MOS Nova (mos-nova)
milestone: none → 8.0
Revision history for this message
Timofey Durakov (tdurakov) wrote :

@dbelova, could you explain how that was said that instance had been migrated successfully? nova show {instance_uuid} shows another host, or smth. else?

Changed in mos:
assignee: MOS Nova (mos-nova) → Dina Belova (dbelova)
Revision history for this message
Dina Belova (dbelova) wrote :

@tdurakov, the only way we could see it was the following traceback from the rally:

Traceback (most recent call last):
  File "/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/runners/base.py", line 79, in _run_scenario_once
    method_name)(**kwargs) or scenario_output
  File "/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/scenarios/nova/servers.py", line 440, in boot_and_live_migrate_server
    block_migration, disk_over_commit)
  File "/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/scenarios/base.py", line 262, in func_atomic_actions
    f = func(self, *args, **kwargs)
  File "/opt/stack/.venv/lib/python2.7/site-packages/rally/benchmark/scenarios/nova/utils.py", line 647, in _live_migrate
    host_pre_migrate)
LiveMigrateException: Live Migration failed: Migration complete but instance did not change host: node-169.domain.tld

Changed in mos:
assignee: Dina Belova (dbelova) → Timofey Durakov (tdurakov)
Revision history for this message
Timofey Durakov (tdurakov) wrote :

@dbelova, ok, now i see, need to check snapshot first, thanks for fast response.

Revision history for this message
Timofey Durakov (tdurakov) wrote :

tried to reproduce, no results. moving bug to incomplete. If it reproduces again, please reopen

Changed in mos:
status: Confirmed → Incomplete
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

No repro in a month, closing.

Changed in mos:
status: Incomplete → Invalid
Oleksii (ozhurba)
Changed in mos:
status: Invalid → Confirmed
Revision history for this message
Oleksii (ozhurba) wrote :

Reproduced on customer environment. Error from Rally:

Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/rally/task/runner.py", line 66, in _run_scenario_once deprecated_output = getattr(scenario_inst, method_name)(**kwargs) File "/usr/local/lib/python2.7/dist-packages/rally/plugins/openstack/scenarios/nova/servers.py", line 619, in boot_server_from_volume_and_live_migrate block_migration, disk_over_commit) File "/usr/local/lib/python2.7/dist-packages/rally/task/atomic.py", line 84, in func_atomic_actions f = func(self, *args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/rally/plugins/openstack/scenarios/nova/utils.py", line 733, in _live_migrate host_pre_migrate) LiveMigrateException: Live Migration failed: Migration complete but instance did not change host: <host_name>

Environment: MOS 7.0 mu2, nfs share for cinder, glance, nova.

Scenario:
"concurrency": 10,
"times": 100
 "block_migration": false
"flavor": {"name": "m1.tiny"}
"force_delete": false
"image": {"name": "TestVM"}
"volume_size": 10
"users_per_tenant": 2
"tenants": 2

We got 7-10 fails per this tests run.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Oleksii, are you sure you are seeing the very same problem? (qemu files to bind a socket for performing a live migration) Could you please provide a diagnostic snapshot?

Changed in mos:
milestone: 8.0 → 8.0-updates
assignee: Timofey Durakov (tdurakov) → Oleksii (ozhurba)
status: Confirmed → Incomplete
Revision history for this message
Dina Belova (dbelova) wrote :

More than a month in the incomplete state. Moving to Invalid. Please move back to confirmed if that is still reproducible + please provide the information about Roman's request.

Changed in mos:
status: Incomplete → Invalid
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

No feedback here, thus, I think this should also be closed as Invalid for 7.0. Feel free to re-open the bug, if it's still reproduced.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.