I noticed this here:
http://logs.openstack.org/74/657374/6/check/watcher-tempest-workload_balancing/3e34c3f/testr_results.html.gz
Live migration fails because the neutron port on the server is deleted while the server is being live migrated, it actually looks like the instance itself is being destroyed before the live migration is complete, which is what deletes the port.
http://logs.openstack.org/74/657374/6/check/watcher-tempest-workload_balancing/3e34c3f/compute1/logs/screen-n-cpu.txt.gz#_May_10_10_18_19_106717
May 10 10:18:19.106717 ubuntu-bionic-vexxhost-sjc1-0006079177 nova-compute[19758]: INFO nova.compute.manager [None req-0e809061-4c10-4e82-9c17-20ebf2ca3832 tempest-TestExecuteWorkloadBalancingStrategy-1723816204 tempest-TestExecuteWorkloadBalancingStrategy-1723816204] [instance: 27813a13-39dc-490d-86f3-c1877c07a010] Took 0.28 seconds to destroy the instance on the hypervisor.
http://logs.openstack.org/74/657374/6/check/watcher-tempest-workload_balancing/3e34c3f/compute1/logs/screen-n-cpu.txt.gz#_May_10_10_18_21_174978
May 10 10:18:21.174978 ubuntu-bionic-vexxhost-sjc1-0006079177 nova-compute[19758]: WARNING nova.virt.libvirt.driver [None req-4e88ad59-55d7-4cc2-9deb-532900c94ab6 None None] [instance: 27813a13-39dc-490d-86f3-c1877c07a010] Error monitoring migration: Failed to activate binding for port 67f1aaa7-8966-448b-9655-56e19a01fb62 and host ubuntu-bionic-vexxhost-sjc1-0006079176.: PortBindingActivationFailed: Failed to activate binding for port 67f1aaa7-8966-448b-9655-56e19a01fb62 and host ubuntu-bionic-vexxhost-sjc1-0006079176.
You can see the nova-api request to delete the server here:
http://logs.openstack.org/74/657374/6/check/watcher-tempest-workload_balancing/3e34c3f/controller/logs/screen-n-api.txt.gz#_May_10_10_18_18_867714
May 10 10:18:18.867714 ubuntu-bionic-vexxhost-sjc1-0006079176 <email address hidden>[6484]: INFO nova.api.openstack.requestlog [None req-0e809061-4c10-4e82-9c17-20ebf2ca3832 tempest-TestExecuteWorkloadBalancingStrategy-1723816204 tempest-TestExecuteWorkloadBalancingStrategy-1723816204] 38.108.68.233 "DELETE /compute/v2.1/servers/27813a13-39dc-490d-86f3-c1877c07a010" status: 204 len: 0 microversion: 2.1 time: 0.264338
The live migration of that server starts here:
http://logs.openstack.org/74/657374/6/check/watcher-tempest-workload_balancing/3e34c3f/controller/logs/screen-n-api.txt.gz#_May_10_10_17_58_008645
May 10 10:17:58.008645 ubuntu-bionic-vexxhost-sjc1-0006079176 <email address hidden>[6484]: DEBUG nova.compute.api [None req-908b9e12-08da-4f48-bbb2-f3b01c320d92 admin admin] [instance: 27813a13-39dc-490d-86f3-c1877c07a010] Going to try to live migrate instance to ubuntu-bionic-vexxhost-sjc1-0006079176 {{(pid=6485) live_migrate /opt/stack/nova/nova/compute/api.py:4540}}
So live migration starts at 10:17:58.008645, the server delete request is at 10:18:18.867714, the guest is deleted from the hypervisor by 10:18:19.106717, and live migration fails at 10:18:21.174978.
The test creates an audit template from the workload_stabilization strategy, creates an audit from the template, and then polls until the audit is finished - and it's the audit that would kick off the live migration right? So I guess the race is in the workload_stabilization strategy not waiting for the live migration to complete?
Looks like at this point the audit status is SUCCEEDED:
http://logs.openstack.org/74/657374/6/check/watcher-tempest-workload_balancing/3e34c3f/job-output.txt.gz#_2019-05-10_10_18_22_112059
2019-05-10 10:18:22.110668 | controller | 2019-05-10 10:18:09,489 12011 INFO [tempest.lib.common.rest_client] Request (TestExecuteWorkloadBalancingStrategy:test_execute_workload_stabilization): 200 GET http://38.108.68.233:9322/v1/audits/97456a4e-a115-45b9-98fe-92c5795fe443 0.134s
2019-05-10 10:18:22.110818 | controller | 2019-05-10 10:18:09,494 12011 DEBUG [tempest.lib.common.rest_client] Request - Headers: {'Accept': 'application/json', 'X-Auth-Token': '<omitted>', 'Content-Type': 'application/json'}
2019-05-10 10:18:22.110854 | controller | Body: None
2019-05-10 10:18:22.111182 | controller | Response - Headers: {u'connection': 'close', u'server': 'Apache/2.4.29 (Ubuntu)', u'vary': 'OpenStack-API-Version', u'content-length': '1137', u'openstack-api-maximum-version': '1.1', u'openstack-api-version': 'infra-optim 1.0', u'openstack-api-minimum-version': '1.0', 'status': '200', u'date': 'Fri, 10 May 2019 10:18:09 GMT', u'content-type': 'application/json', 'content-location': 'http://38.108.68.233:9322/v1/audits/97456a4e-a115-45b9-98fe-92c5795fe443'}
2019-05-10 10:18:22.112059 | controller | Body: {"auto_trigger": false, "next_run_time": null, "audit_type": "ONESHOT", "interval": null, "uuid": "97456a4e-a115-45b9-98fe-92c5795fe443", "parameters": {"thresholds": {"cpu_util": 0.2}, "granularity": 300, "metrics": ["cpu_util"], "retry_count": 1, "weights": {"cpu_util_weight": 1.0}, "periods": {"node": 60000, "instance": 72000}, "host_choice": "retry", "aggregation_method": {"node": "mean", "instance": "mean"}, "instance_metrics": {"cpu_util": "compute.node.cpu.percent"}}, "goal_uuid": "39be02da-7999-45e0-a134-c2c6c6049b39", "created_at": "2019-05-10T10:18:05+00:00", "links": [{"href": "http://38.108.68.233:9322/v1/audits/97456a4e-a115-45b9-98fe-92c5795fe443", "rel": "self"}, {"href": "http://38.108.68.233:9322/audits/97456a4e-a115-45b9-98fe-92c5795fe443", "rel": "bookmark"}], "goal_name": "workload_balancing", "strategy_name": "workload_stabilization", "updated_at": "2019-05-10T10:18:08+00:00", "state": "SUCCEEDED", "strategy_uuid": "e93ded8c-2503-4629-819a-8360332ff929", "scope": [], "deleted_at": null, "hostname": "ubuntu-bionic-vexxhost-sjc1-0006079176", "name": "workload_stabilization-2019-05-10T10:18:04.810340"}
2019-05-10 10:18:22.112176 | controller | 2019-05-10 10:18:09,494 12011 DEBUG [tempest.lib.common.utils.test_utils] Call partial returns true in 4.405191 seconds
I'm confused because I don't see this in the watcher logs:
https://opendev.org/openstack/watcher/src/commit/64d841b3f2ed2a8844ec2d3db6b3c711ae00839e/watcher/common/nova_helper.py#L281
or when it's waiting for the live migration to complete:
https://opendev.org/openstack/watcher/src/commit/64d841b3f2ed2a8844ec2d3db6b3c711ae00839e/watcher/common/nova_helper.py#L308
https://opendev.org/openstack/watcher/src/commit/64d841b3f2ed2a8844ec2d3db6b3c711ae00839e/watcher/common/nova_helper.py#L330
It looks like that's because of this configuration:
May 10 10:06:34.673237 ubuntu-bionic-vexxhost-sjc1-0006079176 watcher-applier[3306]: DEBUG watcher.common.service [-] default_log_levels = ['amqp=WARN', 'amqplib=WARN', 'qpid.messaging=INFO', 'oslo.messaging=INFO', 'sqlalchemy=WARN', 'keystoneclient=INFO', 'stevedore=INFO', 'eventlet.wsgi.server=WARN', 'iso8601=WARN', 'requests=WARN', 'neutronclient=WARN', 'glanceclient=WARN', 'watcher.openstack.common=WARN', 'apscheduler=WARN'] {{(pid=3306) log_opt_values /usr/local/lib/python2.7/dist-packages/oslo_config/cfg.py:2577}}
'watcher.openstack.common=WARN' means I won't see any debug logs from that nova_helper code.
Related fix proposed to branch: master /review. opendev. org/658399
Review: https:/