Stop deployment doesn't work with NovaFlat and Sahara

Bug #1467320 reported by Alexander Kurenyshev
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Vladimir Sharshov
6.1.x
Invalid
High
Vladimir Sharshov
7.0.x
Invalid
High
Vladimir Sharshov

Bug Description

Steps to reproduce:
1. Create new environment
2. Choose nova-network, FlatDHCP Manager
3. Choose Sahara
4. Add 3 controller
5. Add 1 compute
6. Add 1 cinder
7. Verify networks
8. Start deployment
9. Stop process during deployment

Expected result:
Deployment is stopped. Nodes become "pending addition", Deploy Changes button is enabled.

Actual result:
Deployment is not stopped. Progress bar has status "stopping". The top label says "Deployment of environment 'test' is done". See screen attached.

Astute thinks that deploy is successful:
Casting message to Nailgun: {"method"=>"deploy_resp", "args"=>{"task_uuid"=>"031818e2-f558-4359-88b4-320ef512ed88", "status"=>"ready", "progress"=>100}}

Fuel used:
ISO #525 RC3

Revision history for this message
Alexander Kurenyshev (akurenyshev) wrote :
Revision history for this message
Alexander Kurenyshev (akurenyshev) wrote :
Changed in fuel:
importance: Undecided → High
status: New → Confirmed
Revision history for this message
Alexander Kislitsky (akislitsky) wrote :

In logs we can see, that astute did not receive stop_deployment_task.

Revision history for this message
Alexander Kislitsky (akislitsky) wrote :

In rabbitmq we have periodical connection problems:

STOMP detected missed client heartbeat(s) on connection 10.109.0.11:59775 -> 10.109.0.2:61613, closing it

Also in queue naily_service we have 2 messages:

{u'respond_to': u'stop_deployment_resp', u'args': {u'engine': {u'url': u'http://10.109.0.2:80/cobbler_api', u'username': u'cobbler', u'password': u'******', u'master_ip': u'10.109.0.2'}, u'nodes': [{u'admin_ip': u'10.109.0.6', u'uid': u'6', u'roles': [u'compute'], u'slave_name': u'node-6'}, {u'admin_ip': u'10.109.0.5', u'uid': u'8', u'roles': [u'controller'], u'slave_name': u'node-8'}, {u'admin_ip': u'10.109.0.3', u'uid': u'4', u'roles': [u'cinder'], u'slave_name': u'node-4'}, {u'admin_ip': u'10.109.0.4', u'uid': u'2', u'roles': [u'controller'], u'slave_name': u'node-2'}, {u'admin_ip': u'10.109.0.7', u'uid': u'5', u'roles': [u'controller'], u'slave_name': u'node-5'}], u'stop_task_uuid': u'09444f2b-a981-48cb-9fe0-39a87b872b82', u'task_uuid': u'09a2e21a-07bb-4915-8d59-ce0aa5bc0bd9'}, u'method': u'stop_deploy_task', u'api_version': u'1.0'}
{u'respond_to': u'stop_deployment_resp', u'args': {u'engine': {u'url': u'http://10.109.0.2:80/cobbler_api', u'username': u'cobbler', u'password': u'******', u'master_ip': u'10.109.0.2'}, u'nodes': [{u'admin_ip': u'10.109.0.6', u'uid': u'6', u'roles': [u'compute'], u'slave_name': u'node-6'}, {u'admin_ip': u'10.109.0.5', u'uid': u'8', u'roles': [u'controller'], u'slave_name': u'node-8'}, {u'admin_ip': u'10.109.0.3', u'uid': u'4', u'roles': [u'cinder'], u'slave_name': u'node-4'}, {u'admin_ip': u'10.109.0.4', u'uid': u'2', u'roles': [u'controller'], u'slave_name': u'node-2'}, {u'admin_ip': u'10.109.0.7', u'uid': u'5', u'roles': [u'controller'], u'slave_name': u'node-5'}], u'stop_task_uuid': u'031818e2-f558-4359-88b4-320ef512ed88', u'task_uuid': u'09a2e21a-07bb-4915-8d59-ce0aa5bc0bd9'}, u'method': u'stop_deploy_task', u'api_version': u'1.0'}

Revision history for this message
Evgeniy L (rustyrobot) wrote :

It was really hard to figure out what has happened.
But we found out that env was in "operational" status, it could happen if:
1. stop was run via CLI when deployment was done it can be fixed in nailgun with filtering "running" tasks [1]
2. stop was run almost at the end of deployment, in this case we don't have simple technical solutions to solve the problem
3. due to some astute <-> nailgun temporary connectivity problem, the message got lost, we were not able to find evidence for that

[1] https://github.com/stackforge/fuel-web/blob/master/nailgun/nailgun/task/manager.py#L579-L587

For all of this cases from above we have a very simple **workaround**:
Delete hang stop task with cli command "fuel task --delete --task 34 --force", and continue your actions on the environment.

Revision history for this message
Alexander Kurenyshev (akurenyshev) wrote :

To clarify some things: deployment was stopped via Fuel UI and not on the final stage, it was just a very starting of deploy on the controller node.

tags: added: module-astute
tags: added: feature-stop-deployment
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

For some reason Nailgun send, but Astute never got message from service exchange in RabbitMQ, but both of them work correct with queue naily (inbox tasks) and exchange nailgun (outbox task status).

Detected missing server heartbeats also present in Astute log.

Queue naily_service do not use by Astute. More interesting was temporarily workers queuses which must connected to naily_service exchange. Looks like messages was missing during connection problem.

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Could not reproduce. Also ask Alexander Kurenyshev about reproducing. He also could not reproduce it.

Mark as incomplete.

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

One month left - no updates or repeats. Mark as Invalid.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.