nova-compute is down after failure during init_host()

Bug #1529810 reported by Roman Podoliaka
44
This bug affects 7 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Fix Released
High
Roman Podoliaka
8.0.x
Fix Released
High
Denis Meltsaykin
9.x
Fix Released
High
Roman Podoliaka

Bug Description

After completing steps to reproduce nova-compute process stops to report its state to nova-conductor and is marked as down on nova service-list output.

root@node-4:/var/log# nova --version
2.30.2
root@node-4:/var/log# dpkg -l | grep nova
ii nova-common 2:12.0.0-1~u14.04+mos19 all OpenStack Compute - common files
ii nova-compute 2:12.0.0-1~u14.04+mos19 all OpenStack Compute - compute node
ii nova-compute-qemu 2:12.0.0-1~u14.04+mos19 all OpenStack Compute - compute node (QEmu)
ii python-nova 2:12.0.0-1~u14.04+mos19 all OpenStack Compute - libraries
ii python-novaclient 2:2.30.2-1~u14.04+mos3 all client library for OpenStack Compute API
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "361"
  build_id: "361"
  fuel-nailgun_sha: "53c72a9600158bea873eec2af1322a716e079ea0"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "7463551bc74841d1049869aaee777634fb0e5149"
  fuel-nailgun-agent_sha: "92ebd5ade6fab60897761bfa084aefc320bff246"
  astute_sha: "c7ca63a49216744e0bfdfff5cb527556aad2e2a5"
  fuel-library_sha: "ba8063d34ff6419bddf2a82b1de1f37108d96082"
  fuel-ostf_sha: "889ddb0f1a4fa5f839fd4ea0c0017a3c181aa0c1"
  fuel-mirror_sha: "8adb10618bb72bb36bb018386d329b494b036573"
  fuelmenu_sha: "824f6d3ebdc10daf2f7195c82a8ca66da5abee99"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "9f0ba4577915ce1e77f5dc9c639a5ef66ca45896"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "07d5f1c3e1b352cb713852a3a96022ddb8fe2676"
[root@nailgun ~]#

Steps First case:
Neutron vlan
Create and deploy next cluster - Neutron Vlan, cinder/swift, 3 controller, 2 compute, 1 cinder nodes
Run OSTF
Verify network
Fill cinder storage up to 30%
Shutdown of all nodes as it is described in this manual - https://docs.mirantis.com/openstack/fuel/fuel-6.1/operations.html#howto-shut-down-the-whole-cluster
Wait 5 minutes
Start cluster as it is described in this manual - https://docs.mirantis.com/openstack/fuel/fuel-6.1/operations.html#howto-shut-down-the-whole-cluster
Wait until OSTF 'HA' suite passes
Run OSTF

Scenario 2 the same as first but neutron vlanx is used
Actual result:
Nova-manage service list show nova compute as XXX http://paste.openstack.org/show/482783/

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

I tentatively marked this as Medium. If there is a new reproduce, please feel free to raise to High.

affects: fuel → mos
Changed in mos:
milestone: 9.0 → none
milestone: none → 9.0
description: updated
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

No longer fixing Medium bugs in 8.0

Changed in mos:
status: Confirmed → Won't Fix
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

New duplicate found - raising to High - https://bugs.launchpad.net/fuel/+bug/1530858

Changed in mos:
status: Won't Fix → Confirmed
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :
description: updated
summary: - nova-compute is stuck polling libvirt connection
+ nova-compute is down after failure during init_host()
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

This looks interesting. According to stack traces of green/native threads attached above nova-compute is in a weird state: eventlet main loop is started and fully functional, nova-compute is connected to rabbitmq/libvirt (libvirt keep alives go into both direction periodically), while at the same time looks like nova-compute RPC server has not started properly, so it does not really wait for any new messages, nor executes periodic tasks of its own.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Based on the implementation (https://github.com/openstack/oslo.service/blob/master/oslo_service/service.py) we start a service in one green thread and then wait for the service correct termination (self.done conditional variable) from another green thread. So if the service thread fails - we'll never know (this is what we see in #7).

Revision history for this message
Marian Horban (mhorban) wrote :

Do you mean case when we are starting many green threads inside service.start() procedure and then we are waiting on self.done. And during that some of green threads are crashed?

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Marian, service.start() itself is run in a green thread and if it fails with exception, the service will still be alive from the ServiceLauncher point of view, so the launcher.wait() call will block, but nova-compute will not be able to do anything useful.

IMO, we need to fail (exit with some error code) early, if service.start() fails. The problem is that it's OK now for the start() call to end successfully in a thread - its task is to set up an RPC server and do other service initialisation steps - so this thread does not represent a running service and ends early on success.

Currently, I don't see a clean way to fail early with existing oslo.service API. What I was able to achieve is to call service.stop() immediately after service.start() call fails, which allows us to exit in service.wait(), although we still exit with a success error code (i.e. 0), not a failure one.

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/oslo.service (openstack-ci/fuel-8.0/liberty)

Fix proposed to branch: openstack-ci/fuel-8.0/liberty
Change author: Marian Horban <email address hidden>
Review: https://review.fuel-infra.org/16309

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

So I've uploaded a change on review, which should fix this, but IMO it's too risky to introduce such a change to oslo.service such late in the release cycle: basically we change how every single service starts and the error code it exits with. We need to test this thoroughly.

This is how OpenStack services worked for ages. The last thing I want to do is to do a quick fix and break even more things. The workaround is easy - just restart the affected service. The situation itself must be rare and reproduced only in CI / slow environments.

Tentatively moving this to 8.0-updates.

tags: added: release-notes
tags: added: move-to-mu
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on openstack/oslo.service (openstack-ci/fuel-8.0/liberty)

Change abandoned by Roman Podoliaka <email address hidden> on branch: openstack-ci/fuel-8.0/liberty
Review: https://review.fuel-infra.org/16309

tags: added: 8.0 release-notes-done
removed: release-notes
Anna Babich (ababich)
tags: added: on-verification
Revision history for this message
Anna Babich (ababich) wrote :

Verified on (3 controllers+2 computes, vlan, cinder lvm):

[root@nailgun ~]# shotgun2 short-report
cat /etc/fuel_build_id:
 154
cat /etc/fuel_build_number:
 154
cat /etc/fuel_release:
 9.0
cat /etc/fuel_openstack_version:
 mitaka-9.0
rpm -qa | egrep 'fuel|astute|network-checker|nailgun|packetary|shotgun':
 fuel-misc-9.0.0-1.mos8259.noarch
 python-fuelclient-9.0.0-1.mos301.noarch
 fuel-9.0.0-1.mos6325.noarch
 fuel-provisioning-scripts-9.0.0-1.mos8611.noarch
 fuelmenu-9.0.0-1.mos263.noarch
 fuel-ui-9.0.0-1.mos2641.noarch
 fuel-migrate-9.0.0-1.mos8259.noarch
 fuel-ostf-9.0.0-1.mos921.noarch
 shotgun-9.0.0-1.mos87.noarch
 python-packetary-9.0.0-1.mos130.noarch
 fuel-bootstrap-cli-9.0.0-1.mos272.noarch
 fuel-nailgun-9.0.0-1.mos8611.noarch
 fuel-setup-9.0.0-1.mos6325.noarch
 rubygem-astute-9.0.0-1.mos733.noarch
 fuel-mirror-9.0.0-1.mos130.noarch
 fuel-openstack-metadata-9.0.0-1.mos8611.noarch
 fuel-notify-9.0.0-1.mos8259.noarch
 fuel-release-9.0.0-1.mos6325.noarch
 nailgun-mcagents-9.0.0-1.mos733.noarch
 network-checker-9.0.0-1.mos72.x86_64
 fuel-utils-9.0.0-1.mos8259.noarch
 fuel-library9.0-9.0.0-1.mos8259.noarch
 fuel-agent-9.0.0-1.mos272.noarch
[root@nailgun ~]#

The verification was executed in accordance with Steps to reproduce and with adding raise to nova-compute's init_host() to emulate its crash

tags: removed: on-verification
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

This is effectively fixed in https://review.fuel-infra.org/#/c/20853/ , otherwise it would require backporting a backwards incompatible change to oslo.service (https://review.openstack.org/#/c/265163/)

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

I suggest we proceed with the upstart dependency fix in 8.0-updates ^

Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :
tags: added: on-veirification
tags: removed: on-veirification
tags: added: on-verification
Revision history for this message
TatyanaGladysheva (tgladysheva) wrote :

Verified on MOS 8.0 + MU2 updates.

tags: removed: on-verification
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.