Host is not mapped to any cell when booting instance

Bug #1835002 reported by Radosław Piliszek
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
kolla-ansible
Fix Released
High
Mark Goddard
Pike
New
High
Unassigned
Queens
Fix Committed
High
Mark Goddard
Rocky
Fix Committed
High
Mark Goddard
Stein
Fix Released
High
Radosław Piliszek
Train
Fix Released
High
Mark Goddard

Bug Description

Error when booting instance: Host 'xyz' is not mapped to any cell

Example log with failure: http://logs.openstack.org/63/667363/11/check/kolla-ansible-centos-source-upgrade-ceph-1/640d468/primary/logs/ansible/test-openstack

Affected branches: all supported (incl. master)

Problem analysis:

Nova deployment does not wait for computes to be up before discovery.
See: https://opendev.org/openstack/kolla-ansible/src/tag/8.0.0.0rc1/ansible/roles/nova/tasks/discover_computes.yml#L22

The problem has been exacerbated by merging fix for nova race condition:
https://bugs.launchpad.net/kolla-ansible/+bug/1832987
https://review.opendev.org/665554

There are two proposed fix approaches (both WIP):
(yoctozepto) https://review.opendev.org/668553 (abandoned)
(mgoddard) https://review.opendev.org/668623

Gerrit topic: bug/1835002
https://review.opendev.org/#/q/topic:bug/1835002

We should care for ironic too.

Changed in kolla-ansible:
status: New → In Progress
status: In Progress → Confirmed
status: Confirmed → In Progress
description: updated
Mark Goddard (mgoddard)
Changed in kolla-ansible:
importance: Undecided → High
Changed in kolla-ansible:
assignee: nobody → Mark Goddard (mgoddard)
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (master)

Reviewed: https://review.opendev.org/668623
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=c38dd767118164aae37613d557ed23691813f617
Submitter: Zuul
Branch: master

commit c38dd767118164aae37613d557ed23691813f617
Author: Mark Goddard <email address hidden>
Date: Tue Jul 2 08:30:02 2019 +0100

    Wait for all compute services before cell discovery

    There is a race condition during nova deploy since we wait for at least
    one compute service to register itself before performing cells v2 host
    discovery. It's quite possible that other compute nodes will not yet
    have registered and will therefore not be discovered. This leaves them
    not mapped into a cell, and results in the following error if the
    scheduler picks one when booting an instance:

    Host 'xyz' is not mapped to any cell

    The problem has been exacerbated by merging a fix [1][2] for a nova race
    condition, which disabled the dynamic periodic discovery mechanism in
    the nova scheduler.

    This change fixes the issue by waiting for all expected compute services
    to register themselves before performing host discovery. This includes
    both virtualised compute services and bare metal compute services.

    [1] https://bugs.launchpad.net/kolla-ansible/+bug/1832987
    [2] https://review.opendev.org/665554

    Change-Id: I2915e2610e5c0b8d67412e7ec77f7575b8fe9921
    Closes-Bug: #1835002

Changed in kolla-ansible:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/669321

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/stein)

Reviewed: https://review.opendev.org/669321
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=a50549719c970cd3b9ff6324de2ef6a92dfdcb3d
Submitter: Zuul
Branch: stable/stein

commit a50549719c970cd3b9ff6324de2ef6a92dfdcb3d
Author: Mark Goddard <email address hidden>
Date: Tue Jul 2 08:30:02 2019 +0100

    Wait for all compute services before cell discovery

    There is a race condition during nova deploy since we wait for at least
    one compute service to register itself before performing cells v2 host
    discovery. It's quite possible that other compute nodes will not yet
    have registered and will therefore not be discovered. This leaves them
    not mapped into a cell, and results in the following error if the
    scheduler picks one when booting an instance:

    Host 'xyz' is not mapped to any cell

    The problem has been exacerbated by merging a fix [1][2] for a nova race
    condition, which disabled the dynamic periodic discovery mechanism in
    the nova scheduler.

    This change fixes the issue by waiting for all expected compute services
    to register themselves before performing host discovery. This includes
    both virtualised compute services and bare metal compute services.

    [1] https://bugs.launchpad.net/kolla-ansible/+bug/1832987
    [2] https://review.opendev.org/665554

    Change-Id: I2915e2610e5c0b8d67412e7ec77f7575b8fe9921
    Closes-Bug: #1835002
    (cherry picked from commit c38dd767118164aae37613d557ed23691813f617)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/669698

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/669700

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/rocky)

Reviewed: https://review.opendev.org/669698
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=1899feac9ba9c21c6cc76cf0609a4223d1210c55
Submitter: Zuul
Branch: stable/rocky

commit 1899feac9ba9c21c6cc76cf0609a4223d1210c55
Author: Mark Goddard <email address hidden>
Date: Tue Jul 2 08:30:02 2019 +0100

    Wait for all compute services before cell discovery

    There is a race condition during nova deploy since we wait for at least
    one compute service to register itself before performing cells v2 host
    discovery. It's quite possible that other compute nodes will not yet
    have registered and will therefore not be discovered. This leaves them
    not mapped into a cell, and results in the following error if the
    scheduler picks one when booting an instance:

    Host 'xyz' is not mapped to any cell

    The problem has been exacerbated by merging a fix [1][2] for a nova race
    condition, which disabled the dynamic periodic discovery mechanism in
    the nova scheduler.

    This change fixes the issue by waiting for all expected compute services
    to register themselves before performing host discovery. This includes
    both virtualised compute services and bare metal compute services.

    This patch also includes change I58f8fd0a6e82cb614e02fef6e5b271af1d1ce9af
    which was made to fix an issue with the original version of this patch
    running on Ansible<28. See bug 1835817 for details.

    Change-Id: I2915e2610e5c0b8d67412e7ec77f7575b8fe9921
    Closes-Bug: #1835002
    (cherry picked from commit c38dd767118164aae37613d557ed23691813f617)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/queens)

Reviewed: https://review.opendev.org/669700
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=aa442450005ad3f0600369b65296ddff364e44a1
Submitter: Zuul
Branch: stable/queens

commit aa442450005ad3f0600369b65296ddff364e44a1
Author: Mark Goddard <email address hidden>
Date: Tue Jul 2 08:30:02 2019 +0100

    Wait for all compute services before cell discovery

    There is a race condition during nova deploy since we wait for at least
    one compute service to register itself before performing cells v2 host
    discovery. It's quite possible that other compute nodes will not yet
    have registered and will therefore not be discovered. This leaves them
    not mapped into a cell, and results in the following error if the
    scheduler picks one when booting an instance:

    Host 'xyz' is not mapped to any cell

    The problem has been exacerbated by merging a fix [1][2] for a nova race
    condition, which disabled the dynamic periodic discovery mechanism in
    the nova scheduler.

    This change fixes the issue by waiting for all expected compute services
    to register themselves before performing host discovery. This includes
    both virtualised compute services and bare metal compute services.

    This patch also includes change I58f8fd0a6e82cb614e02fef6e5b271af1d1ce9af
    which was made to fix an issue with the original version of this patch
    running on Ansible<28. See bug 1835817 for details.

    Change-Id: I2915e2610e5c0b8d67412e7ec77f7575b8fe9921
    Closes-Bug: #1835002
    (cherry picked from commit c38dd767118164aae37613d557ed23691813f617)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 8.0.0.0rc2

This issue was fixed in the openstack/kolla-ansible 8.0.0.0rc2 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 6.2.2

This issue was fixed in the openstack/kolla-ansible 6.2.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 7.1.2

This issue was fixed in the openstack/kolla-ansible 7.1.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 9.0.0.0rc1

This issue was fixed in the openstack/kolla-ansible 9.0.0.0rc1 release candidate.

Mark Goddard (mgoddard)
Changed in kolla-ansible:
milestone: 9.0.0 → none
Changed in kolla-ansible:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.