SSH pubkeys handled per relation-id only

Bug #1468871 reported by Peter Sabaini
80
This bug affects 13 people
Affects Status Importance Assigned to Milestone
OpenStack Nova Cloud Controller Charm
Fix Released
Wishlist
Unassigned
OpenStack Nova Compute Charm
Invalid
Undecided
Unassigned
nova-cloud-controller (Juju Charms Collection)
Invalid
High
Unassigned

Bug Description

I've got two named services, "compute-only" with N units and "nova-compute" with M units, both instances of the nova-compute charm. They're separated to support different configurations on the units.

Both are related to the nova-cloud-controller service via the "cloud-compute" relation. But, since the services are distinct, the relationships are as well, ie. I get two relationship ids, one for each named service.

This means that ssh pubkeys are handled separately for each named service (cf. hooks.nova_cc_hooks.compute_changed()) and as a consequence pubkeys and hostkeys are distributed separately for each named service.

This is undesirable. When deployed, the n-c units are all part of the same nova services; however, depending on luck, ssh connection between units may or may not work for resize or migration operations.

Tested with lp:charms/trusty/nova-cloud-controller;revno=163

James Page (james-page)
Changed in nova-cloud-controller (Juju Charms Collection):
status: New → Triaged
importance: Undecided → Medium
milestone: none → 15.10
Changed in nova-cloud-controller (Juju Charms Collection):
milestone: 15.10 → 16.04
James Page (james-page)
Changed in nova-cloud-controller (Juju Charms Collection):
milestone: 16.04 → 16.07
tags: added: openstack
Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

I had this again today while migrating instances. Functioning ssh pubkey setup is required for migrating instances in our clouds typically.

Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

Note that this is still an issue on Mitaka

Changed in nova-cloud-controller (Juju Charms Collection):
assignee: nobody → Jorge Niedbalski (niedbalski)
Revision history for this message
JuanJo Ciarlante (jjo) wrote :

FYI workaround'd with: http://paste.ubuntu.com/21277008/
(both ssh host keys, and root/nova user ssh pubkeys)

Liam Young (gnuoy)
Changed in nova-cloud-controller (Juju Charms Collection):
milestone: 16.07 → 16.10
Revision history for this message
Edward Hope-Morley (hopem) wrote :

While you could achieve what you want using a single nova-compute service using finer-grained placement rules, I think this is still a valid request since there are valid reasons to use separate services for different compute AZs (e.g. if you wanted each AZ to have its own set of Ceph credentials) and SSH keys for all compute units will need to be shared across services in order for cross-AZ migration to be possible.

Changed in nova-cloud-controller (Juju Charms Collection):
importance: Medium → High
tags: added: sts
James Page (james-page)
Changed in nova-cloud-controller (Juju Charms Collection):
milestone: 16.10 → 17.01
James Page (james-page)
Changed in charm-nova-cloud-controller:
assignee: nobody → Jorge Niedbalski (niedbalski)
importance: Undecided → High
status: New → Triaged
Changed in nova-cloud-controller (Juju Charms Collection):
status: Triaged → Invalid
Changed in charm-nova-cloud-controller:
assignee: Jorge Niedbalski (niedbalski) → nobody
Changed in nova-cloud-controller (Juju Charms Collection):
assignee: Jorge Niedbalski (niedbalski) → nobody
Liam Young (gnuoy)
Changed in charm-nova-cloud-controller:
assignee: nobody → Liam Young (gnuoy)
Liam Young (gnuoy)
Changed in charm-nova-cloud-controller:
status: Triaged → In Progress
Revision history for this message
Alvaro Uria (aluria) wrote :

A customer running xenial+mitaka on 17.02 release has hit this issue.

Are there plans to support this feature in the near future?

Thank you.

Changed in charm-nova-cloud-controller:
milestone: none → 18.02
Ryan Beisner (1chb1n)
Changed in charm-nova-cloud-controller:
milestone: 18.02 → 18.05
David Ames (thedac)
Changed in charm-nova-cloud-controller:
milestone: 18.05 → 18.08
James Page (james-page)
Changed in charm-nova-cloud-controller:
milestone: 18.08 → 18.11
David Ames (thedac)
Changed in charm-nova-cloud-controller:
milestone: 18.11 → 19.04
Revision history for this message
Xav Paice (xavpaice) wrote :

Added https://bugs.launchpad.net/charm-nova-compute/+bug/1822027 as duplicate. This is preventing workloads from being migrated from an over-capacity AZ to a less used one, and therefore affecting the customer's ability to spawn new workloads without either manually adding the ssh keys, or re-building the workload.

Revision history for this message
Xav Paice (xavpaice) wrote :

Subscribed field-high

Liam Young (gnuoy)
Changed in charm-nova-cloud-controller:
assignee: Liam Young (gnuoy) → nobody
Revision history for this message
Chris Sanders (chris.sanders) wrote :

After discussing with the team today, I wanted to clarify. The urgency around this is only been seen on Queens today we don't have any specific examples where we need this support prior to Queens and a solution for just Queens+ would be sufficient.

David Ames (thedac)
Changed in charm-nova-cloud-controller:
milestone: 19.04 → 19.07
Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

I'm marking this is invalid for charm-nova-compute as it needs to be completed in nova-cloud-controller.

Changed in charm-nova-compute:
status: New → Invalid
Frode Nordahl (fnordahl)
Changed in charm-nova-cloud-controller:
status: In Progress → Triaged
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

The ground work for this has been done in the ssh migration re-work done during the 1907 cycle. The main issue to watch for is to ensure that during upgrade the charm 'flattens' the ssh keys across all relations. One way would be to delete all the existing knownhosts and start again, but that might take a long time on a large cloud ... which is probably the target for this bug.

With the 1907 charms the knownhosts are cached (but may be disabled). I think that an appropriate upgrade strategy will be to switch on the caching in the existing charm, then do an upgrade to a version that will include this bug/feature, to minimise the upgrade time.

Changed in charm-nova-cloud-controller:
assignee: nobody → Alex Kavanagh (ajkavanagh)
David Ames (thedac)
Changed in charm-nova-cloud-controller:
milestone: 19.07 → 19.10
David Ames (thedac)
Changed in charm-nova-cloud-controller:
milestone: 19.10 → 20.01
Changed in charm-nova-cloud-controller:
status: Triaged → In Progress
Ryan Beisner (1chb1n)
Changed in charm-nova-cloud-controller:
status: In Progress → Triaged
assignee: Alex Kavanagh (ajkavanagh) → nobody
milestone: 20.01 → 20.02
Revision history for this message
Ryan Beisner (1chb1n) wrote :

Apologies - this was erroneously placed into an In-Progress state. There were one or more in-person discussions on this at recent planning sprints, and I'll do my best to get it down on paper:

The current behavior is by design. Having multiple distinct instances of the nova-compute charm, represented as unique applications is a common way to represent different types of compute nodes. The logic behind the current implementation is that nova live migration is most often not possible across heterogeneous architectures. That, admittedly, is an aging view, which can certainly be invalidated with solid user stories / use cases.

As Alex has commented, some important pre-requisite work has been completed, which is nice. But the bottom line is this is feature work, and a feature request exists (via this bug, and in the product backlog) to develop cross-application nova live migration. In order to address this, that effort needs to get priority as a non-trivial engineering effort.

Liam Young (gnuoy)
Changed in charm-nova-cloud-controller:
milestone: 20.02 → 20.05
Changed in charm-nova-cloud-controller:
importance: High → Wishlist
James Page (james-page)
Changed in charm-nova-cloud-controller:
importance: Wishlist → High
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Note, some preliminary work has been done in this patchset: https://review.opendev.org/#/c/715948

David Ames (thedac)
Changed in charm-nova-cloud-controller:
milestone: 20.05 → 20.08
James Page (james-page)
Changed in charm-nova-cloud-controller:
status: In Progress → Triaged
Revision history for this message
Ryan Beisner (1chb1n) wrote :

To summarize conversations over time in one succinct place:

1. This behavior was the intended design (each group of nova-compute applications are special in some way, otherwise they would all be in one application, and because they are different, the design assumed they would not be able to live migrate anyway). These earlier assumptions in design can be changed/altered with a new specification and a commitment to the development work.

2. Cross-application live migration for charm-nova-compute is a feature request which has been considered in the backlog by stakeholders in previous development cycles. Unfortunately, it has not secured a spot in the plan thus far.

3. As a new feature, it's not really a fit for an SLA escalation.

Changed in charm-nova-cloud-controller:
importance: High → Wishlist
milestone: 20.08 → none
Revision history for this message
James Page (james-page) wrote :

Please consider switching this from field-high -> field-medium

Revision history for this message
Paul Goins (vultaire) wrote :

I got curious on this and did a quick spike, and I have something which may work.

Working backwards, I just opened the spec MR for review: https://review.opendev.org/c/openstack/charm-specs/+/806997

I'm presuming new charm specs are proposed for the latest releases's approved dir, and then moved to implemented when completed, but I am just guessing here - if this is wrong, guidance is appreciated.

And again, as I mentioned I'm working backwards by writing the spec now... I do have a proposed patch ready, although I am waiting on the formal review until the spec is approved. However, for the curious (or future follow-up by others), it is here: https://github.com/Vultaire/charm-nova-cloud-controller/commit/8edd4ffeff43569872502b89cf33e04577f73e19

Revision history for this message
James Page (james-page) wrote :

I made a general comment on the review - however it did occur to me that sharing of keys is just one part of the migration setup. Nova does pre-migration checks to ensure that the target host can take the instance - do we need to reflect any grouping in Nova via placement or suchlike to make this work consistently and without retry<->failure loops?

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Note I also did some early groundwork on this feature. https://review.opendev.org/c/openstack/charm-nova-cloud-controller/+/715948

I'd be very interested in being a critical-friend-reviewer on this one as I've spent quite a bit of time on improving performance, especially when divvying out keys amongst units as part of the key exchange.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-nova-cloud-controller (master)
Changed in charm-nova-cloud-controller:
status: Triaged → In Progress
Revision history for this message
Paul Goins (vultaire) wrote :

Hello, and thanks for the feedback on the review already.

@james-page: I replied to your comments; can you take a look?

@ajkavanagh: I went ahead and pushed my changes to a Gerrit MR, albeit premature since the spec is in flight: https://review.opendev.org/c/openstack/charm-nova-cloud-controller/+/807175. (I also looked at your WIP MR to see what had been done there.)

Revision history for this message
Paul Goins (vultaire) wrote :

Also, one more comment:

> sharing of keys is just one part of the migration setup. Nova does pre-migration checks to ensure that the target host can take the instance - do we need to reflect any grouping in Nova via placement or suchlike to make this work consistently and without retry<->failure loops?

Short answer: I don't know for sure.

There's no attempt to bring this in line with placement rules, host aggregates or anything like that; this just shares the SSH keys across multiple nova applications so that, if the computes are otherwise able to perform migrations, they shouldn't be blocked from doing so.

On the other hand, let's take this hypothetical case: if you had e.g. 6 identical compute hosts, 3 belonging to one nova-compute app and 3 belonging to another nova-compute app... Aside from the SSH key sharing, would there be anything else in how ncc and nova-compute work which would prevent migrations from working across the nova-compute apps?

Revision history for this message
Drew Freiberger (afreiberger) wrote :

Considering the pre-migration checks question from a different perspective, there is already the ability for Nova to potentially attempt migration across nova-compute applications as they are all deployed as hypervisors in the same cell (typically), and we get issues with performing migrations/retries due to ssh key failure quite a bit now when we add new units that have to have a different config for i.e. different paths to the ephemeral-device on newer machines added to a cloud.

We currently already have to employ aggregate/flavor filters to handle i.e. SRIOV vs non SRIOV type nodes.

Ultimately, adding the ssh keys sync doesn't create any new bugs, it simply resolves this ssh key availability bug. James' question, however, does suggest a need for additional Openstack operator documentation when configuring non-homogenous hypervisors as migration peers.

Also, I agree on the spec comment from James, this should be a nova-compute configuration or relation interface that then communicates to n-c-c that ssh keys should be shared across any number of groups of migration-peer nova-compute applications.

Revision history for this message
Drew Freiberger (afreiberger) wrote :

re-reading Alex's comment in the prior WIP branch:

# NOTE: The following functions are only used to support upgrades from a
# previous version of this charm that partitioned knownhosts/authorized_keys
# according to remote_service (essentially, the nova-compute application name
# that was related to this nova-cloug-controller instance). The current
# behaviour is to merge all of the remote_services so that the
# knownhosts/authorized_keys are shared among all nova-compute relations. See
# bug: #1468871

Does this mean that the issue we're solving for could be fixed just by removing and re-relating all the nova-compute applications to nova-cloud-controller to cross-multiply keys for legacy deployments? Do new deployments not suffer from this segregated key sharing if I'm using 21.04 n-c-c and nova-compute charms?

Haw Loeung (hloeung)
Changed in charm-nova-cloud-controller:
status: In Progress → Triaged
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-nova-cloud-controller (master)

Reviewed: https://review.opendev.org/c/openstack/charm-nova-cloud-controller/+/807175
Committed: https://opendev.org/openstack/charm-nova-cloud-controller/commit/aac2c2a178a86a8b686eb26b553283c0e9ba69f2
Submitter: "Zuul (22348)"
Branch: master

commit aac2c2a178a86a8b686eb26b553283c0e9ba69f2
Author: Paul Goins <email address hidden>
Date: Tue Aug 31 11:11:47 2021 -0700

    Sharing SSH pubkeys across nova-compute apps

    SSH keys from nova-compute are now shared across all
    nova-compute charm apps.

    Closes-Bug: #1468871
    Change-Id: Ia142eceff56bb763fcca8ddf5b74b83f84bf3539

Changed in charm-nova-cloud-controller:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-nova-cloud-controller (stable/21.10)
Revision history for this message
Paul Goins (vultaire) wrote :

For what it's worth, Rodrigo Barbieri gets credit for taking my patch, working with the OpenStack team to refine it and address issues, and getting it across the finish line. Thank you!

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-nova-cloud-controller (master)

Change abandoned by "Alex Kavanagh <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/charm-nova-cloud-controller/+/715948
Reason: Overtaken by events and a great patch here: https://review.opendev.org/c/openstack/charm-nova-cloud-controller/+/807175

Changed in charm-nova-cloud-controller:
status: Fix Committed → Fix Released
milestone: none → 21.10
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-nova-cloud-controller (stable/21.10)

Reviewed: https://review.opendev.org/c/openstack/charm-nova-cloud-controller/+/818077
Committed: https://opendev.org/openstack/charm-nova-cloud-controller/commit/a507992f9e3641ac88ddedc233f5ac0b2ba59e6d
Submitter: "Zuul (22348)"
Branch: stable/21.10

commit a507992f9e3641ac88ddedc233f5ac0b2ba59e6d
Author: Paul Goins <email address hidden>
Date: Tue Aug 31 11:11:47 2021 -0700

    Sharing SSH pubkeys across nova-compute apps

    SSH keys from nova-compute are now shared across all
    nova-compute charm apps.

    Closes-Bug: #1468871
    Change-Id: Ia142eceff56bb763fcca8ddf5b74b83f84bf3539
    (cherry picked from commit aac2c2a178a86a8b686eb26b553283c0e9ba69f2)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.