Comment 3 for bug 1969971

Revision history for this message
Paul Goins (vultaire) wrote :

Hello Alex - I found this bug in the wake of an issue we had on another cloud, and while it manifested in a slightly different way this time, the end result is the same: migrations failing because of issues with the SSH known_hosts file not being fully prepared to allow prompt-less SSH access.

First, let me say: I think the problem is *partially* addressed by the config change and action you mention. However, it wasn't enough for this particular cloud; I have evidence that improvements may be needed.

On this cloud, after the migration problem was reported to us, we set cache-known-hosts=false to turn off hostname caching, and followed that by the clear-unit-knownhost-cache action. And it looks like that works as expected. Here is a sanitized version of the output from the clear-unit-knownhost-cache action:

$ juju show-action-output 12345
UnitId: nova-cloud-controller/1
id: "97791"
results:
  Stderr: |
    # 10.1.2.15:22 SSH-2.0-OpenSSH_8.2p1 Ubuntu-4ubuntu0.4
    # site2-rack3-node15:22 SSH-2.0-OpenSSH_8.2p1 Ubuntu-4ubuntu0.4
    # 10.1.2.15:22 SSH-2.0-OpenSSH_8.2p1 Ubuntu-4ubuntu0.4
    # site2-rack3-node15:22 SSH-2.0-OpenSSH_8.2p1 Ubuntu-4ubuntu0.4
    [...]
  units-updated: '[{''nova-compute-kvm/1'': ''<REDACTED>''}, [...]
status: completed
timing:
  completed: 2023-08-29 17:20:20 +0000 UTC
  enqueued: 2023-08-29 17:19:39 +0000 UTC
  started: 2023-08-29 17:19:39 +0000 UTC

We can see clearly that the script pulled the private-address IP and also the hostname and created entries against both - which is exactly what we want.

However, here's the nuance: the hostname doesn't match what's in "openstack hypervisor list" nor "openstack host list".

# Again, sanitized
$ openstack hypervisor list
+----+-------------------------+-----------------+---------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+-------------------------+-----------------+---------------+-------+
| 1 | site2-rack3-node15.maas | QEMU | 10.1.2.15 | up |
[...]
+----+-------------------------+-----------------+---------------+-------+

$ openstack compute service list --service nova-compute
+----+--------------+-------------------------+---------------------+----------+-------+----------------------------+
| ID | Binary | Host | Zone | Status | State | Updated At |
+----+--------------+-------------------------+---------------------+----------+-------+----------------------------+
| 28 | nova-compute | site2-rack3-node15.maas | availability-zone-3 | enabled | up | 2023-08-30T20:07:11.000000 |
[...]
+----+--------------+-------------------------+---------------------+----------+-------+----------------------------+

As you can see above, there's a .maas domain suffix. That wouldn't have been pre-seeded - and indeed, instance migrations fail without those entries since the hostname field in the relations don't match the hostnames used in OpenStack.

So - I think we have a bug here with regards to how hostnames are handled in the known_hosts file generation process.