Hello Alex - I found this bug in the wake of an issue we had on another cloud, and while it manifested in a slightly different way this time, the end result is the same: migrations failing because of issues with the SSH known_hosts file not being fully prepared to allow prompt-less SSH access.
First, let me say: I think the problem is *partially* addressed by the config change and action you mention. However, it wasn't enough for this particular cloud; I have evidence that improvements may be needed.
On this cloud, after the migration problem was reported to us, we set cache-known-hosts=false to turn off hostname caching, and followed that by the clear-unit-knownhost-cache action. And it looks like that works as expected. Here is a sanitized version of the output from the clear-unit-knownhost-cache action:
We can see clearly that the script pulled the private-address IP and also the hostname and created entries against both - which is exactly what we want.
However, here's the nuance: the hostname doesn't match what's in "openstack hypervisor list" nor "openstack host list".
# Again, sanitized
$ openstack hypervisor list
+----+-------------------------+-----------------+---------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+-------------------------+-----------------+---------------+-------+
| 1 | site2-rack3-node15.maas | QEMU | 10.1.2.15 | up |
[...]
+----+-------------------------+-----------------+---------------+-------+
$ openstack compute service list --service nova-compute
+----+--------------+-------------------------+---------------------+----------+-------+----------------------------+
| ID | Binary | Host | Zone | Status | State | Updated At |
+----+--------------+-------------------------+---------------------+----------+-------+----------------------------+
| 28 | nova-compute | site2-rack3-node15.maas | availability-zone-3 | enabled | up | 2023-08-30T20:07:11.000000 |
[...]
+----+--------------+-------------------------+---------------------+----------+-------+----------------------------+
As you can see above, there's a .maas domain suffix. That wouldn't have been pre-seeded - and indeed, instance migrations fail without those entries since the hostname field in the relations don't match the hostnames used in OpenStack.
So - I think we have a bug here with regards to how hostnames are handled in the known_hosts file generation process.
Hello Alex - I found this bug in the wake of an issue we had on another cloud, and while it manifested in a slightly different way this time, the end result is the same: migrations failing because of issues with the SSH known_hosts file not being fully prepared to allow prompt-less SSH access.
First, let me say: I think the problem is *partially* addressed by the config change and action you mention. However, it wasn't enough for this particular cloud; I have evidence that improvements may be needed.
On this cloud, after the migration problem was reported to us, we set cache-known- hosts=false to turn off hostname caching, and followed that by the clear-unit- knownhost- cache action. And it looks like that works as expected. Here is a sanitized version of the output from the clear-unit- knownhost- cache action:
$ juju show-action-output 12345 controller/ 1 OpenSSH_ 8.2p1 Ubuntu-4ubuntu0.4 node15: 22 SSH-2.0- OpenSSH_ 8.2p1 Ubuntu-4ubuntu0.4 OpenSSH_ 8.2p1 Ubuntu-4ubuntu0.4 node15: 22 SSH-2.0- OpenSSH_ 8.2p1 Ubuntu-4ubuntu0.4 compute- kvm/1'' : ''<REDACTED>''}, [...]
UnitId: nova-cloud-
id: "97791"
results:
Stderr: |
# 10.1.2.15:22 SSH-2.0-
# site2-rack3-
# 10.1.2.15:22 SSH-2.0-
# site2-rack3-
[...]
units-updated: '[{''nova-
status: completed
timing:
completed: 2023-08-29 17:20:20 +0000 UTC
enqueued: 2023-08-29 17:19:39 +0000 UTC
started: 2023-08-29 17:19:39 +0000 UTC
We can see clearly that the script pulled the private-address IP and also the hostname and created entries against both - which is exactly what we want.
However, here's the nuance: the hostname doesn't match what's in "openstack hypervisor list" nor "openstack host list".
# Again, sanitized ------- ------- ------- ---+--- ------- ------- +------ ------- --+---- ---+ ------- ------- ------- ---+--- ------- ------- +------ ------- --+---- ---+ node15. maas | QEMU | 10.1.2.15 | up | ------- ------- ------- ---+--- ------- ------- +------ ------- --+---- ---+
$ openstack hypervisor list
+----+-
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+-
| 1 | site2-rack3-
[...]
+----+-
$ openstack compute service list --service nova-compute ------- ------+ ------- ------- ------- ----+-- ------- ------- -----+- ------- --+---- ---+--- ------- ------- ------- ----+ ------- ------+ ------- ------- ------- ----+-- ------- ------- -----+- ------- --+---- ---+--- ------- ------- ------- ----+ node15. maas | availability-zone-3 | enabled | up | 2023-08- 30T20:07: 11.000000 | ------- ------+ ------- ------- ------- ----+-- ------- ------- -----+- ------- --+---- ---+--- ------- ------- ------- ----+
+----+-
| ID | Binary | Host | Zone | Status | State | Updated At |
+----+-
| 28 | nova-compute | site2-rack3-
[...]
+----+-
As you can see above, there's a .maas domain suffix. That wouldn't have been pre-seeded - and indeed, instance migrations fail without those entries since the hostname field in the relations don't match the hostnames used in OpenStack.
So - I think we have a bug here with regards to how hostnames are handled in the known_hosts file generation process.