MAAS

[2.5rc2] Cannot communicate with KVM hosts if stale SSH fingerprint is cached

Bug #1807231 reported by Florian Guitton on 2018-12-06

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	MAAS	Fix Released	Medium	Newell Jensen	MAAS 2.5.1

Bug Description

Hello everybody !

I am experimenting with 2.5-RC2 and I am trying to deploy a node as a KVM pod.
The installation is fresh with only 1 user, 1 domain, and dozen nodes.
I have attempted multiple times to deploy and invariably falls and stops on the cloud-init message :

Cloud-init v. 18.4-0ubuntu1~18.04.1 running 'modules:final' at Thu, 06 Dec 2018 16:16:45 +0000. Up 31.00 seconds.
ci-info: no authorized ssh keys fingerprints found for user virsh.
Cloud-init v. 18.4-0ubuntu1~18.04.1 finished at Thu, 06 Dec 2018 16:29:15 +0000. Datasource DataSourceMAAS [http://10-80-0-0--16.maas-internal:5248/MAAS/metadata/]. Up 780.66 seconds

Then Web UI indicates the deployment has failed and the node is errored.
All is seemingly install however and the network is configured properly too.

Would anybody have any idea what is happening ?

Best wishes,

Florian

Related branches

~newell-jensen/maas:2.5-lp1807231

Merged into maas:2.5

Newell Jensen (community): Approve on 2019-02-06

~newell-jensen/maas:lp1807231

Merged into maas:master

Andres Rodriguez (community): Approve on 2019-02-05

Mike Pontillo (community): Approve on 2019-02-05

MAAS Lander: Pending (unittests) requested 2019-02-05

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2018-12-06: Re: [2.5rc2] Deploy as KVM pod fails with missing key fingerprints for virsh

Hi Florian,

COuld you please provide full installation logs and machine cloud-init logs:

rsyslog: /var/log/maas/rsyslog/<machine-name>/date/messages
machine cloud-init logs: /var/log/cloud-init.log,cloud-init-output.log #from the system itself. (the "deployed" one)

summary:	- [2.5 RC2] : Deploy as KVM pod fails with missing key fingerprints for - virsh + [2.5rc2] Deploy as KVM pod fails with missing key fingerprints for virsh
Changed in maas:
milestone:	none → 2.5.1
status:	New → Incomplete

Revision history for this message

Mike Pontillo (mpontillo) wrote on 2018-12-06:

I would assume this is just be an informational message, and the deployment is failing for some other reason, because the `virsh` underprivileged user is intentionally deployed without any SSH keys. MAAS generates a random password to be used to communicate with the pod. (A default user - `ubuntu` - will also be configured with the normal SSH keys that exist in MAAS.)

You might also check the node event log to see if there is anything interesting in there.

Revision history for this message

Mike Pontillo (mpontillo) wrote on 2018-12-06:

Sorry for my lousy grammar. I meant to say, "I assume this is just an informational message", referring to the "no authorized ssh keys" message you observed.

One last thing: I would check /var/log/maas/regiond.log during the time that the deployment was finishing to see if anything interesting has been logged.

Revision history for this message

Florian Guitton (f-guitton) wrote on 2018-12-06:

maas.log Edit (809 bytes, text/plain)

Revision history for this message

Florian Guitton (f-guitton) wrote on 2018-12-06:

rackd.log Edit (1.7 KiB, text/plain)

Revision history for this message

Florian Guitton (f-guitton) wrote on 2018-12-06:

regiond.log Edit (35.8 KiB, text/plain)

I am attaching here the extract from the log files as I was proceeding to deployement.
It seems that it has indeed "Failed talking to pod: Failed to login to virsh console".
Now the node that is deployed seem to be functional again, all packaged deployed, network configured appropriately.

System logs to follow...

Revision history for this message

Florian Guitton (f-guitton) wrote on 2018-12-06:

cloud-init.log Edit (195.9 KiB, text/plain)

Revision history for this message

Florian Guitton (f-guitton) wrote on 2018-12-06:

cloud-init-output.log Edit (55.6 KiB, text/plain)

It seem to remain at [0/1] of opening 'http://10-80-0-0--16.maas-internal:5248/MAAS/metadata/status/cfppes', but checking manually after deployment, this is reachable by the system.

Would you have any pointer ?
Let me know if I can provide any further element.

Revision history for this message

Florian Guitton (f-guitton) wrote on 2018-12-06:

I should add that deploying the same node with same configuration and without setting it as a KVM host works and complete successfully

Revision history for this message

Mike Pontillo (mpontillo) wrote on 2018-12-06:

#10

I think the relevant portion of this message is, as you said in a previous comment, the "Failed to login to virsh console" error.

I have a few theories about why that might happen:

(1) A cached SSH key fingerprint for the same IP address (for the `maas` user) caused the KVM driver to fail to log into the newly-deployed KVM host. If this is the cause, we could see this problem when the same IP address is assigned to a host that was previously used as a KVM host in MAAS.

(2) The IP address that MAAS chose to communicate with the newly-deployed machine was unreachable (could be due to firewall issues, etc.)

(3) A race condition occurred; cloud-init reported that it was finished, but the service wasn't yet ready for MAAS to talk to it.

It would be helpful if you could help us narrow down the cause. Note that the deploying KVM host sleeps for 10 seconds before cloud-init finishes, as a precaution to mitigate (3). So I think options (1) or (2) are most likely.

Revision history for this message

Florian Guitton (f-guitton) wrote on 2018-12-07:

#11

I think you have found our culprit.
It seems to be (1). Indeed doing the following manage to allow the node to deploy with no issue :

root@maas-controller-01:> sudo -u maas -H bash
maas@maas-controller-01:> echo -n > ~/.ssh/known_hosts

The physical nodes on our network are allocated static addresses, so when they get recomissionned/redeployed we could essentially fall into that pitfall again.

Maybe there could be some clever bit of logic added to MAAS to cleanup the known_hosts files appropriately. Or in case there are no obvious way of doing that, maybe add a note in the documentation about this limitation.

Thank you very much for the quick and precise answer !
Very best wishes,

Andres Rodriguez (andreserl) on 2018-12-07

Changed in maas:
status:	Incomplete → Triaged
importance:	Undecided → Medium

Mike Pontillo (mpontillo) on 2018-12-07

summary:

- [2.5rc2] Deploy as KVM pod fails with missing key fingerprints for virsh
+ [2.5rc2] Cannot communicate with KVM hosts if stale SSH fingerprint is
+ cached

Revision history for this message

Mike Pontillo (mpontillo) wrote on 2018-12-07:

#12

Thanks for your help testing MAAS.

This issue is a little contentious, since by cleaning up the key fingerprints we reduce the security of MAAS by opening up the possibility that an attacker who has infiltrated your network could launch a man-in-the-middle (MitM) attack against virsh-over-SSH.

However, MAAS never provides the option to check the host key to begin with, so the network could already be compromised. And if an attacker has the ability to do that, I guess one's security theater is already bankrupt. ;-)

Therefore, if power drivers in MAAS call `ssh` with the following additional parameters, we believe the user experience will be improved:

-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null

A full solution to this problem would require users to manually confirm SSH key fingerprints any time they are first cached, or any time they change. And of course, we could also automatically clear out cached fingerprints if a new machine is deployed. But if the goal is to secure MAAS from MitM attacks, that would also require manual confirmation every time a new KVM host is deployed. I don't think that's the user experience MAAS users expect.

Revision history for this message

Florian Guitton (f-guitton) wrote on 2018-12-13:

#13

Could MAAS proceed with removing the key fingerprint of a virsh pod as soon as the pods gets deleted or the machine gets released based on all known IPs of the node ?

I believe it might be a reasonable approach. Would there be a scenario where an automatic removal for node explicitly decommissioned by the user wouldn't be appropriate ? This wouldn't require further user input.

Andres Rodriguez (andreserl) on 2019-01-31

Changed in maas:
assignee:	nobody → Newell Jensen (newell-jensen)

Newell Jensen (newell-jensen) on 2019-01-31

Changed in maas:
status:	Triaged → In Progress

MAAS Lander (maas-lander) on 2019-02-06

Changed in maas:
status:	In Progress → Fix Committed

Andres Rodriguez (andreserl) on 2019-02-08

Changed in maas:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.