Comment 2 for bug 1667735

Revision history for this message
Josh Wilsdon (jwilsdon) wrote :

"That doesn't really make sense. And re-trying timing out in 10 seconds and re-trying wouldn't really change anything."

Unfortunately that's not the case for cloud-init in 14.04 (retrying there would fix the problem but miss keys) and only correct for cloud-init in 16.10 because that version has other problems.

With the Ubuntu 14.04 image's version of cloud-init (0.7.5-0ubuntu1.3) the read() is blocking with no timeout. If there was a timeout and if the request was correctly retried every 10 seconds, eventually things would proceed once the metadata service became available. As it stands now, these instances need to be externally rebooted when they become hung as they will never recover on their own.

I've just tested 16.10 as well with cloud-init 0.7.8-68-gca3ae67-0ubuntu1~16.10.1. With that version the code has changed significantly, but the problem is the same. Even though the timeout has been added in the new version, its usefulness has been negated since the new code gets itself stuck in an infinite loop when we hit this case. You can also see some more details in:

https://smartos.org/bugview/IMAGE-1014

In my testing on 16.10, systemd did not kill cloud-init in the 30 minutes I waited.

"Its very arguable that the *right* thing to do is wait forever on the metadata service."

I agree with this. However, that's not what cloud-init is doing. In part because cloud-init's implementation of the metadata specification (https://eng.joyent.com/mdata/protocol.html) is incomplete. In particular:

 * It uses V2 without doing NEGOTIATE
 * It uses the KVM serial port without reading all the data from the buffer before writing
 * It does not write '\n' and wait for 'invalid command\n'
 * When a read() times out, it tries a read() again instead of starting over

What happens with cloud-init in 16.10 is that if metadata is unavailable when the instance boots, cloud-init will:

 1) write data into the socket (nobody's listening)
 2) do a select() on the socket looking for readable data (and timeout after 10 seconds)
 3) goto 2

the loop between 2 and 3 becomes infinite because even if metadata is enabled at this point, cloud-init never attempts to send any commands to it.

If instead it were to:

 1) open the socket
 2) read on the socket (with a timeout) and discard any data
 3) write '\n'
 4) read on the socket for 'invalid command\n' (with a timeout, on timeout close socket and go to 1 )
 5) NEGOTIATE V2

before making any queries, it would be able to recover when the metadata service became available if it is unavailable initially. If you disable the metadata service and run mdata-get under strace, and then enable metadata, you'll see that that's how it is able to recover in this case.

So in summary: I think we're in agreement that cloud-init should wait forever for the metadata service, but both 14.04 and 16.10 have different but related problems in their implementation which prevent cloud-init from ever actually knowing when metadata has become available. The consequence of this is that in both versions if metadata is unavailable when cloud-init is first run, the VMs will hang until rebooted externally.