At a high level, the issue is that cloud-init will mount and use an
available NoCloud drive on any boot[0]. If that drive includes a
changed instance ID in its metadata, then cloud-init will treat this as
a first instance boot and re-run every module: this means that all
configuration that is specified in the new NoCloud drive will be
applied. This is bad!
[0] This isn't _exactly_ true, as I'll discuss below.
More specifically, here is the sequence which causes this to happen:
* cloud-init restores the NoCloud datasource instance from the previous
boot from its pickled form on disk: https://github.com/canonical/cloud-init/blob/f3bd42659efeed4b092ffcdfd5df7f24813f2d3e/cloudinit/stages.py#L215
* to ensure that this unpickled instance is still applicable for this
boot (an example of when it might not be applicable: if this instance
has been launched from an image that was captured from a
previously-booted instance), we call `ds.check_instance_id`: https://github.com/canonical/cloud-init/blob/f3bd42659efeed4b092ffcdfd5df7f24813f2d3e/cloudinit/stages.py#L230-L234
* we can see in the provided logs that the `else` is taken (search for
"cache invalid in datasource"), which means that cloud-init has
determined that the cached datasource object is no longer applicable
* we can see why this happens by looking at NoCloud's implementation of
`check_instance_id`
(https://github.com/canonical/cloud-init/blob/f3bd42659efeed4b092ffcdfd5df7f24813f2d3e/cloudinit/sources/DataSourceNoCloud.py#L209-L222 );
we only check the kernel command line and seed directories (i.e.
something in the instance's filesystem at one of
/var/lib/cloud/seed/nocloud{,-net}/); we will never detect a cached
instance ID if a NoCloud drive was the source of configuration
* because the cache is invalid, cloud-init does its datasource
discovery process from scratch; it discovers the new NoCloud drive,
determines that the instance ID from this new drive does not match
the instance ID from the previous boot, and so assumes that this is
the first boot of a new instance. This means that it reads all the
configuration and performs all actions that aren't "per-once" (i.e.
it performs ~all of cloud-init's actions.)
subiquity installations are not vulnerable by default because the
installer writes the NoCloud configuration into /var/lib/cloud/seed,
cloud-init _can_ determine that the instance ID of the pickled
datasource is correct, and so it doesn't re-attempt datasource
discovery, and so doesn't even go looking for the attached NoCloud
drive. If you remove /var/lib/cloud/seed, then a subiquity
installation _does_ become vulnerable. (I have tested this.)
(I haven't tested this, but I assume that if the metadata is provided
on the kernel command line then, similarly, we won't run into the
issue.)
OK, here's my analysis of why this is happening:
At a high level, the issue is that cloud-init will mount and use an
available NoCloud drive on any boot[0]. If that drive includes a
changed instance ID in its metadata, then cloud-init will treat this as
a first instance boot and re-run every module: this means that all
configuration that is specified in the new NoCloud drive will be
applied. This is bad!
[0] This isn't _exactly_ true, as I'll discuss below.
More specifically, here is the sequence which causes this to happen:
* cloud-init restores the NoCloud datasource instance from the previous /github. com/canonical/ cloud-init/ blob/f3bd42659e feed4b092ffcdfd 5df7f24813f2d3e /cloudinit/ stages. py#L215 instance_ id`: /github. com/canonical/ cloud-init/ blob/f3bd42659e feed4b092ffcdfd 5df7f24813f2d3e /cloudinit/ stages. py#L230- L234 instance_ id` /github. com/canonical/ cloud-init/ blob/f3bd42659e feed4b092ffcdfd 5df7f24813f2d3e /cloudinit/ sources/ DataSourceNoClo ud.py#L209- L222 ); lib/cloud/ seed/nocloud{ ,-net}/ ); we will never detect a cached
boot from its pickled form on disk:
https:/
* to ensure that this unpickled instance is still applicable for this
boot (an example of when it might not be applicable: if this instance
has been launched from an image that was captured from a
previously-booted instance), we call `ds.check_
https:/
* we can see in the provided logs that the `else` is taken (search for
"cache invalid in datasource"), which means that cloud-init has
determined that the cached datasource object is no longer applicable
* we can see why this happens by looking at NoCloud's implementation of
`check_
(https:/
we only check the kernel command line and seed directories (i.e.
something in the instance's filesystem at one of
/var/
instance ID if a NoCloud drive was the source of configuration
* because the cache is invalid, cloud-init does its datasource
discovery process from scratch; it discovers the new NoCloud drive,
determines that the instance ID from this new drive does not match
the instance ID from the previous boot, and so assumes that this is
the first boot of a new instance. This means that it reads all the
configuration and performs all actions that aren't "per-once" (i.e.
it performs ~all of cloud-init's actions.)
subiquity installations are not vulnerable by default because the cloud/seed, cloud/seed, then a subiquity
installer writes the NoCloud configuration into /var/lib/
cloud-init _can_ determine that the instance ID of the pickled
datasource is correct, and so it doesn't re-attempt datasource
discovery, and so doesn't even go looking for the attached NoCloud
drive. If you remove /var/lib/
installation _does_ become vulnerable. (I have tested this.)
(I haven't tested this, but I assume that if the metadata is provided
on the kernel command line then, similarly, we won't run into the
issue.)