snapd

Bug #1879530
Comment #14

Comment 14 for bug 1879530

Revision history for this message

Dan Watkins (oddbloke) wrote on 2020-06-10:

#14

OK, here's my analysis of why this is happening:

At a high level, the issue is that cloud-init will mount and use an
available NoCloud drive on any boot[0]. If that drive includes a
changed instance ID in its metadata, then cloud-init will treat this as
a first instance boot and re-run every module: this means that all
configuration that is specified in the new NoCloud drive will be
applied. This is bad!

[0] This isn't _exactly_ true, as I'll discuss below.

More specifically, here is the sequence which causes this to happen:

* cloud-init restores the NoCloud datasource instance from the previous
  boot from its pickled form on disk:
  https://github.com/canonical/cloud-init/blob/f3bd42659efeed4b092ffcdfd5df7f24813f2d3e/cloudinit/stages.py#L215
* to ensure that this unpickled instance is still applicable for this
  boot (an example of when it might not be applicable: if this instance
  has been launched from an image that was captured from a
  previously-booted instance), we call `ds.check_instance_id`:
  https://github.com/canonical/cloud-init/blob/f3bd42659efeed4b092ffcdfd5df7f24813f2d3e/cloudinit/stages.py#L230-L234
* we can see in the provided logs that the `else` is taken (search for
  "cache invalid in datasource"), which means that cloud-init has
  determined that the cached datasource object is no longer applicable
* we can see why this happens by looking at NoCloud's implementation of
  `check_instance_id`
  (https://github.com/canonical/cloud-init/blob/f3bd42659efeed4b092ffcdfd5df7f24813f2d3e/cloudinit/sources/DataSourceNoCloud.py#L209-L222 );
  we only check the kernel command line and seed directories (i.e.
  something in the instance's filesystem at one of
  /var/lib/cloud/seed/nocloud{,-net}/); we will never detect a cached
  instance ID if a NoCloud drive was the source of configuration
* because the cache is invalid, cloud-init does its datasource
  discovery process from scratch; it discovers the new NoCloud drive,
  determines that the instance ID from this new drive does not match
  the instance ID from the previous boot, and so assumes that this is
  the first boot of a new instance. This means that it reads all the
  configuration and performs all actions that aren't "per-once" (i.e.
  it performs ~all of cloud-init's actions.)

subiquity installations are not vulnerable by default because the
installer writes the NoCloud configuration into /var/lib/cloud/seed,
cloud-init _can_ determine that the instance ID of the pickled
datasource is correct, and so it doesn't re-attempt datasource
discovery, and so doesn't even go looking for the attached NoCloud
drive. If you remove /var/lib/cloud/seed, then a subiquity
installation _does_ become vulnerable. (I have tested this.)

(I haven't tested this, but I assume that if the metadata is provided
on the kernel command line then, similarly, we won't run into the
issue.)

OK, here's my analysis of why this is happening:

At a high level, the issue is that cloud-init will mount and use an
available NoCloud drive on any boot[0].  If that drive includes a
changed instance ID in its metadata, then cloud-init will treat this as
a first instance boot and re-run every module: this means that all
configuration that is specified in the new NoCloud drive will be
applied.  This is bad!

[0] This isn't _exactly_ true, as I'll discuss below.

More specifically, here is the sequence which causes this to happen:

* cloud-init restores the NoCloud datasource instance from the previous
  boot from its pickled form on disk:
  https://github.com/canonical/cloud-init/blob/f3bd42659efeed4b092ffcdfd5df7f24813f2d3e/cloudinit/stages.py#L215
* to ensure that this unpickled instance is still applicable for this
  boot (an example of when it might not be applicable: if this instance
  has been launched from an image that was captured from a
  previously-booted instance), we call `ds.check_instance_id`:
  https://github.com/canonical/cloud-init/blob/f3bd42659efeed4b092ffcdfd5df7f24813f2d3e/cloudinit/stages.py#L230-L234
* we can see in the provided logs that the `else` is taken (search for
  "cache invalid in datasource"), which means that cloud-init has
  determined that the cached datasource object is no longer applicable
* we can see why this happens by looking at NoCloud's implementation of
  `check_instance_id`
  (https://github.com/canonical/cloud-init/blob/f3bd42659efeed4b092ffcdfd5df7f24813f2d3e/cloudinit/sources/DataSourceNoCloud.py#L209-L222 );
  we only check the kernel command line and seed directories (i.e.
  something in the instance's filesystem at one of
  /var/lib/cloud/seed/nocloud{,-net}/); we will never detect a cached
  instance ID if a NoCloud drive was the source of configuration
* because the cache is invalid, cloud-init does its datasource
  discovery process from scratch; it discovers the new NoCloud drive,
  determines that the instance ID from this new drive does not match
  the instance ID from the previous boot, and so assumes that this is
  the first boot of a new instance.  This means that it reads all the
  configuration and performs all actions that aren't "per-once" (i.e.
  it performs ~all of cloud-init's actions.)

subiquity installations are not vulnerable by default because the
installer writes the NoCloud configuration into /var/lib/cloud/seed,
cloud-init _can_ determine that the instance ID of the pickled
datasource is correct, and so it doesn't re-attempt datasource
discovery, and so doesn't even go looking for the attached NoCloud
drive.  If you remove /var/lib/cloud/seed, then a subiquity
installation _does_ become vulnerable.  (I have tested this.)

(I haven't tested this, but I assume that if the metadata is provided
on the kernel command line then, similarly, we won't run into the
issue.)