Update to maas 3.4.0 snap changed partition ids which broke automated deployments
Affects | Status | Importance | Assigned to | Milestone | ||
---|---|---|---|---|---|---|
MAAS | Status tracked in 3.6 | |||||
3.4 |
Triaged
|
High
|
Unassigned | |||
3.5 |
Triaged
|
High
|
Unassigned | |||
3.6 |
Triaged
|
High
|
Unassigned |
Bug Description
We (Cert) just updated MAAS from 3.3.x to 3.4.0-RC1. We have, in testflinger, default partition definitions that, because of how MAAS identifies partitions and disks is very reliant on MAAS ids for disk devices and partitions.
For example, prior to the move to 3.4.0, this was the definition for one server (these change and grow more or less complex depending on the number of disks in a machine):
2 default_disks:
3 - id: '216'
4 name: nvme0n1
5 parent_disk_blkid: '216'
6 ptable: GPT
7 type: disk
8 - device: '882'
9 id: nvme0n1-part1
10 number: '882'
11 parent_disk: '216'
12 parent_disk_blkid: '216'
13 size: '536870912'
14 type: partition
15 - fstype: fat32
16 id: 882-format
17 label: efi
18 parent_disk: '216'
19 parent_disk_blkid: '216'
20 type: format
21 volume: '882'
22 - device: 882-format
23 id: 882-mount
24 parent_disk: '216'
25 parent_disk_blkid: '216'
26 path: /boot/efi
27 type: mount
28 - device: '883'
29 id: nvme0n1-part2
30 number: '883'
31 parent_disk: '216'
32 parent_disk_blkid: '216'
33 size: '1599778848768'
34 type: partition
35 - fstype: ext4
36 id: 883-format
37 label: root
38 parent_disk: '216'
39 parent_disk_blkid: '216'
40 type: format
41 volume: '883' 42 - device: 883-format
43 id: 883-mount
44 parent_disk: '216'
45 parent_disk_blkid: '216'
46 path: /
47 type: mount
As you can see, this spells out partitions on a disk with the ID of 216, where the partition id is 882 and 883 to spell out the /boot/efi filesystem and the root filesystem respectively. These IDs were pulled from MAAS and reflected what on would get from a 'maas <name> partition reads <disk_id>. This allows us to provide a means for users to define their own partition scheme (e.g. set up something ceph-like, or bcache or whatever) and then revert things to the default.
After the update, all testflinger deployments now fail seemingly because apparently the partition IDs have been changed. Looking at a dump of this machine via the MAAS CLI, the disk ID has remained the same but the partition IDs are now all it the 16,000s:
bladernr@weavile:~$ maas bladernr partitions read 8pk6f8 216
Success.
Machine-readable output follows:
[
{
"uuid": "b838b3db-
"size": 1599778848768,
"bootable": false,
"tags": [],
"used_for": "ext4 formatted filesystem mounted at /",
"type": "partition",
"path": "/dev/disk/
"uuid": "21aa8167-
},
"id": 16153,
},
{
"uuid": "94256eca-
"size": 536870912,
"bootable": false,
"tags": [],
"used_for": "fat32 formatted filesystem mounted at /boot/efi",
"type": "partition",
"path": "/dev/disk/
"uuid": "1b93141c-
},
"id": 16152,
}
]
I am pretty sure that testflinger is failing because it expects to see a partition ID of 882 and 883 on disk 216, but those no longer exist.
Should we expect the partition IDs to change every time MAAS is updated, or is this a weird bug this time around (I don't think we've updated MAAS since we implemented the disk layout in testflinger, so it's possible this has always been the case and we just never had a problem with it before).
Note, the only thing that has changed on our end was the MAAS snap update to 3.4.0, we did not update anything in the testflinger agents from yesterday to today, so I'm reasonably certain this is the root cause here, at least from what I have seen over the last 30 minutes or so of poking at this.
Changed in maas: | |
milestone: | 3.4.x → 3.5.x |
tags: | added: bug-council |
tags: | removed: bug-council |
We need to confirm that an upgrade from 3.3.x to 3.4.0-rc2 changes partition IDs, and re-triage this issue after having this outcome.