Deploying a server with bcache on top of HDD and mdadm can frequently fail
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
MAAS |
Fix Released
|
Critical
|
Alexsander de Souza | ||
3.3 |
Triaged
|
Undecided
|
Unassigned | ||
3.4 |
Fix Released
|
Undecided
|
Unassigned | ||
curtin |
Fix Committed
|
Undecided
|
Alexsander de Souza |
Bug Description
Environment :
* MAAS 3.3 and 3.4
* Ubuntu 22.04
* deployment / commissioning OS : 20.04 and 22.04
* Servers to deploy with slow drives such as HDD
When deploying a server using Bcache as its device for rootfs, especially on top of software RAID (mdadm) and with slow drives such as hard drives, the installation of Ubuntu, on the storage configuration step, can fail quite frequently.
#
# Reproducer :
#
It is possible to recreate the environment with slow drives with Libvirt with the following setup :
1) Create around 6 or more VMs with (see the script "create-
* 3 vCPUs
* 4 GB of RAM
* 3 disks :
* 1 x 10 GB fast, as bcache
* 2 x 30 GB with limited IOPS (150 iops, 30MB/s top speed)
2) With the following disk topology (see reproducer-
* /dev/vda --> 2 partitions
- 1GB for md0
- 29GB for md1
* /dev/vdb --> 2 partitions
- 1GB for md0
- 29GB for md1
* /dev/md0 --> ext4 for /boot
* /dev/vdc (fast drive) --> bcache0 cache set
* /dev/md1 --> bcache0 backend storage
* /dev/bcache0 --> ext4 for /
3) Deploy Ubuntu 22.04 to all VMs
--> some of the VMs will fail with the same error with Curtin
4) (Optional) Also not erasing the drives when releasing and redeploying right away the server seem to increase hugely the likelyness of failing the deployment.
#
# logs
#
I'm attaching to the bug report some more logs :
* quick-summary-
* reproducer-
# theory
And at a first glance, it seems to be a race condition, because when reusing the same server and retrying to deploy again Ubuntu, it may works right.
This may be triggered because the hard drives are already sollicited with mdadm currently syncing the disks together and may become even slower when some changes, such as creating a bcache backend device, is requested and then curtin failing with the race condition.
On a large deployment such as Openstack, this make the installation process cumbersome as one or multiple servers may randomly fail to deploy.
Looking at the logs of the installation output from MAAS, curtin seems to fail to confirm the backend storage
# main differences
## working
2024-02-
2024-02-
2024-02-
2024-02-
2024-02-
2024-02-
2024-02-
## non-working
2024-02-
2024-02-
2024-02-
2024-02-
2024-02-
2024-02-
2024-02-
Related branches
- Server Team CI bot: Approve (continuous-integration)
- curtin developers: Pending requested
-
Diff: 21 lines (+2/-2)1 file modifiedcurtin/block/bcache.py (+2/-2)
Changed in maas: | |
status: | New → Triaged |
importance: | Undecided → Critical |
assignee: | nobody → Alexsander de Souza (alexsander-souza) |
milestone: | none → 3.5.0 |
Changed in curtin: | |
assignee: | nobody → Alexsander de Souza (alexsander-souza) |
Changed in maas: | |
status: | Triaged → In Progress |
Changed in curtin: | |
status: | New → Fix Committed |
Changed in maas: | |
status: | In Progress → Fix Committed |
Changed in maas: | |
milestone: | 3.5.0 → 3.5.0-beta1 |
status: | Fix Committed → Fix Released |
Subscribed ~Field High
It penalises greatly an ongoing deployment with a customer relying on Bcache with hard-drives.