raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
Medium
|
Matthew Ruffell | ||
Bionic |
Fix Released
|
Medium
|
Matthew Ruffell | ||
Focal |
Fix Released
|
Medium
|
Matthew Ruffell | ||
Groovy |
Fix Released
|
Medium
|
Matthew Ruffell | ||
Hirsute |
Fix Released
|
Medium
|
Matthew Ruffell |
Bug Description
BugLink: https:/
[Impact]
Block discard is very slow on Raid10, which causes common use cases which invoke block discard, such as mkfs and fstrim operations, to take a very long time.
For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds.
The bigger the devices, the longer it takes.
The cause is that Raid10 currently uses a 512k chunk size, and uses this for the discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the request into millions of 512k bio requests, even if the underlying device supports larger requests.
For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once:
$ cat /sys/block/
2199023255040
$ cat /sys/block/
2199023255040
Where the Raid10 md device only supports 512k:
$ cat /sys/block/
524288
$ cat /sys/block/
524288
If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes and if we examine the stack, it is stuck in blkdev_
$ sudo cat /proc/1626/stack
[<0>] wait_barrier+
[<0>] regular_
[<0>] raid10_
[<0>] raid10_
[<0>] md_handle_
[<0>] md_submit_
[<0>] __submit_
[<0>] submit_
[<0>] submit_
[<0>] __blkdev_
[<0>] blkdev_
[<0>] blk_ioctl_
[<0>] blkdev_
[<0>] blkdev_
[<0>] block_ioctl+
[<0>] __x64_sys_
[<0>] do_syscall_
[<0>] entry_SYSCALL_
[Fix]
Xiao Ni has developed a patchset which resolves the block discard performance problems. These commits have now landed in 5.13-rc1.
commit cf78408f937a67f
Author: Xiao Ni <email address hidden>
Date: Thu Feb 4 15:50:43 2021 +0800
Subject: md: add md_submit_
Link: https:/
commit c2968285925adb9
Author: Xiao Ni <email address hidden>
Date: Thu Feb 4 15:50:44 2021 +0800
Subject: md/raid10: extend r10bio devs to raid disks
Link: https:/
commit f2e7e269a752531
Author: Xiao Ni <email address hidden>
Date: Thu Feb 4 15:50:45 2021 +0800
Subject: md/raid10: pull the code that wait for blocked dev into one function
Link: https:/
commit d30588b2731fb01
Author: Xiao Ni <email address hidden>
Date: Thu Feb 4 15:50:46 2021 +0800
Subject: md/raid10: improve raid10 discard request
Link: https:/
commit 254c271da0712ea
Author: Xiao Ni <email address hidden>
Date: Thu Feb 4 15:50:47 2021 +0800
Subject: md/raid10: improve discard request for far layout
Link: https:/
There is also an additional commit which is required, and was merged after "md/raid10: improve raid10 discard request" was merged. The following commit enables Radid10 to use large discards, instead of splitting into many bios, since the technical hurdles have now been removed.
commit ca4a4e9a55beeb1
Author: Mike Snitzer <email address hidden>
Date: Fri Apr 30 14:38:37 2021 -0400
Subject: dm raid: remove unnecessary discard limits for raid0 and raid10
Link: https:/
The commits more or less cherry pick to the 5.11, 5.8, 5.4 and 4.15 kernels, with the following minor backports:
1) submit_bio_noacct() needed to be renamed to generic_
commit ed00aabd5eb9fb4
Author: Christoph Hellwig <email address hidden>
Date: Wed Jul 1 10:59:44 2020 +0200
Subject: block: rename generic_
Link: https:/
2) In the 4.15, 5.4 and 5.8 kernels, trace_block_
commit 1c02fca620f7273
Author: Christoph Hellwig <email address hidden>
Date: Thu Dec 3 17:21:38 2020 +0100
Subject: block: remove the request_queue argument to the block_bio_remap tracepoint
Link: https:/
3) bio_split(), mempool_alloc(), bio_clone_fast() all needed their "address of" '&' removed for one of their arguments for the 4.15 kernel, due to changes made in:
commit afeee514ce7f4ca
Author: Kent Overstreet <email address hidden>
Date: Sun May 20 18:25:52 2018 -0400
Subject: md: convert to bioset_
Link: https:/
4) The 4.15 kernel does not need "dm raid: remove unnecessary discard limits for raid0 and raid10" due to not having the following commit, which was merged in 5.1-rc1:
commit 61697a6abd24acb
Author: Mike Snitzer <email address hidden>
Date: Fri Jan 18 14:19:26 2019 -0500
Subject: dm: eliminate 'split_
Link: https:/
5) The 4.15 kernel needed bio_clone_
commit db6638d7d177a8b
Author: Dennis Zhou <email address hidden>
Date: Wed Dec 5 12:10:35 2018 -0500
Subject: blkcg: remove bio->bi_css and instead use bio->bi_blkg
https:/
[Testcase]
You will need a machine with at least 4x NVMe drives which support block discard. I use a i3.8xlarge instance on AWS, since it has all of these things.
$ lsblk
xvda 202:0 0 8G 0 disk
└─xvda1 202:1 0 8G 0 part /
nvme0n1 259:2 0 1.7T 0 disk
nvme1n1 259:0 0 1.7T 0 disk
nvme2n1 259:1 0 1.7T 0 disk
nvme3n1 259:3 0 1.7T 0 disk
Create a Raid10 array:
$ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
Format the array with XFS:
$ time sudo mkfs.xfs /dev/md0
real 11m14.734s
$ sudo mkdir /mnt/disk
$ sudo mount /dev/md0 /mnt/disk
Optional, do a fstrim:
$ time sudo fstrim /mnt/disk
real 11m37.643s
There are test kernels for 5.8, 5.4 and 4.15 available in the following PPA:
https:/
If you install a test kernel, we can see that performance dramatically improves:
$ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
$ time sudo mkfs.xfs /dev/md0
real 0m4.226s
user 0m0.020s
sys 0m0.148s
$ sudo mkdir /mnt/disk
$ sudo mount /dev/md0 /mnt/disk
$ time sudo fstrim /mnt/disk
real 0m1.991s
user 0m0.020s
sys 0m0.000s
The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim
from 11 minutes to 2 seconds.
Performance Matrix (AWS i3.8xlarge):
Kernel | mkfs.xfs | fstrim
-------
4.15 | 7m23.449s | 7m20.678s
5.4 | 8m23.219s | 8m23.927s
5.8 | 2m54.990s | 8m22.010s
4.15-test | 0m4.286s | 0m1.657s
5.4-test | 0m6.075s | 0m3.150s
5.8-test | 0m2.753s | 0m2.999s
The test kernel also changes the discard_max_bytes to the underlying hardware limit:
$ cat /sys/block/
2199023255040
[Where problems can occur]
A problem has occurred once before, with the previous revision of this patchset. This has been documented in bug 1907262, and caused a worst case scenario of data loss for some users, in this particular case, on the second and onward disks. This was due to two two faults: the first, incorrectly calculating the start offset for block discard for the second and extra disks. The second bug was an incorrect stripe size for far layouts.
The kernel team was forced to revert the patches in an emergency and the faulty kernel was removed from the archive, and community users urged to avoid the faulty kernel.
These bugs and a few other minor issues have now been corrected, and we have been testing the new patches since mid February. The patches have been tested against the testcase in bug 1907262 and do not cause the disks to become corrupted.
The regression potential is still the same for this patchset though. If a regression were to occur, it could lead to data loss on Raid10 arrays backed by NVMe or SSD disks that support block discard.
If a regression happens, users need to disable the fstrim systemd service as soon as possible, plan an emergency maintenance window, and downgrade the kernel to a previous release, or upgrade to a corrected kernel.
Changed in linux (Ubuntu Bionic): | |
status: | New → In Progress |
Changed in linux (Ubuntu Focal): | |
status: | New → In Progress |
Changed in linux (Ubuntu Groovy): | |
status: | New → In Progress |
Changed in linux (Ubuntu Bionic): | |
importance: | Undecided → Medium |
Changed in linux (Ubuntu Focal): | |
importance: | Undecided → Medium |
Changed in linux (Ubuntu Groovy): | |
importance: | Undecided → Medium |
Changed in linux (Ubuntu Bionic): | |
assignee: | nobody → Matthew Ruffell (mruffell) |
Changed in linux (Ubuntu Focal): | |
assignee: | nobody → Matthew Ruffell (mruffell) |
Changed in linux (Ubuntu Groovy): | |
assignee: | nobody → Matthew Ruffell (mruffell) |
tags: | added: sts |
description: | updated |
description: | updated |
description: | updated |
description: | updated |
description: | updated |
description: | updated |
description: | updated |
description: | updated |
Changed in linux (Ubuntu Groovy): | |
status: | In Progress → Fix Committed |
Changed in linux (Ubuntu Focal): | |
status: | In Progress → Fix Committed |
Changed in linux (Ubuntu Bionic): | |
status: | In Progress → Fix Committed |
Changed in linux (Ubuntu): | |
status: | Fix Released → In Progress |
Changed in linux (Ubuntu Bionic): | |
status: | Fix Released → In Progress |
Changed in linux (Ubuntu Focal): | |
status: | Fix Released → In Progress |
Changed in linux (Ubuntu Groovy): | |
status: | Fix Released → In Progress |
description: | updated |
description: | updated |
tags: | removed: verification-done-bionic verification-done-focal verification-done-groovy |
Changed in linux (Ubuntu Hirsute): | |
status: | New → In Progress |
importance: | Undecided → Medium |
assignee: | nobody → Matthew Ruffell (mruffell) |
description: | updated |
description: | updated |
Changed in linux (Ubuntu Bionic): | |
status: | In Progress → Fix Committed |
Changed in linux (Ubuntu Focal): | |
status: | In Progress → Fix Committed |
Changed in linux (Ubuntu Groovy): | |
status: | In Progress → Fix Committed |
Changed in linux (Ubuntu Hirsute): | |
status: | In Progress → Fix Committed |
This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification- needed- bionic' to 'verification- done-bionic' . If the problem still exists, change the tag 'verification- needed- bionic' to 'verification- failed- bionic' .
If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.
See https:/ /wiki.ubuntu. com/Testing/ EnableProposed for documentation how to enable and use -proposed. Thank you!