data loss when disk fails during reshaping and re-added afterwards

Bug #2013280 reported by lvm
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
mdadm
New
Undecided
Unassigned

Bug Description

When disk fails while array is being reshaped but later recovers and is added to the array after reshaping has completed, the fact that data on this disk was not fully rearranged is not taken into the account and only the blocks marked in the write-intent bitmap are synced, which results in massive filesystem corruption.

Consider the following scenario

* new disk is added to md RAID6 array, --raid-devices is increased accordingly and array reshaping starts
* one of the pre-existing disks in the array fails during reshaping, reshaping of the degraded array continues and completes successfully
* investigation shows that disk failure was caused by a bad cable connection, cable is reseated/replaced, failed disk comes back online and is healthy
* failed disk is added back to the array (mdadm -a /dev/md1 /dev/sda1)

Expected behaviour at this point is that failed disk will be added as a new device and full array recovery will start, however it is re-added instead (as reported in /proc/mdstat) and synchronization takes unrealistically low time - a couple of minutes (full recovery on the same hardware takes 30+hours), presumably only the blocks marked in the write-intent bitmap are synced, which results in massive filesystem corruption. If this disk is failed again, correct data is retrieved from the parities on disks, which were fully reshaped, and filesystem recovers. The workaround is to zero the superblock on the failed disk before re-adding it, which triggers full recovery.

By no means the latest version, could be already fixed, but as it can result in a complete loss of data on the array I'd better report it.

mdadm - v4.1-rc1 - 2018-03-22
Linux 5.4.0-146-generic #163~18.04.1-Ubuntu SMP Mon Mar 20 15:02:59 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.