mdadm resyncs imsm raid in "Normal" state

Bug #1320402 reported by Martin Stjernholm
90
This bug affects 17 people
Affects Status Importance Assigned to Milestone
mdadm (Fedora)
Fix Released
Medium
mdadm (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Got an imsm raid1 which I don't boot from. Whenever bios reports the raid as being in "Normal" state, mdadm starts a resync of it after boot. If a resync already is underway (bios reports it as being in "Verify" state), it continues where it left off.

This appears to be very similar to a problem in Fedora: https://bugzilla.redhat.com/show_bug.cgi?id=753335 It's apparently fixed there for some time now. (The workaround mentioned in comment 41 in that ticket doesn't work in Ubuntu, but I guess that's due to distro differences in initramfs and systemd.)

Using mdadm 3.2.5-5ubuntu4 in a fresh Trusty install.

Tags: patch
Revision history for this message
Martin Stjernholm (msub) wrote :

Another observation: This system is a dual boot with Windows (why else use imsm?), and if I shut down from Windows with the raid in healthy state, mdadm doesn't start a resync. It is only if I shut down normally/cleanly from Ubuntu that that happens.

Revision history for this message
The Setlaz (dam-brouard) wrote :

I will watch that carefully as I noticed pretty much the same !
At first, on Linux, the RAID was in Initialize state. You can power-down and up, Ubuntu will continu re-sync from where it stops (expected behavior)
After full re-sync with Ubuntu, BIOS displayed Normal status for the RAID1.

From there, I boot on Windows, and iRST was doing Verification from scratch. That put the RAID in Verify in the bios.
Booting back Ubuntu started a full re-sync.

I need to check if rebooting on Ubuntu straigth after in complete the full re-sync (when BIOS displays Normal state) triggers a new re-sync or not.

mdadm 3.2.5 on Trusty.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in mdadm (Ubuntu):
status: New → Confirmed
Revision history for this message
The Setlaz (dam-brouard) wrote :

Windows finished the Verification. RAID1 went into Normal state. I could reboot on both Ubuntu and Windows several times while the RAID1 stayed in Normal state.

When I created a file on the RAID array in Ubuntu, then after reboot, it started a full resync again.

dmesg | grep md:
[ 3.261354] md/raid1:md126: not clean -- starting background reconstruction
[ 3.261357] md/raid1:md126: active with 2 out of 2 mirrors
[ 3.261369] md126: detected capacity change from 0 to 3000590401536
[ 3.265360] md126: p1 p2
[ 3.282575] md: md126 switched to read-write mode.
[ 3.282620] md: resync of RAID array md126

Revision history for this message
The Setlaz (dam-brouard) wrote :

Alright, my bug is slitghtly different than yours. See http://ubuntuforums.org/showthread.php?t=2224874

Ubuntu is not systematically resync-ing the RAID1 array after each reboot for me.

Revision history for this message
Martin Stjernholm (msub) wrote :

I can confirm that if the raid stays in auto-read-only state (i.e. isn't written to), then it won't start a resync on the next boot. I verified that by not mounting any of the file systems (for me it's enough to mount a file system to cause a write - it doesn't have to be a file write). That being the case, is there any significant difference from your case, dam-brouard?

Also, if I stop the raid and the container manually before reboot, it stays clean on the next boot as well.

I found this Gentoo bug which is similar as well: https://bugs.gentoo.org/show_bug.cgi?id=395203 It pointed out that mdmon should be kept running during the shutdown to write out the external metadata after the raid is stopped. So I tried a similar fix in /etc/init.d/killprocs to spare any mdmon processes. That didn't help, but I haven't verified that mdmon is kept running long enough - it could be a necessary but not sufficient fix.

Revision history for this message
The Setlaz (dam-brouard) wrote :

Hi Martin,

My RAID array is automatically mounted through fstab and stays in "auto-read-only" state as you mentionned. Even mounting it unmounting it manually and re-mounting does not cause it to go in "write" state for me. That's the only difference.

Thanks for pointing the Gentoo bug. Our issue might be similar actually. Given the behavior, I wouldn't be surprised actually.

Revision history for this message
Martin Stjernholm (msub) wrote :

I deviced a workaround for this by mounting and unmounting my raid separately. The key is that I don't need it during boot, nor start any daemons that keep files open on it.

First, I added the "noauto" option to all the mounts on the raid in /etc/fstab.

Then I added an upstart script as below. Note that the raid device and the mount points are hardcoded, so anyone using this needs to adapt them. It also doesn't hook into plymouth (aka bootsplash) like the normal mountall stuff does, so there aren't any nice messages and prompts if the mounts fail.

/etc/init/local-mountraid.conf:

description "Mount the imsm raid separately to work around LP #1320402"

start on filesystem
stop on runlevel [!23]
task

post-start script
    mdadm --assemble /dev/md/vol0 || :
    mount /more
    mount /d
end script

post-stop script
    umount /d
    umount /more
    mdadm --stop /dev/md/vol0
end script

Revision history for this message
The Setlaz (dam-brouard) wrote :

Hi Martin,

Thanks for that piece of script !

Are you also facing the bug where any operation on the RAID1 file-system under Ubuntu (write) will trigger a Verification of the array on Windows ?

Damien

Revision history for this message
Martin Stjernholm (msub) wrote :

Yes, the bios reports the raid as normal, but a resync is started regardless whether I boot Ubuntu or Windows. So I think it's pretty clear that the problem is in writing down a clean state to the metadata block during the Ubuntu shutdown.

Revision history for this message
Matthew Joyner (matthewjoyner) wrote :

I also have a similar problem. Had a lot of troulbe setting uf mdadm raid.

Have mdadm for my root partition as well as others so cannot use workaround displayed above.

I believe the problem is somethingto di with thre initramfs not closing down mdmon properly on shutdown.

I also have the problem of the desktop not only showing the imsm mdadm devices but also showiing the the partitions of the raw disk devices. Does not do it every time though. so I end up with 3 copies of the same partition showing up 1 for each mirrored device and one for the virtual array.

I have two mirroed western digital red drives, using intel matrix raid. i dual boot with windows.

I'm not sure if I should not switch back to dmraid, worked fine on 12.04.

Any suggestions, progress on this issue?

Revision history for this message
ChrisMN (chris.mn) wrote :

I also experienced the same problem with a IMSM RAID 1. I believe that the problem is essentially identical to the gentoo problem which was noted earlier. I put in the following workaround which seems to work (on a clean install of Ubuntu 14.04.1 with mdadm version 3.2.5). At least, I have not seen a resync on boot since putting this in this fix.

1. Edit /etc/init.d/sendsigs (to prevent mdmon from being killed on shutdown):
 OMITPIDS=`pidof mdmon | sed -e 's/ /,/g' -e 's/./-o \0'`

2. Edit /etc/init.d/umountroot (to force the system to wait for the RAID to be clean before shutdown):

##########################
#FIRST: Add the following function:
check_raid() {
 log_action_begin_msg “Checking RAIDs ”
 if mdadm --wait-clean --scan ; then
  log_action_end_msg 0
 else
  log_action_end_msg 1
 fi
}

#SECOND: change the call to do_stop:
#stop)
# do_stop
# ;;

stop)
 check_raid
 do_stop
 check_raid
 ;;
##########################

It isn't clear that check_raid has to be called both before and after do_stop (probably after will work fine, but I didn't test it).

Revision history for this message
MatthewHawn (steamraven) wrote :

Comment #12 worked for me! Thanks Chris.mn.

However, I believe the more canonical way of omitting the PIDs to kill would be to use the /run/sendsigs.omit.d directory. Interestingly, /etc/init.d/mdadm already adds mdmon to this, just only on stop, not start. I don't understand why and this may be a mistake.

In any case, I have attached a patch to /etc/init.d/mdadm that adds the mdmon PID to the sendsigsomit.d directory on start.

The check_raid in umountroot is also needed for me and I have included that in the patch

Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

The attachment "mdadm.patch" seems to be a patch. If it isn't, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are a member of the ~ubuntu-reviewers, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issues please contact him.]

tags: added: patch
Revision history for this message
John Center (john-center) wrote :

I have suffered under the same problem, running 12.10, 13.10 & now 14.04. I just tried your patches, Matthew, & they appeared to work. I only rebooted once, however. I'll test some more later.

Thanks!

    -John

Revision history for this message
Dorijan (dmailj) wrote :

Hi to all..
I have the same problem with fresh installed Ubuntu 14.04

can somebody help me how to apply this patch? It seems I am missing files Downloads/init.d/mdadm and Downloads/init.d/umountroot ?
Thank you...

Revision history for this message
Eren (erent) wrote :

I have the similar issue with Ubuntu Server 14.04.2 and I confirm that the patch in #13 working flawlessly. Disks are no longer synced every time I boot the server.

Are the maintainers going to fix this issue in the next release?

Revision history for this message
Gabriel Devenyi (ace-staticwave) wrote :

Just being bitten by this now, this is a serious issue for mdadm root filesystems on RAID1, if there were actual disk issues this could completely destroy a system, the complete opposite of what RAID is supposed to do!

Maintainers, the users fixed the problem for you, please integrate the fix.

Revision history for this message
Tim Kosem (timkosem) wrote :

Patch on Comment #13 worked like a charm for me on 14.04.1 LTS. Syncs on reboot stopped cold (after running to completion, of course) after the patch was applied and system rebooted.

Revision history for this message
Darin Avery (darin-avery) wrote :

This problem just started happening for me too. I tried only the changes in #13 but it still resyncs. Are the changes in 12 necessary as well?

I tried the pidof command on its own and got this: "sed: -e expression #2, char 9: unterminated `s' command"

Also ps ax|grep mdmon shows nothing, and mdmon /dev/md0 gives "mdmon: md0 is not a container - cannot monitor" so it's not clear to me when, if ever, mdmon is running or how to make it run.

1. Is mdmon supposed to be running all the time?
2. Does this work for an mdadm array on any hardware, or only intel? (imsm means intel matrix storage, correct?)

In case anyone else is having this, instead of shutting down, try suspend. That doesn't trigger the resync for me.

Thanks.

Revision history for this message
Stefano Torresi (stefanotorresi) wrote :

I am also experiencing this problem when dual booting with Windows 10 and Ubuntu 15.10 , which supposedly has the fix implemented in /etc/init.d/mdadm-waitidle.

Whenever I reboot the system from one OS to another, even if the raid1 array appears to be in "normal" status in the bios POST, the array gets re-verified by either mdadm or win intel rapid storage manager tool.

Revision history for this message
Elias Kouskoumvekakis (eliaskousk) wrote :

I also had the exact same problem with Ubuntu 15.10 and Windows 10 dual-boot setup and now that I upgraded to Ubuntu 16.04 LTS it is still there unfortunately. Mdadm or the intel tool in windows will always try to resync the array on each reboot.

Revision history for this message
Ivan (ivus-b) wrote :

I'm also have bumped to "imsm raid always resync after boot" issue under Ubuntu 16.04.lts.
I can see that the mdadm-waitidle scrip really does present into the /etc/init.d/ but at the same time no any symbolic links to this script into the etc/rcX.d dirs. I.e. by default this script never executes. I have tried to restore Kxxmdadm-waitidle symlinks into the etc/rc0.d etc the but cannot say yet does this way help or not, my raid is resyncing currently.

Revision history for this message
Elias Kouskoumvekakis (eliaskousk) wrote :

It seems that the resync doesn't happen anymore when I reboot my 16.04 system. Maybe a recent update (last 2 weeks) fixed it. Does anybody else still have the problem?

Also the mdadm-waitidle has the symlinks in /etc/rc0.d and /etc/rc6.d by default - I didn't install them myself. Maybe the update did just that and I didn't have them before thus the problem.

Revision history for this message
Ivan (ivus-b) wrote :

My playing with /rc0.d/rc6.d was unsuccessful, script not executed at all. After some learning it turned out that Ubuntu 16.04 lts does not use scripts from init.d/rcN.d at all, now there is systemd services uses. I tried to create a service but it always executes before umount or just not executes. So I was forced to add to script an explicit commands to stop known services which uses raid and unmount /mnt/raid directly. It executes but again I do not know yet does it successful or not, raid in resync state.

Revision history for this message
Martin Stjernholm (msub) wrote :

I've had this problem since wily (15.10) - the workarounds in this ticket stopped working then. It's still the same in xenial (upgraded yesterday, so it's fresh).

Fwiw I can say it's still a problem at shutdown rather than startup: If I reboot to Windows it's detected as unclean and a resync is done, and if I shut down Windows with the raid clean it starts clean in Ubuntu.

Revision history for this message
Roger Lawhorn (rll-m) wrote :

I am running Linux Mint 18 64bit with Cinnamon Desktop.
I never had this issue with mint 17 (ubuntu 14.04).
Mint 18 does it every single reboot.
I have implemented the fix from #12.
I have yet to test.
It takes 4hrs for my raid 1 mirror to resync.
I cannot live with this bug.

Revision history for this message
Roger Lawhorn (rll-m) wrote :

I have tried two different fixes for this (#12 and #13).
It won't stop.
At this point I can no longer use raid at all.
If my boot drive was raid 5 I'd be screwed as I could not turn it off at all.
I use raid 1 so I am lucky.

Can someone post a command to FORCE the raid to be clean?
Perhaps a forced immediate sync?
The 'mdadm --wait-clean --scan' only tells mdadm to do it as soon as possible.
It does not work either.

I have resorted to running ">echo 5000 > /proc/sys/dev/raid/speed_limit_max" on boot to gain control of my machine back. It slows the resync to 5mb/sec instead of 120mb/sec. However, with this temp fix I never have a full raid mirror backup.

I use my machine to do real work.
This is beyond unbearable and unthinkable that this bug has no cure yet.

Revision history for this message
Roger Lawhorn (rll-m) wrote :

Just a note for #12. Your change to sendsigs is broken, at least for mint 18.
The process mdmon will not be found.
It is actually /sbin/mdadm --monitor ....

I set OMITPIDS to OMITPIDS=`pidof /sbin/mdadm`
When using pidof the full path of the executable should be specified.

Your second sed expression gives off an error for me.

I just made this change and have to wait 4 hrs for my resync to finish before I can reboot to test. I will repost the results.

I have done a lot of research on this and mdmon is the issue. It must NOT be killed during a shutdown according to it's own docs.

#1. Edit /etc/init.d/sendsigs (to prevent mdmon from being killed on shutdown):
# OMITPIDS=`pidof mdmon | sed -e 's/ /,/g' -e 's/./-o \0'`

Revision history for this message
Roger Lawhorn (rll-m) wrote :

Small update....at the 23% sync mark I rebooted.
If my change failed then I'd go back to 0% like usual.
It didn't. I stayed at the 23% mark.
Partial confirmation the fix to /etc/init.d/sendsigs does work.
:-D

Revision history for this message
Roger Lawhorn (rll-m) wrote :

Update results (using my fix):
Reboot with resync in progress= start where I left off (ex. 23%).
Reboot after full resync is done = go back to 0% and resync all 2TB all over again.
Why is this bug not given importance or assigned to anyone?
It is destroying my life. That's all. I have work to do.

Revision history for this message
Roger Lawhorn (rll-m) wrote :

I found this on the net after searching for "kernel 4.4 mdadm" (a page about mdadm - follow first link):
"Kernel versions in series 4.1 through 4.4.1 have a broken RAID implementation. Use a kernel with version at or above 4.4.2. "

I am using kernel 4.4.0-24 therefore I have broken raid.

Hope this helps someone else.

Revision history for this message
Roger Lawhorn (rll-m) wrote :

Confirmed. Kernel 4.4.2 or higher fixes this nasty, nasty life ruining raid bug. Thank God, It's gone!

Revision history for this message
Martin Stjernholm (msub) wrote :

Roger, many thanks for finding that. At first it looked like it wouldn't work for me, but it turns out that both kernel >= 4.4.2 AND the script fixes in comment #13 are necessary.

I have also tested and found that both fixes in #13 are still required:

- /etc/init.d/mdadm: The mdmon pid is put into sendsigs.d at stop, but that's not enough -
   it has to happen already at start. MatthewHawn commented on that already in #13.

- /etc/init.d/umountroot: The "mdadm --wait-clean --scan" kludge there is also still
   required. I note there's a script /etc/init.d/mdadm-waitidle that looks like it's
   supposed to do this, but it doesn't actually work. Probably some kind of sequencing
   problem, but I haven't dug into that.

So, the recipe to work around this bug in wily and xenial ought to be:

1. Download and apply MatthewHawns patch in #13
    (https://launchpadlibrarian.net/191488884/mdadm.patch).

2. Go to e.g. http://kernel.ubuntu.com/~kernel-ppa/mainline/ and pick a kernel >= 4.4.2
    (I've tested with 4.4.2 and the latest 4.4 atm, 4.4.17). Download the .debs for the
    architecture you need and install them using dpkg -i. See
    https://wiki.ubuntu.com/Kernel/MainlineBuilds for more detailed instructions.

Revision history for this message
John Center (john-center) wrote :

Since I upgraded to 16.04, the mdadm resync has started again. How do these changes map to the new systemd configuration? What is the process that systemd uses to start up mdadm? Will this change going forward to 16.10, etc.?

Revision history for this message
Charles Joseph Christie II (sonicbhoc) wrote :

I have been affected by this bug. on 16.04.

I will either have to switch to a distribution that either has this fix or will allow me to fix it myself, or cease Linux usage entirely until the fix is backported to 16.04.

Revision history for this message
Sergio Callegari (callegar) wrote :

see also https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1587142

mdadm with imsm metadata seems to be completely broken in ubuntu.

Having something as common as RAID1 on Intel boards broken is not very nice.

Until this can be fixed, it would be good to at least add a very visible warning in the mdadm documentation and in the distro release notes.

Revision history for this message
Sergio Callegari (callegar) wrote :

To me, this is highly critical. If you cannot use mdadm to manage RAID on dual boot systems that need the imsm metadata (and obviously you cannot if it means paying the price of constant resyncing) then you need to resort to dmraid, which is very badly maintained and fails to properly resync finalizing the dm state whenever the bios thinks that the array is in a "verify" state (whatever it means).

Now, please do not underestimate the need to dual boot. Even machines that are on Linux 99.9% of the time may need to boot the OS they were sold with to do things like bios updates (either on the motherboard or on some of their components) or to be able to call in the manufacturer for assistance in case of failures.

Revision history for this message
Sergio Callegari (callegar) wrote :

With respect to #34, isn't the ubuntu 4.4.0 kernel meant to pick up important security updates and fixes from the mainline 4.4.x stable branch?

Revision history for this message
Clemens Steinkogler (etc-v) wrote :

tried kernel 4.7.5 from mainline with mdadmpatch from #13 - no help. still resyncing after every reboot/start

Revision history for this message
Martin Stjernholm (msub) wrote :

#34 worked for me for all of about two weeks or so, then no more. I suspect an update, but I couldn't find any update of any package that seemed relevant.

I've now resorted to my first workaround in #8, i.e. to mount and unmount the raid partitions separately from the ordinary boot and shutdown process. That of course only works when the raid filesystems aren't used during boot.

Since upstart is gone I converted my workaround to an old-school init.d script, which works with systemd. I've attached it in case it helps someone, but note that MDDEV and MOUNTS will need to be changed for the local configuration.

While trying the script manually, I did notice that umount sometimes failed due to filesystem being busy. I couldn't find any process with open files though, and during shutdown such things should be cleaned up earlier anyway. Might be a hint for this latest failure.

I usually stick to technical issues in bug reports, but I have to say that this problem has gotten quite tedious and the lack of action - even for an open source project - is disappointing.

Revision history for this message
Clemens Steinkogler (etc-v) wrote :

thank you for the script Martin Stjernholm. seems to be working, although the unmounting doesn't seem to be done by the script if I shutdown/reboot my pc - added a few logger-statements and i'm only seeing the mount-messages in my logs. another neglectable problem is that autologin doesn't work anymore.

more than 2 years old this bug and no official solution? things like that are linux' show stoppers!

Revision history for this message
Maksim (swinger) wrote :

Hm, i used server 16.04.1 LTS with patch in the post #13 and additional upgrade kernel to 4.7 or 4.8
NOTE: when using only mdadm.patch radi1 is resync always boot.

now i using newest kernel with raid1, all work fine.
Linux srv-design 4.8.0-040800-generic #201610022031 SMP Mon Oct 3 00:32:57 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Changed in mdadm (Fedora):
importance: Unknown → Medium
status: Unknown → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.