Comment 64 for bug 557429

Revision history for this message
Steve Langasek (vorlon) wrote : Re: [Bug 557429] Re: array with conflicting changes is assembled with data corruption/silent loss

Dustin,

On Tue, Apr 20, 2010 at 03:33:15PM -0000, Dustin Kirkland wrote:
> I agree with Philip's assessment.

> While this is very easy to reproduce in a VM (by just removing/adding
> backing disk files), in practice and on real hardware, I think this is
> definitely less likely.

> When a real hardware disk fails, it should be removed from the system,
> and not come back until it's replaced with new hardware, in which case
> this bug will not be triggered. As Philip explained, this would only
> happen if an admin is adding and removing and booting with just one
> disk, and then the other, and then both. Don't do that.

Have I misunderstood the nature of this bug, or couldn't it be triggered by
a flaky SATA cable causing intermittent connections to the drives? If one
port flakes on one boot, the other port flakes on the next, and both ports
are available on the third, wouldn't that trigger this same bogus
reassembly?

In fact, if the admin is trying to debug the problem, maybe the system comes
up two out of five times without seeing any drives at all, or they've
physically swapped which disk is on which port *because the cable is
unreliable*, and by the fifth time they've thought to replace the cable and
things are reliable again - and *then* the perfectly-good disks get
corrupted because of this bug.

So while it doesn't appear to be a recent regression, and not a
high-frequency occurence, it does look like a data loss bug that can occur
through no fault of the admin and I certainly think our users need to be
warned of this in the release notes.

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>