Default failback value is badly chosen

Bug #1641124 reported by Jon Skarpeteig
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
multipath-tools (Ubuntu)
New
Wishlist
Unassigned

Bug Description

As described by:

https://help.ubuntu.com/lts/serverguide/multipath-setting-up-dm-multipath.html

The default value for failback is set to manual, instead of immediate. This effectively breaks the idea of multipath, which allows for upgrading E.G SAN A side, then upgrade B side when A is complete.

With this set to manual, the system effectively halts. With a system containing 4 paths to a block device, you'll see this in logs during SAN firmware upgrade:

November 11th 2016, 13:58:05.000 3 systemd dev-disk-by\x2did-wwn\x2d0x600a098038303731702b486638665456.device: Dev dev-disk-by\x2did-wwn\x2d0x600a098038303731702b486638665456.device appeared twice with different sysfs paths /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/0000:04:02.0/0000:07:00.0/host0/rport-0:0-4/target0:0:3/0:0:3:0/block/sde and /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/0000:04:02.0/0000:07:00.0/host0/rport-0:0-2/target0:0:1/0:0:1:0/block/sda
November 11th 2016, 13:58:05.000 3 systemd dev-disk-by\x2did-scsi\x2d3600a098038303731702b486638665456.device: Dev dev-disk-by\x2did-scsi\x2d3600a098038303731702b486638665456.device appeared twice with different sysfs paths /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/0000:04:02.0/0000:07:00.0/host0/rport-0:0-4/target0:0:3/0:0:3:0/block/sde and /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/0000:04:02.0/0000:07:00.0/host0/rport-0:0-2/target0:0:1/0:0:1:0/block/sda
November 11th 2016, 13:58:05.000 3 systemd dev-disk-by\x2did-scsi\x2d3600a098038303731702b486638665456.device: Dev dev-disk-by\x2did-scsi\x2d3600a098038303731702b486638665456.device appeared twice with different sysfs paths /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/0000:04:03.0/0000:08:00.0/host7/rport-7:0-1/target7:0:0/7:0:0:0/block/sdc and /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/0000:04:02.0/0000:07:00.0/host0/rport-0:0-2/target0:0:1/0:0:1:0/block/sda
November 11th 2016, 13:58:05.000 4 kernel [585264.496735] sd 0:0:1:0: Asymmetric access state changed
November 11th 2016, 13:58:05.000 3 systemd dev-disk-by\x2did-scsi\x2d3600a098038303731702b486638665456.device: Dev dev-disk-by\x2did-scsi\x2d3600a098038303731702b486638665456.device appeared twice with different sysfs paths /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/0000:04:03.0/0000:08:00.0/host7/rport-7:0-1/target7:0:0/7:0:0:0/block/sdc and /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/0000:04:02.0/0000:07:00.0/host0/rport-0:0-2/target0:0:1/0:0:1:0/block/sda
November 11th 2016, 13:58:05.000 3 systemd dev-disk-by\x2did-wwn\x2d0x600a098038303731702b486638665456.device: Dev dev-disk-by\x2did-wwn\x2d0x600a098038303731702b486638665456.device appeared twice with different sysfs paths /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/0000:04:03.0/0000:08:00.0/host7/rport-7:0-1/target7:0:0/7:0:0:0/block/sdc and /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/0000:04:02.0/0000:07:00.0/host0/rport-0:0-2/target0:0:1/0:0:1:0/block/sda
November 11th 2016, 13:58:05.000 3 systemd dev-disk-by\x2did-wwn\x2d0x600a098038303731702b486638665456.device: Dev dev-disk-by\x2did-wwn\x2d0x600a098038303731702b486638665456.device appeared twice with different sysfs paths /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/0000:04:03.0/0000:08:00.0/host7/rport-7:0-1/target7:0:0/7:0:0:0/block/sdc and /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/0000:04:02.0/0000:07:00.0/host0/rport-0:0-2/target0:0:1/0:0:1:0/block/sda

With the end result that the filesystem is no longer available. If the root partition is on a multipath device (SAN device), then /bin etc. are gone.

Suggested fix: Set failback to immediate as default value

This would ensure that what you would expect to happen, actually happens - which is when you upgrade SAN firmware, machines continue running like nothing happened due to redundant paths

Revision history for this message
Nish Aravamudan (nacc) wrote :

Hello and thank you for reporting this bug! Changing default values is a scary proposition (to me), because we have to think about every possible environment.

What are the implications of failback=immediate over failback=manual?

Also, the base configuration for multipath "works" for all cases (I guess unless you are installing during a failover?); if you know your setup for multipath should use a different setting, you are able to manage that easily (as you noted) by changing the configuration file.

The concerning part from the manpage is:

Tell multipathd how to manage path group failback.
To select \fIimmediate\fR or a \fIvalue\fR, it's mandatory that the device
has support for a working prioritizer.

We do not know that every device of every Ubuntu Server instance already using multipath-tools has a "working prioritizer", do we? I'm not sure what that even is, but I can guess from context in multipath :)

I'm also going to unsubscribe Ubuntu Server, as right now there is not (in my opinion) anything for the Server Team to change -- multipath policy is fraught with danger :) Honestly, we take the default from upstream, because it 'just works'. If you want to see that change, I would work with the upstream community (maybe 'immediate to manual', which uses 'immediate' if it is detected that it can be used ('working prioritizer') or somesuch), but that's outside the scope of the package in Ubuntu.

Changed in multipath-tools (Ubuntu):
importance: Undecided → Wishlist
Revision history for this message
Chris Hofstaedtler (zeha) wrote :

Which value of failback immediate/manual is correct depends on the actual setup. The *default* value is also hardware dependent, as can be seen in libmultipath/hwtable.c. If you have a SAN setup not covered by the built-in defaults, I would suggest sending corrected values for *that* specific SAN upstream.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.