As described by:
https://help.ubuntu.com/lts/serverguide/multipath-setting-up-dm-multipath.html
The default value for failback is set to manual, instead of immediate. This effectively breaks the idea of multipath, which allows for upgrading E.G SAN A side, then upgrade B side when A is complete.
With this set to manual, the system effectively halts. With a system containing 4 paths to a block device, you'll see this in logs during SAN firmware upgrade:
November 11th 2016, 13:58:05.000 3 systemd dev-disk-by\x2did-wwn\x2d0x600a098038303731702b486638665456.device: Dev dev-disk-by\x2did-wwn\x2d0x600a098038303731702b486638665456.device appeared twice with different sysfs paths /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/0000:04:02.0/0000:07:00.0/host0/rport-0:0-4/target0:0:3/0:0:3:0/block/sde and /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/0000:04:02.0/0000:07:00.0/host0/rport-0:0-2/target0:0:1/0:0:1:0/block/sda
November 11th 2016, 13:58:05.000 3 systemd dev-disk-by\x2did-scsi\x2d3600a098038303731702b486638665456.device: Dev dev-disk-by\x2did-scsi\x2d3600a098038303731702b486638665456.device appeared twice with different sysfs paths /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/0000:04:02.0/0000:07:00.0/host0/rport-0:0-4/target0:0:3/0:0:3:0/block/sde and /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/0000:04:02.0/0000:07:00.0/host0/rport-0:0-2/target0:0:1/0:0:1:0/block/sda
November 11th 2016, 13:58:05.000 3 systemd dev-disk-by\x2did-scsi\x2d3600a098038303731702b486638665456.device: Dev dev-disk-by\x2did-scsi\x2d3600a098038303731702b486638665456.device appeared twice with different sysfs paths /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/0000:04:03.0/0000:08:00.0/host7/rport-7:0-1/target7:0:0/7:0:0:0/block/sdc and /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/0000:04:02.0/0000:07:00.0/host0/rport-0:0-2/target0:0:1/0:0:1:0/block/sda
November 11th 2016, 13:58:05.000 4 kernel [585264.496735] sd 0:0:1:0: Asymmetric access state changed
November 11th 2016, 13:58:05.000 3 systemd dev-disk-by\x2did-scsi\x2d3600a098038303731702b486638665456.device: Dev dev-disk-by\x2did-scsi\x2d3600a098038303731702b486638665456.device appeared twice with different sysfs paths /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/0000:04:03.0/0000:08:00.0/host7/rport-7:0-1/target7:0:0/7:0:0:0/block/sdc and /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/0000:04:02.0/0000:07:00.0/host0/rport-0:0-2/target0:0:1/0:0:1:0/block/sda
November 11th 2016, 13:58:05.000 3 systemd dev-disk-by\x2did-wwn\x2d0x600a098038303731702b486638665456.device: Dev dev-disk-by\x2did-wwn\x2d0x600a098038303731702b486638665456.device appeared twice with different sysfs paths /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/0000:04:03.0/0000:08:00.0/host7/rport-7:0-1/target7:0:0/7:0:0:0/block/sdc and /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/0000:04:02.0/0000:07:00.0/host0/rport-0:0-2/target0:0:1/0:0:1:0/block/sda
November 11th 2016, 13:58:05.000 3 systemd dev-disk-by\x2did-wwn\x2d0x600a098038303731702b486638665456.device: Dev dev-disk-by\x2did-wwn\x2d0x600a098038303731702b486638665456.device appeared twice with different sysfs paths /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/0000:04:03.0/0000:08:00.0/host7/rport-7:0-1/target7:0:0/7:0:0:0/block/sdc and /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/0000:04:02.0/0000:07:00.0/host0/rport-0:0-2/target0:0:1/0:0:1:0/block/sda
With the end result that the filesystem is no longer available. If the root partition is on a multipath device (SAN device), then /bin etc. are gone.
Suggested fix: Set failback to immediate as default value
This would ensure that what you would expect to happen, actually happens - which is when you upgrade SAN firmware, machines continue running like nothing happened due to redundant paths
Hello and thank you for reporting this bug! Changing default values is a scary proposition (to me), because we have to think about every possible environment.
What are the implications of failback=immediate over failback=manual?
Also, the base configuration for multipath "works" for all cases (I guess unless you are installing during a failover?); if you know your setup for multipath should use a different setting, you are able to manage that easily (as you noted) by changing the configuration file.
The concerning part from the manpage is:
Tell multipathd how to manage path group failback.
To select \fIimmediate\fR or a \fIvalue\fR, it's mandatory that the device
has support for a working prioritizer.
We do not know that every device of every Ubuntu Server instance already using multipath-tools has a "working prioritizer", do we? I'm not sure what that even is, but I can guess from context in multipath :)
I'm also going to unsubscribe Ubuntu Server, as right now there is not (in my opinion) anything for the Server Team to change -- multipath policy is fraught with danger :) Honestly, we take the default from upstream, because it 'just works'. If you want to see that change, I would work with the upstream community (maybe 'immediate to manual', which uses 'immediate' if it is detected that it can be used ('working prioritizer') or somesuch), but that's outside the scope of the package in Ubuntu.