Comment 21 for bug 1435706

Revision history for this message
Tore Anderson (toreanderson) wrote :

Ok, so I did some more testing. It appears that the problem isn't specific to the dev_loss_tmo and fast_io_fail_tmo setting. This is evidenced by the terminal log below. In multipath.conf (which we know for certain is being read, as the created multipath map gets the correct alias), I instruct it to use the ALUA hardware handler for all devices. However, for some reason, this is ignored, and the EMC hardware handler is used instead:

=====
root@ucstest-osl2:~# cat /etc/multipath.conf
devices {
        device {
                vendor ".*"
                product ".*"
                hardware_handler "1 alua"
        }
}

multipaths {
        multipath {
                wwid 3600601603a71320022967e0a1f38e411
                alias bootvolume
        }
}
root@ucstest-osl2:~# multipath -v 2
create: bootvolume (3600601603a71320022967e0a1f38e411) undef DGC,VRAID
size=50G features='1 queue_if_no_path' hwhandler='1 emc' wp=undef
|-+- policy='round-robin 0' prio=1 status=undef
| |- 0:0:0:0 sda 8:0 undef ready running
| `- 1:0:1:0 sdd 8:48 undef ready running
`-+- policy='round-robin 0' prio=0 status=undef
  |- 0:0:1:0 sdb 8:16 undef ready running
  `- 1:0:0:0 sdc 8:32 undef ready running
=====

This does *NOT* happen on RHEL-based distros - on those, changing the hardware_handler in multipath.conf in this way works as expected.

So why does it use the EMC hardware_handler? Well, there's a built-in default device section that matches the array in question. So this appears to override my user-specified config from multipath.conf:

=====
root@ucstest-osl2:~# multipathd -k'show config' | grep -B10 -A4 '1 emc'
 device {
  vendor "DGC"
  product ".*"
  product_blacklist "LUNZ"
  path_grouping_policy group_by_prio
  getuid_callout "/lib/udev/scsi_id --whitelisted --device=/dev/%n"
  path_selector round-robin 0
  path_checker emc_clariion
  checker emc_clariion
  features "1 queue_if_no_path"
  hardware_handler "1 emc"
  prio emc
  failback immediate
  no_path_retry 60
 }
=====

If I copy the entire default device config into /etc/multipath.conf and only change the hardware_handler setting, then it starts working:

=====
root@ucstest-osl2:~# cat /etc/multipath.conf
devices {
        device {
                vendor "DGC"
                product ".*"
                product_blacklist "LUNZ"
                path_grouping_policy group_by_prio
                getuid_callout "/lib/udev/scsi_id --whitelisted --device=/dev/%n"
                path_selector "round-robin 0"
                path_checker emc_clariion
                checker emc_clariion
                features "1 queue_if_no_path"
                hardware_handler "1 alua"
                prio emc
                failback immediate
                no_path_retry 60
        }
}

multipaths {
        multipath {
                wwid 3600601603a71320022967e0a1f38e411
                alias bootvolume
        }
}
root@ucstest-osl2:~# multipath -v 2
create: bootvolume (3600601603a71320022967e0a1f38e411) undef DGC,VRAID
size=50G features='1 queue_if_no_path' hwhandler='1 alua' wp=undef
|-+- policy='round-robin 0' prio=1 status=undef
| |- 0:0:0:0 sda 8:0 undef ready running
| `- 1:0:1:0 sdd 8:48 undef ready running
`-+- policy='round-robin 0' prio=0 status=undef
  |- 0:0:1:0 sdb 8:16 undef ready running
  `- 1:0:0:0 sdc 8:32 undef ready running
=====

It would appear that for some reason, in order to override default device settings in Ubuntu there must be an *exact* string match between the user-supplied «vendor» and «product» settings. If I change e.g. «product» in multipath.conf to ".*.*", then it starts using the built-in defaults again, ignoring multipath.conf. I consider this behaviour very dangerous - consider that if the admin has a working config (due to exact matching vendor/product settings), and then the package gets updated and extends the built-in defaults to incorporate some new model matching the same profile/settings). At this point the admin's working config will stop being used, possibly causing disruptive problems. I therefore strongly suggest you figure out why it behaves differently in Ubuntu and RHEL, and adopt the RHEL behaviour which really is the only sensible one.

In any case, now that I know how to ensure my multipath.conf settings are being used, I re-tried adding dev_loss_tmo and fast_io_fail_tmo, but it still doesn't work:

=====
root@ucstest-osl2:~# cat /etc/multipath.conf
devices {
        device {
                vendor "DGC"
                product ".*"
                product_blacklist "LUNZ"
                path_grouping_policy group_by_prio
                getuid_callout "/lib/udev/scsi_id --whitelisted --device=/dev/%n"
                path_selector "round-robin 0"
                path_checker emc_clariion
                checker emc_clariion
                features "1 queue_if_no_path"
                hardware_handler "1 alua"
                prio emc
                failback immediate
                no_path_retry 60
                fast_io_fail_tmo 3
                dev_loss_tmo 2147483647
        }
}

multipaths {
        multipath {
                wwid 3600601603a71320022967e0a1f38e411
                alias bootvolume
        }
}
root@ucstest-osl2:~# multipath -v 2
Aug 29 10:39:57 | bootvolume failed to set /class/fc_remote_ports/rport-0:0-1/dev_loss_tmo
create: bootvolume (3600601603a71320022967e0a1f38e411) undef DGC,VRAID
size=50G features='1 queue_if_no_path' hwhandler='1 alua' wp=undef
|-+- policy='round-robin 0' prio=1 status=undef
| |- 0:0:0:0 sda 8:0 undef ready running
| `- 1:0:1:0 sdd 8:48 undef ready running
`-+- policy='round-robin 0' prio=0 status=undef
  |- 0:0:1:0 sdb 8:16 undef ready running
  `- 1:0:0:0 sdc 8:32 undef ready running
root@ucstest-osl2:~# grep . /sys/class/fc_remote_ports/rport-*/*tmo
/sys/class/fc_remote_ports/rport-0:0-0/dev_loss_tmo:30
/sys/class/fc_remote_ports/rport-0:0-0/fast_io_fail_tmo:off
/sys/class/fc_remote_ports/rport-0:0-1/dev_loss_tmo:30
/sys/class/fc_remote_ports/rport-0:0-1/fast_io_fail_tmo:off
/sys/class/fc_remote_ports/rport-0:0-2/dev_loss_tmo:30
/sys/class/fc_remote_ports/rport-0:0-2/fast_io_fail_tmo:off
/sys/class/fc_remote_ports/rport-1:0-0/dev_loss_tmo:30
/sys/class/fc_remote_ports/rport-1:0-0/fast_io_fail_tmo:off
/sys/class/fc_remote_ports/rport-1:0-1/dev_loss_tmo:30
/sys/class/fc_remote_ports/rport-1:0-1/fast_io_fail_tmo:off
/sys/class/fc_remote_ports/rport-1:0-2/dev_loss_tmo:30
/sys/class/fc_remote_ports/rport-1:0-2/fast_io_fail_tmo:off
=====

The *_tmo settings were read and understood by the config file parser, as I can see them occur in the output from «multipathd -k'show config'». It is also clear that they are recognised as supported options, because if I add another «foo» option with the value of «bar» right below them, that one does *not* show up in «multipathd -k'show config'» - so it's clear the config parser doesn't just blindly read in any settings it encounters.

So it clearly does not work. In any case, if you need it I'd be happy to give you access to this test machine so you can see for yourself, Mathieu. Find me on the NetworkManager IRC channel if you're interested in that.

Tore