Comment 3 for bug 1462970

Revision history for this message
Josh Durgin (jdurgin) wrote :

There aren't good docs for this behavior of librados. Basically anything to do with the cluster may block and retry internally, to hide the complexity of retrying everything from library users. This could happen for many reasons. Network connections can flap. Switches can stop working. Daemons can be restarted. When a disk fails, an I/O needs to be resent to a different osd. An object can be unavailable for writes if not enough copies of it are available. The cluster could be temporarily full. In each of these cases (and many more), librados retries for the library user. Higher-level timeouts in things on top of rados are possible, and this eliminates extra configuration to make any timeouts at the librados level match higher levels.

The timeout options for librados were added for library users to simplify use cases where they want to fail eventually, and not block for the cluster to become available. It makes sense to timeout in cases like updating storage usage, which don't need to be strictly up to date, and will be re-run by a higher level (in this case a periodic task). I'm not sure what other cases would make sense in cinder off the top of my head. Config issues like a missing or incorrect ceph key tend to be immediate failures anyway.

Did you have particular things in mind for retrying, or a particular failure condition where connect() wouldn't eventually work?