object-replicator goes into bad state when lockup timeout < rsync timeout

Bug #1575277 reported by Caleb Tennis
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Confirmed
Medium
Unassigned

Bug Description

http://paste.openstack.org/show/9mwyQhMHVRYztd0c037G/
http://paste.openstack.org/show/JFyHkmdIKMqyT9XIsPAq/

We have a long running rsync process as it attempts to move a lot of data across a WAN link. Rsync timeout has been increased to 60 minutes (from the default of 15). Lockup timeout was left alone (30 minutes).

It seems like once we hit this "lockup timeout" during a long running rsync, things go awry. We're moving data, but the replication partition count hasn't increased. Once this occurs we basically never get replication going again. Furthermore, we have to 'kill -9' the object replicator processes, they don't respond to normal signals.

The easiest fix was to just set the lockup timeout to a higher value than the rsync timeout, but I'm opening this ticket as Clay G. thought there might be something strange with the lockup detection and recovery that needs to be looked at.

Revision history for this message
clayg (clay-gerrard) wrote :

I think the logs from this series are pretty damning:

http://paste.openstack.org/show/JFyHkmdIKMqyT9XIsPAq/

It's basically like

1) things are moving (slowly)
2) meh, too slow for me - "Lockup detected.. killing live coros"
3) no progress is made ever again until you manually restart daemons

Changed in swift:
status: New → Confirmed
Revision history for this message
clayg (clay-gerrard) wrote :

I think we can close this bug when we deprecate the python replicators in favor of h9d - which can not happen soon enough!

Revision history for this message
Christian Schwede (cschwede) wrote :

I think I might have seen too, but that was a long time ago...

I can't reproduce this in my lab setting. Here's what I did:

- configured object-replicator as following:

 [object-replicator]
 sync_method = rsync
 rsync_timeout = 60
 lockup_timeout = 5
 stats_interval = 2

- uploaded some multiple GB-sized objects
- deleted all but one replica of these objects, and hashes.pkl too
- replicator runs, I see the "killing live coros" - and it continues?

Not sure how to reproduce this yet.

Looking at the code I think there is another possibility to run into this in case the top-level replicator loop raises a Timeout (https://github.com/openstack/swift/blob/master/swift/obj/replicator.py#L759-L760).

Revision history for this message
clayg (clay-gerrard) wrote :

Ran into this again! Very annoying!

Changed in swift:
importance: Undecided → High
clayg (clay-gerrard)
Changed in swift:
importance: High → Medium
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.