Container Sync might lose the right x-sync-point2 resulting in not syncing objects

Bug #1565834 reported by Oshrit Feder
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Confirmed
Medium
Unassigned

Bug Description

container sync scenario - error path

Consider the case where there is a problem syncing row at #357 (link to line at the end of report), and next_sync_point is keeping the pointer to the problematic row.
In lines #362-363 we fetch the next row and update+persist sync_point_2 (point2 > next_sync_point).
line #344 satisfied, and we sync more objects.

Line #364 outside of the while loop is aware of next_sync_point - and should persist the desired sync_point_2 to retry failed objects.
Now let's assume the code breaks just before performing line #364. (node fails/container sync daemon stopped)
next_sync_point holding a pointer to failed object, which failed to replicate - therefore not on target container, though should, but since it's not persistent it's lost.

On restart of the service - sync_point_2 is now (#322) more advanced and will not try to sync again the failed object(s) indicated with next_sync_point (next_sync_point < persistent sync_point_2), so I suspect we might result in objects never synced to target(?)

For the failure scenario, all the replicas have to fail before setting back x_container_sync_point2 to the failure point. Rare, but might happen.

Better to persist value only if it's the right value, keeping values in memory might result in losing critical information. Also, if we fail syncing, it might be a good idea to return from the method, and not continue to next rows.

The issue is fixed and does not exists in a in-review patch by Eran,
https://review.openstack.org/#/c/225338/

https://github.com/openstack/swift/blob/master/swift/container/sync.py#L322
https://github.com/openstack/swift/blob/master/swift/container/sync.py#L357
https://github.com/openstack/swift/blob/master/swift/container/sync.py#L362
https://github.com/openstack/swift/blob/master/swift/container/sync.py#L364

Tags: container sync
Oshrit Feder (oshritf)
tags: added: container sync
Revision history for this message
Alistair Coles (alistair-coles) wrote :

The line references are out of date with respect to master branch but I believe can see the bug as described in these lines [1]

sync of row x fails so next_sync_point is set to x
continue to successfuly sync row y, then z, then the broker sync_point2 is updated to y, then z

intention is that once the while loop completes, the broker sync_point2 is rolled back to the in memory next_sync_point, but if the process dies before that happens then the broker sync_point2 is never rolled back to the failed row.

[1] https://github.com/openstack/swift/blob/671254224a4a4710e7556535ee68bd999536ab8d/swift/container/sync.py#L396-L407

Changed in swift:
status: New → Confirmed
importance: Undecided → Medium
importance: Medium → Low
Revision history for this message
Alistair Coles (alistair-coles) wrote :

changed to medium importance because as noted in the report all replicas' container servers would need to experience the same failure for the failed row to never be re-tried.

Changed in swift:
importance: Low → Medium
Revision history for this message
Oshrit Feder (oshritf) wrote :

Known issue - discussed in the summit, good idea to document it too, was fixed in the following patch (Add thread level concurrency to container sync)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.