better handling of large files by setting delta-compression threshold

Bug #607268 reported by Parth Malwankar
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Bazaar
Confirmed
Low
Unassigned
Breezy
Triaged
Medium
Unassigned

Bug Description

Handling of large files may be improved by providing a user configurable threshold for delta-compression as suggested by John on the mailing list[1].

Partial Quote:
From: John Arbash Meinel <john <at> arbash-meinel.com>
Subject: Re: large files and storage formats

I'll note that if all you want is for content objects that are greater
than some threshold to not be delta-compressed, you can do:

=== modified file 'bzrlib/groupcompress.py'
--- bzrlib/groupcompress.py 2010-05-20 02:57:52 +0000
+++ bzrlib/groupcompress.py 2010-07-09 05:41:30 +0000
@@ -1721,12 +1721,7 @@
                                                nostore_sha=nostore_sha)
             # delta_ratio = float(len(bytes)) / (end_point - start_point)
             # Check if we want to continue to include that text
           if (prefix == max_fulltext_prefix
               and end_point < 2 * max_fulltext_len):
               # As long as we are on the same file_id, we will fill
at least
               # 2 * max_fulltext_len
               start_new_block = False
           elif end_point > 4*1024*1024:
+ if end_point > 4*1024*1024:
                 start_new_block = True
             elif (prefix is not None and prefix != last_prefix
                   and end_point > 2*1024*1024):

That will leave you with repositories that are considered valid 2a
format repositories, just not as 'packed' as we would normally make them.

I would guess there will be other places where our memory will be larger
than you might like. But at least for the 'compressing 2 large blobs
together takes too much memory' case, it would side step it.

'large' in this case is >4MB.

You could probably even do a little bit better, by checking the length
of the content before calling 'self._compressor.compress()', and
choosing to start a new block right away.

[1] http://article.gmane.org/gmane.comp.version-control.bazaar-ng.general/68744

Tags: memory
Revision history for this message
John A Meinel (jameinel) wrote :

related to bug #109114

I don't think the exact code here will work, but something along the lines of "if content is greater than X, don't delta, just write it out", would be an option. Of course, it makes the size on disk terrible in some circumstances.

Changed in bzr:
importance: Undecided → Low
status: New → Confirmed
Parth Malwankar (parthm)
tags: added: memory
Jelmer Vernooij (jelmer)
tags: added: check-for-breezy
Jelmer Vernooij (jelmer)
Changed in brz:
status: New → Triaged
importance: Undecided → Medium
tags: removed: check-for-breezy
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.