Comment 12 for bug 109114

Revision history for this message
John A Meinel (jameinel) wrote : Re: commit holds whole files in memory

Just trying to add a little bit of debugging to this issue.

1) File.readlines() has a certain amount of memory overhead versus File.read().

To generate a 'large' file, I grabbed a dump from something else, and then copied it until I had a 100MB file. This meant I had a pure text file, but it also meant that the average line width is only 21 bytes.

A 'str' object in python has 24 bytes of python object overhead. (Not to mention any waste do to the allocator, etc.)

So for my particular test case, I have a 108MiB file, with 5.1Mi lines. Because we use 'readlines()' we end up with:
  a) 21MiB list object (a list with 5Mi * 4bytes of references), note that on 64-bit this would double to 41MiB because each ref is 8 bytes
  b) 5.1Mi*24 bytes string overhead, or 122MiB of python object overhead. Again, on 64-bit a 'str' object goes up to around 40 bytes each, which would be 204.6MiB (or ~ 2x the size of the *content*)
  c) 108MiB actual file content

2) KnitVersionedFile._add

   a) The first thing this function does is:
              line_bytes = ''.join(lines)
      which obviously doubles memory consumption right there. Well, in my case
      it is 50% bigger, because a single string removes all of the per-string
      overhead.

   b) After a bit, it then checks to see if the content ends in a newline. If
      it *doesn't* it then:
            if lines[-1][-1] != '\n':
                # copy the contents of lines.
                lines = lines[:]
                options.append('no-eol')
                lines[-1] = lines[-1] + '\n'
                line_bytes += '\n'
      This creates a new 'list' object (in my case, this is 21MiB), but it
      *also* generates a new string with " += '\n'", which will again have 2
      copies in memory.
      So we now have 2 large list objects, 1 copy of the text split across many
      str objects, and 2 copies as large string objects. (Note that it will
      quickly drop, since we are replacing the original string.)

   c) We then call '_record_to_data' which does:
        bytes = ''.join(chain(
            ["version %s %d %s\n" % (key[-1],
                                     len(lines),
                                     digest)],
            dense_lines or lines,
            ["end %s\n" % key[-1]]))
      This uses 'dense_lines or lines', so we shouldn't end up with an extra
      large list, but it does mean that we have yet one-more copy of the file
      content.

So as I can see, there is a minimum of 3 copies of the content in memory, not
to mention a bit of overhead here and there.

I'll poke around a bit and see if I can't get rid of some of those internal
copies. We may decide to enable that code path only when the length of the file
is large, to avoid slowing down when committing lots of small content.