You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Martin Scharrer <ma...@madmarty.de> on 2006/12/10 02:48:19 UTC

Size of revs file when deleting lines in a big text file

Hi list,

I posted this already in the users list but with no final result and I now 
think this can't be solved by non-developer.
For reference purpose: my users-list post was message #59447 with 
subject: 'Size of revs file when deleting lines in a big text file - Bug?'.

Now the issue:
I detected the following using svn 1.4.2 (r22196) with FSFS under Linux:
A text file, mbox with mails, with about 5 MB is (already) in subversion.
I deleted now one email, i.e. a couple of hundred lines, located in the first 
quarter of the file. The diff is about 16kBytes. After checking in this and 
other small changes a detected that the file in the 'db/revs' dir in the 
repository is over 3 MB in size.
A 'svn diff -rN:M | wc -c' showed me only <0.5 MB.

I tested this then with a test repository where I repeated this case several 
times by checking in a copy with this deletion and then a copy without it.
I repeated this about 50 times with the same repository using a script.

The result shows me that the 'delete'-revs have very big sizes which are much 
bigger then the resulting diff (e.g. 500k rev with 16k diff) and that this 
size depends on the position of the change. The size is bigger for deletion 
nearer to the start of the file and smaller for deletion more on the end.
e.g. 1.7MB direct at the begin, 660kB at the 50% mark, 2k at the end.
BUT the revs where the lines get added again are _all_ very small, about 2k.
The sizes should be about the same, the diffs are.

I tested different order of deletion/addition to make sure it's not because of 
skip-deltas like mentioned by Rob Hubbard on the users list.

I now wrote a test script in perl so you can reproduce this easily. The script 
generates a test repository and a testfile and then makes a couple of 
check-ins and prints the resulting sizes

thanks,
Martin

Re: Size of revs file when deleting lines in a big text file

Posted by Daniel Berlin <db...@dberlin.org>.

On 12/11/06, Martin Scharrer <ma...@madmarty.de> wrote:
> Hi Malcolm,
>
> thanks for the explanation in this detail. This makes most things much clearer
> for me.
>
> On Monday 11 December 2006 21:33, Malcolm Rowe wrote:
> > algorithm.
> Ok I understand the operation of the window-oriented delta algorithm now and
> there need for binary files. But it is not possible to fall-back on a
> diff-like patch in the case of small changes of a large text file, or is this
> not compatible with the overall delta-format etc.?

1. We don't know the size of the file when we start performing the
delta (because everything is stream oriented).
Thus, windows *have* to be used
2. Given a window size that is equal to file size, the delta algorithm
we use will produce smaller results than a diff-like patch.

IOW changing to a patch like format would not solve anything here.  If
we knew the size of the files we were performing deltas on ahead of
time, we could make some reasonable estimate as to what window size to
use and get good delta performance.

There are times we know this information, but don't pass it along,
which is kinda sad, but c'est la vie.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Size of revs file when deleting lines in a big text file

Posted by Martin Scharrer <ma...@madmarty.de>.

Hi Malcolm,

thanks for the explanation in this detail. This makes most things much clearer 
for me.

On Monday 11 December 2006 21:33, Malcolm Rowe wrote:
> You can see this if you remove the calls to the other_revs function
> (there's no change in the output) or if you initially commit your
> 'short' file and then continue 'complete', 'short', ... (you'll see then
> that the _added_ files now have larger revisions: everything is now
> delta'd against the short file).
Yes, I checked that and that it's what happens. I also made a script with 
three slightly different files which resulting in different revs sizes 
depending on the 

> In order that it can deal with arbitrarily large files, Subversion's
> delta algorithm deals with each source file as a series of windows, each
> of 100k.  The delta algorithm reads 100k of source data and 100k of
> target data and constructs the delta.
>
> [...]
>
> You have a 'short' file that's 5451960 bytes, or 54 100k windows.  The
> first 27 windows will be simple copies (about 500 bytes in total), and
> the second 27 will be dominated by the 9105 bytes that were brought
> forward from the next source window.  27 x 9105 is 245835 bytes,
> slightly below what I see for each revision's size in an svndiff0
> repository.
>
> How can we improve this?  The best way would be to increase the delta
> window size significantly, though we have some backward-compatibility
> issues (and probably memory constraints) to work around in order to do
> that.  Even so, unless we were able to change the delta logic so that we
> use variable-sized windows and resynchronise where possible, we're always
> going to take a hit on this kind of change with a window-oriented delta
> algorithm.
Ok I understand the operation of the window-oriented delta algorithm now and 
there need for binary files. But it is not possible to fall-back on a 
diff-like patch in the case of small changes of a large text file, or is this 
not compatible with the overall delta-format etc.?

> > I now wrote a test script in perl so you can reproduce this easily. The
> > script generates a test repository and a testfile and then makes a couple
> > of check-ins and prints the resulting sizes
>
> Thank you _very_ much for giving us a reproduction recipe: it made it
> significantly easier to see what was going on.
You are welcome. I know myself how hard it is to reproduce something like this 
appropriate without a script or similar.

Thanks a lot,
Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Size of revs file when deleting lines in a big text file

Posted by Malcolm Rowe <ma...@farside.org.uk>.

Hi Martin,

On Sun, Dec 10, 2006 at 02:48:19AM +0000, Martin Scharrer wrote:
> A text file, mbox with mails, with about 5 MB is (already) in subversion.
> I deleted now one email, i.e. a couple of hundred lines, located in the first 
> quarter of the file. The diff is about 16kBytes. After checking in this and 
> other small changes a detected that the file in the 'db/revs' dir in the 
> repository is over 3 MB in size.
> A 'svn diff -rN:M | wc -c' showed me only <0.5 MB.
> 

The size of the textual diff against the previous revision and the size
of the delta (which isn't necessarily against the previous revision)
aren't related.

> The result shows me that the 'delete'-revs have very big sizes which are much 
> bigger then the resulting diff (e.g. 500k rev with 16k diff)

> I tested different order of deletion/addition to make sure it's not because of 
> skip-deltas like mentioned by Rob Hubbard on the users list.
> 

The details are below, but in summary:

a. You _are_ seeing the result of our skip-delta algorithm, at least in
   your reproduction script.
b. We're doing the best we can given the way we move through the two
   different files (reference and version; or 'source and target' if
   you prefer), that is, window-by-window.  We will always see large
   deltas when you delete or insert content midway through a large file.

So the behaviour you're seeing in your reproduction recipe is partly due
to the fact that we use a skip-list-like method to choose the delta
base.  The reason is simple:

a. We choose a delta base based upon the 'version number' of the file
   itself, regardless of how many commits have occurred to other files.
   More precisely, we use the predecessor count, but for simple cases,
   that's the same as a per-file version number that starts at zero.

b. Your recipe commits a 'complete' file (the full file contents), then
   a 'short' one (with some content removed), then the 'complete' one
   again, and so on: complete, short, complete, short, complete, short...

   (You also do commits to other files, but we ignore those when working
   out the delta base).

c. We choose the delta base by taking the predecessor count as a binary
   string and unsetting the rightmost '1' bit.  (This is the skip-list
   part).  By doing so, we'll _always_ end up with an even-numbered
   predecessor as a delta base.

So because we always use an even predecessor as a base, we'll always use
the 'complete' file, and so all the revisions that want to produce that
complete file end up with very short deltas (ones that are essentially
minimum size: a stream of whole-window copy commands).

You can see this if you remove the calls to the other_revs function
(there's no change in the output) or if you initially commit your
'short' file and then continue 'complete', 'short', ... (you'll see then
that the _added_ files now have larger revisions: everything is now
delta'd against the short file).

You might be wondering why deleting some content ends up producing such
a large delta (c.240k with svndiff0 / --pre-1.4-compatible
repositories).  After all, you've _deleted_ data, not added it, right?
And if you try the experiment I described above (reversing the order),
you'll see that the 'deleted' version is comparable in size to what you
get when you add a similar amount of text.

So why does deleting data from the middle of a file (you deleted 9105
bytes from approximately the middle of a 5461065-byte file) produce a
large delta?  You hit the root of the problem here:

>                                                              and that this 
> size depends on the position of the change. The size is bigger for deletion 
> nearer to the start of the file and smaller for deletion more on the end.
> e.g. 1.7MB direct at the begin, 660kB at the 50% mark, 2k at the end.

In order that it can deal with arbitrarily large files, Subversion's
delta algorithm deals with each source file as a series of windows, each
of 100k.  The delta algorithm reads 100k of source data and 100k of
target data and constructs the delta.

So when you remove 9k from the middle of a file, we end up with the
following:

* each window prior to the removal is simply a single copy command.
* each window after that point contains a copy command for the data
  that's common in both source and target, plus an insertion of about
  9k of data that's been brought forward from the next source window.

That new data (from the next window) is inserted directly into the delta
stream (though in svndiff1 repositories it is at least zlib-compressed),
and appears for each window past that point.

You have a 'short' file that's 5451960 bytes, or 54 100k windows.  The
first 27 windows will be simple copies (about 500 bytes in total), and
the second 27 will be dominated by the 9105 bytes that were brought
forward from the next source window.  27 x 9105 is 245835 bytes,
slightly below what I see for each revision's size in an svndiff0
repository.

How can we improve this?  The best way would be to increase the delta
window size significantly, though we have some backward-compatibility
issues (and probably memory constraints) to work around in order to do
that.  Even so, unless we were able to change the delta logic so that we use
variable-sized windows and resynchronise where possible, we're always
going to take a hit on this kind of change with a window-oriented delta
algorithm.

> 
> I now wrote a test script in perl so you can reproduce this easily. The script 
> generates a test repository and a testfile and then makes a couple of 
> check-ins and prints the resulting sizes
> 

Thank you _very_ much for giving us a reproduction recipe: it made it
significantly easier to see what was going on.

Regards,
Malcolm