You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by "Jonathan S. Shapiro" <sh...@eros-os.org> on 2000/08/16 21:47:39 UTC

Reality of delta benefit

In case this is useful to the delta storage debate...

The other day, in the process of figuring out how big a farm to buy, I ran a
size analysis on the EROS source repository. Basically, I expanded every
version ever and examined the disk consumption in blocks under various
compression schemes: RCS, gzip, bzip2, etc.

Somewhat to my surprise, RCS is only very slightly better than gzip (normal
compression -- nothing fancy) on this data set and roughly the same as
bzip2.

Is this comparable to other measurements people have seen? If so, it strikes
me that storing deltas in the repository itself may multiply I/O's and file
operations for very little benefit.

shap

Re: Reality of delta benefit

Posted by Karl Fogel <kf...@galois.collab.net>.
Jim Blandy <ji...@savonarola.red-bean.com> writes:
> I'm not especially fond of the way CVS does merges into the working
> directory, either.  As you point out, it's not only that the
> repository doesn't store everything you wish it did --- it's that the
> working directory management is messy, too.
> 
> You're not the only one who has expressed a wish for a more coherent
> way to manage merge participants than renaming the original to
> `.#foo'.  But unless Karl has surprises up his sleeve, I think we're
> going to imitate CVS up to 1.0.  Hopefully, we'll then have a decent
> code base to experiment with other merge disciplines.

The main issue is that all the ancestry involved in a merge is not
recorded by in CVS.

However, recording complete ancestry, and taking it into account
during merges and updates, is a well-understood problem (at least, we
thought we understood it some months ago), and the solution Subversion
plans to use is documented in the spec.  In fact, it's one of the few
things in that spec that's not been obsoleted by code already. :-)

1.0 probably won't implement it, only because 1.0's goal is to be a
CVS replacement -- it'll get you out of Egypt, but it won't get you to
the promised land just yet.

Re: Reality of delta benefit

Posted by "Jonathan S. Shapiro" <sh...@eros-os.org>.
> I'm not especially fond of the way CVS does merges into the working
> directory, either...

I thought this was unpleasant too until I tried to implement a merge that
takes two configurations (by "configuration" I mean the set of files
contained by some particular version of a branch) and produces a new
configuration that is the merge of the previous two.

I got the whole implementation successfully performing the merge to files in
a temporary directory. Then I decided to worry about the last step, which
was creating the new configuration and its member entities on the server.

At which point I hit a stumbling block, which is that one may not have write
access to the server.

In addition, if you implement merge into the working area you get the "cvs
update" behavior for free.

shap

Re: Reality of delta benefit

Posted by Jim Blandy <ji...@savonarola.red-bean.com>.
I'm not especially fond of the way CVS does merges into the working
directory, either.  As you point out, it's not only that the
repository doesn't store everything you wish it did --- it's that the
working directory management is messy, too.

You're not the only one who has expressed a wish for a more coherent
way to manage merge participants than renaming the original to
`.#foo'.  But unless Karl has surprises up his sleeve, I think we're
going to imitate CVS up to 1.0.  Hopefully, we'll then have a decent
code base to experiment with other merge disciplines.

I think these issues are more important than the time complexity of
version retrieval.

Re: Reality of delta benefit

Posted by Zack Weinberg <za...@wolery.cumb.org>.
On Fri, Aug 25, 2000 at 04:58:19PM -0500, Jim Blandy wrote:
> 
> > At the risk of broken-record-ness, the SCCS weave format is
> > consistently smaller than RCS format for the same set of versions, and
> > takes the same time to extract any revision.
> 
> In the original RCS paper, they explain why they think RCS is superior
> to SCCS.  Essentially, the time to extract any revision from SCCS is
> proportional to the total number of pieces in the SCCS weave, whereas
> the time to extract a revision from RCS depends only on how distant
> that version is from the trunk --- not on the total number of
> revisions present.  Since most accesses are to recent versions, this
> is the right tradeoff.

Only in the absence of branches.  The time to extract a revision in
RCS is proportional to the distance from the HEAD revision, not just
the trunk.  To get the tip of a branch, you have to execute reverse
diffs back to the branchpoint, then forward diffs from the branchpoint
all the way to the tip of branch.  Creating a branch involves
rewriting most of the body.

SCCS in contrast can extract the tip of any branch in the same time as
it takes to extract tip of trunk.  (Which is the same time it takes to
extract any revision at all.)  It can create a branch at any position
by adding one metadata record, and do a trivial merge between any two
revisions (one for which rcsmerge would generate no conflicts) by
adding one metadata record.

If you choose an implementation that gives you dirt cheap branches,
you discover all sorts of neat things you can do with them.  For
example, the equivalent of a 'cvs update' operation in Bitkeeper, when
you have local changes, looks like this:

0) You have a tree like so:

     1.1 -- 1.2 -- 1.3 -- 1.4 (head)

1) Commit your own patches to your private copy of the s.files.

     1.1 -- 1.2 -- 1.3 -- 1.4 -- 1.5

2) Push those patches onto a new branch.

     1.1 -- 1.2 -- 1.3 -- 1.4
			   +--- 1.4.1.1

3) Acquire the new 1.5 revision.

     1.1 -- 1.2 -- 1.3 -- 1.4 -- 1.5
			   \  -- 1.4.1.1

4) Merge 1.4.1.1 back into 1.5, producing 1.6.

     1.1 -- 1.2 -- 1.3 -- 1.4 -- 1.5     -- 1.6
			   \  -- 1.4.1.1 --  /

This has a huge advantages over the 'mangle everything together in the
working files' approach CVS takes: your original changes are still
available as revision 1.4.1.1.  If you ever need to go back and
disentangle your changes from my changes from Ed's changes, you have
all the information you need to do it.  If your patch is not
immediately accepted to the main repository, you just keep going; the
next time you update, the "1.6" revision will get pushed onto the the
1.4.1.x branch.

     1.1 -- 1.2 -- 1.3 -- 1.4 -- 1.5     -- \ -- 1.6 -- 1.7 ...
			   \  -- 1.4.1.1 -- 1.4.1.2 -- 1.4.1.3 ...

And you never have to do that chunk of the merge again, it
remembers. (Just like we say SVN will, on the webpage, for
inter-repository merges.)

Or if the changes in 1.5 turned out to be broken, you can go back to
1.4.1.1 and keep working from there until the main tree is functional
again.

  1.1 -- 1.2 -- 1.3 -- 1.4 -- 1.5 --- 1.6 -- 1.7 ...
			|      \---------1.5.1.1 (busted)
			\  -- 1.4.1.1 --- / --- 1.4.1.2 -- 1.4.1.3 ...

I might also add that implementation details of both RCS and SCCS
swamp the intrinsic performance characteristics of the file formats.
Theoretically, RCS's complexity is 

   RCS(HEAD) = O(number of lines in HEAD)
   RCS(rev) = O(RCS(rev+1) + PATCH(rev+1, rev))
   PATCH(a, b) >= O(max(lines in A, lines in B)).

So RCS winds up being O(m*n) where m is the distance to HEAD, and n is
the number of lines in the longest revision between the one you want
and the HEAD.  SCCS is just 

   SCCS(rev) = O(total number of lines in s.file)

Ideal implementations of RCS would therefore be faster than SCCS for
very short distances from the HEAD.

In practice, RCS reads the entire ,v file whether it needs to or not,
which means it is never faster than a good implementation of SCCS.
However, the SCCS that comes with SysV Unix has got a huge constant
factor on its operations.  So RCS can *appear* to be faster than SCCS
for the same task.

zw

Re: Reality of delta benefit

Posted by Jim Blandy <ji...@savonarola.red-bean.com>.
> At the risk of broken-record-ness, the SCCS weave format is
> consistently smaller than RCS format for the same set of versions, and
> takes the same time to extract any revision.

In the original RCS paper, they explain why they think RCS is superior
to SCCS.  Essentially, the time to extract any revision from SCCS is
proportional to the total number of pieces in the SCCS weave, whereas
the time to extract a revision from RCS depends only on how distant
that version is from the trunk --- not on the total number of
revisions present.  Since most accesses are to recent versions, this
is the right tradeoff.

Re: Reality of delta benefit

Posted by "Jonathan S. Shapiro" <sh...@eros-os.org>.
> Did you try compressing the ,v files?  It seems likely that delta
> encoding and then general compression would do better than compressing
> each version individually.

I hadn't, but I have now, and you are right. In case anyone is interested,
here are the actual numbers. If anyone wants me to look at other options
(within reason) I'll be glad to do so. I'ld also be interested to run the
same analysis on other/larger repositories. Since I seem to have the space
at the moment, please let me know if you'ld be okay with having me duplicate
a complete repository in order to do the analysis.

In the EROS repository, there are a total of 34260 versions across 9858 RCS
files.

Output of du -sk . at the root of the tree:

Blocks    Reduced to    Description
67480      18.9%    gzip -9 of RCS files
132594    37.3%    RCS files
179276    50.4%    bzip2 each version
184288    51.8%    gzip -9 each version
355548    100%    uncompressed

I started looking at this because I wanted to understand the value of delta
coding both on the wire and in the repository. These numbers may not
generalize to other projects, but here are my initial reactions to them:

The space savings from compressed files to uncompressed RCS is about 25%.
Perhaps it might in some cases be worth paying. Formats that separate every
version are more robust -- the fewer versions that you stream through memory
in each operation, the fewer versions there are to corrupt.

However, the further compressability of RCS files is impressive and
surprising. It suggests (to me) that there is huge amounts of repetition
across files in this repository, and that a compression scheme that built an
initial alphabet by looking across files might do considerably better than
gzip. This can certainly be done by an offline compressor.

If done by an offline compressor, the format can (with care) continue to be
self-describing in the style of normal compressed files. It's a question of
dictionary encoding.

In the context of wire transmission, this has advantages over deltas. If the
server ships the client a delta but the client doesn't have the base
version, then the whole thing needs to recurse and it soon becomes better to
have shipped the whole thing compressed in the first place. If the
compressed content is as efficient as the delta, it's better to ship
standalone compressed entities.

Indeed, one can thing of this as metadeltas -- the mutually shared
dictionary is in effect a preconditioned delta system.

Mostly just thinking out loud.

shap

Re: Reality of delta benefit

Posted by Zack Weinberg <za...@wolery.cumb.org>.
On Wed, Aug 16, 2000 at 05:47:39PM -0400, Jonathan S. Shapiro wrote:
> In case this is useful to the delta storage debate...
> 
> The other day, in the process of figuring out how big a farm to buy, I ran a
> size analysis on the EROS source repository. Basically, I expanded every
> version ever and examined the disk consumption in blocks under various
> compression schemes: RCS, gzip, bzip2, etc.
> 
> Somewhat to my surprise, RCS is only very slightly better than gzip (normal
> compression -- nothing fancy) on this data set and roughly the same as
> bzip2.
> 
> Is this comparable to other measurements people have seen? If so, it strikes
> me that storing deltas in the repository itself may multiply I/O's and file
> operations for very little benefit.

Did you try compressing the ,v files?  It seems likely that delta
encoding and then general compression would do better than compressing
each version individually.

At the risk of broken-record-ness, the SCCS weave format is
consistently smaller than RCS format for the same set of versions, and
takes the same time to extract any revision.

zw