You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@subversion.apache.org by Thomas Harold <th...@nybeta.com> on 2014/12/03 16:46:07 UTC

Re: Efficiency of rep-sharing (deduplication) in 1.8 and later (chunking?)

> 
> Representation cache is based on the sha of the rep.  So it does not
> matter what the filename is or where it is stored.  If it has the same
> sha as an existing rep, then it will be be shared.
> 
> The small improvement in 1.8 was simply to do this for files being added
> within the same revision, but the other scenario was already supported.
> 
> I think it is worth pointing out that a rep is not necessarily a "file".
>  It is the specific delta that SVN would be storing in the repository DB.
> 

One improvement that I'd like to suggest is that files over 1MiB (4? 8?)
be "chunked" prior to calculating rep-sharing.

http://blog.clearpathsg.com/blog/bid/254076/Understanding-Variable-Length-Deduplication

My thinking is that there might be storage gains to be made if
rep-sharing is done at a lower level then the file level in cases of
files over a particular size.  For instance, if you commit a few hundred
files of mid-size (5-15MB or larger), there is probably a lot of
identical data between them (if the files are not already compressed).
Those identical chunks could be possibly found via a variable length
deduplication algorithm and deduped across the repository.

IIRC when I moved our repos from 1.6 to 1.8 format, space usage went
down by 10-15% from rep-sharing.  I wouldn't mind having another 5-10%
space savings.