You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@subversion.apache.org by Daniel Shahaf <d....@daniel.shahaf.name> on 2014/12/06 12:17:05 UTC

Re: Efficiency of rep-sharing (deduplication) in 1.8 and later

Mark Phippard wrote on Fri, Sep 12, 2014 at 11:24:43 -0400:
> On Fri, Sep 12, 2014 at 11:17 AM, Thomas Harold <th...@nybeta.com>
> wrote:
> 
> > I have a question about how efficient SVN is at de-duplication within a
> > repository with regards to files that appear in multiple locations, but
> > which have the same content.
> >
> > I know a small improvement was made in 1.8...
> >
> > http://subversion.apache.org/docs/release-notes/1.8.html#fsfs-enhancements
> >
> > > When representation sharing has been enabled, Subversion 1.8 will now
> > > be able to detect files and properties with identical contents within
> > > the same revision and only store them once. This is a common
> > > situation when you for instance import a non-incremental dump file or
> > > when users apply the same change to multiple branches in a single
> > > commit.
> >
> > #1 - If a commit puts files A, B and C into the repository, and a latter
> > commit puts files B, C and D into the repository at a different
> > location, is SVN smart enough to realize that B and C are already stored
> > in the repository?
> >
> > In other words, does it track each individual file separately, even if
> > they were all part of one big revision?
> >
> 
> Representation cache is based on the sha of the rep.  So it does not matter
> what the filename is or where it is stored.  If it has the same sha as an
> existing rep, then it will be be shared.
> 
> The small improvement in 1.8 was simply to do this for files being added
> within the same revision, but the other scenario was already supported.
> 
> I think it is worth pointing out that a rep is not necessarily a "file".
>  It is the specific delta that SVN would be storing in the repository DB.

The sha1 of the rep itself doesn't matter.  The rep-cache.db file is a
cache of (sha1 of fulltext ↦ location of rep generating that fulltext).

As to the idea of doing the sha1 at chunk level rather than at file
level: I suggest to discuss that on dev@.  Some backend devs might
otherwise miss the discussion.

Cheers,

Daniel