You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@subversion.apache.org by Philip Martin <ph...@wandisco.com> on 2016/02/01 11:06:19 UTC

Re: Svn 1.9 repository 20% bigger than svn 1.8 repository

Stefan Fuhrmann <st...@apache.org> writes:

> So, all user content is there and merely the deduplication failed
> (as already being investigated elsewhere in this thread).

I suppose format 7 might allow us to implement a system that fixes
missing deduplication during packing.

-- 
Philip Martin
WANdisco

Re: Svn 1.9 repository 20% bigger than svn 1.8 repository

Posted by Stefan Fuhrmann <st...@apache.org>.

On 01.02.2016 11:11, Stefan Sperling wrote:
> On Mon, Feb 01, 2016 at 10:06:19AM +0000, Philip Martin wrote:
>> Stefan Fuhrmann <st...@apache.org> writes:
>>
>>> So, all user content is there and merely the deduplication failed
>>> (as already being investigated elsewhere in this thread).
>>
>> I suppose format 7 might allow us to implement a system that fixes
>> missing deduplication during packing.

At least we can scan the file for representation info
in nodesrevs and update the rep-cache.db accordingly.

> And perhaps get rid of sqlite in the repository while at it?

Format 7 assumes that there is a 1:1 relationship between
logical ID and physical locations.  So, you can't simply
make two entries in the L2P index point to the same phys.
item without breaking the P2L index.

Since we can't rewrite the references in all future reps
that point to any redundant one, we need to stick with the
same number of logical and physical items.  A format 8,
however, could allow for "duplicate" P2L entries where
N-1 items get flagged as "shared".  That would be a low-
risk bookkeeping change.

That said, there are limitations to that approach:  Cache
contents is logically addressed, i.e. even if the IDs would
point to the same location, they would be cached twice. So,
we would simply save some disk space.  Depending on how many
active branches there are, the lacking cache efficiency may
not be an issue.

Another problem with "pack" replacing the rep-cache.db is
that deduplication often happens as a result of merges
and those often cross 1k shard boundaries.

One option would be to e.g. defer deduplication to the
pack phase and use the rep-cache.db exclusively during
that operation.

-- Stefan^2.

RE: Svn 1.9 repository 20% bigger than svn 1.8 repository

Posted by Bert Huijben <be...@qqmail.nl>.


> -----Original Message-----
> From: Stefan Sperling [mailto:stsp@elego.de]
> Sent: maandag 1 februari 2016 11:11
> To: Philip Martin <ph...@wandisco.com>
> Cc: Stefan Fuhrmann <st...@apache.org>; Gert Kello
> <ge...@gmail.com>; users@subversion.apache.org
> Subject: Re: Svn 1.9 repository 20% bigger than svn 1.8 repository
> 
> On Mon, Feb 01, 2016 at 10:06:19AM +0000, Philip Martin wrote:
> > Stefan Fuhrmann <st...@apache.org> writes:
> >
> > > So, all user content is there and merely the deduplication failed
> > > (as already being investigated elsewhere in this thread).
> >
> > I suppose format 7 might allow us to implement a system that fixes
> > missing deduplication during packing.
> 
> And perhaps get rid of sqlite in the repository while at it?

I think at least that last part will require a format 8. Optimizing pack
should (theoretically) be possible without a format bump. We could even
backport changes that allow this, but I'm still waiting on real world test
experience with format 7. 

There are still far too many users delaying their upgrades to 1.9 waiting
for others to switch :(

Waiting for the ASF to perform a major upgrade is on reason I hear quite
often...

	Bert

Re: Svn 1.9 repository 20% bigger than svn 1.8 repository

Posted by Stefan Sperling <st...@elego.de>.

On Mon, Feb 01, 2016 at 10:06:19AM +0000, Philip Martin wrote:
> Stefan Fuhrmann <st...@apache.org> writes:
> 
> > So, all user content is there and merely the deduplication failed
> > (as already being investigated elsewhere in this thread).
> 
> I suppose format 7 might allow us to implement a system that fixes
> missing deduplication during packing.

And perhaps get rid of sqlite in the repository while at it?