You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Philip Martin <ph...@wandisco.com> on 2015/05/27 20:14:29 UTC

Populating the rep-cache

Julian Foad <ju...@gmail.com> writes:

> Stefan Fuhrmann wrote:
>> * clear the rep-cache.db
>
> Clearing the cache and continuing operation may make subsequent
> commits much larger than they should be, and there is no easy way to
> undo that if it happens.

I've been thinking of writing some code to populate the rep-cache from
existing revisions.  This code would parse the revision, a bit like
verify, identify checksums in that revision and add any that are found
to the rep-cache.  This would be time consuming if run on the whole
repository but would run perfectly well in a separate process while the
repository remains live.  It could also be run over a revision range
rather than just the whole repository, and running on a single revision
such as HEAD would be fast.

I believe the code will be relative straightforward, if anything it is
the API that is more of a problem.

 - We could add a public svn_fs_rep_cache().  This is backend specific
   but there is precedent: we have svn_fs_berkeley_logfiles() and
   svn_fs_pack().

 - We could add a more general svn_fs_optimize().  This would do backend
   specific optimizations that may change in future versions.  Perhaps
   passing backend-specific flags?

 - We could add the behaviour to svn_fs_recover() by reving the function
   with a revision range.  This would "recover" the rep-cache after the
   existing recovery.  At present recover is fast so to preserve that
   the compatibility function would pass a revision range that is just
   HEAD.

 - We could avoid a public API and call some FSFS function from svnfsfs.

I'll probably go with the last option initially.  Any comments?

I should note that WANdisco has an interest in this code being
developed.

-- 
Philip Martin | Subversion Committer
WANdisco // *Non-Stop Data*

Re: Populating the rep-cache

Posted by Julian Foad <ju...@gmail.com>.
Philip Martin wrote:
> I've been thinking of writing some code to populate the rep-cache from
> existing revisions.  This code would parse the revision, a bit like
> verify, identify checksums in that revision and add any that are found
> to the rep-cache.  This would be time consuming if run on the whole
> repository but would run perfectly well in a separate process while the
> repository remains live.  It could also be run over a revision range
> rather than just the whole repository, and running on a single revision
> such as HEAD would be fast.

+1.

> I believe the code will be relative straightforward, if anything it is
> the API that is more of a problem.
>
>  - We could add a public svn_fs_rep_cache().  This is backend specific
>    but there is precedent: we have svn_fs_berkeley_logfiles() and
>    svn_fs_pack().
>
>  - We could add a more general svn_fs_optimize().  This would do backend
>    specific optimizations that may change in future versions.  Perhaps
>    passing backend-specific flags?
>
>  - We could add the behaviour to svn_fs_recover() by reving the function
>    with a revision range.  This would "recover" the rep-cache after the
>    existing recovery.  At present recover is fast so to preserve that
>    the compatibility function would pass a revision range that is just
>    HEAD.
>
>  - We could avoid a public API and call some FSFS function from svnfsfs.
>
> I'll probably go with the last option initially.  Any comments?

I think the interface to this should be explicit, not hidden in a
generic 'optimize' or 'recover' function. The last option sounds good
as a starting point.

Other than that, I have no opinions on the API yet, nor on the
specific range of functionality that it should offer (examples:
revision ranges, validating existing entries, clearing part or all of
the cache).

> I should note that WANdisco has an interest in this code being
> developed.

I suppose many companies and power users have to deal with issues
where this would be useful.

It might also be useful to consider whether and how Subversion could
tell us whether the rep cache is up to date -- I haven't thought about
this, but as an initial idea tracking the last revision number N where
all revs [0 .. N] are known to be cached would be a possible starting
point for such a feature.

- Julian

Re: Populating the rep-cache

Posted by Johan Corveleyn <jc...@gmail.com>.
On Thu, May 28, 2015 at 6:00 PM, Stefan Fuhrmann
<st...@wandisco.com> wrote:
> On Wed, May 27, 2015 at 8:14 PM, Philip Martin <ph...@wandisco.com>
> wrote:
>>
>> Julian Foad <ju...@gmail.com> writes:
>>
>> > Stefan Fuhrmann wrote:
>> >> * clear the rep-cache.db
>> >
>> > Clearing the cache and continuing operation may make subsequent
>> > commits much larger than they should be, and there is no easy way to
>> > undo that if it happens.
>>
>> I've been thinking of writing some code to populate the rep-cache from
>> existing revisions.  This code would parse the revision, a bit like
>> verify, identify checksums in that revision and add any that are found
>> to the rep-cache.  This would be time consuming if run on the whole
>> repository but would run perfectly well in a separate process while the
>> repository remains live.  It could also be run over a revision range
>> rather than just the whole repository, and running on a single revision
>> such as HEAD would be fast.
>
>
> Makes sense.
>
>>
>> I believe the code will be relative straightforward, if anything it is
>> the API that is more of a problem.
>>
>>  - We could add a public svn_fs_rep_cache().  This is backend specific
>>    but there is precedent: we have svn_fs_berkeley_logfiles() and
>>    svn_fs_pack().
>>
>>  - We could add a more general svn_fs_optimize().  This would do backend
>>    specific optimizations that may change in future versions.  Perhaps
>>    passing backend-specific flags?
>
>
> I think svn_fs_optimize(bool online) would make sense
> in the longer term.
>
> In the "offline" case, it could do anything from removing
> duplicate reps as we build the cache to sharding repos
> or repacking shards. Not that I would want to implement
> any of that soon.

I was wondering about that too. I think repopulating the rep-cache
(without the need to take the repos offline) is very interesting, but
I immediately think: functionality to repopulate the rep-cache *and*
(optionally) rewrite rev files to let them use rep sharing (i.e.
effectively deduplicating the repository) ... that would be even
better.

But big +1 on the initial idea already for offering the ability to
rebuild a broken rep-cache (without having to dump/load).

-- 
Johan

Re: Populating the rep-cache

Posted by Stefan Fuhrmann <st...@wandisco.com>.
On Wed, May 27, 2015 at 8:14 PM, Philip Martin <ph...@wandisco.com>
wrote:

> Julian Foad <ju...@gmail.com> writes:
>
> > Stefan Fuhrmann wrote:
> >> * clear the rep-cache.db
> >
> > Clearing the cache and continuing operation may make subsequent
> > commits much larger than they should be, and there is no easy way to
> > undo that if it happens.
>
> I've been thinking of writing some code to populate the rep-cache from
> existing revisions.  This code would parse the revision, a bit like
> verify, identify checksums in that revision and add any that are found
> to the rep-cache.  This would be time consuming if run on the whole
> repository but would run perfectly well in a separate process while the
> repository remains live.  It could also be run over a revision range
> rather than just the whole repository, and running on a single revision
> such as HEAD would be fast.
>

Makes sense.


> I believe the code will be relative straightforward, if anything it is
> the API that is more of a problem.
>
>  - We could add a public svn_fs_rep_cache().  This is backend specific
>    but there is precedent: we have svn_fs_berkeley_logfiles() and
>    svn_fs_pack().
>
>  - We could add a more general svn_fs_optimize().  This would do backend
>    specific optimizations that may change in future versions.  Perhaps
>    passing backend-specific flags?
>

I think svn_fs_optimize(bool online) would make sense
in the longer term.

In the "offline" case, it could do anything from removing
duplicate reps as we build the cache to sharding repos
or repacking shards. Not that I would want to implement
any of that soon.

OTOH, a new FS API makes only sense if we can control
it nicely and generically from svnadmin or its ilk. It seems
to me that a generic "make stuff better" optimize run has
its merits (e.g. after an svnadmin upgrade) but most people
probably want to tune only specific aspects. That's because
they are likely to have large repos that they can't take them
offline for long.


>  - We could add the behaviour to svn_fs_recover() by reving the function
>    with a revision range.  This would "recover" the rep-cache after the
>    existing recovery.  At present recover is fast so to preserve that
>    the compatibility function would pass a revision range that is just
>    HEAD.
>

There is nothing inherently wrong or broken with having an
incomplete rep cache. So, making this part of the recovery
procedure feels wrong.

 - We could avoid a public API and call some FSFS function from svnfsfs.
>

That is probably the best place even longer-term.

-- Stefan^2.