You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Stefan Fuhrmann <eq...@web.de> on 2011/09/19 10:42:55 UTC

Re: server-side log cache

On 29.08.2011 18:35, Stefan Sperling wrote:

Sorry for the late response. I have been knocked
out for a while.
> On Sun, Aug 28, 2011 at 03:46:03PM +0200, Stefan Fuhrmann wrote:
>>> See http://svn.apache.org/repos/asf/subversion/branches/fs-successor-ids/BRANCH-README
>>> for what this is all about.
>> But the assumptions in that file are actually not valid.
> Which ones are invalid? Can you explain in detail?
It makes at least 5 implicit assumptions:

* Noderevs are the relevant level on which user-level
   questions can be answered.
-> That might not be the case. Noderevs are a purely
    internal construct. In particular, a noderev might get
    copied but the actual delta is empty. Thinking about
    svn obliterate & friends that might actually happen
    as a result of some node removal / replacement.

* User-level queries (where got node copied to?) and
   internal queries (what are the copy dependencies?)
   are looking for the same information.
-> This is linked to the previous. Noderev copy-to info
    can be useful to determine the dependency graph.
    The point here is that we are mixing queries on two
    different layers here. If we need both, they should
    both exist as separate APIs.

* Noderevs changes are relevant for all svn obliterate.
-> That may not always be true. If you only want to
    remove "unused" history, the references in the skip-
    delta tree express the relevant dependencies.

* Noderev changes are the sufficient to answer the
   "where has this been copied to?" question.
-> They are quite unhelpful here. Noderevs might
    help find you changes on the node itself (see first
    assumption). But you need to evaluate the user-
    level copy operations of all parent folders for all
    currently known copies. There are far less of these
    copies (e.g. branch creation) than noderev changes
    (typ. multiple per revision).

* A noderev-based cache alone would reduce the
   amount of data to inspect / process for these queries.
-> The noderev cache is *larger* than e.g. a log cache
    due to each change also modifying all parent nodes.
    To be efficient, you need a node-level copy-to cache
    as well (see previous item).
>
>> * "Where did path P at rev N to M directly or indirectly
>>   copied to, e.g. which releases contain a certain faulty
>>   code segment; optionally list changes to targets?"
>> ->  needs to scan parent paths for copies, too
>>    (reversed "log", revision graph)
> Yes, the successor-id cache only gives us operation roots.
> Information for child nodes needs to be derived -- it is not
> within the scope of the cache itself.
That is unnecessarily complicated / expensive.
Using the proposed log cache, the relevant
information can be found in a *single*, highly
efficient scan.
>
>> It turns out that we can produce even humongous
>> reverse logs (50+ k nodes, thousands of branches
>> and tags) within a second by simply performing
>> a full history scan.
>>
>> A example of how the whole process can be
>> implemented efficiently, can be found here:
>>
>> https://tortoisesvn.googlecode.com/svn/trunk/src/TortoiseProc/RevisionGraph/FullGraphBuilder.cpp
> I'll take a look at that, thanks!
The gist of it is that you keep all currently known
copy targets (e.g. branches, tags) in a tree. For
each revision, you walk that tree to find the affected
sub-tree. The LRU path gets optimized to speed up
the next look-up.

A usually fixed repo layout plus typical change
patterns result in a quasi-constant processing
time for each revision - independent from the
number of copies already found.

Things become much simpler and faster if you
eliminate string operations and use path IDs
instead. TSVN does an "svn log" at over 10M
revs per second.
>>> Storage of successor IDs is essentially a cache of the result of the
>>> inversion of the predecessor relation between all node-revisions.
>>> This inversion is a *very* expensive operation (follow every node
>>> that ever existed in the repository from its most recent revision
>>> back to the revision it was created in).
>> Not true. At least the "*very* expensive" part.
>> Log -v takes about 1min for AFS over ra_local.
>> Building any of the index data described below
>> can be done in<  10s.
> Any of it (if so, which parts?), or all of it?
All of it, of course. The expensive part is "svn log -v"
which may be expensive on slow disks. Everything
else is just writing a few MB of data.
>> I propose a modified version of TSVN's log cache
>> data structure. The adaptations:
>>
>> * split into multiple files to reduce modification overhead
>> * remove rev-prop-dependent information (log comment,
>>    author, timestamp)
>> * add the reverse copy info
>> * simplify the data structures
> This looks very interesting.
>
> What about FSFS-specific requirements?
See assumptions above, this may require a different
data structure. But I think that noderev dependencies
will turn out to be redundant if you have a log cache
and access to the skip-delta forwards dependencies.
> It sounds like you avoid those by storing data in semantics of the repos
> layer (path@revision) instead of the FS layer (node-revision-id)?
Yes.
> In this case separate implementations for FSFS and BDB aren't needed.
> This could be an advantage (e.g. third party FS implementations
> wouldn't need to change to support this).
It also eliminates on of the performance weaknesses
of SVN today: A log on some old / seldom changed
path can take a very long time.
>
> I'll think about this some more, thanks.
>
Welcome ;)

-- Stefan^2.

Re: server-side log cache

Posted by Stefan Fuhrmann <eq...@web.de>.

On 30.09.2011 18:19, Stefan Sperling wrote:
> On Thu, Sep 22, 2011 at 08:43:14PM +0200, Stefan Fuhrmann wrote:
>>>>> This looks very interesting.
>>>>>
>>>>> What about FSFS-specific requirements?
>>>> See assumptions above, this may require a different
>>>> data structure. But I think that noderev dependencies
>>>> will turn out to be redundant if you have a log cache
>>>> and access to the skip-delta forwards dependencies.
>>>>> It sounds like you avoid those by storing data in semantics of the repos
>>>>> layer (path@revision) instead of the FS layer (node-revision-id)?
>>>> Yes.
>>>>> In this case separate implementations for FSFS and BDB aren't needed.
>>>>> This could be an advantage (e.g. third party FS implementations
>>>>> wouldn't need to change to support this).
>>>> It also eliminates on of the performance weaknesses
>>>> of SVN today: A log on some old / seldom changed
>>>> path can take a very long time.
>>>>> I'll think about this some more, thanks.
>>>>>
>>>> Welcome ;)
>
> Reviving this thread.
>
> Your concerns about a node-rev based approach seem to resolve largely
> around performance, not about correctness.
Effectiveness, to be precise.
> I.e. you agree that a
> node-rev-based solution as currently being worked on within the
> fs-successor-ids branch will work correctly, but won't perform
> as well as your proposal for certain queries, right?
Since it does not add any information, the node-rev-based
approach will not *cause* incorrect behavior. In that sense,
I agree.

But I still fail to see how it will be effective except for a
very, very specific use-case. I probably just haven't understood
your use-case. Could you give a short description of the problem
that you are trying to solve and how the node-rev cache will help?
>
> Now, I don't feel comfortable trying to implement your design.
> The reason for this is that you could do a much better job at this yourself.
That's fine with me. I won't have the time to do that this year,
though.
>
> However, I do feel very comfortable continuing the work we've started on
> the fs-successor-ids branch.
Having that implementation available will certainly do no harm.
>
> I also think that the two approaches can complement each other.
> We are not in an either/or situation. We will get correct answers either
> way and the only real difference is performance.
>
> Note also that our plan for putting successor-IDs into the filesystem
> layer we will also solve the problem of creating the successor data
> during an upgrade from SVN 1.7 to 1.8.
> Both approaches need to solve this somehow, and we'd have that part
> sorted out for you.
I don't see a problem here. If necessary, we could extend the FS
layer API with version check methods etc.
>
> So, what about this: We implement successor-IDs in the filesystem
> as planned on the fs-successor-ids branch.
> Once we have that, and when you have time, you adapt your log cache
> proposal to create a runtime cache that sits on top of the new
> successor-ID filesystem data, and caches results for certain log queries
> in memory for quick access. It should even be possible to pre-populate
> this cache when the server start up.
How would I reconstruct copy target path names from node-rev info?
>
> This way, we have some amount of redundancy in the system, but Daniel
> and I can continue trying to deliver a working solution for 1.8 based
> on what we've started. And we can worry about performance issues later,
> because you already have a plan for that.
>
> Frankly, I think the time people are wasting today resolving trivial
> tree-conflicts is a huge waste of their time. No matter how bad the
> performance of an automated solution to this problem will be, it will
> be faster than a human being. Our users will get a huge benefit either
> way because we will be reducing their load of manual labour. Performance
> of the solution doesn't need to be perfect by the time we release 1.8
> and nothing stands in the way of improving performance later.
>
> Do you agree?
This depends entirely on your use-case (see above). My experience
with navigating these change graphs indicates that better-than
O(n^2) performance requires completely different algorithms, data
structures and API than a merely correct path-by-path approach.
BTW, n > 10.000.000 for certain repositories.
>
> If not, I hope that you'll find time to help us implement your pure
> caching solution for 1.8. I would really like to see some solution
> to this problem in the 1.8. release.
I'm currently working on other SVN-related projects.
 From April on, I'm available for hire.

-- Stefan^2.

Re: server-side log cache

Posted by Stefan Sperling <st...@elego.de>.

On Thu, Sep 22, 2011 at 08:43:14PM +0200, Stefan Fuhrmann wrote:
> >>>This looks very interesting.
> >>>
> >>>What about FSFS-specific requirements?
> >>See assumptions above, this may require a different
> >>data structure. But I think that noderev dependencies
> >>will turn out to be redundant if you have a log cache
> >>and access to the skip-delta forwards dependencies.
> >>>It sounds like you avoid those by storing data in semantics of the repos
> >>>layer (path@revision) instead of the FS layer (node-revision-id)?
> >>Yes.
> >>>In this case separate implementations for FSFS and BDB aren't needed.
> >>>This could be an advantage (e.g. third party FS implementations
> >>>wouldn't need to change to support this).
> >>It also eliminates on of the performance weaknesses
> >>of SVN today: A log on some old / seldom changed
> >>path can take a very long time.
> >>>I'll think about this some more, thanks.
> >>>
> >>Welcome ;)

Reviving this thread.

Your concerns about a node-rev based approach seem to resolve largely
around performance, not about correctness. I.e. you agree that a
node-rev-based solution as currently being worked on within the
fs-successor-ids branch will work correctly, but won't perform
as well as your proposal for certain queries, right?

Now, I don't feel comfortable trying to implement your design.
The reason for this is that you could do a much better job at this yourself.
However, I do feel very comfortable continuing the work we've started on
the fs-successor-ids branch.

I also think that the two approaches can complement each other.
We are not in an either/or situation. We will get correct answers either
way and the only real difference is performance.

Note also that our plan for putting successor-IDs into the filesystem
layer we will also solve the problem of creating the successor data
during an upgrade from SVN 1.7 to 1.8.
Both approaches need to solve this somehow, and we'd have that part
sorted out for you.

So, what about this: We implement successor-IDs in the filesystem
as planned on the fs-successor-ids branch.
Once we have that, and when you have time, you adapt your log cache
proposal to create a runtime cache that sits on top of the new
successor-ID filesystem data, and caches results for certain log queries
in memory for quick access. It should even be possible to pre-populate
this cache when the server start up.

This way, we have some amount of redundancy in the system, but Daniel
and I can continue trying to deliver a working solution for 1.8 based
on what we've started. And we can worry about performance issues later,
because you already have a plan for that.

Frankly, I think the time people are wasting today resolving trivial
tree-conflicts is a huge waste of their time. No matter how bad the
performance of an automated solution to this problem will be, it will
be faster than a human being. Our users will get a huge benefit either
way because we will be reducing their load of manual labour. Performance
of the solution doesn't need to be perfect by the time we release 1.8
and nothing stands in the way of improving performance later.

Do you agree?

If not, I hope that you'll find time to help us implement your pure
caching solution for 1.8. I would really like to see some solution
to this problem in the 1.8. release.

Re: server-side log cache

Posted by Stefan Fuhrmann <eq...@web.de>.

On 20.09.2011 23:40, Daniel Shahaf wrote:
> Re-reading this to see how it affects the fs-successor-ids work (in
> particular, whether it supersedes the done/planned work), one question:
>
>
> Why do you refer to skip-deltas?
I haven't thought too deeply about the skip-delta issue
but that is my reasoning: For all questions concerning
the "change flow", the noderev level seems to be the
wrong place to go as I tried to explain.

If you are trying to answer repo-internal questions like
"can I remove this chunk of data?", the dependencies
between the individual representations (skip delta,
shared nodereps) may be more interesting.
>
>
> Consider the following transformation to a repository:
>
>    for noderev in FS:
>      noderev.props['svn:contents'] = noderev.cat()
>      noderev.contents = ""
>
> This transformation does not change the successor relations, but it does
> change the skip-deltas mapping.  Would your algorithms for
> successors/copyto work correctly on a repository that has undergone the
> above transformation?
My algorithms don't  care about noderevs or any layer
below that (including representations). The skip-delta
argument was only to illustrate that the noderevs cache
might be even less helpful.

-- Stefan^2.
>
> Stefan Fuhrmann wrote on Mon, Sep 19, 2011 at 10:42:55 +0200:
>> On 29.08.2011 18:35, Stefan Sperling wrote:
>>
>> Sorry for the late response. I have been knocked
>> out for a while.
>>> On Sun, Aug 28, 2011 at 03:46:03PM +0200, Stefan Fuhrmann wrote:
>>>>> See http://svn.apache.org/repos/asf/subversion/branches/fs-successor-ids/BRANCH-README
>>>>> for what this is all about.
>>>> But the assumptions in that file are actually not valid.
>>> Which ones are invalid? Can you explain in detail?
>> It makes at least 5 implicit assumptions:
>>
>> * Noderevs are the relevant level on which user-level
>>    questions can be answered.
>> ->  That might not be the case. Noderevs are a purely
>>     internal construct. In particular, a noderev might get
>>     copied but the actual delta is empty. Thinking about
>>     svn obliterate&  friends that might actually happen
>>     as a result of some node removal / replacement.
>>
>> * User-level queries (where got node copied to?) and
>>    internal queries (what are the copy dependencies?)
>>    are looking for the same information.
>> ->  This is linked to the previous. Noderev copy-to info
>>     can be useful to determine the dependency graph.
>>     The point here is that we are mixing queries on two
>>     different layers here. If we need both, they should
>>     both exist as separate APIs.
>>
>> * Noderevs changes are relevant for all svn obliterate.
>> ->  That may not always be true. If you only want to
>>     remove "unused" history, the references in the skip-
>>     delta tree express the relevant dependencies.
>>
>> * Noderev changes are the sufficient to answer the
>>    "where has this been copied to?" question.
>> ->  They are quite unhelpful here. Noderevs might
>>     help find you changes on the node itself (see first
>>     assumption). But you need to evaluate the user-
>>     level copy operations of all parent folders for all
>>     currently known copies. There are far less of these
>>     copies (e.g. branch creation) than noderev changes
>>     (typ. multiple per revision).
>>
>> * A noderev-based cache alone would reduce the
>>    amount of data to inspect / process for these queries.
>> ->  The noderev cache is *larger* than e.g. a log cache
>>     due to each change also modifying all parent nodes.
>>     To be efficient, you need a node-level copy-to cache
>>     as well (see previous item).
>>>> * "Where did path P at rev N to M directly or indirectly
>>>>   copied to, e.g. which releases contain a certain faulty
>>>>   code segment; optionally list changes to targets?"
>>>> ->   needs to scan parent paths for copies, too
>>>>    (reversed "log", revision graph)
>>> Yes, the successor-id cache only gives us operation roots.
>>> Information for child nodes needs to be derived -- it is not
>>> within the scope of the cache itself.
>> That is unnecessarily complicated / expensive.
>> Using the proposed log cache, the relevant
>> information can be found in a *single*, highly
>> efficient scan.
>>>> It turns out that we can produce even humongous
>>>> reverse logs (50+ k nodes, thousands of branches
>>>> and tags) within a second by simply performing
>>>> a full history scan.
>>>>
>>>> A example of how the whole process can be
>>>> implemented efficiently, can be found here:
>>>>
>>>> https://tortoisesvn.googlecode.com/svn/trunk/src/TortoiseProc/RevisionGraph/FullGraphBuilder.cpp
>>> I'll take a look at that, thanks!
>> The gist of it is that you keep all currently known
>> copy targets (e.g. branches, tags) in a tree. For
>> each revision, you walk that tree to find the affected
>> sub-tree. The LRU path gets optimized to speed up
>> the next look-up.
>>
>> A usually fixed repo layout plus typical change
>> patterns result in a quasi-constant processing
>> time for each revision - independent from the
>> number of copies already found.
>>
>> Things become much simpler and faster if you
>> eliminate string operations and use path IDs
>> instead. TSVN does an "svn log" at over 10M
>> revs per second.
>>>>> Storage of successor IDs is essentially a cache of the result of the
>>>>> inversion of the predecessor relation between all node-revisions.
>>>>> This inversion is a *very* expensive operation (follow every node
>>>>> that ever existed in the repository from its most recent revision
>>>>> back to the revision it was created in).
>>>> Not true. At least the "*very* expensive" part.
>>>> Log -v takes about 1min for AFS over ra_local.
>>>> Building any of the index data described below
>>>> can be done in<   10s.
>>> Any of it (if so, which parts?), or all of it?
>> All of it, of course. The expensive part is "svn log -v"
>> which may be expensive on slow disks. Everything
>> else is just writing a few MB of data.
>>>> I propose a modified version of TSVN's log cache
>>>> data structure. The adaptations:
>>>>
>>>> * split into multiple files to reduce modification overhead
>>>> * remove rev-prop-dependent information (log comment,
>>>>    author, timestamp)
>>>> * add the reverse copy info
>>>> * simplify the data structures
>>> This looks very interesting.
>>>
>>> What about FSFS-specific requirements?
>> See assumptions above, this may require a different
>> data structure. But I think that noderev dependencies
>> will turn out to be redundant if you have a log cache
>> and access to the skip-delta forwards dependencies.
>>> It sounds like you avoid those by storing data in semantics of the repos
>>> layer (path@revision) instead of the FS layer (node-revision-id)?
>> Yes.
>>> In this case separate implementations for FSFS and BDB aren't needed.
>>> This could be an advantage (e.g. third party FS implementations
>>> wouldn't need to change to support this).
>> It also eliminates on of the performance weaknesses
>> of SVN today: A log on some old / seldom changed
>> path can take a very long time.
>>> I'll think about this some more, thanks.
>>>
>> Welcome ;)
>>
>> -- Stefan^2.

Re: server-side log cache

Posted by Daniel Shahaf <da...@elego.de>.

Re-reading this to see how it affects the fs-successor-ids work (in
particular, whether it supersedes the done/planned work), one question:


Why do you refer to skip-deltas?


Consider the following transformation to a repository:

  for noderev in FS:
    noderev.props['svn:contents'] = noderev.cat()
    noderev.contents = ""

This transformation does not change the successor relations, but it does
change the skip-deltas mapping.  Would your algorithms for
successors/copyto work correctly on a repository that has undergone the
above transformation?



Stefan Fuhrmann wrote on Mon, Sep 19, 2011 at 10:42:55 +0200:
> On 29.08.2011 18:35, Stefan Sperling wrote:
> 
> Sorry for the late response. I have been knocked
> out for a while.
> >On Sun, Aug 28, 2011 at 03:46:03PM +0200, Stefan Fuhrmann wrote:
> >>>See http://svn.apache.org/repos/asf/subversion/branches/fs-successor-ids/BRANCH-README
> >>>for what this is all about.
> >>But the assumptions in that file are actually not valid.
> >Which ones are invalid? Can you explain in detail?
> It makes at least 5 implicit assumptions:
> 
> * Noderevs are the relevant level on which user-level
>   questions can be answered.
> -> That might not be the case. Noderevs are a purely
>    internal construct. In particular, a noderev might get
>    copied but the actual delta is empty. Thinking about
>    svn obliterate & friends that might actually happen
>    as a result of some node removal / replacement.
> 
> * User-level queries (where got node copied to?) and
>   internal queries (what are the copy dependencies?)
>   are looking for the same information.
> -> This is linked to the previous. Noderev copy-to info
>    can be useful to determine the dependency graph.
>    The point here is that we are mixing queries on two
>    different layers here. If we need both, they should
>    both exist as separate APIs.
> 
> * Noderevs changes are relevant for all svn obliterate.
> -> That may not always be true. If you only want to
>    remove "unused" history, the references in the skip-
>    delta tree express the relevant dependencies.
> 
> * Noderev changes are the sufficient to answer the
>   "where has this been copied to?" question.
> -> They are quite unhelpful here. Noderevs might
>    help find you changes on the node itself (see first
>    assumption). But you need to evaluate the user-
>    level copy operations of all parent folders for all
>    currently known copies. There are far less of these
>    copies (e.g. branch creation) than noderev changes
>    (typ. multiple per revision).
> 
> * A noderev-based cache alone would reduce the
>   amount of data to inspect / process for these queries.
> -> The noderev cache is *larger* than e.g. a log cache
>    due to each change also modifying all parent nodes.
>    To be efficient, you need a node-level copy-to cache
>    as well (see previous item).
> >
> >>* "Where did path P at rev N to M directly or indirectly
> >>  copied to, e.g. which releases contain a certain faulty
> >>  code segment; optionally list changes to targets?"
> >>->  needs to scan parent paths for copies, too
> >>   (reversed "log", revision graph)
> >Yes, the successor-id cache only gives us operation roots.
> >Information for child nodes needs to be derived -- it is not
> >within the scope of the cache itself.
> That is unnecessarily complicated / expensive.
> Using the proposed log cache, the relevant
> information can be found in a *single*, highly
> efficient scan.
> >
> >>It turns out that we can produce even humongous
> >>reverse logs (50+ k nodes, thousands of branches
> >>and tags) within a second by simply performing
> >>a full history scan.
> >>
> >>A example of how the whole process can be
> >>implemented efficiently, can be found here:
> >>
> >>https://tortoisesvn.googlecode.com/svn/trunk/src/TortoiseProc/RevisionGraph/FullGraphBuilder.cpp
> >I'll take a look at that, thanks!
> The gist of it is that you keep all currently known
> copy targets (e.g. branches, tags) in a tree. For
> each revision, you walk that tree to find the affected
> sub-tree. The LRU path gets optimized to speed up
> the next look-up.
> 
> A usually fixed repo layout plus typical change
> patterns result in a quasi-constant processing
> time for each revision - independent from the
> number of copies already found.
> 
> Things become much simpler and faster if you
> eliminate string operations and use path IDs
> instead. TSVN does an "svn log" at over 10M
> revs per second.
> >>>Storage of successor IDs is essentially a cache of the result of the
> >>>inversion of the predecessor relation between all node-revisions.
> >>>This inversion is a *very* expensive operation (follow every node
> >>>that ever existed in the repository from its most recent revision
> >>>back to the revision it was created in).
> >>Not true. At least the "*very* expensive" part.
> >>Log -v takes about 1min for AFS over ra_local.
> >>Building any of the index data described below
> >>can be done in<  10s.
> >Any of it (if so, which parts?), or all of it?
> All of it, of course. The expensive part is "svn log -v"
> which may be expensive on slow disks. Everything
> else is just writing a few MB of data.
> >>I propose a modified version of TSVN's log cache
> >>data structure. The adaptations:
> >>
> >>* split into multiple files to reduce modification overhead
> >>* remove rev-prop-dependent information (log comment,
> >>   author, timestamp)
> >>* add the reverse copy info
> >>* simplify the data structures
> >This looks very interesting.
> >
> >What about FSFS-specific requirements?
> See assumptions above, this may require a different
> data structure. But I think that noderev dependencies
> will turn out to be redundant if you have a log cache
> and access to the skip-delta forwards dependencies.
> >It sounds like you avoid those by storing data in semantics of the repos
> >layer (path@revision) instead of the FS layer (node-revision-id)?
> Yes.
> >In this case separate implementations for FSFS and BDB aren't needed.
> >This could be an advantage (e.g. third party FS implementations
> >wouldn't need to change to support this).
> It also eliminates on of the performance weaknesses
> of SVN today: A log on some old / seldom changed
> path can take a very long time.
> >
> >I'll think about this some more, thanks.
> >
> Welcome ;)
> 
> -- Stefan^2.