You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Julian Foad <ju...@btopenworld.com> on 2006/01/11 02:12:44 UTC

Re: Fwd: Effects of importing over 18000 items into repository as one commit transaction

Pavel Repin wrote:
> This did not result in any response on the users list.
> Hopefully someone among svn hackers might know about the impacts of a 
> largish commit.

It looks like there was no response from the developers' list either.  Sorry 
about that.  Probably lots of people read it but they all thought "Hmm, I don't 
know.  It's rather vague."  At least that was my reaction.

> ---------- Forwarded message ----------
> From: *Pavel Repin* <prepin@gmail.com <ma...@gmail.com>>
> Date: Nov 29, 2005 8:18 AM

> We are rolling out a subversion install at work.
> One of the teams created a FSFS repository and svn-imported a snapshot 
> of their entire source tree in one transaction. The resulting FSFS 
> revision file "1" is 261 MB and contains 18834 items. Do you think that 
> was a bad idea to import the entire tree in one shot?

That's not an unreasonable size of import, so it should be fine.

> I sort of suspect it was a bad idea because I am noticing that "svn log" 
> takes considerable amount of time before it dumps anything to stdout on 
> a repository that had only 35 checkins so far.

How slow is it?  What exact command are you using?  (Just "svn log" in an 
up-to-date working copy?)  What version of Subversion are you using (both 
client and server)?

A whole "svn log" of a large repository does typically take a very long time, 
but that's (I assume) because a large repository typically has a very large 
number of revisions (e.g. tens of thousands).  In old versions of Subversion it 
used to take a long time before starting to print anything, but that was fixed 
.  It ought to be quick on only 35 revisions regardless how big each revision is.

Does it work quickly if you avoid the first revision, e.g. with "svn log 
-rHEAD:2" or "svn log --limit=34" ?

Is it slow if you just request the log of the big revision ("svn log -r1")?

If the large revision is causing a massive slow-down in "svn log", that's 
certainly something we ought to investigate, but it might be a low priority if 
it is only moderate and/or only occurs on revision 1.

> I've seen much better "svn log" performance on a repository with vastly 
> larger number of checked in revisions, but that repository grew 
> naturally (it started from nothing and it grew little by little with 
> each checkin).
> 
> Should we have imported that tree as a set of smaller checkins?

Well, if this problem is a real nuisance then that would probably be a way to 
avoid it.  (Specifically: yes, breaking it into even a small number of pieces 
might well speed it up a lot, if the slowdown is due to quadratic time required 
somewhere in the implementation.)  If you can live with it for the short to 
medium term, the inefficiency may eventually be fixed.  If you or someone you 
know can help fix it, of course, it could happen much sooner!

Finally, I note from the CHANGES file 
<http://svn.collab.net/repos/svn/trunk/CHANGES> that there were "svn log" 
performance regressions in v1.2.0, fixed in v1.2.1, and there is a further 
improvement in v1.3.0, so try that when you can.  I don't know the detailed 
effects of these.

- Julian

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Fwd: Effects of importing over 18000 items into repository as one commit transaction

Posted by Ben Collins-Sussman <su...@red-bean.com>.
On 1/11/06, Julian Foad <ju...@btopenworld.com> wrote:

> > Our authz_read() callback operates on single paths, and that's it.
>
> OK, I see.  I guess you're saying we can't change that API, so we're stuck.  If
> we could add and use a "check this list of paths" API it could potentially be
> much faster,

Not at all, it's our own invented API!  I agree, perhaps it would be
better to rev this API to take a *list* of paths, rather than just one
path.  Then the implementor of the authorization system
(mod_authz_svn, or svnserve), at least has the opportunity to make use
of inherited ACLs, if that's how it happens to do authz.  That would
be a nice enhancement.

In retrospect, I think it was probably a mistake to make our
authz_read() callback take only one path rather than a list of them. 
A multi-path version would still allow the single-path case and be
backward-compatible, but is much more flexible and opens the door for
optimizations.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org


Re: Fwd: Effects of importing over 18000 items into repository as one commit transaction

Posted by Julian Foad <ju...@btopenworld.com>.
Ben Collins-Sussman wrote:
> On 1/11/06, Julian Foad <ju...@btopenworld.com> wrote:
> 
>>That document mentions the problem of O(N) behaviour for copy operations, in
>>section 3 of each "chapter".  It doesn't mention such a problem for any other
>>situation.  Do you want to patch it to warn of O(N) behaviour in other or all
>>cases?
> 
> I'm talking about the section on revision properties:

Oh, thanks, Ben - I get it now: I thought our document specifically warned of 
an O(N) speed problem in the case of copies, implying that there is no such 
problem in other cases.  But what it means is just that O(N) behaviour is 
surprising for copies because we otherwise expect them to have O(1) behaviour, 
whereas in other cases O(N) behaviour should be expected.


> In order to decide which revprops are okay to display, all of the
> changed-paths in a given revision have to be examined.  This isn't
> something that can be optimized away with clever uses of trees,
> red-black elimination, etc.  What must be determined is whether
> 
>     * all the changed-paths are readable
>     * some of the changed-paths are readable
>     * none of the changed-paths are readable.
> 
> And that's just an O(N) search (where N == number of changed-paths in
> the revision), no matter what you do.

Obviously when we say "the operation is O(N)" that's a simplification: some 
parts of the operation certainly have to iterate over the N paths, but the 
question is whether a significant amount of time is spent doing something that 
could be done more quickly.  For example, if the authz check could skim through 
the list of paths to find the root-most one and then perform its expensive ACL 
check on that, it might then be able to eliminate many other paths from needing 
an expensive check.  Note: "might" - it wouldn't work in all cases or with all 
authz systems.  And note that I'm not saying mod_dav_svn could do this; it 
would have to be done in the applicable authz module.

> Our authz_read() callback operates on single paths, and that's it.

OK, I see.  I guess you're saying we can't change that API, so we're stuck.  If 
we could add and use a "check this list of paths" API it could potentially be 
much faster, but I guess that API is external to Subversion and not easily 
influenced by us.  (Sorry for trying to discuss it without knowing the 
architecture.)

So, the bottom line is (probably) that if one configures one's repository to 
use an authz system that doesn't have very fast performance, it will slow down 
certain Subversion operations that consider a large set of paths in the 
repository, one example being "svn log".

There are of course further ideas that could be explored if this becomes a big 
problem.  We could reconsider the mapping of how many paths are readable to 
which rev-props are readable.  We could consider partly caching the readability 
of rev-props (may be impossible to do correctly, I know).

Anyway, thanks for the insights.  That's the limit of my interest in the 
matter.  I thought to begin with that the slow-down was implied to be with FSFS 
only, and therefore that it might be a simple algorithmic inefficiency under 
our control.  In fact, we still don't know whether this authorization behaviour 
was the cause of the original poster's problem.

- Julian

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Fwd: Effects of importing over 18000 items into repository as one commit transaction

Posted by Ben Collins-Sussman <su...@red-bean.com>.
On 1/11/06, Julian Foad <ju...@btopenworld.com> wrote:

> That document mentions the problem of O(N) behaviour for copy operations, in
> section 3 of each "chapter".  It doesn't mention such a problem for any other
> situation.  Do you want to patch it to warn of O(N) behaviour in other or all
> cases?

I'm talking about the section on revision properties:

--------------------------------------------
5. REVISION PROPERTIES

   Users are allowed to attach arbitrary, unversioned properties to
   revisions.  Additionally, most revisions also have "standard"
   revision props (revprops), such as svn:author, svn:date, and
   svn:log.  Access to revprops may be restricted, based on
   readability of changed-paths.

     * If a revision contains nothing but unreadable changed-paths,
       then all revprops are unreadable and unwritable.

     * If a revision has a mixture of readable/unreadable
       changed-paths, then all revprops are unreadable, except for
       svn:author and svn:date.  All revprops are unwritable.
--------------------------------------------

The problem here is that 'svn log' is all about fetching revprops: 
author, date, log message.

In order to decide which revprops are okay to display, all of the
changed-paths in a given revision have to be examined.  This isn't
something that can be optimized away with clever uses of trees,
red-black elimination, etc.  What must be determined is whether

    * all the changed-paths are readable
    * some of the changed-paths are readable
    * none of the changed-paths are readable.

And that's just an O(N) search (where N == number of changed-paths in
the revision), no matter what you do.  There's nothing predictable
about the changed-path list.  If one of the changed-paths just happens
to be a child of another, it's still not useful to us;  we can't make
any assumptions about the ACLs set up by the server process, or
whether there's inheritance or not.  Our authz_read() callback
operates on single paths, and that's it.

So yeah, this is a case where security trades off for performance. 
The tradeoff can be reversed by disabling path-based authorization
features completely.   I don't see any other options...

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org


Re: Fwd: Effects of importing over 18000 items into repository as one commit transaction

Posted by Julian Foad <ju...@btopenworld.com>.
Ben Collins-Sussman wrote:
> On 1/10/06, Garrett Rooney <ro...@electricjellyfish.net> wrote:
> 
>>On 1/10/06, Julian Foad <ju...@btopenworld.com> wrote:
>>
>>
>>>If the large revision is causing a massive slow-down in "svn log", that's
>>>certainly something we ought to investigate, but it might be a low priority if
>>>it is only moderate and/or only occurs on revision 1.
>>
>>If that is the case, it tends to be because of the need to run authz
>>checks on large numbers of paths, and other than turning off path
>>based authz there isn't a lot that can be done about it.

Not by the user, maybe.  Presumably what can be done about it from our point of 
view is to ensure that the authz-checking code is called efficiently - in such 
a way that it can accept and reject whole sub-trees without having to evaluate 
every path individually.

> Right.  For more explanation, see our notes/authz_policy.txt file.

That document mentions the problem of O(N) behaviour for copy operations, in 
section 3 of each "chapter".  It doesn't mention such a problem for any other 
situation.  Do you want to patch it to warn of O(N) behaviour in other or all 
cases?

This document also mentions that it need not be that slow:

>    Depending on the specific path-based authz module being used,
>    however, there are sometimes solutions that aren't quite so
>    expensive as O(N).

- Julian

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Fwd: Effects of importing over 18000 items into repository as one commit transaction

Posted by Ben Collins-Sussman <su...@red-bean.com>.
On 1/10/06, Garrett Rooney <ro...@electricjellyfish.net> wrote:
> On 1/10/06, Julian Foad <ju...@btopenworld.com> wrote:
>
> > If the large revision is causing a massive slow-down in "svn log", that's
> > certainly something we ought to investigate, but it might be a low priority if
> > it is only moderate and/or only occurs on revision 1.
>
> If that is the case, it tends to be because of the need to run authz
> checks on large numbers of paths, and other than turning off path
> based authz there isn't a lot that can be done about it.

Right.  For more explanation, see our notes/authz_policy.txt file.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org


Re: Fwd: Effects of importing over 18000 items into repository as one commit transaction

Posted by Garrett Rooney <ro...@electricjellyfish.net>.
On 1/10/06, Julian Foad <ju...@btopenworld.com> wrote:

> If the large revision is causing a massive slow-down in "svn log", that's
> certainly something we ought to investigate, but it might be a low priority if
> it is only moderate and/or only occurs on revision 1.

If that is the case, it tends to be because of the need to run authz
checks on large numbers of paths, and other than turning off path
based authz there isn't a lot that can be done about it.

-garrett

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org