You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Philip Martin <ph...@codematters.co.uk> on 2003/07/29 16:46:27 UTC

The small commit problem

Hello

I have been looking at a cvs2svn conversion and wondering why the
Subversion repository is so much larger than the CVS one.  One of the
things that occurred to me is that the creation of a directory node in
the Subversion filesystem might make small commits relatively
expensive, particularly if the directory has a large number of
elements.

I have been experimenting with scripts like the one at the end of this
mail.  It creates a simple repository containing a number of files and
then makes lots of "small" changes to measure how the repository
grows.  I have tried both renaming a file (one rename per commit) and
editing a file (append a few bytes to one file per commit).

Files in          Effect on repository size
directory         
 10            :  about 10k per edit or move
 50            :  over 15k per edit, about 15k per move
100            :  over 15k per edit or move
200            :  about 20k per edit or move
500            :  about 30k per edit, over 45k per move

I don't think 50, or even 200, is a large number of files to have in a
directory.  Due to the way changes "bubble-up" through the Subversion
filesystem, the effect is amplified if the directory in question is
itself a child of a directory with lots of elements.

Thus a Subversion repository doesn't handle "small" commits
particularly well, there is a sort of threshold on the minimum size
for each commit.  This could explain why we are getting reports that
CVS repositories convert to much larger Subversion repositories.

Does that sound plausible?  If it does I wonder what we could do to
change it: make the nodes less expensive, or use some sort of "diffy"
directory storage, or...



Script follows

#!/bin/bash

STRESS=~/sw/subversion/svn/tools/dev/stress.pl
CHECK=db4.1_checkpoint

$STRESS -n0 -c -F200 -N1 -D0
REPO=file://`pwd`/repostress

rm -rf wc
#svn co $REPO/trunk wc &> /dev/null

$CHECK -1 -h repostress/db
rm -f `svnadmin archive repostress`
psize=`du -ks repostress | awk '{print $1}'`

for i in `seq 100` ; do
  for j in `seq 5` ; do
    #echo $i"x"$j >> wc/foo1 && svn ci -m "" wc &> /dev/null
    #echo $i"x"$j >> wc/foo1 && svn ci -m "" wc &> /dev/null
    svn mv -m "" $REPO/trunk/foo1 $REPO/trunk/xfoo1 &> /dev/null
    svn mv -m "" $REPO/trunk/xfoo1 $REPO/trunk/foo1 &> /dev/null
  done

  $CHECK -1 -h repostress/db
  rm -f `svnadmin archive repostress`
  nsize=`du -ks repostress | awk '{print $1}'`
  echo $psize $nsize $(($nsize-$psize))
  psize=$nsize

done

-- 
Philip Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: The small commit problem

Posted by "Glenn A. Thompson" <gt...@cdr.net>.
Hey,

cmpilato@collab.net wrote:

>Philip Martin <ph...@codematters.co.uk> writes:
>
>  
>
>>Thus a Subversion repository doesn't handle "small" commits
>>particularly well, there is a sort of threshold on the minimum size
>>for each commit.  This could explain why we are getting reports that
>>CVS repositories convert to much larger Subversion repositories.
>>
>>Does that sound plausible?  If it does I wonder what we could do to
>>change it: make the nodes less expensive, or use some sort of "diffy"
>>directory storage, or...
>>    
>>
>
>We explicitly disabled diffy directory storage because of the time
>costs of undeltifying those things.  That said, this cost might not be
>so great these days (with skip-deltas and delta combiners).
>  
>
So all directories are stored as full text?
I couldn't find where the distinction is made in the deltify code.
Please help me out here.

gat (aka clueless)




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: The small commit problem

Posted by Philip Martin <ph...@codematters.co.uk>.
Philip Martin <ph...@codematters.co.uk> writes:

> When converting a large CVS repository those few KB are an additional
> overhead that Subversion is likely to impose on a large number of the
> commits.  Joey's debhelper repository ended up with about 450 tags.

Oops, sent that a bit too early.  I was going to say that adding
another entry to that directory with 450 tags is going to increase the
repository by about 10k, even if the new entry is simply a cheap copy
of the trunk.

-- 
Philip Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: The small commit problem

Posted by Philip Martin <ph...@codematters.co.uk>.
cmpilato@collab.net writes:

> Philip Martin <ph...@codematters.co.uk> writes:
>
>> psize=`du -ks repostress | awk '{print $1}'`
>
> It would appear that your script is measuring just the sheer size of
> the database, log files and all (unless stress.pl does log removal).
> This means you're measuring not just the growth of the repository, but
> the growth of all the intermediate loggish steps taken to change that
> repository.  So yes, you can expect that to grow almost linearly based
> on the size of the of the change.  I mean, as each new edit or move
> comes in, we are replacing the directory entries list.  That's a
> write-ahead log action of (probably) the entire new entries list *for
> each edit*.

The script itself (not stress.pl) does BDB checkpointing and log file
removal.

Looks like I overestimated the values (I didn't properly account for
those steps when the repository shrank) but the problem is real.
Here's what I get moving a file in a directory of 100 files

1400 2332 932
2332 2172 -160
2172 3020 848
3020 3888 868
3888 3704 -184
3704 4564 860
4564 5404 840
5404 5252 -152
5252 6100 848
6100 6964 864
6964 6796 -168
6796 7656 860
7656 8504 848
8504 9360 856
9360 9212 -148
9212 10088 876
10088 10980 892
10980 10800 -180
10800 11684 884
11684 12552 868
12552 12408 -144
12408 13292 884

Each line is 100 commits, the first two columns are repository size,
the third is the difference.  Over 30 log files were used. That's 2200
moves and the repository has grown from 1400k to 13292k, about 5k per
commit.  Repeating, but using a directory of 50 files instead of 100,
I get an average size per commit of about 3k.

This means that small changes are relatively expensive in a Subversion
repository.  Each large directory that contributes to a commit is
going to add several KB to the repository size, a one line change to a
header file and a source file in separate directories could well add
10K to the repository.

When converting a large CVS repository those few KB are an additional
overhead that Subversion is likely to impose on a large number of the
commits.  Joey's debhelper repository ended up with about 450 tags.

-- 
Philip Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: The small commit problem

Posted by cm...@collab.net.
Philip Martin <ph...@codematters.co.uk> writes:

> Thus a Subversion repository doesn't handle "small" commits
> particularly well, there is a sort of threshold on the minimum size
> for each commit.  This could explain why we are getting reports that
> CVS repositories convert to much larger Subversion repositories.
> 
> Does that sound plausible?  If it does I wonder what we could do to
> change it: make the nodes less expensive, or use some sort of "diffy"
> directory storage, or...

We explicitly disabled diffy directory storage because of the time
costs of undeltifying those things.  That said, this cost might not be
so great these days (with skip-deltas and delta combiners).

> psize=`du -ks repostress | awk '{print $1}'`

It would appear that your script is measuring just the sheer size of
the database, log files and all (unless stress.pl does log removal).
This means you're measuring not just the growth of the repository, but
the growth of all the intermediate loggish steps taken to change that
repository.  So yes, you can expect that to grow almost linearly based
on the size of the of the change.  I mean, as each new edit or move
comes in, we are replacing the directory entries list.  That's a
write-ahead log action of (probably) the entire new entries list *for
each edit*.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: The small commit problem

Posted by "Glenn A. Thompson" <gt...@cdr.net>.
Hey,

kfogel@collab.net wrote:

>"Glenn A. Thompson" <gt...@cdr.net> writes:
>  
>
>>Now this file contains only the API exposed functions. Which I want to
>>remove from the API by the way:-)
>>I think rev 2093 may be where it all changed into it's current
>>form. It's a CMike mega merge.
>>I can dig further but not until late next week (knock on wood).
>>It's on my list of things to fully figure out.  Directory and property
>>storage is a sticky point for round one SQL support.
>>    
>>
>
>Uh, I don't think that's the change that did it, but whatever -- we
>can track it down, surely. 
>
Never doubted it:-)

> The point is, we *did* deliberately turn
>off directory deltification, to improve speed (at the cost of space),
>and we can certainly reconsider that decision, especially now that the
>speed penalty is not as bad as it used to be.
>
Yes I understand. The code you pointed out is no longer in deltify.c.   
I thought it moved in rev 2093.  I must have been mistaken.  Sorry.

The reason for my interest has to do with pushing all skels to the bdb 
level and to introduce the ability to store properties, directories, and 
file nodes differently.  Deltification makes this more difficult.  

Sorry didn't mean to hi-jack this thread.

gat  




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: The small commit problem

Posted by kf...@collab.net.
"Glenn A. Thompson" <gt...@cdr.net> writes:
> Now this file contains only the API exposed functions. Which I want to
> remove from the API by the way:-)
> I think rev 2093 may be where it all changed into it's current
> form. It's a CMike mega merge.
> I can dig further but not until late next week (knock on wood).
> It's on my list of things to fully figure out.  Directory and property
> storage is a sticky point for round one SQL support.

Uh, I don't think that's the change that did it, but whatever -- we
can track it down, surely.  The point is, we *did* deliberately turn
off directory deltification, to improve speed (at the cost of space),
and we can certainly reconsider that decision, especially now that the
speed penalty is not as bad as it used to be.

-Karl


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: The small commit problem

Posted by "Glenn A. Thompson" <gt...@cdr.net>.
> 
>
>Yes, I think it might be time to turn on directory deltification.  It
>was turned off here, I think:
>
>
>   ------------------------------------------------------------------------
>   rev 1043:  cmpilato | 2002-01-24 10:03:50 (Thu, 24 Jan 2002) | 12 lines
>   
>   Turning off deltification of directory entries lists (should this be
>   wrapped in #ifdef SVN_FS_DELTIFY_DIR_ENTRIES and made into a
>   compile-time feature?)
>   
>   * subversion/libsvn_fs/deltify.c
>   
>     (deltify): Added 'props_only' argument.
>   
>     (svn_fs__stable_node): Pass 1 for 'props_only' argument to deltify()
>     if node we are stabilizing is a directory.
>
>But undeltification performance is greatly improved since then...
>  
>
Now this file contains only the API exposed functions. Which I want to 
remove from the API by the way:-)
I think rev 2093 may be where it all changed into it's current form. 
It's a CMike mega merge.
I can dig further but not until late next week (knock on wood).
It's on my list of things to fully figure out.  Directory and property 
storage is a sticky point for round one SQL support.

gat


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: The small commit problem

Posted by Branko Čibej <br...@xbc.nu>.
Philip Martin wrote:

>Philip Martin <ph...@codematters.co.uk> writes:
>
>  
>
>>kfogel@collab.net writes:
>>
>>    
>>
>>>Yes, I think it might be time to turn on directory deltification.  It
>>>was turned off here, I think:
>>>
>>>   ------------------------------------------------------------------------
>>>   rev 1043:  cmpilato | 2002-01-24 10:03:50 (Thu, 24 Jan 2002) | 12 lines
>>>   
>>>   Turning off deltification of directory entries lists (should this be
>>>   wrapped in #ifdef SVN_FS_DELTIFY_DIR_ENTRIES and made into a
>>>   compile-time feature?)
>>>      
>>>
>>Hmmm, the code has changed quite a bit since then.  Would this be the
>>way to turn it back on?
>>
>>Index: subversion/libsvn_fs/tree.c
>>===================================================================
>>--- subversion/libsvn_fs/tree.c (revision 6607)
>>+++ subversion/libsvn_fs/tree.c (working copy)
>>@@ -1385,7 +1385,7 @@
>>   /* If this node has a predecesser, deltify it. */
>>   if (noderev->predecessor_id)
>>     SVN_ERR (txn_deltify (node, noderev->predecessor_count, 
>>-                          args->is_dir, trail));
>>+                          0, trail));
>> 
>>   return SVN_NO_ERROR;
>> }
>>    
>>
>
>Good news and bad news.
>
>I have been converting the debhelper CVS archive posted a day or so
>ago.  It's a 3.5M CVS archive and using the Subversion HEAD it
>converts to a 20M Subversion repository with 1087 revisions, of which
>485 are tags.  Using the patch above the Subversion repository size is
>reduced to 13M.  If I 'svnadmin dump' bits of the two repositories
>they appear to be the same.
>
>That's the good news, the bad news is that it is extremely slow.  If I
>attempt a full dump of the repository, it runs a bit slowly for the
>first 22 revisions, but with the machine using 100% CPU.  Around about
>revison 23 the CPU usage drops to near zero and the dump crawls along
>so slowly I have never bothered to let it finish.  Dumping the
>revision range 20:40 takes less than 10 seconds without the patch and
>over 4 minutes with the patch.  The revision range 520:540 takes 10
>seconds without the patch and over a minute with the patch.
>  
>

This is fun. I've had this idea that we're not storing directory info as
efficiently as we could -- in both space and time -- but couldn't find
(and wasn't bothered to look for...) any tangible proof. So it looks
like it's time to rev up the ol' brain again and take a look at the
schema. :-)

-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: The small commit problem

Posted by Philip Martin <ph...@codematters.co.uk>.
Philip Martin <ph...@codematters.co.uk> writes:

> kfogel@collab.net writes:
>
>> Yes, I think it might be time to turn on directory deltification.  It
>> was turned off here, I think:
>>
>>    ------------------------------------------------------------------------
>>    rev 1043:  cmpilato | 2002-01-24 10:03:50 (Thu, 24 Jan 2002) | 12 lines
>>    
>>    Turning off deltification of directory entries lists (should this be
>>    wrapped in #ifdef SVN_FS_DELTIFY_DIR_ENTRIES and made into a
>>    compile-time feature?)
>
> Hmmm, the code has changed quite a bit since then.  Would this be the
> way to turn it back on?
>
> Index: subversion/libsvn_fs/tree.c
> ===================================================================
> --- subversion/libsvn_fs/tree.c (revision 6607)
> +++ subversion/libsvn_fs/tree.c (working copy)
> @@ -1385,7 +1385,7 @@
>    /* If this node has a predecesser, deltify it. */
>    if (noderev->predecessor_id)
>      SVN_ERR (txn_deltify (node, noderev->predecessor_count, 
> -                          args->is_dir, trail));
> +                          0, trail));
>  
>    return SVN_NO_ERROR;
>  }

Good news and bad news.

I have been converting the debhelper CVS archive posted a day or so
ago.  It's a 3.5M CVS archive and using the Subversion HEAD it
converts to a 20M Subversion repository with 1087 revisions, of which
485 are tags.  Using the patch above the Subversion repository size is
reduced to 13M.  If I 'svnadmin dump' bits of the two repositories
they appear to be the same.

That's the good news, the bad news is that it is extremely slow.  If I
attempt a full dump of the repository, it runs a bit slowly for the
first 22 revisions, but with the machine using 100% CPU.  Around about
revison 23 the CPU usage drops to near zero and the dump crawls along
so slowly I have never bothered to let it finish.  Dumping the
revision range 20:40 takes less than 10 seconds without the patch and
over 4 minutes with the patch.  The revision range 520:540 takes 10
seconds without the patch and over a minute with the patch.

-- 
Philip Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: The small commit problem

Posted by Philip Martin <ph...@codematters.co.uk>.
kfogel@collab.net writes:

> Yes, I think it might be time to turn on directory deltification.  It
> was turned off here, I think:
>
>    ------------------------------------------------------------------------
>    rev 1043:  cmpilato | 2002-01-24 10:03:50 (Thu, 24 Jan 2002) | 12 lines
>    
>    Turning off deltification of directory entries lists (should this be
>    wrapped in #ifdef SVN_FS_DELTIFY_DIR_ENTRIES and made into a
>    compile-time feature?)

Hmmm, the code has changed quite a bit since then.  Would this be the
way to turn it back on?

Index: subversion/libsvn_fs/tree.c
===================================================================
--- subversion/libsvn_fs/tree.c (revision 6607)
+++ subversion/libsvn_fs/tree.c (working copy)
@@ -1385,7 +1385,7 @@
   /* If this node has a predecesser, deltify it. */
   if (noderev->predecessor_id)
     SVN_ERR (txn_deltify (node, noderev->predecessor_count, 
-                          args->is_dir, trail));
+                          0, trail));
 
   return SVN_NO_ERROR;
 }


-- 
Philip Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: The small commit problem

Posted by kf...@collab.net.
Philip Martin <ph...@codematters.co.uk> writes:
> I have been looking at a cvs2svn conversion and wondering why the
> Subversion repository is so much larger than the CVS one.  One of the
> things that occurred to me is that the creation of a directory node in
> the Subversion filesystem might make small commits relatively
> expensive, particularly if the directory has a large number of
> elements.
> 
> I have been experimenting with scripts like the one at the end of this
> mail.  It creates a simple repository containing a number of files and
> then makes lots of "small" changes to measure how the repository
> grows.  I have tried both renaming a file (one rename per commit) and
> editing a file (append a few bytes to one file per commit).

"Thank goodness for Philip."

There, now that that's out of the way:

> Thus a Subversion repository doesn't handle "small" commits
> particularly well, there is a sort of threshold on the minimum size
> for each commit.  This could explain why we are getting reports that
> CVS repositories convert to much larger Subversion repositories.
> 
> Does that sound plausible?  If it does I wonder what we could do to
> change it: make the nodes less expensive, or use some sort of "diffy"
> directory storage, or...

Yes, I think it might be time to turn on directory deltification.  It
was turned off here, I think:

   ------------------------------------------------------------------------
   rev 1043:  cmpilato | 2002-01-24 10:03:50 (Thu, 24 Jan 2002) | 12 lines
   
   Turning off deltification of directory entries lists (should this be
   wrapped in #ifdef SVN_FS_DELTIFY_DIR_ENTRIES and made into a
   compile-time feature?)
   
   * subversion/libsvn_fs/deltify.c
   
     (deltify): Added 'props_only' argument.
   
     (svn_fs__stable_node): Pass 1 for 'props_only' argument to deltify()
     if node we are stabilizing is a directory.

But undeltification performance is greatly improved since then...

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org