You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@subversion.apache.org by John Aldridge <jp...@jjdash.demon.co.uk> on 2004/04/13 21:54:06 UTC

Large repository

We've progressed a bit in our evaluation of subversion; and have, 
apparently successfully, done a trial import of our old RCS repository, 
using cvs2svn. It took 36 hours to process the 6,500 or so files, and 
generated around 25,000 revisions.

We were surprised that the result is so much bigger than the original 
RCS files: just over 1GB rather than 320MB. It this expected, or is 
something wrong? We'd hoped that the results would be smaller if 
anything, because of the better handling of binary files. The space is 
nearly all taken by "db/strings" file.

Perhaps related, the book briefly mentions (in the Repository 
Administration section) the command "svnadmin deltify", although this 
command is not listed in the Subversion Complete Reference section. Does 
the repository store revisions un-delta'd unless this command has been 
run?

We've tried this on both Windows and Mac, each using 1.0.1, with 
essentially similar results.

Any light which anyone can shed will be gratefully received!
-- 
Cheers,
John

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: Large repository

Posted by John Aldridge <jp...@jjdash.demon.co.uk>.

In message <10...@localhost.localdomain>, Ben 
Collins-Sussman <su...@collab.net> writes
>On Tue, 2004-04-13 at 16:54, John Aldridge wrote:
>> We've progressed a bit in our evaluation of subversion; and have,
>> apparently successfully, done a trial import of our old RCS repository,
>> using cvs2svn. It took 36 hours to process the 6,500 or so files, and
>> generated around 25,000 revisions.
>>
>> We were surprised that the result is so much bigger than the original
>> RCS files: just over 1GB rather than 320MB. It this expected, or is
>> something wrong? We'd hoped that the results would be smaller if
>> anything, because of the better handling of binary files. The space is
>> nearly all taken by "db/strings" file.
>
>I assume that the extra unused logfiles have been removed?  Either by
>you manually, or automatically (if you're using DB 4.2)?

We just installed the binary subversion distribution, but the /bin 
directory contains a file libdb42.dll, so I guess that probably means 
we're running DB 4.2.

>> Any light which anyone can shed will be gratefully received!
>
>cvs2svn is a separate project:  my personal guess is that the very large
>size is coming from inefficiency in cvs2svn's ability to deduce complex
>branches and tags.  It's probably creating many more copies than it
>needs to.

Oh, OK. It didn't occur to me that the cause could be in cvs2scn, rather 
than in subversion itself.

>I would advise two things:
>
>  1. make sure you run your tests using the absolute latest version of
>cvs2svn.  ('svn checkout http://svn.collab.net/repos/cvs2svn/trunk')

We did that, although I haven't checked for changes in the last week.

>  2. discuss these problems on the users@cvs2svn.tigris.org

I'll head over there now. Thanks!

-- 
Cheers,
John

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: Large repository

Posted by Max Bowsher <ma...@ukf.net>.

Ben Collins-Sussman wrote:
> On Tue, 2004-04-13 at 16:54, John Aldridge wrote:
>> We've progressed a bit in our evaluation of subversion; and have,
>> apparently successfully, done a trial import of our old RCS repository,
>> using cvs2svn. It took 36 hours to process the 6,500 or so files, and
>> generated around 25,000 revisions.
>>
>> We were surprised that the result is so much bigger than the original
>> RCS files: just over 1GB rather than 320MB. It this expected, or is
>> something wrong? We'd hoped that the results would be smaller if
>> anything, because of the better handling of binary files. The space is
>> nearly all taken by "db/strings" file.
>
> I assume that the extra unused logfiles have been removed?  Either by
> you manually, or automatically (if you're using DB 4.2)?
>
>
>> Any light which anyone can shed will be gratefully received!
>
> cvs2svn is a separate project:  my personal guess is that the very large
> size is coming from inefficiency in cvs2svn's ability to deduce complex
> branches and tags.  It's probably creating many more copies than it
> needs to.  Either way, it's a problem with cvs2svn, not with Subversion
> itself.

Ben: This explanation doesn't feel right. This would imply that there was
about 680MB of extra directory skels stored in 'strings' - that seems a bit
much.

Are there any circumstances where a repository load might result in an
inefficiently deltified?

Do you think studying the data in the 'representations' table might be
informative?

Max.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: Large repository

Posted by Ben Collins-Sussman <su...@collab.net>.

On Tue, 2004-04-13 at 16:54, John Aldridge wrote:
> We've progressed a bit in our evaluation of subversion; and have, 
> apparently successfully, done a trial import of our old RCS repository, 
> using cvs2svn. It took 36 hours to process the 6,500 or so files, and 
> generated around 25,000 revisions.
> 
> We were surprised that the result is so much bigger than the original 
> RCS files: just over 1GB rather than 320MB. It this expected, or is 
> something wrong? We'd hoped that the results would be smaller if 
> anything, because of the better handling of binary files. The space is 
> nearly all taken by "db/strings" file.

I assume that the extra unused logfiles have been removed?  Either by
you manually, or automatically (if you're using DB 4.2)?

> Any light which anyone can shed will be gratefully received!

cvs2svn is a separate project:  my personal guess is that the very large
size is coming from inefficiency in cvs2svn's ability to deduce complex
branches and tags.  It's probably creating many more copies than it
needs to.  Either way, it's a problem with cvs2svn, not with Subversion
itself.  Subversion *does* store binary files much more efficiently than
RCS.  If you had been using Subversion from day one, your repository
would very likely look quite different.

I would advise two things:

  1. make sure you run your tests using the absolute latest version of
cvs2svn.  ('svn checkout http://svn.collab.net/repos/cvs2svn/trunk')

  2. discuss these problems on the users@cvs2svn.tigris.org

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: 'svnadmin load' doesn't deltify enough.

Posted by Max Bowsher <ma...@ukf.net>.

John Aldridge wrote:
> In message <85...@newton.ch.collab.net>, kfogel@collab.net
> writes
>> Let's start from the beginning:
>
>   (snip very helpful explanation of subversion deltification)
>
>> If someone now commits to /trunk/bar/blah.txt to create r4, *then* the
>> tip of trunk and the tip of branch will both have fulltexts, because
>> starting in r3, the two blah.txt files were no longer sharing storage.
>
>   :
>
>> It doesn't seem likely to me that the extra fulltexts on branch tips
>> could account for the kinds of storage size differences we're seeing
>> here, anyway.  I mean, yeah, if you create a lot of branches, and make
>> commits to many different files on each branch (as opposed to many
>> commits on a few files), then yes, it could affect total storage by
>> these amounts.
>
>   :
>
> And, in message <85...@newton.ch.collab.net>
>> How does being a former RCS repository imply that every file has a
>> commit on every branch?  Shouldn't it only have a commit if the file
>> was modified since being branched?
>
>
> Let me explain the hole we've dug ourselves into here, in the hope that
> someone can suggest something...
>
> Development occurs on the RCS trunk.  When we come to release time (say
> version 6.0) then, for every file in the repository, we drop a label on
> the tip of the trunk...
>
>    rcs -nV60: *
>
> We set up a branch label starting at that point in case we need to issue
> any patches...
>
>    rcs -nV60X:V60.60 *
>
> And we force a revision onto that branch...
>
>    co -rV60 *
>    ci -rV60X -m"V6.0.* development branch" -f *
>
> Before continuing normal development on the mainline.
>
> To be specific, supposing (for a particular file) version 6.0 used
> revision 1.17 of a file, we now have
>
>    The revision label V60 = 1.17
>    The branch label V60X = 1.17.60
>    And an actual revision 1.17.60.1 essentially identical to 1.17
>
> Why do we force a revision onto the branch? Because a checkout of the
> V60X branch label will not succeed unless there's at least one revision
> there (specifically, it does not fall back to check out the branch point
> on the trunk).
>
> I believe that, although CVS uses RCS format files to store data, it has
> some smarts to avoid creating the branch for a file until it is actually
> needed. Using RCS "raw" makes this a difficult strategy to manage.

Entirely correct about CVS. IIRC, CVS terminology for this is "magic
branches", which simply work by recording the symbol using notation x.y.0.z
instead of x.y.z .

> The net result is that pretty much every file in out repository has
> about 5 branches (one for each release), and that these branches /all/
> contain at least one actual revision which is identical to the trunk
> revision at which the branch is rooted. The vast majority of files
> contain just this one revision on each branch.
>
>
> The RCS strategy of storing backwards differences down the trunk, but
> then forwards differences up branches makes this a reasonably efficient
> strategy. Unfortunately, it seems to be a use-case which is not well
> supported by subversion.
>
> A I understand Karl's explanation, though, there seems to be nothing in
> the subversion data structure which "knows" that deltas go backwards
> from the tip. Is there anything I (or the cvs2scn authors, for that
> matter) can do to cause branch deltas to be built forwards from the
> branch point?

Not without editing and recompiling libsvn_fs or libsvn_repos.

> I also still don't understand the purpose of the "svnadmin deltify"
> command. When would I want/need to use this?

AFAIK, it is a useless leftover from a time when deltification was not
automatic.

> I think our fallback strategy is to remove the branches from the RCS
> files before we import them into subversion, and settle for keeping the
> original RCS data around in case we need to do any detailed research
> about anything outside the trunk. I'd rather not do this if it can be
> avoided, though.

Re: 'svnadmin load' doesn't deltify enough.

Posted by John Aldridge <jp...@jjdash.demon.co.uk>.

In message <85...@newton.ch.collab.net>, kfogel@collab.net
writes
>Let's start from the beginning:

  (snip very helpful explanation of subversion deltification)

>If someone now commits to /trunk/bar/blah.txt to create r4, *then* the
>tip of trunk and the tip of branch will both have fulltexts, because
>starting in r3, the two blah.txt files were no longer sharing storage.

  :

>It doesn't seem likely to me that the extra fulltexts on branch tips
>could account for the kinds of storage size differences we're seeing
>here, anyway.  I mean, yeah, if you create a lot of branches, and make
>commits to many different files on each branch (as opposed to many
>commits on a few files), then yes, it could affect total storage by
>these amounts.

  :

And, in message <85...@newton.ch.collab.net>
>How does being a former RCS repository imply that every file has a
>commit on every branch?  Shouldn't it only have a commit if the file
>was modified since being branched?


Let me explain the hole we've dug ourselves into here, in the hope that
someone can suggest something...

Development occurs on the RCS trunk.  When we come to release time (say
version 6.0) then, for every file in the repository, we drop a label on
the tip of the trunk...

   rcs -nV60: *

We set up a branch label starting at that point in case we need to issue
any patches...

   rcs -nV60X:V60.60 *

And we force a revision onto that branch...

   co -rV60 *
   ci -rV60X -m"V6.0.* development branch" -f *

Before continuing normal development on the mainline.

To be specific, supposing (for a particular file) version 6.0 used
revision 1.17 of a file, we now have

   The revision label V60 = 1.17
   The branch label V60X = 1.17.60
   And an actual revision 1.17.60.1 essentially identical to 1.17

Why do we force a revision onto the branch? Because a checkout of the
V60X branch label will not succeed unless there's at least one revision
there (specifically, it does not fall back to check out the branch point
on the trunk).

I believe that, although CVS uses RCS format files to store data, it has
some smarts to avoid creating the branch for a file until it is actually
needed. Using RCS "raw" makes this a difficult strategy to manage.

The net result is that pretty much every file in out repository has
about 5 branches (one for each release), and that these branches /all/
contain at least one actual revision which is identical to the trunk
revision at which the branch is rooted. The vast majority of files
contain just this one revision on each branch.


The RCS strategy of storing backwards differences down the trunk, but
then forwards differences up branches makes this a reasonably efficient
strategy. Unfortunately, it seems to be a use-case which is not well
supported by subversion.

A I understand Karl's explanation, though, there seems to be nothing in
the subversion data structure which "knows" that deltas go backwards
from the tip. Is there anything I (or the cvs2scn authors, for that
matter) can do to cause branch deltas to be built forwards from the
branch point?

I also still don't understand the purpose of the "svnadmin deltify"
command. When would I want/need to use this?


I think our fallback strategy is to remove the branches from the RCS
files before we import them into subversion, and settle for keeping the
original RCS data around in case we need to do any detailed research
about anything outside the trunk. I'd rather not do this if it can be
avoided, though.

-- 
Cheers,
John

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: 'svnadmin load' doesn't deltify enough.

Posted by kf...@collab.net.

"Max Bowsher" <ma...@ukf.net> writes:
> This repository is a former RCS repository, so for every file has a commit
> on every branch. Instant size explosion. Exceptional case? Maybe not -
> suppose you modify a significant number of the files on a branch (surely
> quite a common scenario, at least for long running branches, being merged
> to). The subversion repository will soon become larger than the equivalent
> CVS repository. Subversion may need to reconsider its deltification scheme.

Maybe I'm missing something...

How does being a former RCS repository imply that every file has a
commit on every branch?  Shouldn't it only have a commit if the file
was modified since being branched?

I presume a more detailed restatement would be "Every file that is on
a branch B also has at least one commit on B", but still don't see why
that would be true...

-K

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: 'svnadmin load' doesn't deltify enough.

Posted by kf...@collab.net.

"Max Bowsher" <ma...@ukf.net> writes:
> This repository is a former RCS repository, so for every file has a commit
> on every branch. Instant size explosion. Exceptional case? Maybe not -
> suppose you modify a significant number of the files on a branch (surely
> quite a common scenario, at least for long running branches, being merged
> to). The subversion repository will soon become larger than the equivalent
> CVS repository. Subversion may need to reconsider its deltification scheme.

Maybe I'm missing something...

How does being a former RCS repository imply that every file has a
commit on every branch?  Shouldn't it only have a commit if the file
was modified since being branched?

I presume a more detailed restatement would be "Every file that is on
a branch B also has at least one commit on B", but still don't see why
that would be true...

-K

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: 'svnadmin load' doesn't deltify enough.

Posted by Max Bowsher <ma...@ukf.net>.

kfogel@collab.net wrote:
> "Erik Huelsmann" <e....@gmx.net> writes:
>> Max Bowsher <ma...@ukf.net> writes
>>> The problem is that each branch is creating another set of fulltexts in
>>> the repository.
>>>
>>> I don't know how deltification is supposed to work with branches,
>>> hopefully someone can explain that?
>>
>> Just guessing here but Subversion does not have any idea about the
concepts
>> of trunk and branches. So what you see is that Subversion optimizes
accesses
>> by keeping HEAD undeltified be the file in a branch or on trunk.
>>
>> Thinking about why the Subversion community never saw this problem
before:
>> We remove dead branches. Therefore the tip of the branch no longer is in
>> HEAD and thus can be stored as delta.
>
> I think this might not be quite right.
>
> Let's start from the beginning:
>
> When a branch is created (by cvs2svn or otherwise), no new file
> fulltexts should be created because the branch is merely a copy or a
> set of copies.  At least one new directory node would be created, of
> course; and more if the branch had to be "patch up" a lot to contain
> the exact set of files.  But file fulltexts?  No, shouldn't be any new
> ones.
>
> And if you commit to a branch, the predecessor node should get
> deltified against the new head of the branch: libsvn_fs has no
> preference to deltify against HEADs in /trunk versus HEADs elsewhere.
>
> Take the following repository:
>
>     r1: /trunk/
>            /foo/
>            /foo/qux.txt
>            /bar/
>            /bar/blah.txt
>         /branches/
>
> Now r2 makes a branch of /trunk.  The arrows show where storage is shared:
>
>     r2: /trunk/
>            /foo/  <-------------------------.
>            /foo/qux.txt  <---------.        |
>            /bar/  <----------------|-----.  |
>            /bar/blah.txt <---------|--.  |  |
>         /branches/                 |  |  |  |
>         /branches/mybranch/        |  |  |  |
>                    /foo/ ----------|--|--|--'
>                    /foo/qux.txt ---'  |  |
>                    /bar/ -------------|--'
>                    /bar/blah.txt -----'
>
> This means that in r2, both 'blah.txt' files are *the same node*, as
> are both 'qux.txt', and both 'foo' and 'bar' directories.  (Sadly, I
> think 'mybranch' is not the same node as trunk, because we had to make
> a new node with a new CopyID.  However, that caveat only applies to
> the top node in a copy.)
>
> r2:/branches is not the same node as r1:/branches, of course.
>
> I'm hoping Mike Pilato or someone will sanity check all my claims
> here, by the way :-).
>
> Okay, as of r2, both .txt nodes are fulltexts (notice the language:
> there are four .txt files, but only two nodes for those four files).
>
> Now we commit a change to blah.txt on the branch, creating r3:
>
>     r3: /trunk/
>            /foo/  <----------------------.
>            /foo/qux.txt  <---------.     |
>            /bar/                   |     |
>            /bar/blah.txt <---------|-----|--- This is now deltified
against
>         /branches/                 |     |
/branches/mybranch/bar/blah.txt
>         /branches/mybranch/        |     |
>                    /foo/ ----------|-----'
>                    /foo/qux.txt ---'
>                    /bar/
>                    /bar/blah.txt
>
> Why does /trunk/bar/blah.txt get deltified when someone makes a commit
> to /branches/mybranch/bar/blah.txt?  Because inside the filesystem,
> the original files were the same node: fulltext vs deltatext is merely
> the "representation" of that node in the database.  Since a commit to
> /branches/mybranch/bar/blah.txt will cause the filesystem to deltify
> the predecessor node (just like any commit, for Subversion doesn't
> think "/branches" is special), that means the HEAD of blah.txt in
> trunk will have a deltatext representation.
>
> If someone now commits to /trunk/bar/blah.txt to create r4, *then* the
> tip of trunk and the tip of branch will both have fulltexts, because
> starting in r3, the two blah.txt files were no longer sharing storage.
>
> (One interesting question is: would r3:/trunk/bar/blah.txt be
> redeltified against r4:/trunk/bar/blah.txt, or simply be left as a
> delta against the branch version of blah.txt?  I don't know the answer
> offhand, but it's not really related to the original issue anyway, I'm
> just asking for fun.  My guess is that we do not redeltify if the
> storage is already deltatext -- because there's no reason to believe
> we'd get better space savings.)
>
> So anyway, yes, if both the branch and the trunk are actively
> developed, the total number of fulltexts in the filesystem goes up,
> though perhaps not as quickly as one might expect, because of that
> shared initial storage.
>
> But, deleting a branch doesn't get rid of its fulltexts!  The tip of a
> branch still exists even after the branch has been deleted, and since
> no commits are happening to those files, there's nothing to trigger
> further deltification.  Therefore the fact that we tend to remove dead
> branches in Subversion's own repository shouldn't change the number of
> fulltexts.
>
> It doesn't seem likely to me that the extra fulltexts on branch tips
> could account for the kinds of storage size differences we're seeing
> here, anyway.  I mean, yeah, if you create a lot of branches, and make
> commits to many different files on each branch (as opposed to many
> commits on a few files), then yes, it could affect total storage by
> these amounts.
>
> I haven't done the math for this particular repository, though.  Maybe
> my instincts are off base here?

This repository is a former RCS repository, so for every file has a commit
on every branch. Instant size explosion. Exceptional case? Maybe not -
suppose you modify a significant number of the files on a branch (surely
quite a common scenario, at least for long running branches, being merged
to). The subversion repository will soon become larger than the equivalent
CVS repository. Subversion may need to reconsider its deltification scheme.

Max.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: 'svnadmin load' doesn't deltify enough.

Posted by Max Bowsher <ma...@ukf.net>.

kfogel@collab.net wrote:
> "Erik Huelsmann" <e....@gmx.net> writes:
>> Max Bowsher <ma...@ukf.net> writes
>>> The problem is that each branch is creating another set of fulltexts in
>>> the repository.
>>>
>>> I don't know how deltification is supposed to work with branches,
>>> hopefully someone can explain that?
>>
>> Just guessing here but Subversion does not have any idea about the
concepts
>> of trunk and branches. So what you see is that Subversion optimizes
accesses
>> by keeping HEAD undeltified be the file in a branch or on trunk.
>>
>> Thinking about why the Subversion community never saw this problem
before:
>> We remove dead branches. Therefore the tip of the branch no longer is in
>> HEAD and thus can be stored as delta.
>
> I think this might not be quite right.
>
> Let's start from the beginning:
>
> When a branch is created (by cvs2svn or otherwise), no new file
> fulltexts should be created because the branch is merely a copy or a
> set of copies.  At least one new directory node would be created, of
> course; and more if the branch had to be "patch up" a lot to contain
> the exact set of files.  But file fulltexts?  No, shouldn't be any new
> ones.
>
> And if you commit to a branch, the predecessor node should get
> deltified against the new head of the branch: libsvn_fs has no
> preference to deltify against HEADs in /trunk versus HEADs elsewhere.
>
> Take the following repository:
>
>     r1: /trunk/
>            /foo/
>            /foo/qux.txt
>            /bar/
>            /bar/blah.txt
>         /branches/
>
> Now r2 makes a branch of /trunk.  The arrows show where storage is shared:
>
>     r2: /trunk/
>            /foo/  <-------------------------.
>            /foo/qux.txt  <---------.        |
>            /bar/  <----------------|-----.  |
>            /bar/blah.txt <---------|--.  |  |
>         /branches/                 |  |  |  |
>         /branches/mybranch/        |  |  |  |
>                    /foo/ ----------|--|--|--'
>                    /foo/qux.txt ---'  |  |
>                    /bar/ -------------|--'
>                    /bar/blah.txt -----'
>
> This means that in r2, both 'blah.txt' files are *the same node*, as
> are both 'qux.txt', and both 'foo' and 'bar' directories.  (Sadly, I
> think 'mybranch' is not the same node as trunk, because we had to make
> a new node with a new CopyID.  However, that caveat only applies to
> the top node in a copy.)
>
> r2:/branches is not the same node as r1:/branches, of course.
>
> I'm hoping Mike Pilato or someone will sanity check all my claims
> here, by the way :-).
>
> Okay, as of r2, both .txt nodes are fulltexts (notice the language:
> there are four .txt files, but only two nodes for those four files).
>
> Now we commit a change to blah.txt on the branch, creating r3:
>
>     r3: /trunk/
>            /foo/  <----------------------.
>            /foo/qux.txt  <---------.     |
>            /bar/                   |     |
>            /bar/blah.txt <---------|-----|--- This is now deltified
against
>         /branches/                 |     |
/branches/mybranch/bar/blah.txt
>         /branches/mybranch/        |     |
>                    /foo/ ----------|-----'
>                    /foo/qux.txt ---'
>                    /bar/
>                    /bar/blah.txt
>
> Why does /trunk/bar/blah.txt get deltified when someone makes a commit
> to /branches/mybranch/bar/blah.txt?  Because inside the filesystem,
> the original files were the same node: fulltext vs deltatext is merely
> the "representation" of that node in the database.  Since a commit to
> /branches/mybranch/bar/blah.txt will cause the filesystem to deltify
> the predecessor node (just like any commit, for Subversion doesn't
> think "/branches" is special), that means the HEAD of blah.txt in
> trunk will have a deltatext representation.
>
> If someone now commits to /trunk/bar/blah.txt to create r4, *then* the
> tip of trunk and the tip of branch will both have fulltexts, because
> starting in r3, the two blah.txt files were no longer sharing storage.
>
> (One interesting question is: would r3:/trunk/bar/blah.txt be
> redeltified against r4:/trunk/bar/blah.txt, or simply be left as a
> delta against the branch version of blah.txt?  I don't know the answer
> offhand, but it's not really related to the original issue anyway, I'm
> just asking for fun.  My guess is that we do not redeltify if the
> storage is already deltatext -- because there's no reason to believe
> we'd get better space savings.)
>
> So anyway, yes, if both the branch and the trunk are actively
> developed, the total number of fulltexts in the filesystem goes up,
> though perhaps not as quickly as one might expect, because of that
> shared initial storage.
>
> But, deleting a branch doesn't get rid of its fulltexts!  The tip of a
> branch still exists even after the branch has been deleted, and since
> no commits are happening to those files, there's nothing to trigger
> further deltification.  Therefore the fact that we tend to remove dead
> branches in Subversion's own repository shouldn't change the number of
> fulltexts.
>
> It doesn't seem likely to me that the extra fulltexts on branch tips
> could account for the kinds of storage size differences we're seeing
> here, anyway.  I mean, yeah, if you create a lot of branches, and make
> commits to many different files on each branch (as opposed to many
> commits on a few files), then yes, it could affect total storage by
> these amounts.
>
> I haven't done the math for this particular repository, though.  Maybe
> my instincts are off base here?

This repository is a former RCS repository, so for every file has a commit
on every branch. Instant size explosion. Exceptional case? Maybe not -
suppose you modify a significant number of the files on a branch (surely
quite a common scenario, at least for long running branches, being merged
to). The subversion repository will soon become larger than the equivalent
CVS repository. Subversion may need to reconsider its deltification scheme.

Max.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: 'svnadmin load' doesn't deltify enough.

Posted by kf...@collab.net.

"Erik Huelsmann" <e....@gmx.net> writes:
> Max Bowsher <ma...@ukf.net> writes
> > The problem is that each branch is creating another set of fulltexts in
> > the repository.
> > 
> > I don't know how deltification is supposed to work with branches,
> > hopefully someone can explain that?
> 
> Just guessing here but Subversion does not have any idea about the concepts
> of trunk and branches. So what you see is that Subversion optimizes accesses
> by keeping HEAD undeltified be the file in a branch or on trunk.
> 
> Thinking about why the Subversion community never saw this problem before:
> We remove dead branches. Therefore the tip of the branch no longer is in
> HEAD and thus can be stored as delta.

I think this might not be quite right.

Let's start from the beginning:

When a branch is created (by cvs2svn or otherwise), no new file
fulltexts should be created because the branch is merely a copy or a
set of copies.  At least one new directory node would be created, of
course; and more if the branch had to be "patch up" a lot to contain
the exact set of files.  But file fulltexts?  No, shouldn't be any new
ones.

And if you commit to a branch, the predecessor node should get
deltified against the new head of the branch: libsvn_fs has no
preference to deltify against HEADs in /trunk versus HEADs elsewhere.

Take the following repository:

    r1: /trunk/
           /foo/
           /foo/qux.txt
           /bar/
           /bar/blah.txt
        /branches/

Now r2 makes a branch of /trunk.  The arrows show where storage is shared:

    r2: /trunk/
           /foo/  <-------------------------.
           /foo/qux.txt  <---------.        |
           /bar/  <----------------|-----.  |
           /bar/blah.txt <---------|--.  |  |
        /branches/                 |  |  |  |
        /branches/mybranch/        |  |  |  |
                   /foo/ ----------|--|--|--'
                   /foo/qux.txt ---'  |  |
                   /bar/ -------------|--'
                   /bar/blah.txt -----'

This means that in r2, both 'blah.txt' files are *the same node*, as
are both 'qux.txt', and both 'foo' and 'bar' directories.  (Sadly, I
think 'mybranch' is not the same node as trunk, because we had to make
a new node with a new CopyID.  However, that caveat only applies to
the top node in a copy.)

r2:/branches is not the same node as r1:/branches, of course.

I'm hoping Mike Pilato or someone will sanity check all my claims
here, by the way :-).

Okay, as of r2, both .txt nodes are fulltexts (notice the language:
there are four .txt files, but only two nodes for those four files).

Now we commit a change to blah.txt on the branch, creating r3:

    r3: /trunk/
           /foo/  <----------------------.
           /foo/qux.txt  <---------.     |
           /bar/                   |     |
           /bar/blah.txt <---------|-----|--- This is now deltified against
        /branches/                 |     |    /branches/mybranch/bar/blah.txt
        /branches/mybranch/        |     |
                   /foo/ ----------|-----'
                   /foo/qux.txt ---'
                   /bar/                 
                   /bar/blah.txt

Why does /trunk/bar/blah.txt get deltified when someone makes a commit
to /branches/mybranch/bar/blah.txt?  Because inside the filesystem,
the original files were the same node: fulltext vs deltatext is merely
the "representation" of that node in the database.  Since a commit to
/branches/mybranch/bar/blah.txt will cause the filesystem to deltify
the predecessor node (just like any commit, for Subversion doesn't
think "/branches" is special), that means the HEAD of blah.txt in
trunk will have a deltatext representation.

If someone now commits to /trunk/bar/blah.txt to create r4, *then* the
tip of trunk and the tip of branch will both have fulltexts, because
starting in r3, the two blah.txt files were no longer sharing storage.

(One interesting question is: would r3:/trunk/bar/blah.txt be
redeltified against r4:/trunk/bar/blah.txt, or simply be left as a
delta against the branch version of blah.txt?  I don't know the answer
offhand, but it's not really related to the original issue anyway, I'm
just asking for fun.  My guess is that we do not redeltify if the
storage is already deltatext -- because there's no reason to believe
we'd get better space savings.)

So anyway, yes, if both the branch and the trunk are actively
developed, the total number of fulltexts in the filesystem goes up,
though perhaps not as quickly as one might expect, because of that
shared initial storage.

But, deleting a branch doesn't get rid of its fulltexts!  The tip of a
branch still exists even after the branch has been deleted, and since
no commits are happening to those files, there's nothing to trigger
further deltification.  Therefore the fact that we tend to remove dead
branches in Subversion's own repository shouldn't change the number of
fulltexts.

It doesn't seem likely to me that the extra fulltexts on branch tips
could account for the kinds of storage size differences we're seeing
here, anyway.  I mean, yeah, if you create a lot of branches, and make
commits to many different files on each branch (as opposed to many
commits on a few files), then yes, it could affect total storage by
these amounts.

I haven't done the math for this particular repository, though.  Maybe
my instincts are off base here?

-Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: 'svnadmin load' doesn't deltify enough.

Posted by kf...@collab.net.

"Erik Huelsmann" <e....@gmx.net> writes:
> Max Bowsher <ma...@ukf.net> writes
> > The problem is that each branch is creating another set of fulltexts in
> > the repository.
> > 
> > I don't know how deltification is supposed to work with branches,
> > hopefully someone can explain that?
> 
> Just guessing here but Subversion does not have any idea about the concepts
> of trunk and branches. So what you see is that Subversion optimizes accesses
> by keeping HEAD undeltified be the file in a branch or on trunk.
> 
> Thinking about why the Subversion community never saw this problem before:
> We remove dead branches. Therefore the tip of the branch no longer is in
> HEAD and thus can be stored as delta.

I think this might not be quite right.

Let's start from the beginning:

When a branch is created (by cvs2svn or otherwise), no new file
fulltexts should be created because the branch is merely a copy or a
set of copies.  At least one new directory node would be created, of
course; and more if the branch had to be "patch up" a lot to contain
the exact set of files.  But file fulltexts?  No, shouldn't be any new
ones.

And if you commit to a branch, the predecessor node should get
deltified against the new head of the branch: libsvn_fs has no
preference to deltify against HEADs in /trunk versus HEADs elsewhere.

Take the following repository:

    r1: /trunk/
           /foo/
           /foo/qux.txt
           /bar/
           /bar/blah.txt
        /branches/

Now r2 makes a branch of /trunk.  The arrows show where storage is shared:

    r2: /trunk/
           /foo/  <-------------------------.
           /foo/qux.txt  <---------.        |
           /bar/  <----------------|-----.  |
           /bar/blah.txt <---------|--.  |  |
        /branches/                 |  |  |  |
        /branches/mybranch/        |  |  |  |
                   /foo/ ----------|--|--|--'
                   /foo/qux.txt ---'  |  |
                   /bar/ -------------|--'
                   /bar/blah.txt -----'

This means that in r2, both 'blah.txt' files are *the same node*, as
are both 'qux.txt', and both 'foo' and 'bar' directories.  (Sadly, I
think 'mybranch' is not the same node as trunk, because we had to make
a new node with a new CopyID.  However, that caveat only applies to
the top node in a copy.)

r2:/branches is not the same node as r1:/branches, of course.

I'm hoping Mike Pilato or someone will sanity check all my claims
here, by the way :-).

Okay, as of r2, both .txt nodes are fulltexts (notice the language:
there are four .txt files, but only two nodes for those four files).

Now we commit a change to blah.txt on the branch, creating r3:

    r3: /trunk/
           /foo/  <----------------------.
           /foo/qux.txt  <---------.     |
           /bar/                   |     |
           /bar/blah.txt <---------|-----|--- This is now deltified against
        /branches/                 |     |    /branches/mybranch/bar/blah.txt
        /branches/mybranch/        |     |
                   /foo/ ----------|-----'
                   /foo/qux.txt ---'
                   /bar/                 
                   /bar/blah.txt

Why does /trunk/bar/blah.txt get deltified when someone makes a commit
to /branches/mybranch/bar/blah.txt?  Because inside the filesystem,
the original files were the same node: fulltext vs deltatext is merely
the "representation" of that node in the database.  Since a commit to
/branches/mybranch/bar/blah.txt will cause the filesystem to deltify
the predecessor node (just like any commit, for Subversion doesn't
think "/branches" is special), that means the HEAD of blah.txt in
trunk will have a deltatext representation.

If someone now commits to /trunk/bar/blah.txt to create r4, *then* the
tip of trunk and the tip of branch will both have fulltexts, because
starting in r3, the two blah.txt files were no longer sharing storage.

(One interesting question is: would r3:/trunk/bar/blah.txt be
redeltified against r4:/trunk/bar/blah.txt, or simply be left as a
delta against the branch version of blah.txt?  I don't know the answer
offhand, but it's not really related to the original issue anyway, I'm
just asking for fun.  My guess is that we do not redeltify if the
storage is already deltatext -- because there's no reason to believe
we'd get better space savings.)

So anyway, yes, if both the branch and the trunk are actively
developed, the total number of fulltexts in the filesystem goes up,
though perhaps not as quickly as one might expect, because of that
shared initial storage.

But, deleting a branch doesn't get rid of its fulltexts!  The tip of a
branch still exists even after the branch has been deleted, and since
no commits are happening to those files, there's nothing to trigger
further deltification.  Therefore the fact that we tend to remove dead
branches in Subversion's own repository shouldn't change the number of
fulltexts.

It doesn't seem likely to me that the extra fulltexts on branch tips
could account for the kinds of storage size differences we're seeing
here, anyway.  I mean, yeah, if you create a lot of branches, and make
commits to many different files on each branch (as opposed to many
commits on a few files), then yes, it could affect total storage by
these amounts.

I haven't done the math for this particular repository, though.  Maybe
my instincts are off base here?

-Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: 'svnadmin load' doesn't deltify enough.

Posted by Erik Huelsmann <e....@gmx.net>.

> John Aldridge wrote:
> > In message <05...@starfruit>, Max Bowsher
> > <ma...@ukf.net> writes
> >> Would it be possible for you to make your abnormally large repository
> >> available for debugging?
> >
> > I've sent a (cut down) sample of the data to Max -- I'll wait to see
> > what he comes up with before posting the same problem over on the
> > cvs2scn list.
> 
> Thankyou, I've analysed the problem.
> 
> The problem is that each branch is creating another set of fulltexts in
> the
> repository.
> 
> I don't know how deltification is supposed to work with branches,
> hopefully
> someone can explain that?

Just guessing here but Subversion does not have any idea about the concepts
of trunk and branches. So what you see is that Subversion optimizes accesses
by keeping HEAD undeltified be the file in a branch or on trunk.

Thinking about why the Subversion community never saw this problem before:
We remove dead branches. Therefore the tip of the branch no longer is in
HEAD and thus can be stored as delta.

bye,

Erik.

-- 
NEU : GMX Internet.FreeDSL
Ab sofort DSL-Tarif ohne Grundgebühr: http://www.gmx.net/info

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: 'svnadmin load' doesn't deltify enough.

Posted by Erik Huelsmann <e....@gmx.net>.

> John Aldridge wrote:
> > In message <05...@starfruit>, Max Bowsher
> > <ma...@ukf.net> writes
> >> Would it be possible for you to make your abnormally large repository
> >> available for debugging?
> >
> > I've sent a (cut down) sample of the data to Max -- I'll wait to see
> > what he comes up with before posting the same problem over on the
> > cvs2scn list.
> 
> Thankyou, I've analysed the problem.
> 
> The problem is that each branch is creating another set of fulltexts in
> the
> repository.
> 
> I don't know how deltification is supposed to work with branches,
> hopefully
> someone can explain that?

Just guessing here but Subversion does not have any idea about the concepts
of trunk and branches. So what you see is that Subversion optimizes accesses
by keeping HEAD undeltified be the file in a branch or on trunk.

Thinking about why the Subversion community never saw this problem before:
We remove dead branches. Therefore the tip of the branch no longer is in
HEAD and thus can be stored as delta.

bye,

Erik.

-- 
NEU : GMX Internet.FreeDSL
Ab sofort DSL-Tarif ohne Grundgebühr: http://www.gmx.net/info

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: 'svnadmin load' doesn't deltify enough.

Posted by John Aldridge <jp...@jjdash.demon.co.uk>.

In message <00...@starfruit>, Max Bowsher 
<ma...@ukf.net> writes
>John Aldridge wrote:
>> In message <05...@starfruit>, Max Bowsher
>> <ma...@ukf.net> writes
>>> Would it be possible for you to make your abnormally large repository
>>> available for debugging?
>>
>> I've sent a (cut down) sample of the data to Max -- I'll wait to see
>> what he comes up with before posting the same problem over on the
>> cvs2scn list.
>
>Thankyou, I've analysed the problem.
>
>The problem is that each branch is creating another set of fulltexts in the
>repository.
>
>I don't know how deltification is supposed to work with branches, hopefully
>someone can explain that?

That would certainly account for it.

Most files in our existing RCS repository will have 5-6 branches off the 
trunk (each branch mostly containing just a single revision with only a 
$Id$/$Log$ change). In the RCS files, I believe these are stored as 
deltas from the trunk branch point.

If, in subversion (at least as built by cvs2scn), each such branch is 
duplicating the whole file, that would give a big expansion.
-- 
Cheers,
John

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

'svnadmin load' doesn't deltify enough.

Posted by Max Bowsher <ma...@ukf.net>.

John Aldridge wrote:
> In message <05...@starfruit>, Max Bowsher
> <ma...@ukf.net> writes
>> Would it be possible for you to make your abnormally large repository
>> available for debugging?
>
> I've sent a (cut down) sample of the data to Max -- I'll wait to see
> what he comes up with before posting the same problem over on the
> cvs2scn list.

Thankyou, I've analysed the problem.

The problem is that each branch is creating another set of fulltexts in the
repository.

I don't know how deltification is supposed to work with branches, hopefully
someone can explain that?

Max.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

'svnadmin load' doesn't deltify enough.

Posted by Max Bowsher <ma...@ukf.net>.

John Aldridge wrote:
> In message <05...@starfruit>, Max Bowsher
> <ma...@ukf.net> writes
>> Would it be possible for you to make your abnormally large repository
>> available for debugging?
>
> I've sent a (cut down) sample of the data to Max -- I'll wait to see
> what he comes up with before posting the same problem over on the
> cvs2scn list.

Thankyou, I've analysed the problem.

The problem is that each branch is creating another set of fulltexts in the
repository.

I don't know how deltification is supposed to work with branches, hopefully
someone can explain that?

Max.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Large repository

Posted by John Aldridge <jp...@jjdash.demon.co.uk>.

In message <05...@starfruit>, Max Bowsher 
<ma...@ukf.net> writes
>Would it be possible for you to make your abnormally large repository
>available for debugging?

I've sent a (cut down) sample of the data to Max -- I'll wait to see 
what he comes up with before posting the same problem over on the 
cvs2scn list.
-- 
Cheers,
John

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: Large repository

Posted by Max Bowsher <ma...@ukf.net>.

John Aldridge wrote:
> We've progressed a bit in our evaluation of subversion; and have,
> apparently successfully, done a trial import of our old RCS repository,
> using cvs2svn. It took 36 hours to process the 6,500 or so files, and
> generated around 25,000 revisions.
>
> We were surprised that the result is so much bigger than the original
> RCS files: just over 1GB rather than 320MB. It this expected, or is
> something wrong? We'd hoped that the results would be smaller if
> anything, because of the better handling of binary files. The space is
> nearly all taken by "db/strings" file.

Would it be possible for you to make your abnormally large repository
available for debugging?

If so, please run the following:

cd path/to/your/repos/db
db_checkpoint -1
for i in *s; do
   db_dump -kp $i | bzip2 -c > $i.bdbdump.bz2
done

and place the .bdbdump.bz2 files on a web/ftp server.

Max.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org