You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by John Coiner <jo...@amd.com> on 2009/02/21 17:23:12 UTC

SVN scalability problem as number of tags grows

Hi SVN developers,

I support SVN for a few hundred co-workers. We have been using SVN 
heavily for about two years, generating about 60000 commits, 90000 tags, 
and 3000 branches in one repository.

We have recently discovered a scalability problem. If you follow the 
usual "trunk/tags/branches" structure, the size required to store each 
new tag grows in proportion to the number of tags previously created.

This can be demonstrated in just a few commands, in a brand new repository:

183  svnadmin create test_repo
184  svn list file:///home/john/testsvn/test_repo
185  svn mkdir file:///home/john/testsvn/test_repo/trunk -m ''
186  svn mkdir file:///home/john/testsvn/test_repo/tags -m ''
187  svn copy file:///home/john/testsvn/test_repo/trunk 
file:///home/john/testsvn/test_repo/tags/tag1 -m ''
188  svn copy file:///home/john/testsvn/test_repo/trunk 
file:///home/john/testsvn/test_repo/tags/tag2 -m ''
189  svn copy file:///home/john/testsvn/test_repo/trunk 
file:///home/john/testsvn/test_repo/tags/tag3 -m ''
190  svn copy file:///home/john/testsvn/test_repo/trunk 
file:///home/john/testsvn/test_repo/tags/tag4 -m ''
191  svn copy file:///home/john/testsvn/test_repo/trunk 
file:///home/john/testsvn/test_repo/tags/tag5 -m ''
192  svn copy file:///home/john/testsvn/test_repo/trunk 
file:///home/john/testsvn/test_repo/tags/tag6 -m ''
193  svn copy file:///home/john/testsvn/test_repo/trunk 
file:///home/john/testsvn/test_repo/tags/tag7 -m ''
194  svn copy file:///home/john/testsvn/test_repo/trunk 
file:///home/john/testsvn/test_repo/tags/tag8 -m ''
195  svn copy file:///home/john/testsvn/test_repo/trunk 
file:///home/john/testsvn/test_repo/tags/tag9 -m ''

In the FSFS, each new revs/ entry is larger than the previous one. In 
the output of 'ls' below, revs 3 through 11 correspond to the creation 
of the tag1 through tag9 directories:

john@pitfall:~/testsvn/test_repo/db/revs/0$ ls -latr
total 56
-rw-r--r-- 1 john john  115 2009-02-21 11:19 0
drwxr-sr-x 3 john john 4096 2009-02-21 11:19 ..
-rw-r--r-- 1 john john  277 2009-02-21 11:19 1
-rw-r--r-- 1 john john  305 2009-02-21 11:19 2
-rw-r--r-- 1 john john  531 2009-02-21 11:19 3
-rw-r--r-- 1 john john  564 2009-02-21 11:19 4
-rw-r--r-- 1 john john  595 2009-02-21 11:19 5
-rw-r--r-- 1 john john  628 2009-02-21 11:19 6
-rw-r--r-- 1 john john  659 2009-02-21 11:19 7
-rw-r--r-- 1 john john  690 2009-02-21 11:20 8
-rw-r--r-- 1 john john  721 2009-02-21 11:20 9
-rw-r--r-- 1 john john  762 2009-02-21 11:20 10
-rw-r--r-- 1 john john  800 2009-02-21 11:20 11
drwxr-sr-x 2 john john 4096 2009-02-21 11:20 .

After creating 90000 tags, each new tag consumes megabytes of space in 
the repository. Also each new tag takes a few seconds to apply, up from 
milliseconds when we first began. We had the expectation of more 
graceful scaling, based in part on our experience in other situations 
where SVN scales well, for example committing a million additions to the 
same file.

Our big installation is running on Linux, SVN 1.4.4, and FSFS. The 
problem also exists in SVN 1.5.1.

Is this a known issue? Are there plans to make this more scalable? I 
searched the issues database and did not find anything that looked like 
a duplicate. Should I file a new issue?

Do you have any recommendations for a work around?

One workaround that we are evaluating is to shard the branches and tags 
over a large number of directories. So rather than create 
"tags/TAG_NAME", we may begin to create "tags2/1/b/5/e/TAG_NAME". The 
"1/b/5/e" is the first four hex digits of the md5 hash of "TAG_NAME". We 
chose "tags2" as the base directory to avoid colliding with existing 
entries under "tags/" that happen to be named after a hex digit.

This scales better. Applying N sharded tags requires O(N) space and each 
tag takes O(1) time to apply.

One possible resolution of this issue is a documentation-only change. If 
the SVN book described the scalability issue and recommended a sharded 
tags and branches structure, it would help future "enterprise" adopters 
(and other crazy people who create way too many tags :)

Please let me know if you need any more information about this problem. 
Cheers,

John

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=1204071

Re: SVN scalability problem as number of tags grows

Posted by Greg Hudson <gh...@mit.edu>.
On Sat, 2009-02-21 at 12:23 -0500, John Coiner wrote:
> Is this a known issue? Are there plans to make this more scalable? I 
> searched the issues database and did not find anything that looked like 
> a duplicate. Should I file a new issue?

It is a known issue that svn's back end storage of directories with many
entries isn't terribly efficient.  All revisions of all directory lists
are stored in full, so a directory with many entries takes O(n) time to
modify and O(n) space to hold each new revision (O(n^2) space total, if
the number of changes is proportional to the number of entries).

Since we use directories to hold tags, this issue applies to large
numbers of tags if they are stored in a single flat directory, as the
usual convention suggests.

I don't know of any plans to make this more scalable.  It would require
a significant rearchitecting of directory storage.  One approach would
be to use a balanced tree with many roots to hold all revisions of a
directory--but to do that, we'd have to store all revisions of a
directory together (not necessarily in the same disk blocks, but in some
fashion designed to avoid excessive seeking).  In FSFS, because of other
design contraints, that's simply not practical.  In BDB it might be more
tractable.

> Do you have any recommendations for a work around?

Organizing the tags in a tree structure is probably the best workaround,
as you have already found.

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=1204282

RE: Re: SVN scalability problem as number of tags grows

Posted by Andy Bolstridge <an...@bolstridge.plus.com>.
> Another possibility is to use rev #s rather than tags. If we were 
> starting over we might do this. (Given our infrastructure already 
> deployed atop SVN, which already uses tags, switching to rev #s is a 
> riskier change than switching to sharded tags.)
> 

Someone once suggested storing a 'label' text that mapped to a revnum, so you could have human-readable 'tags' without having to create the tag branches. IIRC he got shot down in flames, but I think the suggestion was a good one - especially if you create many tag branches, and they are not quite as cheap as described.

sure, adding new entries = more data, but if you add more data and branch as well, ad then make lots of tags, you're going to see a significant increase in storage sooner rather than later. 

I havn't seen a problem with it yet (and I have 12Gig and 300,000 revisions) but this does act as a warning not to start creating tag branches, thanks.

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2313467

Re: SVN scalability problem as number of tags grows

Posted by John Coiner <jo...@amd.com>.
Greg Stein wrote:
> It really doesn't have anything to do with tags/ per se, but simply
> that you're creating an ever-larger directory. The size of the
> name:node mapping for the directory contents will continue to grow as
> you add new entries into that directory.

Yes, agreed.

> The sharding is the appropriate solution. You could shard by date,
> initial letters of the tag, or the hash of the tag (as you suggested).
> Just settle on one, and you should be fine.

Thank you, it's nice to have a vote of confidence.

It would be nice if the svn book had a warning about this. It would be 
extra nice if the svn book had a section on the scalability of several 
common operations.

One of my coworkers has tested SVN scalability in a number of 
situations. So the data exists. I'll get in touch with the svn book 
project and see if they would like a contribution.

> Another solution would be to delete obsolete tags... Lots of possibilities.

Agreed. It's difficult for us to know which tags are obsolete, which is 
our own problem.

Another possibility is to use rev #s rather than tags. If we were 
starting over we might do this. (Given our infrastructure already 
deployed atop SVN, which already uses tags, switching to rev #s is a 
riskier change than switching to sharded tags.)

Thanks for your help with this. Cheers,

John

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=1204451

Re: SVN scalability problem as number of tags grows

Posted by Greg Stein <gs...@gmail.com>.
Hi John,

It really doesn't have anything to do with tags/ per se, but simply
that you're creating an ever-larger directory. The size of the
name:node mapping for the directory contents will continue to grow as
you add new entries into that directory.

You'd see the exact same problem if you created 90000 entries in
/trunk/some/path/down/deep/in/the/hierarchy/.

The sharding is the appropriate solution. You could shard by date,
initial letters of the tag, or the hash of the tag (as you suggested).
Just settle on one, and you should be fine.

Another solution would be to delete obsolete tags. Note that they will
always be there in history, just not in HEAD. You could also rotate
tags into an archival tag directory. For example, each month, you
could move tags into /tags/archive/2008-12/ and
/tags/archive/2009-01/. Or even /archived-tags/... for that matter.

Lots of possibilities. I think the right answer is going to depend
upon your workflow, to determine what will work best for you. Main
point: creating directories with 90k entries *is* going to consume
more time and space.

Cheers,
-g

On Sat, Feb 21, 2009 at 18:23, John Coiner <jo...@amd.com> wrote:
> Hi SVN developers,
>
> I support SVN for a few hundred co-workers. We have been using SVN
> heavily for about two years, generating about 60000 commits, 90000 tags,
> and 3000 branches in one repository.
>
> We have recently discovered a scalability problem. If you follow the
> usual "trunk/tags/branches" structure, the size required to store each
> new tag grows in proportion to the number of tags previously created.
>
> This can be demonstrated in just a few commands, in a brand new repository:
>
> 183  svnadmin create test_repo
> 184  svn list file:///home/john/testsvn/test_repo
> 185  svn mkdir file:///home/john/testsvn/test_repo/trunk -m ''
> 186  svn mkdir file:///home/john/testsvn/test_repo/tags -m ''
> 187  svn copy file:///home/john/testsvn/test_repo/trunk
> file:///home/john/testsvn/test_repo/tags/tag1 -m ''
> 188  svn copy file:///home/john/testsvn/test_repo/trunk
> file:///home/john/testsvn/test_repo/tags/tag2 -m ''
> 189  svn copy file:///home/john/testsvn/test_repo/trunk
> file:///home/john/testsvn/test_repo/tags/tag3 -m ''
> 190  svn copy file:///home/john/testsvn/test_repo/trunk
> file:///home/john/testsvn/test_repo/tags/tag4 -m ''
> 191  svn copy file:///home/john/testsvn/test_repo/trunk
> file:///home/john/testsvn/test_repo/tags/tag5 -m ''
> 192  svn copy file:///home/john/testsvn/test_repo/trunk
> file:///home/john/testsvn/test_repo/tags/tag6 -m ''
> 193  svn copy file:///home/john/testsvn/test_repo/trunk
> file:///home/john/testsvn/test_repo/tags/tag7 -m ''
> 194  svn copy file:///home/john/testsvn/test_repo/trunk
> file:///home/john/testsvn/test_repo/tags/tag8 -m ''
> 195  svn copy file:///home/john/testsvn/test_repo/trunk
> file:///home/john/testsvn/test_repo/tags/tag9 -m ''
>
> In the FSFS, each new revs/ entry is larger than the previous one. In
> the output of 'ls' below, revs 3 through 11 correspond to the creation
> of the tag1 through tag9 directories:
>
> john@pitfall:~/testsvn/test_repo/db/revs/0$ ls -latr
> total 56
> -rw-r--r-- 1 john john  115 2009-02-21 11:19 0
> drwxr-sr-x 3 john john 4096 2009-02-21 11:19 ..
> -rw-r--r-- 1 john john  277 2009-02-21 11:19 1
> -rw-r--r-- 1 john john  305 2009-02-21 11:19 2
> -rw-r--r-- 1 john john  531 2009-02-21 11:19 3
> -rw-r--r-- 1 john john  564 2009-02-21 11:19 4
> -rw-r--r-- 1 john john  595 2009-02-21 11:19 5
> -rw-r--r-- 1 john john  628 2009-02-21 11:19 6
> -rw-r--r-- 1 john john  659 2009-02-21 11:19 7
> -rw-r--r-- 1 john john  690 2009-02-21 11:20 8
> -rw-r--r-- 1 john john  721 2009-02-21 11:20 9
> -rw-r--r-- 1 john john  762 2009-02-21 11:20 10
> -rw-r--r-- 1 john john  800 2009-02-21 11:20 11
> drwxr-sr-x 2 john john 4096 2009-02-21 11:20 .
>
> After creating 90000 tags, each new tag consumes megabytes of space in
> the repository. Also each new tag takes a few seconds to apply, up from
> milliseconds when we first began. We had the expectation of more
> graceful scaling, based in part on our experience in other situations
> where SVN scales well, for example committing a million additions to the
> same file.
>
> Our big installation is running on Linux, SVN 1.4.4, and FSFS. The
> problem also exists in SVN 1.5.1.
>
> Is this a known issue? Are there plans to make this more scalable? I
> searched the issues database and did not find anything that looked like
> a duplicate. Should I file a new issue?
>
> Do you have any recommendations for a work around?
>
> One workaround that we are evaluating is to shard the branches and tags
> over a large number of directories. So rather than create
> "tags/TAG_NAME", we may begin to create "tags2/1/b/5/e/TAG_NAME". The
> "1/b/5/e" is the first four hex digits of the md5 hash of "TAG_NAME". We
> chose "tags2" as the base directory to avoid colliding with existing
> entries under "tags/" that happen to be named after a hex digit.
>
> This scales better. Applying N sharded tags requires O(N) space and each
> tag takes O(1) time to apply.
>
> One possible resolution of this issue is a documentation-only change. If
> the SVN book described the scalability issue and recommended a sharded
> tags and branches structure, it would help future "enterprise" adopters
> (and other crazy people who create way too many tags :)
>
> Please let me know if you need any more information about this problem.
> Cheers,
>
> John
>
> ------------------------------------------------------
> http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=1204071
>

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=1204276