You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@subversion.apache.org by The Grey Wolf <gr...@starwolf.com> on 2012/01/24 22:18:36 UTC

revision files absurdly large at higher revisions

Hello, I'm not quite sure how to properly phrase the subject
as a query term, so if this has been answered, please forgive
the redundancy and quietly point me to where this gets addressed.

We are using svn at work to hold customer 'vault' data [various bits
of information for each customer].  It has been a huge success -- to
the point where we have over 1,000 customers using vaults.  The checkins
are automated, and we have amassed over 100,000 revisions thus far.

User directories are created as /Ab/username [where Ab is a 2-character
hash via a known (balanced) algorithm to make location of username files more
machine-efficient].  So we have about 1,200 of these guys, with some hashes
obviously being re-used, no big deal.

The problem is that, even on miniscule changes, we are finding the
db/rev/<shard>/<revno> files to be disproportionately large; for an
addition or change of a file that is about 1k-4k, the rev files are
at 100K each.  At lower revisions, we noticed that the rev files are
4k but have been increasing in size with each shard that gets added,
usually to the tune of 1k/shard.  With so many revisions being checked
in at a rapid rate, we found ourselves having to take production off
line for a couple of minutes while we migrated the repository in question
to a larger filesystem due to the threat of the filesystem filling
up.

The upshot of this is:  Why does a minimal delta create such a large
delta file?  100k for a small change?  What's going on and how can we
mitigate this?
-- 
                --*greywolf; 

Re: revision files absurdly large at higher revisions

Posted by Thorsten Schöning <ts...@am-soft.de>.
Guten Tag Greywolf,
am Mittwoch, 25. Januar 2012 um 09:06 schrieben Sie:

> So are you saying that if I add a file /ab/username/file, it's going to copy
> the ENTIRE top level directory in as a delta?

This problem was discussed some times on the list and last year a very
good explanation of how subversion stores it's directories was posted,
but I can't find it. If anyone is able to provide it, it is a lot
easier to understand your problem. Search for other discussions of
large rev files with little changes, directory layout and some stuff,
I didn't had luck, though.

Mit freundlichen Grüßen,

Thorsten Schöning

-- 
Thorsten Schöning       E-Mail:Thorsten.Schoening@AM-SoFT.de
AM-SoFT IT-Systeme      http://www.AM-SoFT.de/

Telefon.............030-2 1001-310
Fax...............05151-  9468- 88
Mobil..............0178-8 9468- 04

AM-SoFT GmbH IT-Systeme, Brandenburger Str. 7c, 31789 Hameln
AG Hanover HRB 207 694 - Geschäftsführer: Andreas Muchow


Re: revision files absurdly large at higher revisions

Posted by Johan Corveleyn <jc...@gmail.com>.
On Wed, Jan 25, 2012 at 9:06 AM, Greywolf <gr...@starwolf.com> wrote:
> On 1/24/2012 23:04, Ryan Schmidt wrote:
>>
>> On Jan 24, 2012, at 15:18, The Grey Wolf wrote:
>>
>>> Hello, I'm not quite sure how to properly phrase the subject as a query
>>> term, so if this has been answered, please forgive the redundancy and
>>> quietly point me to where this gets addressed.
>>>
>>> We are using svn at work to hold customer 'vault' data [various bits of
>>> information for each customer].  It has been a huge success -- to the
>>> point where we have over 1,000 customers using vaults.  The checkins are
>>> automated, and we have amassed over 100,000 revisions thus far.
>>>
>>> User directories are created as /Ab/username [where Ab is a 2-character
>>> hash via a known (balanced) algorithm to make location of username files
>>> more machine-efficient].  So we have about 1,200 of these guys, with some
>>> hashes obviously being re-used, no big deal.
>>>
>>> The problem is that, even on miniscule changes, we are finding the
>>> db/rev/<shard>/<revno>  files to be disproportionately large; for an
>>> addition or change of a file that is about 1k-4k, the rev files are at
>>> 100K each.  At lower revisions, we noticed that the rev files are 4k but
>>> have been increasing in size with each shard that gets added, usually to
>>> the tune of 1k/shard.  With so many revisions being checked in at a rapid
>>> rate, we found ourselves having to take production off line for a couple
>>> of minutes while we migrated the repository in question to a larger
>>> filesystem due to the threat of the filesystem filling up.
>>>
>>> The upshot of this is:  Why does a minimal delta create such a large
>>> delta file?  100k for a small change?  What's going on and how can we
>>> mitigate this?
>>
>>
>> It probably has to do with the size of the directory entries, not the
>> changes you're making to the files.
>>
>> If you add a file, that's recorded as a change to the directory. When you
>> change a file, Subversion stores only the changes you made, not the
>> complete new file, and it stores them compressed. However, when you change
>> a directory (e.g. by adding or removing a file or directory), Subversion
>> records a complete new copy of the directory, and I don't know if it's
>> compressed or not. If the directory has hundreds or thousands of items,
>> that will take some space.
>>
>> I don't remember if modifying a file counts as a change to the directory,
>> but adding or deleting a file certainly do.
>>
>> Based on this I would assume you could mitigate the problem by having
>> fewer
>> items in each directory. Create a deeper directory structure from your
>> hash: /A/Ab/username, or even /A/Ab/Abc/username. You should try this out
>> in a testing environment. Either create some test data, or dump your
>> current repository, and then a) load it into a fresh empty repository
>> as-is, and b) transform it into a deeper directory structure using a tool
>> like svndumptool, then load that into a second fresh empty repository.
>> Then
>> see if there is an appreciable size difference.
>
>
> Interesting, to be sure.  Here's some stats.
>
> top level = 2817 entries
> second level = 1..22 entries [depending on which one]
> Some have a third level, most don't; ranges 1..27 entries.
>
> So are you saying that if I add a file /ab/username/file, it's going to copy
> the ENTIRE top level directory in as a delta?

No, every revision stores the entire directory listing of its parent
directories as a full text list, not as a delta.

See issue #4084 [1] for some recent pickup on this problem (it's
always been that way, but recently some more people are looking into
this problem).

AFAIK, Stefan Fuhrmann has recently implemented "Directory
deltification" on trunk [2], so perhaps this will come in 1.8. But
there is still some discussion and testing about the tradeoffs (it's
basically a CPU vs. storage tradeoff: deltifying directories requires
the server to do more work).


[1] http://subversion.tigris.org/issues/show_bug.cgi?id=4084 (FSFS and
BDB store large directories inefficiently)
[2] http://svn.haxx.se/dev/archive-2011-12/0356.shtml and
http://svn.haxx.se/dev/archive-2012-01/0020.shtml (the thread is
somehow broken in two on haxx.se)
-- 
Johan

Re: revision files absurdly large at higher revisions

Posted by Greywolf <gr...@starwolf.com>.
On 1/24/2012 23:04, Ryan Schmidt wrote:
> On Jan 24, 2012, at 15:18, The Grey Wolf wrote:
>
>> Hello, I'm not quite sure how to properly phrase the subject as a query
>> term, so if this has been answered, please forgive the redundancy and
>> quietly point me to where this gets addressed.
>>
>> We are using svn at work to hold customer 'vault' data [various bits of
>> information for each customer].  It has been a huge success -- to the
>> point where we have over 1,000 customers using vaults.  The checkins are
>> automated, and we have amassed over 100,000 revisions thus far.
>>
>> User directories are created as /Ab/username [where Ab is a 2-character
>> hash via a known (balanced) algorithm to make location of username files
>> more machine-efficient].  So we have about 1,200 of these guys, with some
>> hashes obviously being re-used, no big deal.
>>
>> The problem is that, even on miniscule changes, we are finding the
>> db/rev/<shard>/<revno>  files to be disproportionately large; for an
>> addition or change of a file that is about 1k-4k, the rev files are at
>> 100K each.  At lower revisions, we noticed that the rev files are 4k but
>> have been increasing in size with each shard that gets added, usually to
>> the tune of 1k/shard.  With so many revisions being checked in at a rapid
>> rate, we found ourselves having to take production off line for a couple
>> of minutes while we migrated the repository in question to a larger
>> filesystem due to the threat of the filesystem filling up.
>>
>> The upshot of this is:  Why does a minimal delta create such a large
>> delta file?  100k for a small change?  What's going on and how can we
>> mitigate this?
>
> It probably has to do with the size of the directory entries, not the
> changes you're making to the files.
>
> If you add a file, that's recorded as a change to the directory. When you
> change a file, Subversion stores only the changes you made, not the
> complete new file, and it stores them compressed. However, when you change
> a directory (e.g. by adding or removing a file or directory), Subversion
> records a complete new copy of the directory, and I don't know if it's
> compressed or not. If the directory has hundreds or thousands of items,
> that will take some space.
>
> I don't remember if modifying a file counts as a change to the directory,
> but adding or deleting a file certainly do.
>
> Based on this I would assume you could mitigate the problem by having fewer
> items in each directory. Create a deeper directory structure from your
> hash: /A/Ab/username, or even /A/Ab/Abc/username. You should try this out
> in a testing environment. Either create some test data, or dump your
> current repository, and then a) load it into a fresh empty repository
> as-is, and b) transform it into a deeper directory structure using a tool
> like svndumptool, then load that into a second fresh empty repository. Then
> see if there is an appreciable size difference.

Interesting, to be sure.  Here's some stats.

top level = 2817 entries
second level = 1..22 entries [depending on which one]
Some have a third level, most don't; ranges 1..27 entries.

So are you saying that if I add a file /ab/username/file, it's going to copy
the ENTIRE top level directory in as a delta?

>
>
>
>
>
>


-- 
				--*greywolf;

Re: revision files absurdly large at higher revisions

Posted by Ryan Schmidt <su...@ryandesign.com>.
On Jan 24, 2012, at 15:18, The Grey Wolf wrote:

> Hello, I'm not quite sure how to properly phrase the subject
> as a query term, so if this has been answered, please forgive
> the redundancy and quietly point me to where this gets addressed.
> 
> We are using svn at work to hold customer 'vault' data [various bits
> of information for each customer].  It has been a huge success -- to
> the point where we have over 1,000 customers using vaults.  The checkins
> are automated, and we have amassed over 100,000 revisions thus far.
> 
> User directories are created as /Ab/username [where Ab is a 2-character
> hash via a known (balanced) algorithm to make location of username files more
> machine-efficient].  So we have about 1,200 of these guys, with some hashes
> obviously being re-used, no big deal.
> 
> The problem is that, even on miniscule changes, we are finding the
> db/rev/<shard>/<revno> files to be disproportionately large; for an
> addition or change of a file that is about 1k-4k, the rev files are
> at 100K each.  At lower revisions, we noticed that the rev files are
> 4k but have been increasing in size with each shard that gets added,
> usually to the tune of 1k/shard.  With so many revisions being checked
> in at a rapid rate, we found ourselves having to take production off
> line for a couple of minutes while we migrated the repository in question
> to a larger filesystem due to the threat of the filesystem filling
> up.
> 
> The upshot of this is:  Why does a minimal delta create such a large
> delta file?  100k for a small change?  What's going on and how can we
> mitigate this?

It probably has to do with the size of the directory entries, not the changes you're making to the files.

If you add a file, that's recorded as a change to the directory. When you change a file, Subversion stores only the changes you made, not the complete new file, and it stores them compressed. However, when you change a directory (e.g. by adding or removing a file or directory), Subversion records a complete new copy of the directory, and I don't know if it's compressed or not. If the directory has hundreds or thousands of items, that will take some space.

I don't remember if modifying a file counts as a change to the directory, but adding or deleting a file certainly do.

Based on this I would assume you could mitigate the problem by having fewer items in each directory. Create a deeper directory structure from your hash: /A/Ab/username, or even /A/Ab/Abc/username. You should try this out in a testing environment. Either create some test data, or dump your current repository, and then a) load it into a fresh empty repository as-is, and b) transform it into a deeper directory structure using a tool like svndumptool, then load that into a second fresh empty repository. Then see if there is an appreciable size difference.