You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@subversion.apache.org by Peter Valdemar Mørch <sw...@sneakemail.com> on 2004/12/28 20:28:14 UTC

Minimizing repository growth when large files change....

Hi there,

I'm trying to figure out how to store large ASCII (7-bit?) files in 
subversion in the most space-efficient manner.

My test data is the Contents-i386.gz from a Debian distritution
http://ftp2.de.debian.org/debian/dists/sid/Contents-i386.gz
(that contains a list of what files are in what packages)
I want to track this file over time with subversion...

It started out as a question about whether to store the Contents-i386.gz 
in svn or unpack it and store Contents-i386 instead. I thought that 
since the diffs were small, then storing Contents-i386 would be best 
since the initial file would be big, but the diffs would be small..

But the repository *explodes* in size when I try this...

I have the file in two versions. Both are about 8.8 MB .gz and 122 MB 
raw. They contain about 1.6 million lines and about 1.5% = 25500 (17000 
+ 8500 for each direction) of these lines change between the two 
versions. A `diff`  between the raw files is 2.1 MB.

But I found that the repository grew a whopping *285.54* MB!!! Thats 12k 
for each line of diff and 135 times the size of storing the diff output!

Or 1.5% lines changed =>
    repository grew 227% (or 308% if I store the .gzs)

Whats up?

Are there any good pointers on how to store these large text files and 
track their changes? Should I store .gz or raw files?

If 25000 lines change every day, and that results in 300MB repository 
growth, that quickly becomes unmanagable... (There are multiple of such 
files...)

Peter

----------

I tried creating a fresh repository containing first one version, then 
the other for both the .gz and the raw files. Here are the repository 
sizes in MB when this is done:

                    .gzs       raw files
First commit:      11.25      125.81
Second commit:     46.21      411.35
Rep Growth:        34.96      285.54
Rep Growth Ratio:  308%       227%

Repository size vs.
Sum of file sizes
after 2nd commit:  263%       286%

The two test files can be found here:
http://demo.capmon.dk/~pvm/svnLargeTextFiles/

-- 
Peter Valdemar Mørch
http://www.morch.com


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: Minimizing repository growth when large files change....

Posted by Justin Erenkrantz <ju...@erenkrantz.com>.
--On Thursday, December 30, 2004 1:32 AM -0600 Ben Collins-Sussman 
<su...@collab.net> wrote:

> This sounds really weird to me.  I mean, we're all aware that fsfs uses
> *some* less space than bdb... like 20% less, I thought, was the rule of
> thumb.
>
> But 90% less space?  Is something really fishy going on here?  If the
> script below really reproduces this, should we investigate?

Well, BDB may not be as efficient.  From their docs:

<http://www.sleepycat.com/docs/ref/am_misc/diskspace.html>

"Space freed by deleting key/data pairs from a Btree or Hash database is 
never returned to the filesystem, although it is reused where possible. 
This means that the Btree and Hash databases are grow-only. If enough keys 
are deleted from a database that shrinking the underlying file is 
desirable, you should create a new database and copy the records from the 
old one into it."

Here's a data point with a certain repository with a dump w/~120k revisions:

BDB on a straight load:  6.3GB
FSFS on a straight load: 3.5GB
BDB after a db_dump/db_load cycle: 4.7GB

So, after a BDB dump/load, yes, it's within ~20% of FSFS.  However, I bet 
large BDB temporary transactions (such as we do for a commit) causes a 
spike in the size and that is never really recouped...  (BDB 4.2.52, FWIW.)

HTH.  -- justin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Minimizing repository growth when large files change....

Posted by Ben Collins-Sussman <su...@collab.net>.
On Dec 28, 2004, at 3:27 PM, Peter Valdemar Mørch wrote:
>
> The fsfs repository uses 11% of the space the bdb repository does - for
> the exact same files! Hurray!
>
>                    bdb         fsfs
>                    raw files   raw files
> First commit:      124.56      15.69
> Second commit:     270.44      29.49
> Rep Growth:        145.88      13.79
> Rep Growth Ratio:  117%        88%
>
> Repository size vs.
> Sum of file sizes
> after 2nd commit:  188%        20.4%

This sounds really weird to me.  I mean, we're all aware that fsfs uses 
*some* less space than bdb... like 20% less, I thought, was the rule of 
thumb.

But 90% less space?  Is something really fishy going on here?  If the 
script below really reproduces this, should we investigate?


>
> Peter
>
> -- 
> Peter Valdemar Mørch
> http://www.morch.com
>
>
> --
> Script to reproduce:
>
> #!/bin/bash
>
> file1=f1
> file2=f2
> # file1=F1.gz
> # file2=F2.gz
>
> rm -rf rep dir/
> # svnadmin create --fs-type fsfs rep
> svnadmin create --fs-type bdb rep
> export r=file://`pwd`/rep
> svn mkdir -m "" $r/dir
> svn co $r/dir
>
> cp $file1 dir/file
>
> svn add dir/file
> svn ci -m "" dir
>
> svnadmin list-unused-dblogs rep/ | xargs rm -f
> echo
> echo "Repos size 1"
> calc.pl `du -s --block-size=1 rep | sed s/rep//` / 1024 / 1024
>
> svn ci -m "" dir
>
> cp $file2 dir/file
> svn ci -m "" dir
>
> svnadmin list-unused-dblogs rep/ | xargs rm -f
> echo
> echo "Repos size 2"
> calc.pl `du -s --block-size=1 rep | sed s/rep//` / 1024 / 1024
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: users-help@subversion.tigris.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org


Re: Minimizing repository growth when large files change....

Posted by Peter Valdemar Mørch <pe...@morch.com>.
Peter Valdemar Mørch swp5jhu02-at-sneakemail.com |Lists| wrote:
> Hi there,
> 
> I'm trying to figure out how to store large ASCII (7-bit?) files in 
> subversion in the most space-efficient manner.
...
> 
>                    .gzs       raw files
> First commit:      11.25      125.81
> Second commit:     46.21      411.35
> Rep Growth:        34.96      285.54
> Rep Growth Ratio:  308%       227%
> 
> Repository size vs.
> Sum of file sizes
> after 2nd commit:  263%       286%
> 

OK, well, I discovered it myself... (Why did it have to happen *after* i
  hit "Send"? :-D )

First off, I was using --fs-type bdb. And so removing the logs with
svnadmin list-unused-dblogs $rep | xargs rm
shaved off a lot of the space.

Second, now 1.1.X is available on Debian, so I tried the fsfs. And
*that* helped *a lot*!

The fsfs repository uses 11% of the space the bdb repository does - for
the exact same files! Hurray!

                    bdb         fsfs
                    raw files   raw files
First commit:      124.56      15.69
Second commit:     270.44      29.49
Rep Growth:        145.88      13.79
Rep Growth Ratio:  117%        88%

Repository size vs.
Sum of file sizes
after 2nd commit:  188%        20.4%

Peter

-- 
Peter Valdemar Mørch
http://www.morch.com


--
Script to reproduce:

#!/bin/bash

file1=f1
file2=f2
# file1=F1.gz
# file2=F2.gz

rm -rf rep dir/
# svnadmin create --fs-type fsfs rep
svnadmin create --fs-type bdb rep
export r=file://`pwd`/rep
svn mkdir -m "" $r/dir
svn co $r/dir

cp $file1 dir/file

svn add dir/file
svn ci -m "" dir

svnadmin list-unused-dblogs rep/ | xargs rm -f
echo
echo "Repos size 1"
calc.pl `du -s --block-size=1 rep | sed s/rep//` / 1024 / 1024

svn ci -m "" dir

cp $file2 dir/file
svn ci -m "" dir

svnadmin list-unused-dblogs rep/ | xargs rm -f
echo
echo "Repos size 2"
calc.pl `du -s --block-size=1 rep | sed s/rep//` / 1024 / 1024




---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org