You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@subversion.apache.org by Peter Valdemar Mørch <sw...@sneakemail.com> on 2004/12/28 20:28:14 UTC
Minimizing repository growth when large files change....
Hi there,
I'm trying to figure out how to store large ASCII (7-bit?) files in
subversion in the most space-efficient manner.
My test data is the Contents-i386.gz from a Debian distritution
http://ftp2.de.debian.org/debian/dists/sid/Contents-i386.gz
(that contains a list of what files are in what packages)
I want to track this file over time with subversion...
It started out as a question about whether to store the Contents-i386.gz
in svn or unpack it and store Contents-i386 instead. I thought that
since the diffs were small, then storing Contents-i386 would be best
since the initial file would be big, but the diffs would be small..
But the repository *explodes* in size when I try this...
I have the file in two versions. Both are about 8.8 MB .gz and 122 MB
raw. They contain about 1.6 million lines and about 1.5% = 25500 (17000
+ 8500 for each direction) of these lines change between the two
versions. A `diff` between the raw files is 2.1 MB.
But I found that the repository grew a whopping *285.54* MB!!! Thats 12k
for each line of diff and 135 times the size of storing the diff output!
Or 1.5% lines changed =>
repository grew 227% (or 308% if I store the .gzs)
Whats up?
Are there any good pointers on how to store these large text files and
track their changes? Should I store .gz or raw files?
If 25000 lines change every day, and that results in 300MB repository
growth, that quickly becomes unmanagable... (There are multiple of such
files...)
Peter
----------
I tried creating a fresh repository containing first one version, then
the other for both the .gz and the raw files. Here are the repository
sizes in MB when this is done:
.gzs raw files
First commit: 11.25 125.81
Second commit: 46.21 411.35
Rep Growth: 34.96 285.54
Rep Growth Ratio: 308% 227%
Repository size vs.
Sum of file sizes
after 2nd commit: 263% 286%
The two test files can be found here:
http://demo.capmon.dk/~pvm/svnLargeTextFiles/
--
Peter Valdemar Mørch
http://www.morch.com
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Re: Minimizing repository growth when large files change....
Posted by Justin Erenkrantz <ju...@erenkrantz.com>.
--On Thursday, December 30, 2004 1:32 AM -0600 Ben Collins-Sussman
<su...@collab.net> wrote:
> This sounds really weird to me. I mean, we're all aware that fsfs uses
> *some* less space than bdb... like 20% less, I thought, was the rule of
> thumb.
>
> But 90% less space? Is something really fishy going on here? If the
> script below really reproduces this, should we investigate?
Well, BDB may not be as efficient. From their docs:
<http://www.sleepycat.com/docs/ref/am_misc/diskspace.html>
"Space freed by deleting key/data pairs from a Btree or Hash database is
never returned to the filesystem, although it is reused where possible.
This means that the Btree and Hash databases are grow-only. If enough keys
are deleted from a database that shrinking the underlying file is
desirable, you should create a new database and copy the records from the
old one into it."
Here's a data point with a certain repository with a dump w/~120k revisions:
BDB on a straight load: 6.3GB
FSFS on a straight load: 3.5GB
BDB after a db_dump/db_load cycle: 4.7GB
So, after a BDB dump/load, yes, it's within ~20% of FSFS. However, I bet
large BDB temporary transactions (such as we do for a commit) causes a
spike in the size and that is never really recouped... (BDB 4.2.52, FWIW.)
HTH. -- justin
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Re: Minimizing repository growth when large files change....
Posted by Ben Collins-Sussman <su...@collab.net>.
On Dec 28, 2004, at 3:27 PM, Peter Valdemar Mørch wrote:
>
> The fsfs repository uses 11% of the space the bdb repository does - for
> the exact same files! Hurray!
>
> bdb fsfs
> raw files raw files
> First commit: 124.56 15.69
> Second commit: 270.44 29.49
> Rep Growth: 145.88 13.79
> Rep Growth Ratio: 117% 88%
>
> Repository size vs.
> Sum of file sizes
> after 2nd commit: 188% 20.4%
This sounds really weird to me. I mean, we're all aware that fsfs uses
*some* less space than bdb... like 20% less, I thought, was the rule of
thumb.
But 90% less space? Is something really fishy going on here? If the
script below really reproduces this, should we investigate?
>
> Peter
>
> --
> Peter Valdemar Mørch
> http://www.morch.com
>
>
> --
> Script to reproduce:
>
> #!/bin/bash
>
> file1=f1
> file2=f2
> # file1=F1.gz
> # file2=F2.gz
>
> rm -rf rep dir/
> # svnadmin create --fs-type fsfs rep
> svnadmin create --fs-type bdb rep
> export r=file://`pwd`/rep
> svn mkdir -m "" $r/dir
> svn co $r/dir
>
> cp $file1 dir/file
>
> svn add dir/file
> svn ci -m "" dir
>
> svnadmin list-unused-dblogs rep/ | xargs rm -f
> echo
> echo "Repos size 1"
> calc.pl `du -s --block-size=1 rep | sed s/rep//` / 1024 / 1024
>
> svn ci -m "" dir
>
> cp $file2 dir/file
> svn ci -m "" dir
>
> svnadmin list-unused-dblogs rep/ | xargs rm -f
> echo
> echo "Repos size 2"
> calc.pl `du -s --block-size=1 rep | sed s/rep//` / 1024 / 1024
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: users-help@subversion.tigris.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Re: Minimizing repository growth when large files change....
Posted by Peter Valdemar Mørch <pe...@morch.com>.
Peter Valdemar Mørch swp5jhu02-at-sneakemail.com |Lists| wrote:
> Hi there,
>
> I'm trying to figure out how to store large ASCII (7-bit?) files in
> subversion in the most space-efficient manner.
...
>
> .gzs raw files
> First commit: 11.25 125.81
> Second commit: 46.21 411.35
> Rep Growth: 34.96 285.54
> Rep Growth Ratio: 308% 227%
>
> Repository size vs.
> Sum of file sizes
> after 2nd commit: 263% 286%
>
OK, well, I discovered it myself... (Why did it have to happen *after* i
hit "Send"? :-D )
First off, I was using --fs-type bdb. And so removing the logs with
svnadmin list-unused-dblogs $rep | xargs rm
shaved off a lot of the space.
Second, now 1.1.X is available on Debian, so I tried the fsfs. And
*that* helped *a lot*!
The fsfs repository uses 11% of the space the bdb repository does - for
the exact same files! Hurray!
bdb fsfs
raw files raw files
First commit: 124.56 15.69
Second commit: 270.44 29.49
Rep Growth: 145.88 13.79
Rep Growth Ratio: 117% 88%
Repository size vs.
Sum of file sizes
after 2nd commit: 188% 20.4%
Peter
--
Peter Valdemar Mørch
http://www.morch.com
--
Script to reproduce:
#!/bin/bash
file1=f1
file2=f2
# file1=F1.gz
# file2=F2.gz
rm -rf rep dir/
# svnadmin create --fs-type fsfs rep
svnadmin create --fs-type bdb rep
export r=file://`pwd`/rep
svn mkdir -m "" $r/dir
svn co $r/dir
cp $file1 dir/file
svn add dir/file
svn ci -m "" dir
svnadmin list-unused-dblogs rep/ | xargs rm -f
echo
echo "Repos size 1"
calc.pl `du -s --block-size=1 rep | sed s/rep//` / 1024 / 1024
svn ci -m "" dir
cp $file2 dir/file
svn ci -m "" dir
svnadmin list-unused-dblogs rep/ | xargs rm -f
echo
echo "Repos size 2"
calc.pl `du -s --block-size=1 rep | sed s/rep//` / 1024 / 1024
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org