You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Daniel Patterson <da...@adaptiveinternational.com> on 2003/05/13 07:47:13 UTC
Best practise for long term repository size management
Greetings,
I've taken the approach of One Big Repository (OBR), in
order to be able to happily copy code between projects
and retain some history (and perhaps, be able to merge
changes back).
However, this has led me to think that I may have created
a bit of a size beast.
Currently, with around 10-15 people actively using the
repository, it's growing at the rate of around 20M/day.
The reason for this is mostly the kind of files being used
(large Word docs are the main culprit), but I only expect
this rate of growth to increase as more users come online.
Without the ability to "prune" out old data, I fear an
impending point where the database is too large to backup,
and most of the data within it is unchanging historical
data anyway. The database also becomes difficult to manage
(i.e. takes significant time to copy, etc) if the size becomes
too large.
I guess the tradeoff is between database size and amount
of history retained, but some guidance as to the point at
which you draw the line would be useful.
My questions are:
1) Should the OBR approach be discouraged, for the reason
that it may grow to unmanageable volumes?
2) Is there any reasonable way to archive and prune
unused historical data? (Perhaps "svn obliterate"
will solve this)....
daniel
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Re: Best practise for long term repository size management
Posted by Branko Čibej <br...@xbc.nu>.
cmpilato@collab.net wrote:
>Daniel Patterson <da...@adaptiveinternational.com> writes:
>
>
>
>>On Wed, 2003-05-14 at 00:48, kfogel@collab.net wrote:
>>
>>
>>>I just want to echo Francois' point. Your problem is most likely the
>>>Berkeley log files, not the data itself.
>>>
>>>The BDB program is `db_archive', read about it in the sleepcat docs.
>>>
>>>
>>Yes, I'm already doing that. Logfiles aren't really a concern, (they're
>>being archived and stored offline), it's the "strings" database file
>>that's growing at the 20M/day rate. Given that there is no way to
>>shrink that file, or give it an upper bound, what's the best approach to
>>manage it's size?
>>
>>
>
>Switch to RTF instead of native Word docs? At least you have a
>fighting chance of worthwhile deltification. :-)
>
>
Nonsense. You won't get much better deltification from RTF than from
.doc, especially since RFT is much larger to begin with. I just ran a
test (vdelta-test -q, to be precise), comparing two versions of a word
file, between which most of the changes were images (the file contains a
_lot_ of bitmaps).
Here are the results:
version 1 size: 1.55 MiB
version 2 size: 1.6 MiB
delta estimate: 440 kiB (17 windows)
Now, the same files saved as RTF:
version 1 size: 36.4 MiB
version 2 size: 40.7 MiB
delta estimate: 3.0 MiB (417 windows)
3 megs vs. 440k is indeed a great improvement, don't you think? "Big is
beautiful", etc.
Now, maybe my example wasn't very good, given that the files contain
many bitmaps. So I tried the same with a file that contains only one
bitmap, and the only changes between the two versions were review
comments. The file is also a lot smaller.
DOC:
version 1 size: 170.5 kiB
version 2 size: 236.0 kiB
delta estimate: 43.9 kiB (3 windows)
RTF:
version 1 size: 430.6 kiB
version 2 size: 492.9 kiB
delta estimate: 16.7 kiB (5 windows)
Indeed, that's a lot better. However, take into account the fact that
the HEAD version is always stored in full, and you get 279k vs. 508k for
storing two versions of the file, and the break-equal point is around
version 10.
--
Brane Čibej <br...@xbc.nu> http://www.xbc.nu/brane/
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Re: Best practise for long term repository size management
Posted by Daniel Berlin <db...@dberlin.org>.
On Tuesday, May 13, 2003, at 10:06 PM, Daniel Patterson wrote:
> On Wed, 2003-05-14 at 11:54, cmpilato@collab.net wrote:
>> Switch to RTF instead of native Word docs? At least you have a
>> fighting chance of worthwhile deltification. :-)
>
> *sigh* that's kind of what I figured. Has anyone investigated using
> xdelta/xdelta2 to do binary diffs (although I'm not sure that it'd help
> for most binary formats)....
It won't really help for any formats.
Our encoding is a vdelta algorithm output into basically a VCDIFF
subset.
In fact, xdelta3 is using an xdelta style algorithm that will do better
(but i couldn't imagine more than maybe 10% better), but be slower, and
they output into a real VCDIFF based encoding, which will be a bit more
compact.
There is an issue to track svndiff version 1 that i wrote, which made
up just about all of the difference by doing the VCDIFF style address
encoding and range encoding compression of the strings data (It's been
so long, i might not remember the exact details of what we are range
encoding anymore). The remaining VCDIFF encoding pieces aren't worth
the cost for our purposes, but they will complexify our code incredibly.
Algorithmically, you aren't likely to get more than 10% smaller diffs
using the xdelta algorithm. I remember running all kinds of tests
against it and vdelta when working on svndiff 1.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Re: Best practise for long term repository size management
Posted by Daniel Patterson <da...@adaptiveinternational.com>.
On Wed, 2003-05-14 at 11:54, cmpilato@collab.net wrote:
> Switch to RTF instead of native Word docs? At least you have a
> fighting chance of worthwhile deltification. :-)
*sigh* that's kind of what I figured. Has anyone investigated using
xdelta/xdelta2 to do binary diffs (although I'm not sure that it'd help
for most binary formats)....
daniel
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Re: Best practise for long term repository size management
Posted by cm...@collab.net.
Daniel Patterson <da...@adaptiveinternational.com> writes:
> On Wed, 2003-05-14 at 00:48, kfogel@collab.net wrote:
> > I just want to echo Francois' point. Your problem is most likely the
> > Berkeley log files, not the data itself.
> >
> > The BDB program is `db_archive', read about it in the sleepcat docs.
>
> Yes, I'm already doing that. Logfiles aren't really a concern, (they're
> being archived and stored offline), it's the "strings" database file
> that's growing at the 20M/day rate. Given that there is no way to
> shrink that file, or give it an upper bound, what's the best approach to
> manage it's size?
Switch to RTF instead of native Word docs? At least you have a
fighting chance of worthwhile deltification. :-)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Re: Best practise for long term repository size management
Posted by Daniel Patterson <da...@adaptiveinternational.com>.
On Wed, 2003-05-14 at 00:48, kfogel@collab.net wrote:
> I just want to echo Francois' point. Your problem is most likely the
> Berkeley log files, not the data itself.
>
> The BDB program is `db_archive', read about it in the sleepcat docs.
Yes, I'm already doing that. Logfiles aren't really a concern, (they're
being archived and stored offline), it's the "strings" database file
that's growing at the 20M/day rate. Given that there is no way to
shrink that file, or give it an upper bound, what's the best approach to
manage it's size?
daniel
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Re: Best practise for long term repository size management
Posted by kf...@collab.net.
"Francois Beausoleil" <fb...@users.sourceforge.net> writes:
> Do you regularly remove the log files ? There was a lot of discussion
> about the log files. You can run a BDB program that will tell you which
> files can be removed, after you've backed them up.
I just want to echo Francois' point. Your problem is most likely the
Berkeley log files, not the data itself.
The BDB program is `db_archive', read about it in the sleepcat docs.
-K
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Re: Best practise for long term repository size management
Posted by Francois Beausoleil <fb...@users.sourceforge.net>.
Hello Daniel,
Do you regularly remove the log files ? There was a lot of discussion
about the log files. You can run a BDB program that will tell you which
files can be removed, after you've backed them up.
Hope that helps,
Francois
On 13 May 2003 17:47:13 +1000, "Daniel Patterson"
<da...@adaptiveinternational.com> said:
> Greetings,
>
> I've taken the approach of One Big Repository (OBR), in
> order to be able to happily copy code between projects
> and retain some history (and perhaps, be able to merge
> changes back).
>
> However, this has led me to think that I may have created
> a bit of a size beast.
>
> Currently, with around 10-15 people actively using the
> repository, it's growing at the rate of around 20M/day.
> The reason for this is mostly the kind of files being used
> (large Word docs are the main culprit), but I only expect
> this rate of growth to increase as more users come online.
>
> Without the ability to "prune" out old data, I fear an
> impending point where the database is too large to backup,
> and most of the data within it is unchanging historical
> data anyway. The database also becomes difficult to manage
> (i.e. takes significant time to copy, etc) if the size becomes
> too large.
>
> I guess the tradeoff is between database size and amount
> of history retained, but some guidance as to the point at
> which you draw the line would be useful.
>
> My questions are:
>
> 1) Should the OBR approach be discouraged, for the reason
> that it may grow to unmanageable volumes?
> 2) Is there any reasonable way to archive and prune
> unused historical data? (Perhaps "svn obliterate"
> will solve this)....
>
> daniel
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: dev-help@subversion.tigris.org
>
>
--
Francois Beausoleil
Developer of Java Gui Builder
http://jgb.sourceforge.net/
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org