You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Daniel Patterson <da...@adaptiveinternational.com> on 2003/05/13 07:47:13 UTC

Best practise for long term repository size management

Greetings,

  I've taken the approach of One Big Repository (OBR), in
  order to be able to happily copy code between projects
  and retain some history (and perhaps, be able to merge
  changes back).

  However, this has led me to think that I may have created
  a bit of a size beast.

  Currently, with around 10-15 people actively using the
  repository, it's growing at the rate of around 20M/day.
  The reason for this is mostly the kind of files being used
  (large Word docs are the main culprit), but I only expect
  this rate of growth to increase as more users come online.

  Without the ability to "prune" out old data, I fear an
  impending point where the database is too large to backup,
  and most of the data within it is unchanging historical
  data anyway.  The database also becomes difficult to manage
  (i.e. takes significant time to copy, etc) if the size becomes 
  too large.

  I guess the tradeoff is between database size and amount
  of history retained, but some guidance as to the point at
  which you draw the line would be useful.

  My questions are:

    1) Should the OBR approach be discouraged, for the reason
       that it may grow to unmanageable volumes?
    2) Is there any reasonable way to archive and prune
       unused historical data?  (Perhaps "svn obliterate"
       will solve this)....

daniel


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Best practise for long term repository size management

Posted by Branko Čibej <br...@xbc.nu>.
cmpilato@collab.net wrote:

>Daniel Patterson <da...@adaptiveinternational.com> writes:
>
>  
>
>>On Wed, 2003-05-14 at 00:48, kfogel@collab.net wrote:
>>    
>>
>>>I just want to echo Francois' point.  Your problem is most likely the
>>>Berkeley log files, not the data itself.
>>>
>>>The BDB program is `db_archive', read about it in the sleepcat docs.
>>>      
>>>
>>Yes, I'm already doing that.  Logfiles aren't really a concern, (they're
>>being archived and stored offline), it's the "strings" database file
>>that's growing at the 20M/day rate.  Given that there is no way to
>>shrink that file, or give it an upper bound, what's the best approach to
>>manage it's size?
>>    
>>
>
>Switch to RTF instead of native Word docs?  At least you have a
>fighting chance of worthwhile deltification.  :-)
>  
>
Nonsense. You won't get much better deltification from RTF than from
.doc, especially since RFT is much larger to begin with. I just ran a
test (vdelta-test -q, to be precise), comparing two versions of a word
file, between which most of the changes were images (the file contains a
_lot_ of bitmaps).

Here are the results:

    version 1 size: 1.55 MiB
    version 2 size: 1.6 MiB
    delta estimate: 440 kiB (17 windows)

Now, the same files saved as RTF:

    version 1 size: 36.4 MiB
    version 2 size: 40.7 MiB
    delta estimate:  3.0 MiB (417 windows)


3 megs vs. 440k is indeed a great improvement, don't you think? "Big is
beautiful", etc.

Now, maybe my example wasn't very good, given that the files contain
many bitmaps. So I tried the same with a file that contains only one
bitmap, and the only changes between the two versions were review
comments. The file is also a lot smaller.

DOC:

    version 1 size: 170.5 kiB
    version 2 size: 236.0 kiB
    delta estimate:  43.9 kiB (3 windows)

RTF:

    version 1 size: 430.6 kiB
    version 2 size: 492.9 kiB
    delta estimate:  16.7 kiB (5 windows)


Indeed, that's a lot better. However, take into account the fact that
the HEAD version is always stored in full, and you get 279k vs. 508k for
storing two versions of the file, and the break-equal point is around
version 10.


-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Best practise for long term repository size management

Posted by Daniel Berlin <db...@dberlin.org>.
On Tuesday, May 13, 2003, at 10:06  PM, Daniel Patterson wrote:

> On Wed, 2003-05-14 at 11:54, cmpilato@collab.net wrote:
>> Switch to RTF instead of native Word docs?  At least you have a
>> fighting chance of worthwhile deltification.  :-)
>
> *sigh* that's kind of what I figured.  Has anyone investigated using
> xdelta/xdelta2 to do binary diffs (although I'm not sure that it'd help
> for most binary formats)....

It won't really help for any formats.
Our encoding is a vdelta algorithm output into basically a VCDIFF 
subset.

In fact, xdelta3 is using an xdelta style algorithm that will do better 
(but i couldn't imagine more than maybe 10% better), but be slower, and 
they output into a real VCDIFF based encoding, which will be a bit more 
compact.

There is an issue to track svndiff version 1 that i wrote, which made 
up just about all of the difference by doing the VCDIFF style address 
encoding and range encoding compression of the strings data (It's been 
so long, i might not remember the exact details of what we are range 
encoding anymore).  The remaining VCDIFF encoding pieces aren't worth 
the cost for our purposes, but they will complexify our code incredibly.

Algorithmically, you aren't likely to get more than 10% smaller diffs 
using the xdelta algorithm.  I remember running all kinds of tests 
against it and vdelta when working on svndiff 1.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Best practise for long term repository size management

Posted by Daniel Patterson <da...@adaptiveinternational.com>.
On Wed, 2003-05-14 at 11:54, cmpilato@collab.net wrote:
> Switch to RTF instead of native Word docs?  At least you have a
> fighting chance of worthwhile deltification.  :-)

*sigh* that's kind of what I figured.  Has anyone investigated using
xdelta/xdelta2 to do binary diffs (although I'm not sure that it'd help
for most binary formats)....

daniel


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Best practise for long term repository size management

Posted by cm...@collab.net.
Daniel Patterson <da...@adaptiveinternational.com> writes:

> On Wed, 2003-05-14 at 00:48, kfogel@collab.net wrote:
> > I just want to echo Francois' point.  Your problem is most likely the
> > Berkeley log files, not the data itself.
> > 
> > The BDB program is `db_archive', read about it in the sleepcat docs.
> 
> Yes, I'm already doing that.  Logfiles aren't really a concern, (they're
> being archived and stored offline), it's the "strings" database file
> that's growing at the 20M/day rate.  Given that there is no way to
> shrink that file, or give it an upper bound, what's the best approach to
> manage it's size?

Switch to RTF instead of native Word docs?  At least you have a
fighting chance of worthwhile deltification.  :-)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Best practise for long term repository size management

Posted by Daniel Patterson <da...@adaptiveinternational.com>.
On Wed, 2003-05-14 at 00:48, kfogel@collab.net wrote:
> I just want to echo Francois' point.  Your problem is most likely the
> Berkeley log files, not the data itself.
> 
> The BDB program is `db_archive', read about it in the sleepcat docs.

Yes, I'm already doing that.  Logfiles aren't really a concern, (they're
being archived and stored offline), it's the "strings" database file
that's growing at the 20M/day rate.  Given that there is no way to
shrink that file, or give it an upper bound, what's the best approach to
manage it's size?

daniel


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Best practise for long term repository size management

Posted by kf...@collab.net.
"Francois Beausoleil" <fb...@users.sourceforge.net> writes:
> Do you regularly remove the log files ?  There was a lot of discussion
> about the log files.  You can run a BDB program that will tell you which
> files can be removed, after you've backed them up.

I just want to echo Francois' point.  Your problem is most likely the
Berkeley log files, not the data itself.

The BDB program is `db_archive', read about it in the sleepcat docs.

-K

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Best practise for long term repository size management

Posted by Francois Beausoleil <fb...@users.sourceforge.net>.
Hello Daniel,

Do you regularly remove the log files ?  There was a lot of discussion
about the log files.  You can run a BDB program that will tell you which
files can be removed, after you've backed them up.

Hope that helps,
Francois

On 13 May 2003 17:47:13 +1000, "Daniel Patterson"
<da...@adaptiveinternational.com> said:
> Greetings,
> 
>   I've taken the approach of One Big Repository (OBR), in
>   order to be able to happily copy code between projects
>   and retain some history (and perhaps, be able to merge
>   changes back).
> 
>   However, this has led me to think that I may have created
>   a bit of a size beast.
> 
>   Currently, with around 10-15 people actively using the
>   repository, it's growing at the rate of around 20M/day.
>   The reason for this is mostly the kind of files being used
>   (large Word docs are the main culprit), but I only expect
>   this rate of growth to increase as more users come online.
> 
>   Without the ability to "prune" out old data, I fear an
>   impending point where the database is too large to backup,
>   and most of the data within it is unchanging historical
>   data anyway.  The database also becomes difficult to manage
>   (i.e. takes significant time to copy, etc) if the size becomes 
>   too large.
> 
>   I guess the tradeoff is between database size and amount
>   of history retained, but some guidance as to the point at
>   which you draw the line would be useful.
> 
>   My questions are:
> 
>     1) Should the OBR approach be discouraged, for the reason
>        that it may grow to unmanageable volumes?
>     2) Is there any reasonable way to archive and prune
>        unused historical data?  (Perhaps "svn obliterate"
>        will solve this)....
> 
> daniel
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: dev-help@subversion.tigris.org
> 
> 
--
  Francois Beausoleil
  Developer of Java Gui Builder
  http://jgb.sourceforge.net/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org