You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@subversion.apache.org by Pete Gonzalez <pg...@bluel.com> on 2004/05/14 05:26:47 UTC

Coping with repository bloat

Our project relies on some data files and source files generated
by various tools with lots of dependencies.  It is important that
team members are able to check out the project and compile it
without a lot of extra setup steps and tool installations.
In order to accomplish this, our Subversion repository has to
store a lot of source code and binary data that changes frequently,
and the result is that our database is accumulating a lot of dead
wood.  (The Berkely database is already approaching 1GB in size,
whereas the project itself is only around 50MB.)

When we were using SourceSafe, this problem was solved using a
feature called "story only latest version", which allowed the
revision tracking to be disabled for specific files.  Although
it was a little awkward to use, it enabled us to dramatically
reduce the database size by discarding these histories (which
were not needed anyway).

Does Subversion have something like this?  If not, it seems like
it would be pretty trivial to implement, since it basically
involves subtracting existing functionality rather than implementing
something new.  :-)

A related feature would be the ability to delete old revisions.
For example, we have some relatively ancient revision histories
in our repository, and it would be nice to issue a command such
as "Collapse all changes between Jun 1, 2003 and Jan 1, 2004
into a single revision."  This would not only reduce the database
footprint, it could also dramatically reduce the amount of time
you have to wait for operations such as the revision history
window in TortoiseSVN.  (I realize that it might be possible to
accomplish this with a 200-step manual procedure involving
"svnadmin dump," but maybe it could be automated?)

Is anyone else working on large projects?  A lot of the postings
on this list seem to be from people who are just setting up
Subversion or who have relatively small projects.

Thanks,
-Pete


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: Coping with repository bloat

Posted by Glenn Maynard <g_...@zewt.org>.

On Fri, May 14, 2004 at 08:20:52AM +0100, Branko Čibej wrote:
> Are you saying that CVS handles changing binaries better than SVN? 
> What's your definition of "better"?

I can remove old revisions, with "cvs admin -o".  It's not perfectly clean
(it can confuse cvs update if deleted revisions are still live somewhere),
but it's fast, doesn't require repository downtime, and fulfills the most
fundamental problem I have: frees disk space for old data that I no longer
need (and no longer have space) to keep around.  Subversion just can't do
that.

> No, Subversion is exactly just version control. I never understand 
> what's wrong with putting the unversioned data on network filesystem.

Hmm.  I find it a little hard to explain, just because it's very obvious
to me.  I'll give it a shot.

(First, a disclaimer: I don't really want "no versioning".  Ideally, I'd
like to be able to limit versioning.  Being able to back out a couple
revisions, even for large binary files, could be useful.  Fundamentally,
I just don't want the repository being bloated by ancient data--if I want
to keep old stuff around indefinitely, I'll download it and burn it to a
CD, instead of having it take up finite server space forever.)

When a new user is checking out the tree for the first time, he has to jump
some hoops to get an account created and configure SVN.  Once that's done,
the entire tree can be checked out with one command run from one client;
commits and updates also always happen with the same client.  Users only have
to learn how to use one interface, and only one server has to be maintained.

If a second method of transfer is introduced, everything becomes much more
complicated.  Separate trees have to be checked out with different clients;
data must be arranged or configured correctly to see each other; updates and
commits happen differently depending on what kind of data you're working with.
Separate server programs have to be configured and maintained, and there are
more points of failure.

That's essentially what happens with my case of using both CVS and SVN.  Each
user has to install both a CVS client and an SVN client (eg. WinCVS and TSVN)
and has to know how to use both of them, and so on.  It's a huge pain.

(Substitute "CVS" with any "network filesystem".)

Pete Gonzalez seems to have a comparable situation; he may have other factors.

-- 
Glenn Maynard

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: Coping with repository bloat

Posted by Scott Lawrence <sl...@pingtel.com>.

On Fri, 2004-05-14 at 13:59, Pete Gonzalez wrote:

> I think that probably the most general approach is the suggested
> idea for a tool that lets you fold together revision histories
> in the database, i.e. "combine all changes between Jun 1, 2003
> and Jan 1, 2004 into a single revision."  With this feature, you
> could just write a check-in hook script that e.g. automatically
> deletes everything before the most recent 3 revisions for ".bmp"
> files (or files with a particular property setting, etc.).
> 
> However, I'm guessing that this feature might lead to a lot of
> subtle complications for the current update/merging implementation.

I think it's much worse than that; subversion manages revisions that are
the entire tree, not per-file.  So 'folding' the revision history of
some files and not others might well introduce subtle problems (in how
the users think, if nothing else).

I think that this whole discussion began with the fact that some parts
of the backend store become very large when large files that are not
efficiently diffed are stored.  It may be that what's needed is an
attribute that tells the backend that this is one of those files - just
store a single copy of each version outside the normal diff mechansim.

-- 
Scott Lawrence
Consulting Engineer
Pingtel Corp.   
sip:slawrence@pingtel.com
+1.781.938.5306 x162

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: Coping with repository bloat

Posted by Matt Kunze <ku...@datasplice.com>.

Pete Gonzalez wrote:
> At 04:27 AM 5/14/2004, jpsa@jjdash.demon.co.uk wrote:
> 
>> > All right, providing *integrated* exclusive access to unversioned files
>> > is the best argument to date. Of course, Subversion hasn't a chance of
>> > doing this until it supports exclusive locks -- my pet project for 
>> 1.1 :-)
>>
>> Some of these problems sound as if they would be neatly solved by 
>> allowing svn:externals to refer to unversioned files?
> 
> 
> That's an interesting idea.  But it would be nice to have some
> version control features such as commits, authorship, deleting
> of deadwood, metadata, etc.

The idea I was kind of thinking of implementing was to have a separate 
'Builds' repository that contained the non-source binary files resulting 
from a particular build. Then svn:externals can fetch the relevant 
dependencies for a project.

Once this was setup, I was planning on periodically deleting this 
repository if it got too large. Since I don't care about the history 
this wouldn't lose any important information. Then the next nightly (or 
whenever) build would start the repository anew, just smaller than before.

It would be nice if this could be a little more automated, but by 
keeping the repositories separate it would at least be easier to keep 
the total storage requirements under control.

-- 

.o0O0o.__.o0O0o.__.o0O0o.__.o0O0o.__.o0O0o.__.o0O0o.__.o0O0o.
| Matt Kunze                      Sometimes there's a point.|
| Build Master Fooly Fool         This is not one of those  |
| DataSplice Software Developer   times.                    |
=============================================================

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Coping with repository bloat

Posted by Pete Gonzalez <pg...@bluel.com>.

At 04:27 AM 5/14/2004, jpsa@jjdash.demon.co.uk wrote:
> > All right, providing *integrated* exclusive access to unversioned files
> > is the best argument to date. Of course, Subversion hasn't a chance of
> > doing this until it supports exclusive locks -- my pet project for 1.1 :-)
>
>Some of these problems sound as if they would be neatly solved by allowing 
>svn:externals to refer to unversioned files?

That's an interesting idea.  But it would be nice to have some
version control features such as commits, authorship, deleting
of deadwood, metadata, etc.

I think that probably the most general approach is the suggested
idea for a tool that lets you fold together revision histories
in the database, i.e. "combine all changes between Jun 1, 2003
and Jan 1, 2004 into a single revision."  With this feature, you
could just write a check-in hook script that e.g. automatically
deletes everything before the most recent 3 revisions for ".bmp"
files (or files with a particular property setting, etc.).

However, I'm guessing that this feature might lead to a lot of
subtle complications for the current update/merging implementation.
But as Subversion's job description is to "manage revisions",
I think this would be a very apropos addition.  There is nothing
strange or outlandish about people wanting to manage histories,
and I'm guessing that the reason this has not received more attention
is simply that not many people have gotten to the "big repository"
stage yet.

At the very least, since CVS has this feature ("cvs admin -o"),
Subversion should have it under the principle that Subversion
is a replacement for CVS.  :-D

Cheers,
-Pete

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Coping with repository bloat

Posted by Pete Gonzalez <pg...@bluel.com>.

At 04:27 AM 5/14/2004, jpsa@jjdash.demon.co.uk wrote:
> > All right, providing *integrated* exclusive access to unversioned files
> > is the best argument to date. Of course, Subversion hasn't a chance of
> > doing this until it supports exclusive locks -- my pet project for 1.1 :-)
>
>Some of these problems sound as if they would be neatly solved by allowing 
>svn:externals to refer to unversioned files?

That's an interesting idea.  But it would be nice to have some
version control features such as commits, authorship, deleting
of deadwood, metadata, etc.

I think that probably the most general approach is the suggested
idea for a tool that lets you fold together revision histories
in the database, i.e. "combine all changes between Jun 1, 2003
and Jan 1, 2004 into a single revision."  With this feature, you
could just write a check-in hook script that e.g. automatically
deletes everything before the most recent 3 revisions for ".bmp"
files (or files with a particular property setting, etc.).

However, I'm guessing that this feature might lead to a lot of
subtle complications for the current update/merging implementation.
But as Subversion's job description is to "manage revisions",
I think this would be a very apropos addition.  There is nothing
strange or outlandish about people wanting to manage histories,
and I'm guessing that the reason this has not received more attention
is simply that not many people have gotten to the "big repository"
stage yet.

At the very least, since CVS has this feature ("cvs admin -o"),
Subversion should have it under the principle that Subversion
is a replacement for CVS.  :-D

Cheers,
-Pete

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: Coping with repository bloat

Posted by jp...@jjdash.demon.co.uk.

> All right, providing *integrated* exclusive access to unversioned files 
> is the best argument to date. Of course, Subversion hasn't a chance of 
> doing this until it supports exclusive locks -- my pet project for 1.1 :-)

Some of these problems sound as if they would be neatly solved by allowing svn:externals to refer to unversioned files?



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: Coping with repository bloat

Posted by Branko Čibej <br...@xbc.nu>.

Pete Gonzalez wrote:

> At 12:20 AM 5/14/2004, you wrote:
>
>> No, Subversion is exactly just version control. I never understand
>> what's wrong with putting the unversioned data on network filesystem.
>
>
> Heh, you must be one of those "relatively small projects" guys
> I was alluding to.  ;-)

Oh be serious. Is 100 developers and 7Mloc small? Yes I've worked on 
such a project, and we did keep our unversioned bits on a well-known 
network share.

> Seriously though, suppose some of your GUI windows rely on lots
> of bitmap images, and that your math code depends on source files
> containing huge lookup tables generated by various tools that
> are complex to setup.  It's very desirable to have this stuff
> included automatically during a checkout.  Although nobody cares
> about "revision histories" for these files, they do present very
> similar version control problems as ordinary source code.
>
> You will no doubt propose that we simply copy these files to an FTP
> server and instruct the team to periodically download the data files
> to their source code directory, or maybe use rdist or somesuch.  But
> have you ever worked on a project with hundreds of data files being
> edited by different people?  It's a huge mess, because people are
> always overwriting each others' files, forgetting to delete deadwood,
> etc.  Since these files are already in the source code directory tree
> (and many of them are in fact source code), it actually seems quite
> strange to ask that they be managed by a separate filesystem.

All right, providing *integrated* exclusive access to unversioned files 
is the best argument to date. Of course, Subversion hasn't a chance of 
doing this until it supports exclusive locks -- my pet project for 1.1 :-)

>>> data", but removing data is not losing data.  My filesystem doesn't 
>>> lose
>>> data, but that doesn't mean it doesn't support unlink().
>>
>> This comparison is tricky, because there are two kinds of "unlink"
>> in Subversion: "svn remove", which we have, and "svn obliterate",
>> which is on the wishlist. The latter would remove all traces of a file,
>> its data and history, from the repository.
>
> There also seems to be a strong (and valid) design concept of "never lose
>
> Correct -- this is exactly like implementing "rm -Rf *" and then
> arguing that the ability to delete individual files would be
> "overkill".  :-D

Oh I'm sure "svn obliterate" will accept the -r to restrict the range of 
revisions, if that's what you mean.

> Once again, you need to think from the perspective of BIG projects,
> where a single file's history might have thousands of revisions.
> Think about that... a file history window with a THOUSAND entries.
> This is exactly the case where complicated global dump/filter
> operations are completely infeasible, and simply regenerating a
> new repository with no history would be too extreme.  Subversion's
> job is to manage revisions, and I see nothing outlandish about
> people asking for the ability to manage revisions that occurred
> in the past.

I agree. Like I said, the functionality is on the wishlist, but it 
probablu won't happen soon. I'd be pleasantly surprised if it will 
happen during the 1.x cycle.

> I want to write a check-in hook that deletes the histories
> of .bmp files except for their most recent 3 revisions.  Is
> that crazy?

It's...interesting. But I see your point.

>   This seems like exactly the kind of problem that
> version control models are supposed to facilitate.

Yup. I'll just note that we've now moved away from the "unversioned" to 
the "versioned in a limited number of instances" world, and this is 
_way_ more complicated.

What I'd like to see is discussion about how unversioned files behave 
wrt branching and merging, for example; the edge cases are interesting/

-- Brane



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: Coping with repository bloat

Posted by Andreas Mahel <an...@ch.ibm.com>.





Hi,

just to give you a scenario where it might be very convenient (still not
necessary) to have a "keep only the latest version" file wihtin the svn
repository:

Suppose you are developing an application that relies on some license file
to work, and this license file has an expiration date. So after this date,
you'll have to use a "new" license file - even if you should choose to
checkout a revision which is dated before that deadline (because you might
not want to set back your system date just for that).

On the other hand, for developing purposes it might be very convenient to
have this license file in your repository, so you'll get a valid license
whenever you checkout.

I'm aware that this scenario does have its edges (and certainly
workarounds), but I hope that it can illustrate why people might wish such
a behavior.

Just my 0.02 €

Andreas
_
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

  Andreas Mahel
  IBM Global Services
  Telephone: +41 79 460 11 80
  E-Mail: Andreas.Mahel@ch.ibm.com
  Timezone: Switzerland GMT+1
  Visit Unity Software Deployment at: http://www.unitysite.com
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++




                                                                           
             Branko Čibej                                                  
             <br...@xbc.nu>                                                
                                                                        To 
             14.05.2004 09:20          Glenn Maynard <g_...@zewt.org>      
                                                                        cc 
                                       users@subversion.tigris.org         
                                                                   Subject 
                                       Re: Coping with repository bloat    
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           



...

I _still_ don't see a valid reason why those can't be just ordinary
files sitting on the filesystem.

...

-- Brane



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: Coping with repository bloat

Posted by Branko Čibej <br...@xbc.nu>.

Glenn Maynard wrote:

>On Thu, May 13, 2004 at 10:26:47PM -0700, Pete Gonzalez wrote:
>  
>
>>Does Subversion have something like this?  If not, it seems like
>>it would be pretty trivial to implement, since it basically
>>involves subtracting existing functionality rather than implementing
>>something new.  :-)
>>    
>>
>
>Nope.  I need the same thing, and the only solution I've found is to use
>CVS for the data tree and SVN for the source tree, which is a pain (but
>better than using CVS for everything).
>  
>
Are you saying that CVS handles changing binaries better than SVN? 
What's your definition of "better"?

>The general response to this seems to be "why are you putting data in
>version control if you don't want it versioned"; the answer is that svn
>isn't just version control--it's a respository.
>  
>
No, Subversion is exactly just version control. I never understand 
what's wrong with putting the unversioned data on network filesystem.

>There also seems to be a strong (and valid) design concept of "never lose
>data", but removing data is not losing data.  My filesystem doesn't lose
>data, but that doesn't mean it doesn't support unlink().
>  
>
This comparison is tricky, because there are two kinds of "unlink" in 
Subversion: "svn remove", which we have, and "svn obliterate", which is 
on the wishlist. The latter would remove all traces of a file, its data 
and history, from the repository.

>I don't think it would be trivial, though--nothing useful is ever trivial.
>For example, the protocol is based on sending diffs--if old revisions
>aren't kept, the server can't produce diffs, since it doesn't have the
>client's old revision.
>  
>
How do you suppose Subversion does the initial checkout? :-)


>>A related feature would be the ability to delete old revisions.
>>For example, we have some relatively ancient revision histories
>>in our repository, and it would be nice to issue a command such
>>as "Collapse all changes between Jun 1, 2003 and Jan 1, 2004
>>into a single revision."  This would not only reduce the database
>>    
>>
>
>This was my original question; I think being able to say "keep only the
>latest copy, don't version" is more useful and require less babysitting,
>though.
>  
>
I _still_ don't see a valid reason why those can't be just ordinary 
files sitting on the filesystem.

>>footprint, it could also dramatically reduce the amount of time
>>you have to wait for operations such as the revision history
>>window in TortoiseSVN.  (I realize that it might be possible to
>>accomplish this with a 200-step manual procedure involving
>>"svnadmin dump," but maybe it could be automated?)
>>    
>>
>
>The "dump" procedure has problems: it requires taking the repository
>down, takes a long time, and requires that you have enough space to
>store at least one extra copy of the repository (and if you're doing
>this to free up space, you probably don't).  I think you can only delete
>entire ranges of revisions this way, too, not revisions for specific
>files, at least without doing special dump filtering.
>
Which is why we have svndumpfilter.

>  Also, it has the
>update problem: clients that have revisions that you deleted won't
>be able to update cleanly.
>
>I think there's an issue open for implementing a simpler way to wipe out
>revisions, but if I remember correctly, it's expected to still have most
>of the above problems.
>  
>
Yup. And others.

-- Brane



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: Coping with repository bloat

Posted by Glenn Maynard <g_...@zewt.org>.

On Thu, May 13, 2004 at 10:26:47PM -0700, Pete Gonzalez wrote:
> Does Subversion have something like this?  If not, it seems like
> it would be pretty trivial to implement, since it basically
> involves subtracting existing functionality rather than implementing
> something new.  :-)

Nope.  I need the same thing, and the only solution I've found is to use
CVS for the data tree and SVN for the source tree, which is a pain (but
better than using CVS for everything).

The general response to this seems to be "why are you putting data in
version control if you don't want it versioned"; the answer is that svn
isn't just version control--it's a respository.

There also seems to be a strong (and valid) design concept of "never lose
data", but removing data is not losing data.  My filesystem doesn't lose
data, but that doesn't mean it doesn't support unlink().

I don't think it would be trivial, though--nothing useful is ever trivial.
For example, the protocol is based on sending diffs--if old revisions
aren't kept, the server can't produce diffs, since it doesn't have the
client's old revision.

> A related feature would be the ability to delete old revisions.
> For example, we have some relatively ancient revision histories
> in our repository, and it would be nice to issue a command such
> as "Collapse all changes between Jun 1, 2003 and Jan 1, 2004
> into a single revision."  This would not only reduce the database

This was my original question; I think being able to say "keep only the
latest copy, don't version" is more useful and require less babysitting,
though.

> footprint, it could also dramatically reduce the amount of time
> you have to wait for operations such as the revision history
> window in TortoiseSVN.  (I realize that it might be possible to
> accomplish this with a 200-step manual procedure involving
> "svnadmin dump," but maybe it could be automated?)

The "dump" procedure has problems: it requires taking the repository
down, takes a long time, and requires that you have enough space to
store at least one extra copy of the repository (and if you're doing
this to free up space, you probably don't).  I think you can only delete
entire ranges of revisions this way, too, not revisions for specific
files, at least without doing special dump filtering.  Also, it has the
update problem: clients that have revisions that you deleted won't
be able to update cleanly.

I think there's an issue open for implementing a simpler way to wipe out
revisions, but if I remember correctly, it's expected to still have most
of the above problems.

> Is anyone else working on large projects?  A lot of the postings
> on this list seem to be from people who are just setting up
> Subversion or who have relatively small projects.

We deal with about a gig of data right now, and we only have a few gigs
on our server (server space is expensive).

-- 
Glenn Maynard

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: Coping with repository bloat

Posted by Ulrich Eckhardt <ec...@satorlaser.com>.

Pete Gonzalez wrote:
> Our project relies on some data files and source files generated
> by various tools with lots of dependencies.  It is important that
> team members are able to check out the project and compile it
> without a lot of extra setup steps and tool installations.
> In order to accomplish this, our Subversion repository has to
> store a lot of source code and binary data that changes frequently,
> and the result is that our database is accumulating a lot of dead
> wood. 
>
> When we were using SourceSafe, this problem was solved using a
> feature called "story only latest version", which allowed the
> revision tracking to be disabled for specific files.  Although
> it was a little awkward to use, it enabled us to dramatically
> reduce the database size by discarding these histories (which
> were not needed anyway).

How about a second repository which the primary repository references? You 
cound keep the versions of the last two weeks there, just to be sure, and svn 
automatically fetches possibly-to-be-generated files from there. It is 
accomplished using the extern property(IIRC, it sure is documented).

Uli


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org