You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@subversion.apache.org by Justin Connell <ju...@propylon.com> on 2010/02/16 16:16:48 UTC

SVN Dump Question

I have a really huge repo that occupies 151 GB of space on the file 
system. Just to give some background, there is a lot of content that 
gets added and deleted from the repo, now we are sitting with a rev 
number of over 1, 500 000.

My question is, would it be possible to take a dump of just a specified 
path within the repo, for example if my repo is located at /path/to/repo 
, could I run a dump such as /path/to/repo/specific/location/in/repo ?

Thanks
Justin

Re: SVN Dump Question

Posted by B Smith-Mannschott <bs...@gmail.com>.

On Tue, Feb 16, 2010 at 18:39, Justin Connell
<ju...@propylon.com> wrote:

> The reason, I'm asking such strange questions is that I have a very abnormal
> situation on my hands here. I previously took a full dump of the repo (for
> the reason you implied) where the original size of the repo on disk was 150
> GB, and the resulting dump file ended up at 46 GB. This was quite unexpected
> (the dump is usually larger than the repos on smaller repos that I have
> worked on).

Such a large difference would make me suspicious, even if you're using
--deltas when generating the dump. Back of the envelope calculation
says that FSFS is wasting some space due to external fragmentation.
I'd estimate less than 10GB, so that's not enough to explain this
discrepancy. (Estimate: about 5.5 GB in storing revision properties
(assuming 4KB file system block sizes, 1.5M revisions, circa 140 bytes
actual revprop content per revisions) another 2.8 GB wasted on the
revs themselves, unless they've been packed, assuming the last file
system block of each file averages half full.)

> Just as a sanity check, this is what I was trying to accompliesh:
>
> Scenario - The repo needs to get trimmed down from 150 GB to a more
> maintainable size. We have a collection of users who access the repository
> as a document revision control system. Many files have been added and
> deleted over the past 2 years and all these transactions have caused such an
> astronomical growth in the physical size of the repo. My mission is to solve
> this issue preferably using subversion best practices. There are certain
> locations in the repo that do not have to retain version history and others
> that must retain their version history.
>
> Proposed solution -
>
> Take a full dump of the repo
> run a svnadmin dumpfilter including the paths that need to have version
> history preserved into a single filtered.dump file
> export the top revision of the files that do not require version history to
> be preserved
> create a new repo and load the filtered repo
> import the content form the svn export to complete the process
>
> Is this a sane approach to solving the problem? and what about the size
> difference between the dump file and the original repo - am I loosing
> revisions (the dump shows all revision numbers being written to the dump
> file and this looks correct).

Any files that were created outside of the repository portion being
selected by dumpfilter, but later copied into said portion will cause
you problems. I ran into that when trying to split up one of my larger
repositories. At the time, I gave up and moved onto more productive
endeavors.Looking back on it: Perhaps I could have made a series of
incremental dumps, the devisions being at the points in history where
the problematic renames took place. I then could have filtered each of
the dumps separately and then loaded them into a fresh repository.
sounds fiddly.

> Another aspect could also be that there are unused log files occupying disk
> space (we are not using Berkley DB though) is this a valid assumption to
> make when using the FS configuration.

FSFS does not write logs.

> Thanks so much to all who have responded to this mail, and all of you who
> take the time and read these messages
>
> Justin
>

Re: SVN Dump Question

Posted by Andrey Repin <an...@freemail.ru>.

Greetings, Justin Connell!

> The reason, I'm asking such strange questions is that I have a very
> abnormal situation on my hands here. I previously took a full dump of 
> the repo (for the reason you implied) where the original size of the 
> repo on disk was 150 GB, and the resulting dump file ended up at 46 GB. 
> This was quite unexpected (the dump is usually larger than the repos on 
> smaller repos that I have worked on).

Sounds odd.
Have you tried running cleanup procedures? If you're using autoversioning
through some strange and unnatural (to Subversion) tools, this could
potentially cause lots of stalled transactions that lying under the hood,
wasting space, but not impacting the repository in any sensible way.

> Just as a sanity check, this is what I was trying to accompliesh:

> Scenario - The repo needs to get trimmed down from 150 GB to a more 
> maintainable size. We have a collection of users who access the 
> repository as a document revision control system. Many files have been 
> added and deleted over the past 2 years and all these transactions have 
> caused such an astronomical growth in the physical size of the repo. My 
> mission is to solve this issue preferably using subversion best 
> practices. There are certain locations in the repo that do not have to 
> retain version history and others that must retain their version history.

> Proposed solution -

>    1. Take a full dump of the repo
>    2. run a svnadmin dumpfilter including the paths that need to have
>       version history preserved into a single filtered.dump file
>    3. export the top revision of the files that do not require version
>       history to be preserved
>    4. create a new repo and load the filtered repo
>    5. import the content form the svn export to complete the process

> Is this a sane approach to solving the problem? and what about the size 
> difference between the dump file and the original repo - am I loosing 
> revisions (the dump shows all revision numbers being written to the dump 
> file and this looks correct).

> Another aspect could also be that there are unused log files occupying 
> disk space (we are not using Berkley DB though) is this a valid 
> assumption to make when using the FS configuration.

> Thanks so much to all who have responded to this mail, and all of you 
> who take the time and read these messages

What you describing sounds... fair.
But first, I would suggest some steps that involve more, say, native
operations on repository.
Cleanup, hotcopy... See how would they affect it, and what would be the
results.


--
WBR,
 Andrey Repin (anrdaemon@freemail.ru) 18.02.2010, <2:56>

Sorry for my terrible english...

Re: SVN Dump Question

Posted by Justin Connell <ju...@propylon.com>.

Andrey Repin wrote:
> Greetings, Justin Connell!
>
>   
>>> I have a really huge repo that occupies 151 GB of space on the file system.
>>> Just to give some background, there is a lot of content that gets added and
>>> deleted from the repo, now we are sitting with a rev number of over 1, 500 000.
>>>
>>> My question is, would it be possible to take a dump of just a specified
>>> path within the repo, for example if my repo is located at /path/to/repo ,
>>> could I run a dump such as /path/to/repo/specific/location/in/repo ?
>>>       
>
>   
>> No but you can take a complete dump and pipe it through svndumpfilter to
>> extract out just the part you want. 
>>     
>
> To clarify, if it's unclear, you could attempt to directly pipe the dump to
> filter on the fly, saving disk space for intermediate storage.
> Although, it's not the one failsafe process, I'm afraid.
>
>
> --
> WBR,
>  Andrey Repin (anrdaemon@freemail.ru) 16.02.2010, <20:01>
>
> Sorry for my terrible english...
>
>
>
>   
Thanks Andrey,
The reason, I'm asking such strange questions is that I have a very 
abnormal situation on my hands here. I previously took a full dump of 
the repo (for the reason you implied) where the original size of the 
repo on disk was 150 GB, and the resulting dump file ended up at 46 GB. 
This was quite unexpected (the dump is usually larger than the repos on 
smaller repos that I have worked on).

Just as a sanity check, this is what I was trying to accompliesh:

Scenario - The repo needs to get trimmed down from 150 GB to a more 
maintainable size. We have a collection of users who access the 
repository as a document revision control system. Many files have been 
added and deleted over the past 2 years and all these transactions have 
caused such an astronomical growth in the physical size of the repo. My 
mission is to solve this issue preferably using subversion best 
practices. There are certain locations in the repo that do not have to 
retain version history and others that must retain their version history.

Proposed solution -

   1. Take a full dump of the repo
   2. run a svnadmin dumpfilter including the paths that need to have
      version history preserved into a single filtered.dump file
   3. export the top revision of the files that do not require version
      history to be preserved
   4. create a new repo and load the filtered repo
   5. import the content form the svn export to complete the process

Is this a sane approach to solving the problem? and what about the size 
difference between the dump file and the original repo - am I loosing 
revisions (the dump shows all revision numbers being written to the dump 
file and this looks correct).

Another aspect could also be that there are unused log files occupying 
disk space (we are not using Berkley DB though) is this a valid 
assumption to make when using the FS configuration.

Thanks so much to all who have responded to this mail, and all of you 
who take the time and read these messages

Justin

Re: SVN Dump Question

Posted by Andrey Repin <an...@freemail.ru>.

Greetings, Justin Connell!

>> I have a really huge repo that occupies 151 GB of space on the file system.
>> Just to give some background, there is a lot of content that gets added and
>> deleted from the repo, now we are sitting with a rev number of over 1, 500 000.
>> 
>> My question is, would it be possible to take a dump of just a specified
>> path within the repo, for example if my repo is located at /path/to/repo ,
>> could I run a dump such as /path/to/repo/specific/location/in/repo ?

> No but you can take a complete dump and pipe it through svndumpfilter to
> extract out just the part you want. 

To clarify, if it's unclear, you could attempt to directly pipe the dump to
filter on the fly, saving disk space for intermediate storage.
Although, it's not the one failsafe process, I'm afraid.


--
WBR,
 Andrey Repin (anrdaemon@freemail.ru) 16.02.2010, <20:01>

Sorry for my terrible english...

Re: SVN Dump Question

Posted by Ryan Schmidt <su...@ryandesign.com>.

On Feb 16, 2010, at 10:16, Justin Connell wrote:

> I have a really huge repo that occupies 151 GB of space on the file system. Just to give some background, there is a lot of content that gets added and deleted from the repo, now we are sitting with a rev number of over 1, 500 000.
> 
> My question is, would it be possible to take a dump of just a specified path within the repo, for example if my repo is located at /path/to/repo , could I run a dump such as /path/to/repo/specific/location/in/repo ?

No but you can take a complete dump and pipe it through svndumpfilter to extract out just the part you want.