You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by kf...@collab.net on 2004/06/18 17:18:45 UTC

Re: [Issue 1585] Deltified dumps for archival and schema conversion

termim@tigris.org writes:
> http://subversion.tigris.org/issues/show_bug.cgi?id=1585
> 
> ------- Additional comments from termim@tigris.org Fri Jun 18 10:21:09 -0700 2004 -------
> I completely agree with Greg. Current dump file format makes dump/load
> useless for large progects. For example I have a test project which dump file
> is around 43Gb. It takes _10 days_ to load this file and makes it almost 
> impossible to switch from CVS to SVN for such a big project.

Note that loading this dumpfile probably won't be any faster with
compressed deltas.  (It might even be slightly slower, I don't know.)

Is it 10 days or the 43GB which makes it impossible to convert your
project (with cvs2svn, I presume, though you didn't say)?  The 43GB
shouldn't matter, as you can convert without having a full
intermediate dumpfile at any point.

-Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [Issue 1585] Deltified dumps for archival and schema conversion

Posted by Branko Čibej <br...@xbc.nu>.

Mikhail Terekhov wrote:

> Branko C(ibej wrote:
>
>> As for using deltas in the dumps -- "svnadmin load" won't simply 
>> store those back into the repository. It'll recreate the current 
>> fulltext and apply that. The deltas are for space, not time efficiency. 
>
> You lost me here. Do you mean that "svnadmin load" first recreates 
> full text from delta and
> then does exactly the same as in the case with full text dump (i.e. 
> calculate delta again)?

You must remember that a) a dumpfile can be loaded into an existing, 
non-empty repository, and b) the delta relationships in the repository 
are quite a bit more complicated than the ones in the dump file. Take a 
look at http://svn.collab.net/repos/svn-xml/trunk/notes/skip-deltas 
which describes how we currently use deltas in the repository. Obviously 
the dump file format must be independent of these details.

-- Brane

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [Issue 1585] Deltified dumps for archival and schema conversion

Posted by kf...@collab.net.

Greg Hudson <gh...@MIT.EDU> writes:
> I've thought about making a dump format variation which stores the
> actual deltas present in the database, and which can be loaded without
> recomputing all the deltas, but it didn't seem worth the complexity and
> architectural impurity.  And it wouldn't help with CVS conversion, since
> the deltas stored in CVS are still much too foreign.  (Different delta
> algorithm, different delta bases.)

Also, loading a dumpfile wants to verify checksums.  I don't think it
can do that if it doesn't reconstruct the fulltext at some point.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [Issue 1585] Deltified dumps for archival and schema conversion

Posted by Greg Hudson <gh...@MIT.EDU>.

On Wed, 2004-06-23 at 13:26, Martin Tomes wrote:
> > This genericity makes dump/load useless
> > for large svn repositories unfortunately :(.
> 
> Not if you pipe the output from a dump to the input of a load, then you 
> don't need the intermediate storage.

He's worried about time, not space.  svnadmin dump --deltas is perfectly
adequate for space, but the amount of time it takes is still
proportional (at least) to the amount of fulltext represented in the
repository.

We don't anticipate requiring a dump and load cycle prior to 2.0, but we
may have to revisit this issue before 2.0 comes out.  It was one thing
to tell our 0.x users that they needed to perform a dump and load even
when repository size made that prohibitive; it will be another to tell
our 1.4 users that.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [Issue 1585] Deltified dumps for archival and schema conversion

Posted by Martin Tomes <li...@tomes.org>.

Mikhail Terekhov wrote:
> Greg Hudson wrote:
> 
>> Actually, it's more complicated than that.  See
>> http://svn.collab.net/repos/svn/trunk/notes/skip-deltas for the gory
>> details.
> This genericity makes dump/load useless
> for large svn repositories unfortunately :(.

Not if you pipe the output from a dump to the input of a load, then you 
don't need the intermediate storage.  For example, if you want to change 
over to the fs backend when 1.1 comes out this would be a good way to 
move the data from a bdb version to an fs version.

Still no help for archiving though.

-- 
Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [Issue 1585] Deltified dumps for archival and schema conversion

Posted by Mikhail Terekhov <te...@emc.com>.

Greg Hudson wrote:

> Actually, it's more complicated than that.  See
> http://svn.collab.net/repos/svn/trunk/notes/skip-deltas for the gory
> details.
> 
> I've thought about making a dump format variation which stores the
> actual deltas present in the database, and which can be loaded without
> recomputing all the deltas, but it didn't seem worth the complexity and
> architectural impurity.  And it wouldn't help with CVS conversion, since
> the deltas stored in CVS are still much too foreign.  (Different delta
> algorithm, different delta bases.)

Thanks for the explanation. The real reason then for the dump file format
genericity (and the dump/load process slowness) is that different
backends have different db schema. This genericity makes dump/load useless
for large svn repositories unfortunately :(.

Mikhail


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [Issue 1585] Deltified dumps for archival and schema conversion

Posted by Greg Hudson <gh...@MIT.EDU>.

On Tue, 2004-06-22 at 19:22, Benjamin Pflugmann wrote:
> Yes, it will create the fulltext, put that into the repository and the
> repository (library) will calculate a delta again. But it sounds as if
> you think it is the same delta that was in the dumpfile. It isn't.

> Regardless of whether the delta algorithm is the same (I don't know
> without checking, but I don't see a reason for using two different
> ones here), there is a difference of what is diff'ed against what.

Correct so far (and it is the same algorithm), but...

> while the repository (BDB backend) stores deltas like this:
> 
>   (2,1) (3,2) (4,3) and so on (i.e. deltas against the younger revision)

Actually, it's more complicated than that.  See
http://svn.collab.net/repos/svn/trunk/notes/skip-deltas for the gory
details.

I've thought about making a dump format variation which stores the
actual deltas present in the database, and which can be loaded without
recomputing all the deltas, but it didn't seem worth the complexity and
architectural impurity.  And it wouldn't help with CVS conversion, since
the deltas stored in CVS are still much too foreign.  (Different delta
algorithm, different delta bases.)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [Issue 1585] Deltified dumps for archival and schema conversion

Posted by Benjamin Pflugmann <be...@pflugmann.de>.

On Mon 2004-06-21 at 15:07:02 -0400, Mikhail Terekhov wrote:
> >As for using deltas in the dumps -- "svnadmin load" won't simply store 
> >those back into the repository. It'll recreate the current fulltext 
> >and apply that. The deltas are for space, not time efficiency. 
> 
> You lost me here. Do you mean that "svnadmin load" first recreates
> full text from delta and then does exactly the same as in the case
> with full text dump (i.e. calculate delta again)?

Yes, it will create the fulltext, put that into the repository and the
repository (library) will calculate a delta again. But it sounds as if
you think it is the same delta that was in the dumpfile. It isn't.

Regardless of whether the delta algorithm is the same (I don't know
without checking, but I don't see a reason for using two different
ones here), there is a difference of what is diff'ed against what.

Say the notation (a,b) means a delta that transforms a file from
version a into b. Then an incremental dump contains deltas of the form

  (1,2) (2,3) (3,4) and so on (that is, deltas against the older revision)

while the repository (BDB backend) stores deltas like this:

  (2,1) (3,2) (4,3) and so on (i.e. deltas against the younger revision)

Bye,

	Benjamin.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [Issue 1585] Deltified dumps for archival and schema conversion

Posted by Mikhail Terekhov <te...@charter.net>.

Branko C(ibej wrote:

> There's a huge misconception here. CVS uses context diffs (diff -e, I 
> believe). Subversion uses a block-copying, compressing binary delta 
> algorithm. The most efficient way to "convert" from one to the other 
> is simply to recreate the fulltext from CVS and calculate the delta in 
> SVN. I suppose we could convert directly, but the results wouldn't be 
> that good.

I'll rely on your experience here.

>
> As for using deltas in the dumps -- "svnadmin load" won't simply store 
> those back into the repository. It'll recreate the current fulltext 
> and apply that. The deltas are for space, not time efficiency. 

You lost me here. Do you mean that "svnadmin load" first recreates full 
text from delta and
then does exactly the same as in the case with full text dump (i.e. 
calculate delta again)?

> Last but not lease, you convert your CVS repository to Subversion 
> exactly once. If it takes 10 days, we can probably shave off some of 
> that time, but it's only once, after all. What you can do is start 
> using Subversion immediately, importing a snapshot of the current 
> state from CVS. Then convert the CVS repository, however long that 
> takes. Then do an incremental dump of the SVN repository (from just 
> after the import) and load that on top of the converted CVS 
> repository. In that way you don't have to stop work during repository 
> conversion, you just don't have access to history in SVN while the 
> conversion is going on (but _do_ have access in CVS, of course). 
> That's not too bad, is it?

That is a nice workaround! Thanks.

>
> -- Brane 


Mikhail


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [Issue 1585] Deltified dumps for archival and schema conversion

Posted by Branko Čibej <br...@xbc.nu>.

Mikhail Terekhov wrote:

> I thought that convert one kind of delta to another is much less
> expensive than calculate it anew, plus huge savings in space.

There's a huge misconception here. CVS uses context diffs (diff -e, I 
believe). Subversion uses a block-copying, compressing binary delta 
algorithm. The most efficient way to "convert" from one to the other is 
simply to recreate the fulltext from CVS and calculate the delta in SVN. 
I suppose we could convert directly, but the results wouldn't be that good.

As for using deltas in the dumps -- "svnadmin load" won't simply store 
those back into the repository. It'll recreate the current fulltext and 
apply that. The deltas are for space, not time efficiency.

Last but not lease, you convert your CVS repository to Subversion 
exactly once. If it takes 10 days, we can probably shave off some of 
that time, but it's only once, after all. What you can do is start using 
Subversion immediately, importing a snapshot of the current state from 
CVS. Then convert the CVS repository, however long that takes. Then do 
an incremental dump of the SVN repository (from just after the import) 
and load that on top of the converted CVS repository. In that way you 
don't have to stop work during repository conversion, you just don't 
have access to history in SVN while the conversion is going on (but _do_ 
have access in CVS, of course). That's not too bad, is it?

-- Brane

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [Issue 1585] Deltified dumps for archival and schema conversion

Posted by Branko Čibej <br...@xbc.nu>.

kfogel@collab.net wrote:

>Branko Čibej <br...@xbc.nu> writes:
>  
>
>>kfogel@collab.net wrote:
>>    
>>
>>>  1. Convert CVS repository at time T to a dumpfile.
>>>  2. Start loading the dumpfile into an SVN repository.
>>>  3. While it's loading, continue working in CVS.
>>>  4. When the load is done, take all commits in CVS since time T, and
>>>     replay them into the new SVN repository.
>>>  5. From now on, all work happens in Subversion.
>>>
>>>      
>>>
>>Your recipe is a lot more complicated than mine, _and_ it requires
>>some manual conversion from CVS. Shame on you. :-)
>>    
>>
>
>Yes, but mine doesn't involve resetting working copies later :-).
>  
>
Urgh. Yikes. Shame on me, then.

-- Brane


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [Issue 1585] Deltified dumps for archival and schema conversion

Posted by kf...@collab.net.

Branko Čibej <br...@xbc.nu> writes:
> kfogel@collab.net wrote:
> >   1. Convert CVS repository at time T to a dumpfile.
> >   2. Start loading the dumpfile into an SVN repository.
> >   3. While it's loading, continue working in CVS.
> >   4. When the load is done, take all commits in CVS since time T, and
> >      replay them into the new SVN repository.
> >   5. From now on, all work happens in Subversion.
> >
> Your recipe is a lot more complicated than mine, _and_ it requires
> some manual conversion from CVS. Shame on you. :-)

Yes, but mine doesn't involve resetting working copies later :-).

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [Issue 1585] Deltified dumps for archival and schema conversion

Posted by Branko Čibej <br...@xbc.nu>.

kfogel@collab.net wrote:

>   1. Convert CVS repository at time T to a dumpfile.
>   2. Start loading the dumpfile into an SVN repository.
>   3. While it's loading, continue working in CVS.
>   4. When the load is done, take all commits in CVS since time T, and
>      replay them into the new SVN repository.
>   5. From now on, all work happens in Subversion.
>  
>
Your recipe is a lot more complicated than mine, _and_ it requires some 
manual conversion from CVS. Shame on you. :-)

-- Brane


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [Issue 1585] Deltified dumps for archival and schema conversion

Posted by Mikhail Terekhov <te...@charter.net>.

kfogel@collab.net wrote:

>that's not prohibitive.  It effectively just means you delay your
>group's switchover by that many days:
>
>   1. Convert CVS repository at time T to a dumpfile.
>   2. Start loading the dumpfile into an SVN repository.
>   3. While it's loading, continue working in CVS.
>   4. When the load is done, take all commits in CVS since time T, and
>      replay them into the new SVN repository.
>   5. From now on, all work happens in Subversion.
>
Thanks for the recipe.

>
>Obviously, there's some scripting to do here, and it would be nice if
>cvs2svn.py made this easier (by taking a date-range specifier, for
>
That would be very helpful indeed.

Mikhail


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [Issue 1585] Deltified dumps for archival and schema conversion

Posted by kf...@collab.net.

Mikhail Terekhov <te...@charter.net> writes:
> It is Dell Precision 650 - 2 cpu (2.66GHz), 1G memory , local IDE WD
> 250 GB drive,
> OS - SuSE-9.1 Linux.

I'm surprised, but unfortunately, I don't have enough hardware to try
this myself :-(.  Sorry.  (Speed of the drive might be relevant, but
given the rest of the system, I doubt you bought a slow drive! :-) )

However, even if the conversion takes 10 days (or 20, or whatever),
that's not prohibitive.  It effectively just means you delay your
group's switchover by that many days:

   1. Convert CVS repository at time T to a dumpfile.
   2. Start loading the dumpfile into an SVN repository.
   3. While it's loading, continue working in CVS.
   4. When the load is done, take all commits in CVS since time T, and
      replay them into the new SVN repository.
   5. From now on, all work happens in Subversion.

Obviously, there's some scripting to do here, and it would be nice if
cvs2svn.py made this easier (by taking a date-range specifier, for
example).  However, the main thing is that even long conversion times
needn't prevent switching to Subversion.

-Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [Issue 1585] Deltified dumps for archival and schema conversion

Posted by Mikhail Terekhov <te...@charter.net>.

kfogel@collab.net wrote:

>10 days to load a dumpfile that only took 3 days to create?
>
>That's quite extraordinary.  What kind of machine, drive (network
>drive?), etc is this on?
>

It is Dell Precision 650 - 2 cpu (2.66GHz), 1G memory , local IDE WD 250 
GB drive,
OS - SuSE-9.1 Linux.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [Issue 1585] Deltified dumps for archival and schema conversion

Posted by kf...@collab.net.

Mikhail Terekhov <te...@charter.net> writes:
> > But is that 10 days to load the dumpfile, or 10 days to convert the
> > project?  If the latter, I highly doubt 33% of total conversion time
> > is spent in Subversion's vdelta code -- much more likely it is spent
> > in cvs2svn.py.
> 
> It takes 3+ days to create the dumpfile and 10+ days to load it.
> CVS repository ~ 770M, 8+K files.

10 days to load a dumpfile that only took 3 days to create?

That's quite extraordinary.  What kind of machine, drive (network
drive?), etc is this on?

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [Issue 1585] Deltified dumps for archival and schema conversion

Posted by Mikhail Terekhov <te...@charter.net>.


kfogel@collab.net wrote:
> Mikhail Terekhov <te...@charter.net> writes:
> 
>>That would be very surprising!  The closer dump format reflects
>>internal DB structure
>>the faster should be dump/load operations IMHO. Storing file deltas
>>instead of full file
>>content should eliminate applying deltas on dump and calculating them
>>on load.
> 
> 
> I don't believe it works that way, unfortunately.

It solves another problem then, unfortunately.

> 
> 
>>>Is it 10 days or the 43GB which makes it impossible to convert your
>>>project (with cvs2svn, I presume, though you didn't say)?  The 43GB
>>>shouldn't matter, as you can convert without having a full
>>>intermediate dumpfile at any point.
>>>
>>
>>10 days of course! Profiling with oprofile shows that more than 33% of
>>this time is spent
>>in libsvn_delta (vdelta), around 9% in reiserfs, about 8.5% in
>>libaprutil-0(MD5Transform)
>>etc. From this I suppose that most of the time in cvs2svn conversion
>>is spent on calculating
>>deltas which we already have from cvs.
> 
> 
> We don't really have them from CVS.  CVS has one kind of delta, we
> have another.

I thought that convert one kind of delta to another is much less
expensive than calculate it anew, plus huge savings in space.

> 
> But is that 10 days to load the dumpfile, or 10 days to convert the
> project?  If the latter, I highly doubt 33% of total conversion time
> is spent in Subversion's vdelta code -- much more likely it is spent
> in cvs2svn.py.

It takes 3+ days to create the dumpfile and 10+ days to load it.
CVS repository ~ 770M, 8+K files.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [Issue 1585] Deltified dumps for archival and schema conversion

Posted by kf...@collab.net.

Mikhail Terekhov <te...@charter.net> writes:
> That would be very surprising!  The closer dump format reflects
> internal DB structure
> the faster should be dump/load operations IMHO. Storing file deltas
> instead of full file
> content should eliminate applying deltas on dump and calculating them
> on load.

I don't believe it works that way, unfortunately.

> >Is it 10 days or the 43GB which makes it impossible to convert your
> >project (with cvs2svn, I presume, though you didn't say)?  The 43GB
> >shouldn't matter, as you can convert without having a full
> >intermediate dumpfile at any point.
> >
> 10 days of course! Profiling with oprofile shows that more than 33% of
> this time is spent
> in libsvn_delta (vdelta), around 9% in reiserfs, about 8.5% in
> libaprutil-0(MD5Transform)
> etc. From this I suppose that most of the time in cvs2svn conversion
> is spent on calculating
> deltas which we already have from cvs.

We don't really have them from CVS.  CVS has one kind of delta, we
have another.

But is that 10 days to load the dumpfile, or 10 days to convert the
project?  If the latter, I highly doubt 33% of total conversion time
is spent in Subversion's vdelta code -- much more likely it is spent
in cvs2svn.py.



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [Issue 1585] Deltified dumps for archival and schema conversion

Posted by Mikhail Terekhov <te...@charter.net>.


kfogel@collab.net wrote:

>termim@tigris.org writes:
>  
>
>>http://subversion.tigris.org/issues/show_bug.cgi?id=1585
>>
>>------- Additional comments from termim@tigris.org Fri Jun 18 10:21:09 -0700 2004 -------
>>I completely agree with Greg. Current dump file format makes dump/load
>>useless for large progects. For example I have a test project which dump file
>>is around 43Gb. It takes _10 days_ to load this file and makes it almost 
>>impossible to switch from CVS to SVN for such a big project.
>>    
>>
>
>Note that loading this dumpfile probably won't be any faster with
>compressed deltas.  (It might even be slightly slower, I don't know.)
>  
>
That would be very surprising!  The closer dump format reflects internal 
DB structure
the faster should be dump/load operations IMHO. Storing file deltas 
instead of full file
content should eliminate applying deltas on dump and calculating them on 
load.

>Is it 10 days or the 43GB which makes it impossible to convert your
>project (with cvs2svn, I presume, though you didn't say)?  The 43GB
>shouldn't matter, as you can convert without having a full
>intermediate dumpfile at any point.
>
10 days of course! Profiling with oprofile shows that more than 33% of 
this time is spent
in libsvn_delta (vdelta), around 9% in reiserfs, about 8.5% in 
libaprutil-0(MD5Transform)
etc. From this I suppose that most of the time in cvs2svn conversion is 
spent on calculating
deltas which we already have from cvs.

Mikhail