You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Greg Hudson <gh...@MIT.EDU> on 2005/02/10 18:08:52 UTC

Delta directions (was Re: delta combiner sloooooooowwwwww)

On Thu, 2005-02-10 at 10:12, Daniel Berlin wrote:
> The worst part, is that having stared at the delta combiner for quite a
> while myself, i'm actually not sure you can do anything about this
> either.  I believe that handling source ops from the second window (or
> target ops from the first) is just going to be O(n^2) because of what is
> *has* to do to make it work right.

Perhaps the long-term answer is to stop using random-access deltas, and
instead use a generic compression algorithm on the output of our delta
routine.

I'll explain that in greater detail for those who aren't Dan Berlin and
Branko.  Right now our deltas are allowed to use instructions like:

  Copy N bytes from location <x> in source
  Copy N bytes from location <x> in already-reconstructed target
  Copy N bytes from new data, and advance new data pointer

(There's no offset for a "new data" copy, because "copy from
already-reconstructed target" handles the case of wanting to copy
earlier bytes from the new data area.)  Compare this to gnu diff, which
is allowed to use instructions like:

  Copy N lines from source, and advance source pointer
  Skip N lines from source (advance source pointer w/o copying)
  Copy N lines from new data, and advance new data pointer

The difference is not just that gnu diff uses lines and we use bytes;
it's that we're allowed to use random access to the source and target,
while gnu diff is not.  Random access allows us to produce more
efficient deltas--for instance, it allows us to get some
self-compression if the delta source is the empty stream.  So if we were
to sacrifice random access, we'd want to compress the output of the
delta routine in order to get savings in the "new data" segment.  The
end result might be more or less space efficient than what we have now.

By sacrificing random access, we could get the following benefits:

  * We could get rid of delta windowing, which is a source of
complexity.  (Obviously not before 2.0, but I'm not sure if we can
realize any of these benefits before 2.0.)  Eliminating windowing also
means we could get much better results for some changes, like "insert
128K of data at the beginning of the file".

  * We could start using deltas for the blame algorithm, like CVS does,
although it would be a little trickier.  Skip-deltas might totally
negate this benefit, unfortunately.

  * Our delta combiner would have an easier time working efficiently.

Of course, if we eliminate random access, then we need a completely
different binary delta algorithm.  But it sounds like xdelta may be such
an algorithm and it can be implemented in a small amount of code, so
that may not be a problem.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Delta directions (was Re: delta combiner sloooooooowwwwww)

Posted by Daniel Berlin <db...@dberlin.org>.
On Thu, 2005-02-10 at 13:22 -0500, Greg Hudson wrote:
> I hadn't read part of the "make blame even faster" thread when I sent
> this.  Having read it, I have one addendum:
> 
> On Thu, 2005-02-10 at 13:08, Greg Hudson wrote:
> > (Obviously not before 2.0, but I'm not sure if we can
> > realize any of these benefits before 2.0.)
> 
> As people pointed out in that other thread, we can realize the "faster
> delta combiner" benefit before 2.0, and it may be worth changing to
> xdelta in 1.2 in order to do so.  My biggest concern is that FSFS
> repositories with very shallow file histories may become significantly
> larger, since we will lose all self-compression.  Perhaps we can
> continue using vdelta for self-compression in FSFS? 

You'd probably also want to do it for sending over the wire for
checkout, so it doesn't end up sending fulltexts over the wire.

>  I'm not sure how much of the speed win (if any) we'd be giving up by doing that.

> 
> At any rate, the more desirable end state is to ditch svndiff and switch
> to a format which doesn't even allow random access, and to have a
> compression layer on the output of the delta routine.  That would let us
> recapture most or all of the space efficiency and would let us use
> deltas in other ways.  But getting to that end state is much more
> difficult than simply changing our delta algorithm from vdelta to
> xdelta.
> 

Yes.
I was, of course, looking for some immediate gratification so i don't
have to wait a year or more to possibly move gcc over to subversion
(assuming people will move :P)
IE make it fast enough for now, make it even better later.



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Delta directions (was Re: delta combiner sloooooooowwwwww)

Posted by Greg Hudson <gh...@MIT.EDU>.
On Sat, 2005-02-12 at 14:18, Branko Čibej wrote:
> Like I said on IRC, it's because I was wrong... the vdelta would be the 
> first in the combination chain, and it's full of target copies, so it 
> triggers the O(n^2) in the combiner from the start.

Well, we can get around this.  With Daniel's patch, an FSFS file history
looks like:

  xdelta --> xdelta --> xdelta --> vdelta --> empty

If we apply all the xdeltas to the vdeltas, we lose.  But if we instead
apply the vdelta to the empty stream and combine the rest of the deltas,
now the history looks like:

  xdelta --> xdelta --> xdelta --> <constructed text>

and we win.

FSFS already has code to deal with the "delta --> delta --> delta -->
plain" case, although it (a) doesn't get used in practice, and (b) isn't
quite generic enough for this purpose.  But it should be pretty simple
to fix it up.  Then we can get the full speed benefit of the xdelta
changeover without sacrificing plaintext compression, and without a
schema change.

(I already presented this idea on IRC, but since I'm not going to
implement it myself in the near future, I thought I should note it on
the list.)


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org


Re: Delta directions (was Re: delta combiner sloooooooowwwwww)

Posted by Branko Čibej <br...@xbc.nu>.
Daniel Berlin wrote:

>On Sat, 2005-02-12 at 03:45 +0100, Branko Čibej wrote:
>  
>
>>Greg Hudson wrote:
>>
>>    
>>
>>>I hadn't read part of the "make blame even faster" thread when I sent
>>>this.  Having read it, I have one addendum:
>>>
>>>On Thu, 2005-02-10 at 13:08, Greg Hudson wrote:
>>> 
>>>
>>>      
>>>
>>>>(Obviously not before 2.0, but I'm not sure if we can
>>>>realize any of these benefits before 2.0.)
>>>>   
>>>>
>>>>        
>>>>
>>>As people pointed out in that other thread, we can realize the "faster
>>>delta combiner" benefit before 2.0, and it may be worth changing to
>>>xdelta in 1.2 in order to do so.  My biggest concern is that FSFS
>>>repositories with very shallow file histories may become significantly
>>>larger, since we will lose all self-compression.  Perhaps we can
>>>continue using vdelta for self-compression in FSFS?  I'm not sure how
>>>much of the speed win (if any) we'd be giving up by doing that.
>>> 
>>>
>>>      
>>>
>>If we use vdelta strictly for self-compression of the first revision of 
>>a file in FSFS, it shouldn't be a problem. That delta would always be 
>>the last in any chain of combinations, but because the self-compressed 
>>vdelta doesn't contain source copy ops, the combiner should behave nicely.
>>    
>>
>
>I actually tried this.
>We still get a very large amount of copy_source_ops time according to
>oprofile (though a factor of 5 or 10 less than you do with vdelta every
>revision).
>
>This was with vdelta only used when source_len == 0
>
>I tested this with two repos. One where vdelta was enabled if source_len
>== 0,and one where vdelta was not enabled for sotring to the repo, but
>was for compressing fulltexts that went over the wire (through a small
>hack to add an argument to compose_window, and disable vdelta for the
>tpush_* call to compose_window, which is only used by fsfs when storing
>to the repo).
>
>I verified that vdelta was happening only during over-the-wire transmits
>by aborting if it triggered vdelta, doing a repository load to make sure
>it didn't happen during that, and then making sure it *did* abort when i
>did a checkout or something that was going to send fulltexts.
>
>No clue why the combiner still behaves badly but i can provide you with
>repos and patches to reproduce.
>  
>
Like I said on IRC, it's because I was wrong... the vdelta would be the 
first in the combination chain, and it's full of target copies, so it 
triggers the O(n^2) in the combiner from the start.

-- Brane


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Delta directions (was Re: delta combiner sloooooooowwwwww)

Posted by Daniel Berlin <db...@dberlin.org>.
On Sat, 2005-02-12 at 03:45 +0100, Branko Čibej wrote:
> Greg Hudson wrote:
> 
> >I hadn't read part of the "make blame even faster" thread when I sent
> >this.  Having read it, I have one addendum:
> >
> >On Thu, 2005-02-10 at 13:08, Greg Hudson wrote:
> >  
> >
> >>(Obviously not before 2.0, but I'm not sure if we can
> >>realize any of these benefits before 2.0.)
> >>    
> >>
> >
> >As people pointed out in that other thread, we can realize the "faster
> >delta combiner" benefit before 2.0, and it may be worth changing to
> >xdelta in 1.2 in order to do so.  My biggest concern is that FSFS
> >repositories with very shallow file histories may become significantly
> >larger, since we will lose all self-compression.  Perhaps we can
> >continue using vdelta for self-compression in FSFS?  I'm not sure how
> >much of the speed win (if any) we'd be giving up by doing that.
> >  
> >
> If we use vdelta strictly for self-compression of the first revision of 
> a file in FSFS, it shouldn't be a problem. That delta would always be 
> the last in any chain of combinations, but because the self-compressed 
> vdelta doesn't contain source copy ops, the combiner should behave nicely.

I actually tried this.
We still get a very large amount of copy_source_ops time according to
oprofile (though a factor of 5 or 10 less than you do with vdelta every
revision).

This was with vdelta only used when source_len == 0

I tested this with two repos. One where vdelta was enabled if source_len
== 0,and one where vdelta was not enabled for sotring to the repo, but
was for compressing fulltexts that went over the wire (through a small
hack to add an argument to compose_window, and disable vdelta for the
tpush_* call to compose_window, which is only used by fsfs when storing
to the repo).

I verified that vdelta was happening only during over-the-wire transmits
by aborting if it triggered vdelta, doing a repository load to make sure
it didn't happen during that, and then making sure it *did* abort when i
did a checkout or something that was going to send fulltexts.

No clue why the combiner still behaves badly but i can provide you with
repos and patches to reproduce.

--Dan



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org


Re: Delta directions (was Re: delta combiner sloooooooowwwwww)

Posted by Branko Čibej <br...@xbc.nu>.
Greg Hudson wrote:

>I hadn't read part of the "make blame even faster" thread when I sent
>this.  Having read it, I have one addendum:
>
>On Thu, 2005-02-10 at 13:08, Greg Hudson wrote:
>  
>
>>(Obviously not before 2.0, but I'm not sure if we can
>>realize any of these benefits before 2.0.)
>>    
>>
>
>As people pointed out in that other thread, we can realize the "faster
>delta combiner" benefit before 2.0, and it may be worth changing to
>xdelta in 1.2 in order to do so.  My biggest concern is that FSFS
>repositories with very shallow file histories may become significantly
>larger, since we will lose all self-compression.  Perhaps we can
>continue using vdelta for self-compression in FSFS?  I'm not sure how
>much of the speed win (if any) we'd be giving up by doing that.
>  
>
If we use vdelta strictly for self-compression of the first revision of 
a file in FSFS, it shouldn't be a problem. That delta would always be 
the last in any chain of combinations, but because the self-compressed 
vdelta doesn't contain source copy ops, the combiner should behave nicely.

-- Brane


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Delta directions (was Re: delta combiner sloooooooowwwwww)

Posted by Greg Hudson <gh...@MIT.EDU>.
I hadn't read part of the "make blame even faster" thread when I sent
this.  Having read it, I have one addendum:

On Thu, 2005-02-10 at 13:08, Greg Hudson wrote:
> (Obviously not before 2.0, but I'm not sure if we can
> realize any of these benefits before 2.0.)

As people pointed out in that other thread, we can realize the "faster
delta combiner" benefit before 2.0, and it may be worth changing to
xdelta in 1.2 in order to do so.  My biggest concern is that FSFS
repositories with very shallow file histories may become significantly
larger, since we will lose all self-compression.  Perhaps we can
continue using vdelta for self-compression in FSFS?  I'm not sure how
much of the speed win (if any) we'd be giving up by doing that.

At any rate, the more desirable end state is to ditch svndiff and switch
to a format which doesn't even allow random access, and to have a
compression layer on the output of the delta routine.  That would let us
recapture most or all of the space efficiency and would let us use
deltas in other ways.  But getting to that end state is much more
difficult than simply changing our delta algorithm from vdelta to
xdelta.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Delta directions (was Re: delta combiner sloooooooowwwwww)

Posted by Branko Čibej <br...@xbc.nu>.
Greg Hudson wrote:

>By sacrificing random access, we could get the following benefits:
>
>  * We could get rid of delta windowing, which is a source of
>complexity.  (Obviously not before 2.0, but I'm not sure if we can
>realize any of these benefits before 2.0.)  Eliminating windowing also
>means we could get much better results for some changes, like "insert
>128K of data at the beginning of the file".
>  
>
I'm not sure that's the case. Xdelta and diff both use windows because 
they'd otherwise have to index all the source and target data at once, 
and streaminess is still a requirement. The restriction might possibly 
make it easier to implement sliding windows, though.

-- Brane


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org