You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Sander Striker <st...@apache.org> on 2002/05/17 16:33:28 UTC

Diff code overhaul, WAS: RE: Problem invoking diff3.

> From: Philip Martin [mailto:pm@localhost]On Behalf Of Philip Martin
> Sent: 16 May 2002 14:24

> Philip Martin <ph...@codematters.co.uk> writes:
> 
> Eek!  I don't know what went wrong, but the body of my last mail
> appears to have gone walkies :-(  Here is what I meant to say...

*grin*
 
> "Sander Striker" <st...@apache.org> writes:
> 
>> I will have time somewhere next week to continue work on it.  One
>> of the things I want to try is to implement the algorithm described
>> in "An O(NP) Sequence Comparison Algorithm" by Wu, Manber, Myers and
>> Miller.
> 
> I have some old C++ code that implements O(NP). I was porting this to
> Subversion when Sander's code appeared. He and I discussed our
> approaches and I decided to stop work on my code.  Sander contacted me
> yesterday, asking about the performance of O(NP). This prompted me to
> look out my old code and test it against GNU diff.  At first my code
> used a vtable function based approach to tokens, a bit like the one in
> svn_diff.h.  However that proved to be a performance killer, the
> algorithm has a tight loop comparing tokens and the function call
> overhead is significant.  Yesterday I stripped out the token
> interface, moving to an array interface (which is also more like GNU
> diff) and added some simple line hashing. It now runs almost as fast
> as GNU diff --minimal, its about 5% slower.

I want to see if my BV-HS implementation will benefit from some optimizations
I have in mind.
 
> While the token/vtable interface in svn_diff.h provides a generic diff
> capability, I think it may be a performance limitation. My original
> C++ code was template based, the "function" interface was non-virtual
> and could be inlined. It will be interesting to see if the algorithm
> used by Sander's code is similarly affected by this interface.

It could benefit a speedup by inlining it, although I don't know how
much.  That's something I will try.  We might need an API change there,
although I most certainly hope we can keep it generic.

> It may be that Subversion needs different algorithms depending on the
> size of the file. A minimal match algorithm for small/medium files to
> reduce the number of conflicts, and a non-minimal algorithm for large
> files to get better speed/memory performance.

Yes, that may be the case.  I received a book I ordered this week:
"Algorithms on strings, trees, and sequences" by Dan Gusfield
(ISBN 0-521-58519-8).

This is really enlighting me in what (should) work and what not.  Both
our selected algorithms are suboptimal in speed, judging from what I
read.

I've spent a reasonable amount of time in researching what we need for the
diff lib and like to continue doing some more research for a little while.
Like I said, next week has some time allocated for this.

In the mean time I'll commit a fix for a bug in the current codebase that
I stumbled over.  That way we still have something that works, albeit on
the slow side.

Sander

PS.  For the impatient and interested, these papers deserve some attention:
     
     "A sublinear algorithm for approximate keyword searching." by E.W. Meyers
     Algorithmica, 12:345-74, 1994

     "Algorithmic advances for searching biosequence databases" by E. Meyers
     Computational Methods in Genome Research, 121-35, 1994.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

RE: Diff code overhaul, WAS: RE: Problem invoking diff3.

Posted by Sander Striker <st...@apache.org>.

> From: Philip Martin [mailto:pm@localhost]On Behalf Of Philip Martin
> Sent: 17 May 2002 20:59

> "Sander Striker" <st...@apache.org> writes:
>>> While the token/vtable interface in svn_diff.h provides a generic diff
>>> capability, I think it may be a performance limitation. My original
>>> C++ code was template based, the "function" interface was non-virtual
>>> and could be inlined. It will be interesting to see if the algorithm
>>> used by Sander's code is similarly affected by this interface.
>> 
>> It could benefit a speedup by inlining it, although I don't know how
>> much.  That's something I will try.  We might need an API change there,
>> although I most certainly hope we can keep it generic.
> 
> The speedup obviously depends on how many token lookups and
> comparisons the algorithm requires.  It's important to remember just
> how many that can be for 'difficult' files, it's lots and lots.  The
> current interface requires a function call to get each token, and a
> function call to compare them.  My O(NP) implementation did something
> similar, removing the function calls more than doubled the speed.  The
> algorithm can't really cache the tokens, that more or less defeats the
> purpose of providing the function interface in the first place.

The BV-HS function is single pass, top-down, so no token is required more
than once.  With a small trick we can even do merging without rereading.
However, other algorithms (like O(NP)) need random access to both files.
It all comes down to which algorithm we end up using.
 
> I worry that we have an overly generic interface, that will only be
> used for one specific problem.  Producing conventional diff output
> pretty much requires a line based diff. It is unlikely that the tokens
> will ever be anything other than a variable length string of
> characters.  I guess if one is comparing digital images, say, then the
> tokens might be different.  However, that would use a totally
> different algorithm so the ability to support such tokens isn't
> terribly useful.

In subversion that is pretty much true.  For a visual diff/merge tool
I guess it holds aswell.  If we gain much performance by moving to a
less generic API, we'll move to that.

>>> It may be that Subversion needs different algorithms depending on the
>>> size of the file. A minimal match algorithm for small/medium files to
>>> reduce the number of conflicts, and a non-minimal algorithm for large
>>> files to get better speed/memory performance.
>> 
>> Yes, that may be the case.  I received a book I ordered this week:
>> "Algorithms on strings, trees, and sequences" by Dan Gusfield
>> (ISBN 0-521-58519-8).
>> 
>> This is really enlighting me in what (should) work and what not.  Both
>> our selected algorithms are suboptimal in speed, judging from what I
>> read.
>> 
>> I've spent a reasonable amount of time in researching what we need for the
>> diff lib and like to continue doing some more research for a little while.
>> Like I said, next week has some time allocated for this.
> 
> Fine, I'm not trying to tread on your toes :)

Don't worry, you didn't :)

> The only reason I looked at my code again was because you asked me about
> performance.  I then only posted it when I realised that it was a) so close
> to GNU diff in performance

For the --minimal case.  I want to see if we can get their performance all
the time.  Not a showstopper though, if it works at a reasonable speed I'll
be satisfied for alpha.

> and b) only a few hundred lines to provide diff and merge.
>
> It's quite possible there are better algorithms, I'm happy that you
> are working on it and I hope you come up with a solution.

Me too! :)
 
> Philip

Sander

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Diff code overhaul, WAS: RE: Problem invoking diff3.

Posted by Philip Martin <ph...@codematters.co.uk>.

"Sander Striker" <st...@apache.org> writes:

> > While the token/vtable interface in svn_diff.h provides a generic diff
> > capability, I think it may be a performance limitation. My original
> > C++ code was template based, the "function" interface was non-virtual
> > and could be inlined. It will be interesting to see if the algorithm
> > used by Sander's code is similarly affected by this interface.
> 
> It could benefit a speedup by inlining it, although I don't know how
> much.  That's something I will try.  We might need an API change there,
> although I most certainly hope we can keep it generic.

The speedup obviously depends on how many token lookups and
comparisons the algorithm requires.  It's important to remember just
how many that can be for 'difficult' files, it's lots and lots.  The
current interface requires a function call to get each token, and a
function call to compare them.  My O(NP) implementation did something
similar, removing the function calls more than doubled the speed.  The
algorithm can't really cache the tokens, that more or less defeats the
purpose of providing the function interface in the first place.

I worry that we have an overly generic interface, that will only be
used for one specific problem.  Producing conventional diff output
pretty much requires a line based diff. It is unlikely that the tokens
will ever be anything other than a variable length string of
characters.  I guess if one is comparing digital images, say, then the
tokens might be different.  However, that would use a totally
different algorithm so the ability to support such tokens isn't
terribly useful.

> 
> > It may be that Subversion needs different algorithms depending on the
> > size of the file. A minimal match algorithm for small/medium files to
> > reduce the number of conflicts, and a non-minimal algorithm for large
> > files to get better speed/memory performance.
> 
> Yes, that may be the case.  I received a book I ordered this week:
> "Algorithms on strings, trees, and sequences" by Dan Gusfield
> (ISBN 0-521-58519-8).
> 
> This is really enlighting me in what (should) work and what not.  Both
> our selected algorithms are suboptimal in speed, judging from what I
> read.
> 
> I've spent a reasonable amount of time in researching what we need for the
> diff lib and like to continue doing some more research for a little while.
> Like I said, next week has some time allocated for this.

Fine, I'm not trying to tread on your toes :)  The only reason I looked
at my code again was because you asked me about performance.  I then
only posted it when I realised that it was a) so close to GNU diff in
performance and b) only a few hundred lines to provide diff and merge.
It's quite possible there are better algorithms, I'm happy that you
are working on it and I hope you come up with a solution.

-- 
Philip

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org