You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Johan Corveleyn <jc...@gmail.com> on 2010/10/07 23:44:28 UTC

[RFC] Diff (blame) optimization: how to go forward

Hi all,

This is a follow-up to the WIP-patches I posted last week [1], which
are about improving performance of svn_diff (and therefor also blame
on the client-side), especially for large files.

To summarize: the idea was (is) to strip off the identical prefix and
suffix, and then letting the existing diff algorithm work on the
remainder. As stated in [2], I ran into some difficulties to get the
exact same result of the existing diff algorithm (because of too eager
suffix stripping). I've spent some time searching for a "perfect"
solution, but I can't find one. So now I'm stuck and would like to
have some feedback on what to do with it.

First: the thing works, and it can reduce the time needed for svn_diff
by 20% up to 80% for large files (depending on the distribution of the
changes). An extreme example is a reduction from ~150ms to ~30ms for a
60000-lines file with a small number of lines changed (say up to 100)
located close together (so lots of identical prefix/suffix).

Blame gets similar benefits, *if the server is fast enough*. I used
svnserve built from Stefan^2's performance branch to test this. A
normal svnserve with FSFS isn't really fast enough (unless maybe with
an SSD, but I don't have one here) to serve up the 2500 revisions of
the large file quickly enough. But the performance-enhanced-svnserve,
with a hot cache filled with the necessary full-texts, is definitely
fast enough to make the time of blame completely dependent on the
client-side performance. Conclusion: the diff optimization we're
talking about is much more useful for blame together with a server
with the full-text membuffer caching of the performance branch.

Now, the reason why I'm stuck: suffix scanning sometimes strips a
little bit too much (see [2] for full explanation). In this case the
resulting diff is still correct, but it's not as nice for the user. As
I noted in [2], GNU diff also has this problem, but the user can help
a little bit by specifying a large enough number of
prefix/suffix-lines to keep (--horizon-lines).

So, what to do?
- I have not found a better solution than to use some form of
"horizon-lines" to get the "nice" result, but still have a similar
performance improvement (unless if I leave out suffix stripping
entirely, only strip prefix, but that halves the power of the
optimization).

- I've tested with keeping up to 50 suffix-lines. It didn't have the
slightest impact on the performance improvement (we're talking about
stripping away 1000's of lines). This fixed all real-life examples I
could find/test (diff/blame output is precisely identical to original
svn). If I kept only 20 lines, I still ran into some differences (30
lines was enough for my example file, but I took it up to 50 to be
sure).

- I would like to avoid leaving the decision to the user to come up
with an adequate number of suffix-lines-to-keep. So I'd like to just
hardcode some value, say 50, or even 100. I think this would give good
(=identical to original svn) diff/blame results in all but the most
extreme cases. With suffix-lines-to-keep=50, you'd need to insert a
block of text that has its last 50 lines identical to the 50 lines
preceding the insertion point, to mess up the diff result.

- If really necessary, we could say default=50, but it can be
overridden by the user with another -x option.

So the main question is: is such a change in behavior (unlikely to
have an impact in most realistic cases) acceptable for this kind of
performance improvement?

Other options:
- Come up with a clever solution to fix the optimization properly (but
I don't know if that's possible -> suggestions welcome :-)).
- Throw this optimization away, and take a look at something like
"Patience diff" (supposed to be fast as well, and give nice results).
However, I don't know Patience diff well enough to code up something,
or even to judge whether it would give good results in all (or most)
cases.
- Just throw it away and don't look back :-)
- Something else?

Cheers,
-- 
Johan

[1] http://svn.haxx.se/dev/archive-2010-10/0032.shtml
[2] http://svn.haxx.se/dev/archive-2010-10/0050.shtml

Re: [RFC] Diff (blame) optimization: how to go forward

Posted by Johan Corveleyn <jc...@gmail.com>.
On Fri, Oct 8, 2010 at 10:35 PM, Hyrum K. Wright
<hy...@mail.utexas.edu> wrote:
> On Fri, Oct 8, 2010 at 3:24 PM, Johan Corveleyn <jc...@gmail.com> wrote:
>> ...
>>
>> Just one more thing: as I mentioned in my rather long mail, blame
>> would benefit the most from my optimization if the server were fast
>> enough. So if anyone could spare some review cycles (ok, I know they
>> are scarce these days), reviewing Stefan^2's integrate-cache-membuffer
>> branch would really help my cause as well ...
>
> Yeah, just a question of tuits.  If you've reviewed the branch,
> posting your conclusions here would be a great contribution.  Speaking
> for myself, I feel you've got more experience in some areas of that
> branch than I do. :)
>
> (Apologies if you've already done so, and I've missed it.)

No, sorry, I haven't either. I have compiled and run it (well actually
the performance branch itself, not the integrate-cache-membuffer
branch). But I haven't looked at the source code.

Mmmm, tuits, I wish I had some spare ones lying around :-). Who knows,
maybe I'll find some ...

Cheers,
-- 
Johan

Re: [RFC] Diff (blame) optimization: how to go forward

Posted by "Hyrum K. Wright" <hy...@mail.utexas.edu>.
On Fri, Oct 8, 2010 at 3:24 PM, Johan Corveleyn <jc...@gmail.com> wrote:
> ...
>
> Just one more thing: as I mentioned in my rather long mail, blame
> would benefit the most from my optimization if the server were fast
> enough. So if anyone could spare some review cycles (ok, I know they
> are scarce these days), reviewing Stefan^2's integrate-cache-membuffer
> branch would really help my cause as well ...

Yeah, just a question of tuits.  If you've reviewed the branch,
posting your conclusions here would be a great contribution.  Speaking
for myself, I feel you've got more experience in some areas of that
branch than I do. :)

(Apologies if you've already done so, and I've missed it.)

-Hyrum

Re: [RFC] Diff (blame) optimization: how to go forward

Posted by Johan Corveleyn <jc...@gmail.com>.
On Fri, Oct 8, 2010 at 3:08 PM, Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> Johan Corveleyn wrote on Fri, 8 Oct 2010 at 01:44 -0000:
>> With suffix-lines-to-keep=50, you'd need to insert a
>> block of text that has its last 50 lines identical to the 50 lines
>> preceding the insertion point, to mess up the diff result.
>>
>> - If really necessary, we could say default=50, but it can be
>> overridden by the user with another -x option.
>>
>> So the main question is: is such a change in behavior (unlikely to
>> have an impact in most realistic cases) acceptable for this kind of
>> performance improvement?
>
> Just to clarify: are you asking whether it's acceptable to have a "not
> nice" (but still semantically correct) diff output in case the user adds
> a block whose last 50 lines match the 50 lines prior to the first
> difference?
>
> And 'not nice' just means "the preceding, rather than trailing, instance
> of the repeated block would be grabbed by the /^+/ lines in the diff",
> right?

Yep, that's what I meant.

> In this case, I think I'm going to answer your long mail with just...
>
>    Yes.

Ok, great.

> or actually...
>
>    Yes, and let's move on to deeper problems (if any).

What deeper problems? :-)

Seriously: I saw this as the biggest problem in the scope of this
optimization. If the result is not good the entire optimization would
be worthless, so any other problem with my implementation would be
irrelevant.

Good to hear that it's not that big of a problem.

(I realize this optimization is just a small thing compared to more
important/fundamental stuff going on right now in subversion. But it's
important to me personally. And it's the most I can do with my
abilities/knowledge/time right now, so ...)

> (I'm not /particularly/ worried about a "different-than-expected, but
> still semantically correct, diff" --- though, yes, it's nice to output
> the "expected" diff if that's possible without too much effort and
> without breaking other use cases.)
>
> :-),
>
> Daniel
> (p.s. I don't have as much time to look into this as I'd like to...)

No problem. Your feedback is very much appreciated.

On Fri, Oct 8, 2010 at 8:46 PM, Hyrum K. Wright
<hy...@mail.utexas.edu> wrote:
> My not-having-looked-deeply-at-the-problem reaction is this: if we can
> introduce significant speed-ups in the common case, while still
> maintaining functionality, let's do it.  There may be corner cases
> where somebody gets an ugly diff, but in those corner cases there are
> likely to be *other* issues as play as well.  Let's not penalize
> everybody for the sake of a couple of corner cases.

Ok, perfect, we seem to have a consensus then :-). I'll cook up a
version of the patch which keeps 50 suffix lines.

Thanks for the feedback, Daniel and Hyrum.

Just one more thing: as I mentioned in my rather long mail, blame
would benefit the most from my optimization if the server were fast
enough. So if anyone could spare some review cycles (ok, I know they
are scarce these days), reviewing Stefan^2's integrate-cache-membuffer
branch would really help my cause as well ...

Cheers,
-- 
Johan

Re: [RFC] Diff (blame) optimization: how to go forward

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Johan Corveleyn wrote on Fri, 8 Oct 2010 at 01:44 -0000:
> With suffix-lines-to-keep=50, you'd need to insert a
> block of text that has its last 50 lines identical to the 50 lines
> preceding the insertion point, to mess up the diff result.
> 
> - If really necessary, we could say default=50, but it can be
> overridden by the user with another -x option.
> 
> So the main question is: is such a change in behavior (unlikely to
> have an impact in most realistic cases) acceptable for this kind of
> performance improvement?

Just to clarify: are you asking whether it's acceptable to have a "not
nice" (but still semantically correct) diff output in case the user adds
a block whose last 50 lines match the 50 lines prior to the first
difference?

And 'not nice' just means "the preceding, rather than trailing, instance
of the repeated block would be grabbed by the /^+/ lines in the diff",
right?

In this case, I think I'm going to answer your long mail with just...

    Yes.

or actually...

    Yes, and let's move on to deeper problems (if any).

(I'm not /particularly/ worried about a "different-than-expected, but
still semantically correct, diff" --- though, yes, it's nice to output
the "expected" diff if that's possible without too much effort and
without breaking other use cases.)

:-),

Daniel
(p.s. I don't have as much time to look into this as I'd like to...)

Re: [RFC] Diff (blame) optimization: how to go forward

Posted by "Hyrum K. Wright" <hy...@mail.utexas.edu>.
On Thu, Oct 7, 2010 at 6:44 PM, Johan Corveleyn <jc...@gmail.com> wrote:
> Hi all,
>
> This is a follow-up to the WIP-patches I posted last week [1], which
> are about improving performance of svn_diff (and therefor also blame
> on the client-side), especially for large files.
>
> To summarize: the idea was (is) to strip off the identical prefix and
> suffix, and then letting the existing diff algorithm work on the
> remainder. As stated in [2], I ran into some difficulties to get the
> exact same result of the existing diff algorithm (because of too eager
> suffix stripping). I've spent some time searching for a "perfect"
> solution, but I can't find one. So now I'm stuck and would like to
> have some feedback on what to do with it.
>
> First: the thing works, and it can reduce the time needed for svn_diff
> by 20% up to 80% for large files (depending on the distribution of the
> changes). An extreme example is a reduction from ~150ms to ~30ms for a
> 60000-lines file with a small number of lines changed (say up to 100)
> located close together (so lots of identical prefix/suffix).
>
> Blame gets similar benefits, *if the server is fast enough*. I used
> svnserve built from Stefan^2's performance branch to test this. A
> normal svnserve with FSFS isn't really fast enough (unless maybe with
> an SSD, but I don't have one here) to serve up the 2500 revisions of
> the large file quickly enough. But the performance-enhanced-svnserve,
> with a hot cache filled with the necessary full-texts, is definitely
> fast enough to make the time of blame completely dependent on the
> client-side performance. Conclusion: the diff optimization we're
> talking about is much more useful for blame together with a server
> with the full-text membuffer caching of the performance branch.
>
> Now, the reason why I'm stuck: suffix scanning sometimes strips a
> little bit too much (see [2] for full explanation). In this case the
> resulting diff is still correct, but it's not as nice for the user. As
> I noted in [2], GNU diff also has this problem, but the user can help
> a little bit by specifying a large enough number of
> prefix/suffix-lines to keep (--horizon-lines).
>
> So, what to do?
> - I have not found a better solution than to use some form of
> "horizon-lines" to get the "nice" result, but still have a similar
> performance improvement (unless if I leave out suffix stripping
> entirely, only strip prefix, but that halves the power of the
> optimization).
>
> - I've tested with keeping up to 50 suffix-lines. It didn't have the
> slightest impact on the performance improvement (we're talking about
> stripping away 1000's of lines). This fixed all real-life examples I
> could find/test (diff/blame output is precisely identical to original
> svn). If I kept only 20 lines, I still ran into some differences (30
> lines was enough for my example file, but I took it up to 50 to be
> sure).
>
> - I would like to avoid leaving the decision to the user to come up
> with an adequate number of suffix-lines-to-keep. So I'd like to just
> hardcode some value, say 50, or even 100. I think this would give good
> (=identical to original svn) diff/blame results in all but the most
> extreme cases. With suffix-lines-to-keep=50, you'd need to insert a
> block of text that has its last 50 lines identical to the 50 lines
> preceding the insertion point, to mess up the diff result.
>
> - If really necessary, we could say default=50, but it can be
> overridden by the user with another -x option.
>
> So the main question is: is such a change in behavior (unlikely to
> have an impact in most realistic cases) acceptable for this kind of
> performance improvement?
>
> Other options:
> - Come up with a clever solution to fix the optimization properly (but
> I don't know if that's possible -> suggestions welcome :-)).
> - Throw this optimization away, and take a look at something like
> "Patience diff" (supposed to be fast as well, and give nice results).
> However, I don't know Patience diff well enough to code up something,
> or even to judge whether it would give good results in all (or most)
> cases.
> - Just throw it away and don't look back :-)
> - Something else?

My not-having-looked-deeply-at-the-problem reaction is this: if we can
introduce significant speed-ups in the common case, while still
maintaining functionality, let's do it.  There may be corner cases
where somebody gets an ugly diff, but in those corner cases there are
likely to be *other* issues as play as well.  Let's not penalize
everybody for the sake of a couple of corner cases.

-Hyrum