You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Stefan Sperling <st...@elego.de> on 2021/06/02 12:20:12 UTC

[PATCH] limit diff effort to fix performance issue

I've heard of several instances where users complain that 'svn update'
takes an extraordinary amount of time to run (in terms of hours).

At elego we have received files which reproduce such behaviour.
These files are XML files that are almost 100MB in size. While versioning
such data is not the primary use case for Subversion, this should not get
in the way of people getting work done. So I would like to get this fixed.

The root of the performance problem is located in Subversion's diff code.

The files contain in the order of 1.200.000 lines of XML text. I cannot
share the full files, unfortunately. Here's one diff chunk from a diff
between two such files, just to give you an idea about the type of content
which the diff algorithm is trying to split into chunks. Simply imagine that
such content goes on for more than a million lines and you have a mental
image of what the file generally looks like:

                   <MC-DATA-INSTANCE>                                                                                                 
                     <SHORT-NAME>CounterDbCounter</SHORT-NAME>     
                     <CATEGORY>ARRAY</CATEGORY>                    
-                    <ARRAY-SIZE>163</ARRAY-SIZE>                  
+                    <ARRAY-SIZE>193</ARRAY-SIZE>^M                
                     <SUB-ELEMENTS>                                
                       <MC-DATA-INSTANCE>                          
                         <SHORT-NAME>RamEccDoubleBitError_CounterDbCounter</SHORT-NAME>                                               

Generating a diff between two versions of such files with 'svn diff' takes
more than 40 minutes on my machine. I stopped it at that point. This is
obviously taking too long and we don't have to care about exactly how
long it will take to eventually run to completion.

As a reference, Git manages to produce a diff in about 13 seconds:

$ time git diff 01.arxml 02.arxml > /dev/null
    0m13.39s real     0m13.04s user     0m00.36s system

Without my patch, the svn update/diff processes are spinning in the
following loop:

[[[
  do
    {
      /* For k < 0, insertions are free */
      for (k = (d < 0 ? d : 0) - p; k < 0; k++)
        {
          svn_diff__snake(fp + k, token_counts, &lcs_freelist, pool);
        }
      /* for k > 0, deletions are free */
      for (k = (d > 0 ? d : 0) + p; k >= 0; k--)
        {
          svn_diff__snake(fp + k, token_counts, &lcs_freelist, pool);
        }

      p++;
    }
  while (fp[0].position[1] != &sentinel_position[1]);
]]]

What this loop is trying to do is to find an optimal "diff snake",
a partial traversal of the Myers graph which builds up step-by-step
towards some optimal traversal of the entire graph.

In the Myers diff algorithm, the problem of finding an optimal graph
traversal doesn't have a single answer. There can be several valid
traversals, resulting in differences in the generated diff.
An extreme example is a traversal which produces a patch that first
removes all lines from file A, and then adds all lines from file B.
This isn't a useful diff chunk for humans to read, but it is a valid
diff and will work when applied with a patch program.
Finding this particular traversal is also trivial and very fast :-)

The search for an optimal traversal which will result in a human-readable
diff can take time, and this is exactly where Subversion is spending too
much effort when diffing large XML files.

Other diff implementations limit the effort spent by aborting after a
certain number of search iterations, and then simply use the valid
traversal which was most recently discovered. The libxdiff code used by Git
does this, as does the diff implementation which Neels Hofmyer wrote for
Game of Trees (see https://git.gameoftrees.org/gitweb/?p=diff.git;a=summary)

The patch below implements an effort limit for Subversion's LCS diff.
Diffing the offending XML files in a Subversion working copy becomes
fast enough to be usable:

$ time svn diff > /dev/null
    0m06.82s real     0m05.40s user     0m01.41s system

The resulting diff doesn't look as pretty as Git's diff output, but I don't
think this will matter in practice.

My only concern is that some corner-case diffs that are currently readable
could become less readable. If such cases are found, we could easily address
the problem by bumping the limit. Note: Any such diffs would still be *valid*.

The test suite is passing, which implies that trivial diffs aren't affected
by this change. I expect that most, if not all, diffs which people actually
want to read will remain unaffected. But I cannot tell with 100% certainty.

Before anyone asks, without evidence that it is really needed I don't think
adding a config option to set this limit is warranted. Other implementations
are also using hard-coded limits. We could add an environment variable to
allow for experimentation, perhaps. In any case, I would prefer to make
such tweaks in follow-up patches and keep this initial fix simple so it
can be backported easily.

[[[
* subversion/libsvn_diff/lcs.c
  (svn_diff__lcs): Limit the effort spent on finding an optimal diff to
   avoid spinning on the CPU for a long time (30 minutes or more) for
   some input files. Large XML documents were known to trigger this.
]]

Index: subversion/libsvn_diff/lcs.c
===================================================================
--- subversion/libsvn_diff/lcs.c	(revision 1890384)
+++ subversion/libsvn_diff/lcs.c	(working copy)
@@ -247,6 +247,7 @@ svn_diff__lcs(svn_diff__position_t *position_list1
   apr_off_t k;
   apr_off_t p = 0;
   svn_diff__lcs_t *lcs, *lcs_freelist = NULL;
+  int maxeffort = 1024;
 
   svn_diff__position_t sentinel_position[2];
 
@@ -353,7 +354,7 @@ svn_diff__lcs(svn_diff__position_t *position_list1
 
       p++;
     }
-  while (fp[0].position[1] != &sentinel_position[1]);
+  while (fp[0].position[1] != &sentinel_position[1] && --maxeffort > 0);
 
   if (suffix_lines)
     lcs->next = prepend_lcs(fp[0].lcs, suffix_lines,

Re: [PATCH] limit diff effort to fix performance issue

Posted by Johan Corveleyn <jc...@gmail.com>.

On Wed, Jun 2, 2021 at 4:05 PM Daniel Shahaf <d....@daniel.shahaf.name> wrote:
>
> Stefan Sperling wrote on Wed, 02 Jun 2021 12:20 +00:00:
> > The test suite is passing, which implies that trivial diffs aren't affected
> > by this change. I expect that most, if not all, diffs which people actually
> > want to read will remain unaffected. But I cannot tell with 100% certainty.
>
> Not with 100% certainty, but you could run «svn diff -c $i ^/» on the
> last few thousand revisions of /repos/asf, diff the resulting diffs, and
> review the non-empty interdiffs.

Another thing I could try is running my old "blame huge XML file with
lots of revisions", which I used to do when I worked on diff
optimisations back in 2011 (prefix/suffix-scanning and some other
things -- also some other people back then did some other optimisation
contributions which I double checked this way). I'm guessing my huge
XML file is not huge enough (or our diffs not pathological enough) for
this limit to have much impact, because after the 2011 optimisations I
can't remember running into longer times than say 5 seconds or so per
diff. Or otherwise, I can at least give some feedback about a good
"effort limit" with a real-world case where we still consider the
actual diff output somewhat important (for assigning blame :-)).

I'll try to give it a go in a couple of days (but I'll have to dust
off my local build first).

BTW: for whoever is interested, some discussion about this on IRC just
now: https://colabti.org/irclogger/irclogger_log/svn-dev?date=2021-06-02
Feel free to join on #svn-dev on irc.libera.chat ...

-- 
Johan

Re: [PATCH] limit diff effort to fix performance issue

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Stefan Sperling wrote on Wed, 02 Jun 2021 12:20 +00:00:
> The test suite is passing, which implies that trivial diffs aren't affected
> by this change. I expect that most, if not all, diffs which people actually
> want to read will remain unaffected. But I cannot tell with 100% certainty.

Not with 100% certainty, but you could run «svn diff -c $i ^/» on the
last few thousand revisions of /repos/asf, diff the resulting diffs, and
review the non-empty interdiffs.

Re: [PATCH] limit diff effort to fix performance issue

Posted by Stefan Sperling <st...@elego.de>.

On Fri, Apr 01, 2022 at 06:08:14PM +0200, Johan Corveleyn wrote:
> I suppose the "modified for better performance" refers to some
> optimisations done by Morten Kloster, who then later submitted the
> patch adding this comment in r1128862. His optimisations were more
> related to the LCS algorithm (further reducing the problem space, and
> providing early exits in certain cases or some such).
> 
> The core LCS algorithm in libsvn_diff/lcs.c was from way before my
> time, and according to 'svn blame' was mainly written by Sander
> Striker. I don't understand it fully either :-), but I always assumed
> it was basically some clever implementation of Myers' algorithm.

Yes, that is my understanding as well.

I believe this performance problem has always existed.
I have already tried reverting optimizations made over time, and it did
not make any difference with the problematic test data I have.

Re: [PATCH] limit diff effort to fix performance issue

Posted by Nathan Hartman <ha...@gmail.com>.

On Fri, Apr 1, 2022 at 10:37 AM Stefan Sperling <st...@elego.de> wrote:
>
> On Fri, Apr 01, 2022 at 04:17:45PM +0200, Johan Corveleyn wrote:
> > - Perhaps the fundamental diff algorithm in SVN is fine, but it has a
> > performance bug / low-level inefficiency? I think that should be
> > explored first, because it might fix most of this problem without
> > requiring a discussion about the precise output, and without huge
> > efforts of rewrites.
>
> I myself find the current SVN diff code very difficult to make sense of.
> Maybe someone smarter than me (like Sander Striker?) could figure this out.

Just a reminder that there's a program in tools/dev/gen-test-data that
can generate files that trigger the slow diff phenomenon.

Cheers,
Nathan

Re: [PATCH] limit diff effort to fix performance issue

Posted by Johan Corveleyn <jc...@gmail.com>.

On Fri, Apr 1, 2022 at 5:19 PM Stefan Sperling <st...@elego.de> wrote:
>
> On Fri, Apr 01, 2022 at 05:04:49PM +0200, Johan Corveleyn wrote:
> > Yes, I suppose this is the case: Patience feeds different (smaller)
> > things to LCS. Because, as far as I understand, Myers' way of
> > calculating the LCS is fundamentally "somewhat" quadratic (according
> > to the Myers paper from 1986 [1], titled "An O(ND) Difference
> > Algorithm and Its Variations").
> >
> > That is: O(ND) where N = the sum of the lengths of both sides of the
> > diff and D = the size of the minimal diff. That's why eliminating the
> > common prefix and suffix (added as an optimization years ago) is
> > useful, it makes N much smaller.
> >
> > Now, in cases where both texts are huge (even after prefix-suffix
> > stripping), but the minimal diff is small, the algorithm should in
> > theory be fast (because D is small). Just not sure what is going wrong
> > then in our variation of this algorithm. Or have we implemented
> > another algorithm than the one described by Myers?
>
> SVN diff is allegedly "based on the approach described by ... Meyers (sic),
> [...]  but has been modified for better performance."
> I guess the modifications refer to prefix/suffix scanning,
> because this description dates from r1128862.

Hm, not the prefix/suffix scanning, as I implemented those (it was my
first big patch / work in SVN), and that's outside of the LCS
algorithm (it happens in a phase before assembling "tokens (=lines)"
and before handing it off to the LCS algorithm). It was integrated
into trunk in r1067800 (and some further fixes followed afterwards).
Basically, that work is done in
libsvn_diff/diff_file.c#find_identical_{prefix,suffix}.

I suppose the "modified for better performance" refers to some
optimisations done by Morten Kloster, who then later submitted the
patch adding this comment in r1128862. His optimisations were more
related to the LCS algorithm (further reducing the problem space, and
providing early exits in certain cases or some such).

The core LCS algorithm in libsvn_diff/lcs.c was from way before my
time, and according to 'svn blame' was mainly written by Sander
Striker. I don't understand it fully either :-), but I always assumed
it was basically some clever implementation of Myers' algorithm.

--
Johan

Re: [PATCH] limit diff effort to fix performance issue

Posted by Stefan Sperling <st...@elego.de>.

On Fri, Apr 01, 2022 at 05:04:49PM +0200, Johan Corveleyn wrote:
> Yes, I suppose this is the case: Patience feeds different (smaller)
> things to LCS. Because, as far as I understand, Myers' way of
> calculating the LCS is fundamentally "somewhat" quadratic (according
> to the Myers paper from 1986 [1], titled "An O(ND) Difference
> Algorithm and Its Variations").
> 
> That is: O(ND) where N = the sum of the lengths of both sides of the
> diff and D = the size of the minimal diff. That's why eliminating the
> common prefix and suffix (added as an optimization years ago) is
> useful, it makes N much smaller.
> 
> Now, in cases where both texts are huge (even after prefix-suffix
> stripping), but the minimal diff is small, the algorithm should in
> theory be fast (because D is small). Just not sure what is going wrong
> then in our variation of this algorithm. Or have we implemented
> another algorithm than the one described by Myers?

SVN diff is allegedly "based on the approach described by ... Meyers (sic),
[...]  but has been modified for better performance."
I guess the modifications refer to prefix/suffix scanning,
because this description dates from r1128862.

For gameoftrees Neels re-implemented Myers from scratch, using the original
paper and some articles found on the web as a reference. This code looks
very different to SVN diff. But those differences could be due to
implementation style. I cannot judge that. Neels code makes more sense to
me and is exhaustively commented. The SVN diff code is not nearly as
readable from my point of view.

Neels diff code was developed stand-alone in a separate Git repository,
with the intention that it could be re-used outside of gameoftrees:
https://git.gameoftrees.org/gitweb/?p=diff.git;a=summary
If someone eventually ends up looking at rewriting SVN diff's code from
scratch, then this code would be a good starting point (and the licence
is compatible). Neels is an inactive SVN committer, but can be reached
and might be able to provide assistance if required.

Re: [PATCH] limit diff effort to fix performance issue

Posted by Johan Corveleyn <jc...@gmail.com>.

On Fri, Apr 1, 2022 at 4:37 PM Stefan Sperling <st...@elego.de> wrote:
> Yes, roughly, Patience diff involves two algorithms, the grouping of
> lines along similar-line boundaries performed by Patience, and an
> LCS for parts of the files which Patience cannot deal with by itself.
>
> But LCS needs to complete its work before Patience can do its thing,
> and vice-versa. For a given input, each algorithm might run multiple
> times, switching whenever the current algorithm cannot make any process.
> So if LCS still has a fundamental performance issue for some inputs,
> adding Patience diff probably won't help. It could help in cases where
> LCS is now fed with different inputs as determined by Patience, such
> that bad inputs for LCS are avoided. But I don't know if that would
> always be the case.

Yes, I suppose this is the case: Patience feeds different (smaller)
things to LCS. Because, as far as I understand, Myers' way of
calculating the LCS is fundamentally "somewhat" quadratic (according
to the Myers paper from 1986 [1], titled "An O(ND) Difference
Algorithm and Its Variations").

That is: O(ND) where N = the sum of the lengths of both sides of the
diff and D = the size of the minimal diff. That's why eliminating the
common prefix and suffix (added as an optimization years ago) is
useful, it makes N much smaller.

Now, in cases where both texts are huge (even after prefix-suffix
stripping), but the minimal diff is small, the algorithm should in
theory be fast (because D is small). Just not sure what is going wrong
then in our variation of this algorithm. Or have we implemented
another algorithm than the one described by Myers?

Anyway, if Patience diff is able to split up the work for LCS in
smaller bits (with their small O(ND) complexity), then I can totally
understand that this can be faster.

[1] http://www.xmailserver.org/diff2.pdf

-- 
Johan

Re: [PATCH] limit diff effort to fix performance issue

Posted by Stefan Sperling <st...@elego.de>.

On Fri, Apr 01, 2022 at 04:17:45PM +0200, Johan Corveleyn wrote:
> - Perhaps the fundamental diff algorithm in SVN is fine, but it has a
> performance bug / low-level inefficiency? I think that should be
> explored first, because it might fix most of this problem without
> requiring a discussion about the precise output, and without huge
> efforts of rewrites.

I myself find the current SVN diff code very difficult to make sense of.
Maybe someone smarter than me (like Sander Striker?) could figure this out.

> - I think a lot of other diff tools out there have an "effort-limit",
> like the one you proposed (like --speed-large-files, aka
> "use-heuristic" for GNU diff [1], [2], which might be the default
> behaviour, dunno). I'm guessing 'git diff' has one too. So agreed,
> perhaps we need one too. Just not entirely sure how high we should put
> the limit, and what we have to communicate around "diff output might
> have changed / might not be a minimal diff" (and whether we still need
> an option to fallback to "go over the limit").

The only way I found to terminate SVN diff code early leaves a
giant ugly diff chunk at the bottom of the diff.
Other diff implementations do not have this issue. And I don't know
how this issue could be fixed in the current code.

> - Patience diff might also be an option (though it might have its own
> set of problems and peculiar edge cases, I don't know). It might even
> be possible to reuse part of the current code for this (Not sure this
> requires a complete rewrite. Does it still require a LCS algorithm at
> its core, after picking apart the problem space? I haven't studied it
> enough to know how it works internally). In this case too there is the
> back-compat vs behavioral-change discussion that will take time to
> settle (Do we change the default algorithm / output? Do we make it
> optional, or do we already have too manu knobs? ...).

Yes, roughly, Patience diff involves two algorithms, the grouping of
lines along similar-line boundaries performed by Patience, and an
LCS for parts of the files which Patience cannot deal with by itself.

But LCS needs to complete its work before Patience can do its thing,
and vice-versa. For a given input, each algorithm might run multiple
times, switching whenever the current algorithm cannot make any process.
So if LCS still has a fundamental performance issue for some inputs,
adding Patience diff probably won't help. It could help in cases where
LCS is now fed with different inputs as determined by Patience, such
that bad inputs for LCS are avoided. But I don't know if that would
always be the case.

Re: [PATCH] limit diff effort to fix performance issue

Posted by Johan Corveleyn <jc...@gmail.com>.

On Fri, Apr 1, 2022 at 1:12 PM Stefan Sperling <st...@apache.org> wrote:
>
> On Fri, Apr 01, 2022 at 12:44:24PM +0200, Johan Corveleyn wrote:
> > On Tue, Jun 8, 2021 at 5:58 PM Johan Corveleyn <jc...@gmail.com> wrote:
> > > On Tue, Jun 8, 2021 at 3:24 PM Stefan Sperling <st...@stsp.name> wrote:
> > > > On Tue, Jun 08, 2021 at 02:57:34PM +0200, Johan Corveleyn wrote:
> > > > > Okay, I focused on the first revision causing the annotate to differ,
> > > > > and with some debug logging:
> > > > > - p went up to 139
> > > > > - length[0]=1942, length[1]=1817
> > > > >
> > > > > Now, 1942 lines on the left and 1817 on the right doesn't seem all
> > > > > that many (that's what remains after prefix-suffix stripping). I see
> > > > > no reason 'svn diff' shouldn't be able to calculate a minimal diff for
> > > > > those sizes in a reasonable amount of time. So if p can go up to such
> > > > > a "cost" for such "reasonable" lenghts, I imagine we should put the
> > > > > actual limit much higher. I suppose we can just as well set "min_cost"
> > > > > to 1024 or even higher. 64 is way too low.
> > > >
> > > > Thanks!
> > > > I have set the value back to at least 1024 with this new version of patch.
> > >
> > > Hmmm, apparently I'm still running into the limit. Even with 8192 I'm
> > > hitting it at least once. The actual diff there is not really pretty,
> > > but it's an actual manual change made by some developer years ago, and
> > > the "minimal diff" is more or less readable / understandable. The log
> > > message is something like "Group sections related to multimedia
> > > module", and it shuffles around a lot of xml sections to group them
> > > together.
> > >
> > > In this case, length[0]==length[1]==46780. Pretty big for the LCS, but
> > > not humongous. The value of 'p' goes up to 8279.
> > >
> > > With the limit set to 16384, I'm not hitting it, and the annotate
> > > comes out as with the original.
> > >
> > > Now, I'm a bit puzzled why limit==8192 already takes 18s in your
> > > tests. In my situation, with the above diff, I get:
> > > limit==8192: 3.3s
> > > limit==16384: 3.5s
> > >
> > > (that's including downloading both versions from the server over https)
> > >
> > > So I'm wondering why it takes so long in your case. One thing to keep
> > > in mind here is that, since this is cpu intensive, optimizations might
> > > matter. I remember back in 2011 that I did most of my measurements
> > > with a "Release" build for example. But the above tests I did were
> > > with a debug build, so ... dunno.
> >
> > [ snip part about another optimization by Morten Kloster in r1140247,
> > which seemed not to work for Stefan. ]
> >
> > Coming back to this thread from last year, just wondering if you want
> > to pick this up again, Stefan (or anyone else), since a 1.15 release
> > might be happening in the near future.
> >
> > I think we already concluded that changing this behaviour in a patch
> > release would be hard to justify. But for 1.15 it might be acceptable
> > in some form.
> >
> > Just wanted to bring this back up, I have no vested interest here. I'm
> > still skeptical about changing the output of diff & blame in cases
> > where we hit "this takes too much work", but I won't stand in the way
> > if that what others consense on :-). If we want to pick a limit of the
> > maximum effort, then I'd suggest 16384 (or above), but that's a purely
> > personal / local suggestion, because the use case I had above would
> > then not be impacted.
> >
> > BTW: it's a pity that we can's say: "limit the diff effort in case of
> > 'update' or 'merge', but not in case of 'diff' or 'blame'". Because in
> > the latter case, the user might care more about the actual output, and
> > in the former they might not (until they hit a text conflict of
> > course, then they will care again ...)
> >
> > It might be interesting if someone could take a more "profiling" look
> > at why Stefan's example takes much more time, even with limit==8192
> > (18 seconds), compared to my example (3.3 seconds). Is that purely
> > because of the total size of the files / number of lines? I find that
> > difference quite strange. There might be something that can be
> > optimized without changing behaviour. But I'm afraid I don't have time
> > to dig deeply into this.
>
> My assumption by now is that the algorithm needs to be changed.
> This would essentially amount to a complete rewrite of the diff code.
>
> There must be a fundamental difference to how 'git diff' works,
> for example, which succeeds on the large files I have tested
> in a relatively short amount of time.
>
> Even 'got diff' (from my side-project gameoftrees.org), which does not
> focus on performance as a first-class feature, completes a diff of the
> problematic test case discussed in this thread in about 3 minutes.
> It uses a diff implemention written by Neels, with the Patience algorithm.
>
> I have shelved this topic for now. I have no desire to rewrite Subversion's
> diff code. And the original use case is an edge case where people are
> versioning large auto-generated XML files for reasons I don't understand.
> They worked around it by configuring 'git diff' as a diff tool for
> Subversion to use, and have not complained since.
> I suspect they could also change their process to work around this issue,
> but such changes are not always easy to get pushed through in a large
> organization. But, in the end, these are home-made problems which could
> be solved by using Subversion as it is intended to be used.

I'm shelving this too, but just one more circle-back here ...

- Perhaps the fundamental diff algorithm in SVN is fine, but it has a
performance bug / low-level inefficiency? I think that should be
explored first, because it might fix most of this problem without
requiring a discussion about the precise output, and without huge
efforts of rewrites.

- I think a lot of other diff tools out there have an "effort-limit",
like the one you proposed (like --speed-large-files, aka
"use-heuristic" for GNU diff [1], [2], which might be the default
behaviour, dunno). I'm guessing 'git diff' has one too. So agreed,
perhaps we need one too. Just not entirely sure how high we should put
the limit, and what we have to communicate around "diff output might
have changed / might not be a minimal diff" (and whether we still need
an option to fallback to "go over the limit").

- Patience diff might also be an option (though it might have its own
set of problems and peculiar edge cases, I don't know). It might even
be possible to reuse part of the current code for this (Not sure this
requires a complete rewrite. Does it still require a LCS algorithm at
its core, after picking apart the problem space? I haven't studied it
enough to know how it works internally). In this case too there is the
back-compat vs behavioral-change discussion that will take time to
settle (Do we change the default algorithm / output? Do we make it
optional, or do we already have too manu knobs? ...).

[1] https://github.com/Distrotech/diffutils/blob/9e70e1ce7aaeff0f9c428d1abc9821589ea054f1/NEWS#L238
[2] https://github.com/Distrotech/diffutils/blob/9e70e1ce7aaeff0f9c428d1abc9821589ea054f1/lib/diffseq.h#L255

-- 
Johan

Re: [PATCH] limit diff effort to fix performance issue

Posted by Stefan Sperling <st...@apache.org>.

On Fri, Apr 01, 2022 at 12:44:24PM +0200, Johan Corveleyn wrote:
> On Tue, Jun 8, 2021 at 5:58 PM Johan Corveleyn <jc...@gmail.com> wrote:
> > On Tue, Jun 8, 2021 at 3:24 PM Stefan Sperling <st...@stsp.name> wrote:
> > > On Tue, Jun 08, 2021 at 02:57:34PM +0200, Johan Corveleyn wrote:
> > > > Okay, I focused on the first revision causing the annotate to differ,
> > > > and with some debug logging:
> > > > - p went up to 139
> > > > - length[0]=1942, length[1]=1817
> > > >
> > > > Now, 1942 lines on the left and 1817 on the right doesn't seem all
> > > > that many (that's what remains after prefix-suffix stripping). I see
> > > > no reason 'svn diff' shouldn't be able to calculate a minimal diff for
> > > > those sizes in a reasonable amount of time. So if p can go up to such
> > > > a "cost" for such "reasonable" lenghts, I imagine we should put the
> > > > actual limit much higher. I suppose we can just as well set "min_cost"
> > > > to 1024 or even higher. 64 is way too low.
> > >
> > > Thanks!
> > > I have set the value back to at least 1024 with this new version of patch.
> >
> > Hmmm, apparently I'm still running into the limit. Even with 8192 I'm
> > hitting it at least once. The actual diff there is not really pretty,
> > but it's an actual manual change made by some developer years ago, and
> > the "minimal diff" is more or less readable / understandable. The log
> > message is something like "Group sections related to multimedia
> > module", and it shuffles around a lot of xml sections to group them
> > together.
> >
> > In this case, length[0]==length[1]==46780. Pretty big for the LCS, but
> > not humongous. The value of 'p' goes up to 8279.
> >
> > With the limit set to 16384, I'm not hitting it, and the annotate
> > comes out as with the original.
> >
> > Now, I'm a bit puzzled why limit==8192 already takes 18s in your
> > tests. In my situation, with the above diff, I get:
> > limit==8192: 3.3s
> > limit==16384: 3.5s
> >
> > (that's including downloading both versions from the server over https)
> >
> > So I'm wondering why it takes so long in your case. One thing to keep
> > in mind here is that, since this is cpu intensive, optimizations might
> > matter. I remember back in 2011 that I did most of my measurements
> > with a "Release" build for example. But the above tests I did were
> > with a debug build, so ... dunno.
> 
> [ snip part about another optimization by Morten Kloster in r1140247,
> which seemed not to work for Stefan. ]
> 
> Coming back to this thread from last year, just wondering if you want
> to pick this up again, Stefan (or anyone else), since a 1.15 release
> might be happening in the near future.
> 
> I think we already concluded that changing this behaviour in a patch
> release would be hard to justify. But for 1.15 it might be acceptable
> in some form.
> 
> Just wanted to bring this back up, I have no vested interest here. I'm
> still skeptical about changing the output of diff & blame in cases
> where we hit "this takes too much work", but I won't stand in the way
> if that what others consense on :-). If we want to pick a limit of the
> maximum effort, then I'd suggest 16384 (or above), but that's a purely
> personal / local suggestion, because the use case I had above would
> then not be impacted.
> 
> BTW: it's a pity that we can's say: "limit the diff effort in case of
> 'update' or 'merge', but not in case of 'diff' or 'blame'". Because in
> the latter case, the user might care more about the actual output, and
> in the former they might not (until they hit a text conflict of
> course, then they will care again ...)
> 
> It might be interesting if someone could take a more "profiling" look
> at why Stefan's example takes much more time, even with limit==8192
> (18 seconds), compared to my example (3.3 seconds). Is that purely
> because of the total size of the files / number of lines? I find that
> difference quite strange. There might be something that can be
> optimized without changing behaviour. But I'm afraid I don't have time
> to dig deeply into this.

My assumption by now is that the algorithm needs to be changed.
This would essentially amount to a complete rewrite of the diff code.

There must be a fundamental difference to how 'git diff' works,
for example, which succeeds on the large files I have tested
in a relatively short amount of time.

Even 'got diff' (from my side-project gameoftrees.org), which does not
focus on performance as a first-class feature, completes a diff of the
problematic test case discussed in this thread in about 3 minutes.
It uses a diff implemention written by Neels, with the Patience algorithm.

I have shelved this topic for now. I have no desire to rewrite Subversion's
diff code. And the original use case is an edge case where people are
versioning large auto-generated XML files for reasons I don't understand.
They worked around it by configuring 'git diff' as a diff tool for
Subversion to use, and have not complained since.
I suspect they could also change their process to work around this issue,
but such changes are not always easy to get pushed through in a large
organization. But, in the end, these are home-made problems which could
be solved by using Subversion as it is intended to be used.

Cheers,
Stefan

Re: [PATCH] limit diff effort to fix performance issue

Posted by Johan Corveleyn <jc...@gmail.com>.

On Tue, Jun 8, 2021 at 5:58 PM Johan Corveleyn <jc...@gmail.com> wrote:
> On Tue, Jun 8, 2021 at 3:24 PM Stefan Sperling <st...@stsp.name> wrote:
> > On Tue, Jun 08, 2021 at 02:57:34PM +0200, Johan Corveleyn wrote:
> > > Okay, I focused on the first revision causing the annotate to differ,
> > > and with some debug logging:
> > > - p went up to 139
> > > - length[0]=1942, length[1]=1817
> > >
> > > Now, 1942 lines on the left and 1817 on the right doesn't seem all
> > > that many (that's what remains after prefix-suffix stripping). I see
> > > no reason 'svn diff' shouldn't be able to calculate a minimal diff for
> > > those sizes in a reasonable amount of time. So if p can go up to such
> > > a "cost" for such "reasonable" lenghts, I imagine we should put the
> > > actual limit much higher. I suppose we can just as well set "min_cost"
> > > to 1024 or even higher. 64 is way too low.
> >
> > Thanks!
> > I have set the value back to at least 1024 with this new version of patch.
>
> Hmmm, apparently I'm still running into the limit. Even with 8192 I'm
> hitting it at least once. The actual diff there is not really pretty,
> but it's an actual manual change made by some developer years ago, and
> the "minimal diff" is more or less readable / understandable. The log
> message is something like "Group sections related to multimedia
> module", and it shuffles around a lot of xml sections to group them
> together.
>
> In this case, length[0]==length[1]==46780. Pretty big for the LCS, but
> not humongous. The value of 'p' goes up to 8279.
>
> With the limit set to 16384, I'm not hitting it, and the annotate
> comes out as with the original.
>
> Now, I'm a bit puzzled why limit==8192 already takes 18s in your
> tests. In my situation, with the above diff, I get:
> limit==8192: 3.3s
> limit==16384: 3.5s
>
> (that's including downloading both versions from the server over https)
>
> So I'm wondering why it takes so long in your case. One thing to keep
> in mind here is that, since this is cpu intensive, optimizations might
> matter. I remember back in 2011 that I did most of my measurements
> with a "Release" build for example. But the above tests I did were
> with a debug build, so ... dunno.

[ snip part about another optimization by Morten Kloster in r1140247,
which seemed not to work for Stefan. ]

Coming back to this thread from last year, just wondering if you want
to pick this up again, Stefan (or anyone else), since a 1.15 release
might be happening in the near future.

I think we already concluded that changing this behaviour in a patch
release would be hard to justify. But for 1.15 it might be acceptable
in some form.

Just wanted to bring this back up, I have no vested interest here. I'm
still skeptical about changing the output of diff & blame in cases
where we hit "this takes too much work", but I won't stand in the way
if that what others consense on :-). If we want to pick a limit of the
maximum effort, then I'd suggest 16384 (or above), but that's a purely
personal / local suggestion, because the use case I had above would
then not be impacted.

BTW: it's a pity that we can's say: "limit the diff effort in case of
'update' or 'merge', but not in case of 'diff' or 'blame'". Because in
the latter case, the user might care more about the actual output, and
in the former they might not (until they hit a text conflict of
course, then they will care again ...)

It might be interesting if someone could take a more "profiling" look
at why Stefan's example takes much more time, even with limit==8192
(18 seconds), compared to my example (3.3 seconds). Is that purely
because of the total size of the files / number of lines? I find that
difference quite strange. There might be something that can be
optimized without changing behaviour. But I'm afraid I don't have time
to dig deeply into this.

-- 
Johan

Re: [PATCH] limit diff effort to fix performance issue

Posted by Stefan Sperling <st...@elego.de>.

On Tue, Jun 08, 2021 at 05:58:14PM +0200, Johan Corveleyn wrote:
> Also: one of the last things that Morten Kloster committed in 2011 was
> another performance improvement, namely "restarting the LCS" in some
> specific cases. At the time, I didn't have enough time to review this
> properly, and had some reservations (but mainly lack of time), so I
> asked him to commit it on a branch (diff-improvements). This feature
> branch never made it to trunk. Just as another "shot", perhaps you
> could try r1140247 and see if it helps?

Unfortunately, r1140247 doesn't help in my case.
I see the new logic which was added trigger once very early on.
But it soon gets stuck in the loop forever just like the code on trunk does.

The code added in r1140247 seems unfinished and untested.
It has a bug in clear_fp() which clang warns about (see below).
Fixing that alone doesn't help my test case though.

subversion/libsvn_diff/lcs.c:264:19: warning: incompatible pointer types assigning to 'svn_diff__lcs_t *'
      (aka 'struct svn_diff__lcs_t *') from 'svn_diff__lcs_t **' (aka 'struct svn_diff__lcs_t **'); dereference with *
      [-Wincompatible-pointer-types]
        lcs->next = lcs_freelist;
                  ^ ~~~~~~~~~~~~
                    *
subversion/libsvn_diff/lcs.c:265:22: warning: incompatible pointer types assigning to 'svn_diff__lcs_t **'
      (aka 'struct svn_diff__lcs_t **') from 'svn_diff__lcs_t *' (aka 'struct svn_diff__lcs_t *'); take the address with &
      [-Wincompatible-pointer-types]
        lcs_freelist = lcs;
                     ^ ~~~
                       &
2 warnings generated.

Re: [PATCH] limit diff effort to fix performance issue

Posted by Johan Corveleyn <jc...@gmail.com>.

On Tue, Jun 8, 2021 at 3:24 PM Stefan Sperling <st...@stsp.name> wrote:
>
> On Tue, Jun 08, 2021 at 02:57:34PM +0200, Johan Corveleyn wrote:
> > Okay, I focused on the first revision causing the annotate to differ,
> > and with some debug logging:
> > - p went up to 139
> > - length[0]=1942, length[1]=1817
> >
> > Now, 1942 lines on the left and 1817 on the right doesn't seem all
> > that many (that's what remains after prefix-suffix stripping). I see
> > no reason 'svn diff' shouldn't be able to calculate a minimal diff for
> > those sizes in a reasonable amount of time. So if p can go up to such
> > a "cost" for such "reasonable" lenghts, I imagine we should put the
> > actual limit much higher. I suppose we can just as well set "min_cost"
> > to 1024 or even higher. 64 is way too low.
>
> Thanks!
> I have set the value back to at least 1024 with this new version of patch.

Hmmm, apparently I'm still running into the limit. Even with 8192 I'm
hitting it at least once. The actual diff there is not really pretty,
but it's an actual manual change made by some developer years ago, and
the "minimal diff" is more or less readable / understandable. The log
message is something like "Group sections related to multimedia
module", and it shuffles around a lot of xml sections to group them
together.

In this case, length[0]==length[1]==46780. Pretty big for the LCS, but
not humongous. The value of 'p' goes up to 8279.

With the limit set to 16384, I'm not hitting it, and the annotate
comes out as with the original.

Now, I'm a bit puzzled why limit==8192 already takes 18s in your
tests. In my situation, with the above diff, I get:
limit==8192: 3.3s
limit==16384: 3.5s

(that's including downloading both versions from the server over https)

So I'm wondering why it takes so long in your case. One thing to keep
in mind here is that, since this is cpu intensive, optimizations might
matter. I remember back in 2011 that I did most of my measurements
with a "Release" build for example. But the above tests I did were
with a debug build, so ... dunno.

Also: one of the last things that Morten Kloster committed in 2011 was
another performance improvement, namely "restarting the LCS" in some
specific cases. At the time, I didn't have enough time to review this
properly, and had some reservations (but mainly lack of time), so I
asked him to commit it on a branch (diff-improvements). This feature
branch never made it to trunk. Just as another "shot", perhaps you
could try r1140247 and see if it helps?

-- 
Johan

Re: [PATCH] limit diff effort to fix performance issue

Posted by Stefan Sperling <st...@stsp.name>.

On Tue, Jun 08, 2021 at 02:57:34PM +0200, Johan Corveleyn wrote:
> Okay, I focused on the first revision causing the annotate to differ,
> and with some debug logging:
> - p went up to 139
> - length[0]=1942, length[1]=1817
> 
> Now, 1942 lines on the left and 1817 on the right doesn't seem all
> that many (that's what remains after prefix-suffix stripping). I see
> no reason 'svn diff' shouldn't be able to calculate a minimal diff for
> those sizes in a reasonable amount of time. So if p can go up to such
> a "cost" for such "reasonable" lenghts, I imagine we should put the
> actual limit much higher. I suppose we can just as well set "min_cost"
> to 1024 or even higher. 64 is way too low.

Thanks!
I have set the value back to at least 1024 with this new version of patch.

> BTW, the actual revision (and proper diff) here was not really
> unreadable. It had 86 hunks. The log message was something like
> "refactor contact parameters", and the change was removing 4 lines and
> replacing them with 1 line, throughout an entire section. If the
> lcs-limit gets hit, the entire "remaining part" (which normally would
> have been some more small hunks) becomes one giant remove+add of 1000
> lines or so (i.e. quite useless for a diff that a human might want to
> review).

Indeed, we want to avoid such changes if possible, and only punish
the cases which have problematic run-time behaviour.

> Finally, a couple of small remarks related to the patch itself:
> [[[
> +  int max_cost;
> +  /* The minimum cost 'p' (# moves away from k=0) we are willing to invest. */
> +  const int min_cost = 64;
> ]]]
> 
> I get confused by the naming of the variables here (unless we use
> min_max_cost or something). Perhaps something like 'cost_limit' and
> 'min_cost_limit' or something like that?

Sure, done.

> [[[
> +  max_cost = shift_sqrt(length[0] + length[1] / 2);
> ]]]
> 
> Is that correct WRT operator precedence? It looks like you want to
> take the sqrt of the average length, which would be (length[0] +
> length[1]) / 2. Or maybe I misunderstand. I'm not really sure why this
> is used as an estimate for the "acceptable cost" / max_cost. But then
> again, it's been a while since I read the details of Myers algorithm.

Good catch. In my test case it doesn't make a difference. The limit ends
up begin 2048 in either case. But you're right, the code was incorrect.

[[[
* subversion/libsvn_diff/lcs.c
  (shift_sqrt): New helper function. Borrowed from Game of Trees' diff.
  (svn_diff__lcs): Limit the effort spent on finding an optimal diff to
   avoid spinning on the CPU for a long time (30 minutes or more) for
   some input files. Large XML documents were known to trigger this.
]]

Index: subversion/libsvn_diff/lcs.c
===================================================================
--- subversion/libsvn_diff/lcs.c	(revision 1890390)
+++ subversion/libsvn_diff/lcs.c	(working copy)
@@ -227,7 +227,18 @@ prepend_lcs(svn_diff__lcs_t *lcs, apr_off_t lines,
   return new_lcs;
 }
 
+/* Integer square root approximation */
+static int
+shift_sqrt(apr_off_t val)
+{
+  int i;
 
+  for (i = 1; val > 0; val >>= 2)
+    i <<= 1;
+
+  return i;
+}
+
 svn_diff__lcs_t *
 svn_diff__lcs(svn_diff__position_t *position_list1, /* pointer to tail (ring) */
               svn_diff__position_t *position_list2, /* pointer to tail (ring) */
@@ -245,7 +256,10 @@ svn_diff__lcs(svn_diff__position_t *position_list1
   svn_diff__snake_t *fp;
   apr_off_t d;
   apr_off_t k;
-  apr_off_t p = 0;
+  apr_off_t p = 0; /* # moves away from k=0 */
+  /* The minimum cost 'p' (# moves away from k=0) we are willing to invest. */
+  const int min_cost_limit = 1024;
+  int max_cost_limit;
   svn_diff__lcs_t *lcs, *lcs_freelist = NULL;
 
   svn_diff__position_t sentinel_position[2];
@@ -337,6 +351,24 @@ svn_diff__lcs(svn_diff__position_t *position_list1
   fp[d - 1].position[0] = sentinel_position[0].next;
   fp[d - 1].position[1] = &sentinel_position[1];
 
+  /* Limit the effort spent looking for a diff snake. If files have
+   * very few lines in common, the effort spent to find nice diff
+   * snakes is just not worth it, the diff result will still be
+   * essentially minus everything on the left, plus everything on
+   * the right, with a few useless matches here and there.
+   *
+   * Very large files which have a lot of common lines interleaved with
+   * changed lines (such as humongous auto-generated XML documents) could
+   * easily keep us busy here for a very long time, in the order of hours.
+   * In such cases we abort the algorithm before it is finished and use
+   * the most recently computed intermediate result. The diff will not be
+   * pretty but a quick suboptimal result is better than a perfect result
+   * which takes hours to compute.
+   */
+  max_cost_limit = shift_sqrt((length[0] + length[1]) / 2);
+  if (max_cost_limit < min_cost_limit)
+    max_cost_limit = min_cost_limit;
+
   p = 0;
   do
     {
@@ -353,7 +385,7 @@ svn_diff__lcs(svn_diff__position_t *position_list1
 
       p++;
     }
-  while (fp[0].position[1] != &sentinel_position[1]);
+  while (fp[0].position[1] != &sentinel_position[1] && p < max_cost_limit);
 
   if (suffix_lines)
     lcs->next = prepend_lcs(fp[0].lcs, suffix_lines,

Re: [PATCH] limit diff effort to fix performance issue

Posted by Johan Corveleyn <jc...@gmail.com>.

On Sun, Jun 6, 2021 at 4:57 PM Stefan Sperling <st...@elego.de> wrote:
>
> On Sat, Jun 05, 2021 at 08:56:24PM +0200, Johan Corveleyn wrote:
> > Hmm, I tried this patch with my "annotate large XML file with deep
> > history test", but the result isn't the same as with 1.14. I'd have to
> > investigate a bit more to find out where it took another turn, and
> > what that diff looks like (and whether or not the "other diff" is a
> > problem for me).
>
> It would be good to know how many iterations your file needs to produce
> a diff without a limit. You can use something like SVN_DBG(("p=%d\n", p))
> to print this number.

Okay, I focused on the first revision causing the annotate to differ,
and with some debug logging:
- p went up to 139
- length[0]=1942, length[1]=1817

Now, 1942 lines on the left and 1817 on the right doesn't seem all
that many (that's what remains after prefix-suffix stripping). I see
no reason 'svn diff' shouldn't be able to calculate a minimal diff for
those sizes in a reasonable amount of time. So if p can go up to such
a "cost" for such "reasonable" lenghts, I imagine we should put the
actual limit much higher. I suppose we can just as well set "min_cost"
to 1024 or even higher. 64 is way too low.

BTW, the actual revision (and proper diff) here was not really
unreadable. It had 86 hunks. The log message was something like
"refactor contact parameters", and the change was removing 4 lines and
replacing them with 1 line, throughout an entire section. If the
lcs-limit gets hit, the entire "remaining part" (which normally would
have been some more small hunks) becomes one giant remove+add of 1000
lines or so (i.e. quite useless for a diff that a human might want to
review).

To give you some more numbers about this large XML file:
- it has 17701 revisions
- it has almost 140,000 lines in the HEAD revision
- it's a little less than 4 MB

Annotating the full history (i.e. performing 17700 diffs) takes around
10 minutes, with or without your lcs-limit patch. So it's not like any
of those 17700 diffs are taking a long time. I just tried running it
with min_cost set to 160, and I still get differences. Next I'll try
with 1024.

Finally, a couple of small remarks related to the patch itself:
[[[
+  int max_cost;
+  /* The minimum cost 'p' (# moves away from k=0) we are willing to invest. */
+  const int min_cost = 64;
]]]

I get confused by the naming of the variables here (unless we use
min_max_cost or something). Perhaps something like 'cost_limit' and
'min_cost_limit' or something like that?

[[[
+  max_cost = shift_sqrt(length[0] + length[1] / 2);
]]]

Is that correct WRT operator precedence? It looks like you want to
take the sqrt of the average length, which would be (length[0] +
length[1]) / 2. Or maybe I misunderstand. I'm not really sure why this
is used as an estimate for the "acceptable cost" / max_cost. But then
again, it's been a while since I read the details of Myers algorithm.

-- 
Johan

Re: [PATCH] limit diff effort to fix performance issue

Posted by Stefan Sperling <st...@elego.de>.

On Sat, Jun 05, 2021 at 08:56:24PM +0200, Johan Corveleyn wrote:
> Hmm, I tried this patch with my "annotate large XML file with deep
> history test", but the result isn't the same as with 1.14. I'd have to
> investigate a bit more to find out where it took another turn, and
> what that diff looks like (and whether or not the "other diff" is a
> problem for me).

It would be good to know how many iterations your file needs to produce
a diff without a limit. You can use something like SVN_DBG(("p=%d\n", p))
to print this number.

Could it help to use the new heuristics only in cases where our existing
performance tricks fail? I.e. enforce a limit only when the amount of
common tokens between the files is low, and the amount of common prefix/suffix
lines relative to the total number of lines is low?

Re: [PATCH] limit diff effort to fix performance issue

Posted by Johan Corveleyn <jc...@gmail.com>.

On Thu, Jun 3, 2021 at 12:57 PM Stefan Sperling <st...@elego.de> wrote:
>
> On Wed, Jun 02, 2021 at 02:20:12PM +0200, Stefan Sperling wrote:
> > The patch below implements an effort limit for Subversion's LCS diff.
>
> Here is an updated version of this patch. I took a closer look at
> other diff implementations again, and realized that they don't use
> a fixed limit but one which is scaled to the size of the Myers graph.
> With a square root applied to keep the effort within certain bounds
> as the files being compared become larger.
>
> Compared to the previous patch, I have added comments, and renamed
> 'max_effort' to 'max_cost' to better align with existing documentation
> of the algorithm in existing comments.
>
> We already maintain a variable 'p' which represents the cost of
> the algorithm. So instead of maintaining a separate counter I am
> now setting an upper bound on 'p'.
>
> With this new patch the limit ends up being 2048 for the files I have.
> For regular files in a working copy of SVN trunk, the loop typically
> needs less than 10 iterations.
>
> For reference, run-times of 'svn diff' on my test files with various
> limits, including no limit:
>
> limit = 1024
>     0m06.82s real     0m05.40s user     0m01.41s system
> limit = 2048
>     0m08.15s real     0m06.55s user     0m01.56s system
> limit = 4096
>     0m11.10s real     0m09.59s user     0m01.52s system
> limit = 8192
>     0m18.40s real     0m16.57s user     0m01.57s system
> limit = 65536
>     7m55.40s real     7m53.81s user     0m01.58s system
> Full run without limit:
>   163m31.49s real   163m16.77s user     0m02.37s system
>
> Git/libxdiff somehow manages to produce a smaller collection of larger
> chunks on the problematic files, rather than many small chunks with
> one giant chunk at the end which we end up produing when we bail out
> of the loop.
> I have made some attempts to produce a more readable diff result in
> case the limit is reached but I haven't been successful. Ideas welcome.
> Still, I don't think it makes sense to invest a lot of time into
> optimizing output in this case, under the assumption that nobody is
> going to be carefully reading affected diffs anyway.
>
> [[[
> * subversion/libsvn_diff/lcs.c
>   (shift_sqrt): New helper function. Borrowed from Game of Trees' diff.
>   (svn_diff__lcs): Limit the effort spent on finding an optimal diff to
>    avoid spinning on the CPU for a long time (30 minutes or more) for
>    some input files. Large XML documents were known to trigger this.
> ]]
>
> Index: subversion/libsvn_diff/lcs.c
> ===================================================================
> --- subversion/libsvn_diff/lcs.c        (revision 1890390)
> +++ subversion/libsvn_diff/lcs.c        (working copy)
> @@ -227,7 +227,18 @@ prepend_lcs(svn_diff__lcs_t *lcs, apr_off_t lines,
>    return new_lcs;
>  }
>
> +/* Integer square root approximation */
> +static int
> +shift_sqrt(apr_off_t val)
> +{
> +  int i;
>
> +  for (i = 1; val > 0; val >>= 2)
> +    i <<= 1;
> +
> +  return i;
> +}
> +
>  svn_diff__lcs_t *
>  svn_diff__lcs(svn_diff__position_t *position_list1, /* pointer to tail (ring) */
>                svn_diff__position_t *position_list2, /* pointer to tail (ring) */
> @@ -247,6 +258,9 @@ svn_diff__lcs(svn_diff__position_t *position_list1
>    apr_off_t k;
>    apr_off_t p = 0;
>    svn_diff__lcs_t *lcs, *lcs_freelist = NULL;
> +  int max_cost;
> +  /* The minimum cost 'p' (# moves away from k=0) we are willing to invest. */
> +  const int min_cost = 64;
>
>    svn_diff__position_t sentinel_position[2];
>
> @@ -337,6 +351,24 @@ svn_diff__lcs(svn_diff__position_t *position_list1
>    fp[d - 1].position[0] = sentinel_position[0].next;
>    fp[d - 1].position[1] = &sentinel_position[1];
>
> +  /* Limit the effort spent looking for a diff snake. If files have
> +   * very few lines in common, the effort spent to find nice diff
> +   * snakes is just not worth it, the diff result will still be
> +   * essentially minus everything on the left, plus everything on
> +   * the right, with a few useless matches here and there.
> +   *
> +   * Very large files which have a lot of common lines interleaved with
> +   * changed lines (such as humongous auto-generated XML documents) could
> +   * easily keep us busy here for a very long time, in the order of hours.
> +   * In such cases we abort the algorithm before it is finished and use
> +   * the most recently computed intermediate result. The diff will not be
> +   * pretty but a quick suboptimal result is better than a perfect result
> +   * which takes hours to compute.
> +   */
> +  max_cost = shift_sqrt(length[0] + length[1] / 2);
> +  if (max_cost < min_cost)
> +    max_cost = min_cost;
> +
>    p = 0;
>    do
>      {
> @@ -353,7 +385,7 @@ svn_diff__lcs(svn_diff__position_t *position_list1
>
>        p++;
>      }
> -  while (fp[0].position[1] != &sentinel_position[1]);
> +  while (fp[0].position[1] != &sentinel_position[1] && p < max_cost);
>
>    if (suffix_lines)
>      lcs->next = prepend_lcs(fp[0].lcs, suffix_lines,
>

Hmm, I tried this patch with my "annotate large XML file with deep
history test", but the result isn't the same as with 1.14. I'd have to
investigate a bit more to find out where it took another turn, and
what that diff looks like (and whether or not the "other diff" is a
problem for me).

-- 
Johan

Re: [PATCH] limit diff effort to fix performance issue

Posted by Stefan Sperling <st...@elego.de>.

On Wed, Jun 02, 2021 at 02:20:12PM +0200, Stefan Sperling wrote:
> The patch below implements an effort limit for Subversion's LCS diff.

Here is an updated version of this patch. I took a closer look at
other diff implementations again, and realized that they don't use
a fixed limit but one which is scaled to the size of the Myers graph.
With a square root applied to keep the effort within certain bounds
as the files being compared become larger.

Compared to the previous patch, I have added comments, and renamed
'max_effort' to 'max_cost' to better align with existing documentation
of the algorithm in existing comments.

We already maintain a variable 'p' which represents the cost of
the algorithm. So instead of maintaining a separate counter I am
now setting an upper bound on 'p'.

With this new patch the limit ends up being 2048 for the files I have.
For regular files in a working copy of SVN trunk, the loop typically
needs less than 10 iterations.

For reference, run-times of 'svn diff' on my test files with various
limits, including no limit:

limit = 1024
    0m06.82s real     0m05.40s user     0m01.41s system
limit = 2048
    0m08.15s real     0m06.55s user     0m01.56s system
limit = 4096
    0m11.10s real     0m09.59s user     0m01.52s system
limit = 8192
    0m18.40s real     0m16.57s user     0m01.57s system
limit = 65536
    7m55.40s real     7m53.81s user     0m01.58s system
Full run without limit:
  163m31.49s real   163m16.77s user     0m02.37s system

Git/libxdiff somehow manages to produce a smaller collection of larger
chunks on the problematic files, rather than many small chunks with
one giant chunk at the end which we end up produing when we bail out
of the loop.
I have made some attempts to produce a more readable diff result in
case the limit is reached but I haven't been successful. Ideas welcome.
Still, I don't think it makes sense to invest a lot of time into
optimizing output in this case, under the assumption that nobody is
going to be carefully reading affected diffs anyway.

[[[
* subversion/libsvn_diff/lcs.c
  (shift_sqrt): New helper function. Borrowed from Game of Trees' diff.
  (svn_diff__lcs): Limit the effort spent on finding an optimal diff to
   avoid spinning on the CPU for a long time (30 minutes or more) for
   some input files. Large XML documents were known to trigger this.
]]

Index: subversion/libsvn_diff/lcs.c
===================================================================
--- subversion/libsvn_diff/lcs.c	(revision 1890390)
+++ subversion/libsvn_diff/lcs.c	(working copy)
@@ -227,7 +227,18 @@ prepend_lcs(svn_diff__lcs_t *lcs, apr_off_t lines,
   return new_lcs;
 }
 
+/* Integer square root approximation */
+static int
+shift_sqrt(apr_off_t val)
+{
+  int i;
 
+  for (i = 1; val > 0; val >>= 2)
+    i <<= 1;
+
+  return i;
+}
+
 svn_diff__lcs_t *
 svn_diff__lcs(svn_diff__position_t *position_list1, /* pointer to tail (ring) */
               svn_diff__position_t *position_list2, /* pointer to tail (ring) */
@@ -247,6 +258,9 @@ svn_diff__lcs(svn_diff__position_t *position_list1
   apr_off_t k;
   apr_off_t p = 0;
   svn_diff__lcs_t *lcs, *lcs_freelist = NULL;
+  int max_cost;
+  /* The minimum cost 'p' (# moves away from k=0) we are willing to invest. */
+  const int min_cost = 64;
 
   svn_diff__position_t sentinel_position[2];
 
@@ -337,6 +351,24 @@ svn_diff__lcs(svn_diff__position_t *position_list1
   fp[d - 1].position[0] = sentinel_position[0].next;
   fp[d - 1].position[1] = &sentinel_position[1];
 
+  /* Limit the effort spent looking for a diff snake. If files have
+   * very few lines in common, the effort spent to find nice diff
+   * snakes is just not worth it, the diff result will still be
+   * essentially minus everything on the left, plus everything on
+   * the right, with a few useless matches here and there.
+   *
+   * Very large files which have a lot of common lines interleaved with
+   * changed lines (such as humongous auto-generated XML documents) could
+   * easily keep us busy here for a very long time, in the order of hours.
+   * In such cases we abort the algorithm before it is finished and use
+   * the most recently computed intermediate result. The diff will not be
+   * pretty but a quick suboptimal result is better than a perfect result
+   * which takes hours to compute.
+   */
+  max_cost = shift_sqrt(length[0] + length[1] / 2);
+  if (max_cost < min_cost)
+    max_cost = min_cost;
+
   p = 0;
   do
     {
@@ -353,7 +385,7 @@ svn_diff__lcs(svn_diff__position_t *position_list1
 
       p++;
     }
-  while (fp[0].position[1] != &sentinel_position[1]);
+  while (fp[0].position[1] != &sentinel_position[1] && p < max_cost);
 
   if (suffix_lines)
     lcs->next = prepend_lcs(fp[0].lcs, suffix_lines,

Re: [PATCH] limit diff effort to fix performance issue

Posted by Daniel Sahlberg <da...@gmail.com>.

Den ons 9 juni 2021 kl 17:18 skrev Johan Corveleyn <jc...@gmail.com>:

> As for test data, I just remembered something: in 2015 Bert developed
> a tool called "AnonymizedFileDumper", after some discussions we had
> during a hackathon and on IRC (related to blame and diff performance).
> With this tool one can create a dump file from (part of) a repository,
> with all text lines replaced by their CRC32 checksum (so identical
> lines remain identical, but other than that the actual information is
> mostly gone). If you combine this with svndumpfilter for stripping out
> the log messages and the author names, I think it's pretty much
> stripped of all sensitive information (I remember I asked Bert to add
> 'eliding autor names and log messages' as extra features of his dumper
> tool, but don't remember whether he eventually got around to that). It
> is the perfect tool for creating test data out of real repositories
> out there, without leaking company data, so other devs can take a
> look.
>
> I don't have the tool handy anymore, but I just searched the IRC logs
> and found some mentions of it here:
>
> https://colabti.org/irclogger/irclogger_log_search/svn-dev?search=nonymized&action=search&timespan=20150101-20151231&text=checked
>
> The binary download link still works, but it's a Windows binary. I
> don't know if the sources are still available from the sharpsvn
> repository.
>

The sources are still there, checked out fine. I couldn't build it but I
assume that is just a missing Nuget package of Sharpsvn. No time to dig
around further.

Kind regards,
Daniel

Re: [PATCH] limit diff effort to fix performance issue

Posted by Johan Corveleyn <jc...@gmail.com>.

On Tue, Jun 8, 2021 at 3:29 PM Nathan Hartman <ha...@gmail.com> wrote:
>
> On Tue, Jun 8, 2021 at 5:55 AM Stefan Sperling <st...@elego.de> wrote:
> >
> > On Tue, Jun 08, 2021 at 01:45:00AM -0400, Nathan Hartman wrote:
> > > In order to do some testing, I needed some test data that reproduces
> > > the issue; since stsp can't share the customer's 100MB XML file, and
> > > we'd probably want other inputs or sizes anyway, I wrote a program
> > > that attempts to generate such a thing. I'm attaching that program...
> > >
> > > To build, rename to .c extension and, e.g.,
> > > $ gcc gen_diff_test_data.c -o gen_diff_test_data
> > >
> > > To run it, provide two parameters:
> > >
> > > The first is a 'seed' value like you'd provide to a pseudo random
> > > number generator at init time.
> > >
> > > The second is a 'length' parameter that says how long (approximately)
> > > you want the output data to be. (The program nearly always overshoots
> > > this by a small amount.)
> > >
> > > Rather than using the system's pseudo random number generator, this
> > > program includes its own implementation to ensure that users on any
> > > system can get the same results when using the same parameters. So if
> > > different people want to test with the same sets of input, you only
> > > have to share 2 numbers, rather than send each other files >100MB of
> > > useless junk.
> > >
> > > Example: Generate two files of approx 100 MB, containing lots of
> > > differences and diff them:
> > >
> > > $ gen_diff_test_data 98 100m > one.txt
> > > $ gen_diff_test_data 99 100m > two.txt
> > > $ time diff one.txt two.txt > /dev/null
> > >
> > > With the above parameters, it takes my system's diff about 50 seconds
> > > to come up with something that looks reasonable at a glance; svn's
> > > diff has been crunching away for a while now...
> >
> > Thank you Nathan, this is incredibly useful!
> >
> > Would you consider committing this tool to our repository, e.g. somewhere
> > within the tools/dev/ subtree?
>
>
> Sure, done in r1890601.
>
> It's in tools/dev/gen-test-data/gen_diff_test_data.c.
>
> I added the gen-test-data directory in case we want to add other
> sample data generators in the future.

As for test data, I just remembered something: in 2015 Bert developed
a tool called "AnonymizedFileDumper", after some discussions we had
during a hackathon and on IRC (related to blame and diff performance).
With this tool one can create a dump file from (part of) a repository,
with all text lines replaced by their CRC32 checksum (so identical
lines remain identical, but other than that the actual information is
mostly gone). If you combine this with svndumpfilter for stripping out
the log messages and the author names, I think it's pretty much
stripped of all sensitive information (I remember I asked Bert to add
'eliding autor names and log messages' as extra features of his dumper
tool, but don't remember whether he eventually got around to that). It
is the perfect tool for creating test data out of real repositories
out there, without leaking company data, so other devs can take a
look.

I don't have the tool handy anymore, but I just searched the IRC logs
and found some mentions of it here:
https://colabti.org/irclogger/irclogger_log_search/svn-dev?search=nonymized&action=search&timespan=20150101-20151231&text=checked

The binary download link still works, but it's a Windows binary. I
don't know if the sources are still available from the sharpsvn
repository.

@Stefan: maybe you can use this to create an anonymized test case out
of the 100 MB XML file of the elego client? You'd need a Windows
machine if you want to use that binary from 2015.

@Bert: WDYT? Are the sources still out there? Maybe this can be ported
to "plain old svn test tools" (perhaps with the help of someone here
that can spend some time on it)?
(and hello by the way :-))

--
Johan

Re: [PATCH] limit diff effort to fix performance issue

Posted by Nathan Hartman <ha...@gmail.com>.

On Tue, Jun 8, 2021 at 5:55 AM Stefan Sperling <st...@elego.de> wrote:
>
> On Tue, Jun 08, 2021 at 01:45:00AM -0400, Nathan Hartman wrote:
> > In order to do some testing, I needed some test data that reproduces
> > the issue; since stsp can't share the customer's 100MB XML file, and
> > we'd probably want other inputs or sizes anyway, I wrote a program
> > that attempts to generate such a thing. I'm attaching that program...
> >
> > To build, rename to .c extension and, e.g.,
> > $ gcc gen_diff_test_data.c -o gen_diff_test_data
> >
> > To run it, provide two parameters:
> >
> > The first is a 'seed' value like you'd provide to a pseudo random
> > number generator at init time.
> >
> > The second is a 'length' parameter that says how long (approximately)
> > you want the output data to be. (The program nearly always overshoots
> > this by a small amount.)
> >
> > Rather than using the system's pseudo random number generator, this
> > program includes its own implementation to ensure that users on any
> > system can get the same results when using the same parameters. So if
> > different people want to test with the same sets of input, you only
> > have to share 2 numbers, rather than send each other files >100MB of
> > useless junk.
> >
> > Example: Generate two files of approx 100 MB, containing lots of
> > differences and diff them:
> >
> > $ gen_diff_test_data 98 100m > one.txt
> > $ gen_diff_test_data 99 100m > two.txt
> > $ time diff one.txt two.txt > /dev/null
> >
> > With the above parameters, it takes my system's diff about 50 seconds
> > to come up with something that looks reasonable at a glance; svn's
> > diff has been crunching away for a while now...
>
> Thank you Nathan, this is incredibly useful!
>
> Would you consider committing this tool to our repository, e.g. somewhere
> within the tools/dev/ subtree?


Sure, done in r1890601.

It's in tools/dev/gen-test-data/gen_diff_test_data.c.

I added the gen-test-data directory in case we want to add other
sample data generators in the future.

Cheers,
Nathan

Re: [PATCH] limit diff effort to fix performance issue

Posted by Stefan Sperling <st...@elego.de>.

On Tue, Jun 08, 2021 at 01:45:00AM -0400, Nathan Hartman wrote:
> In order to do some testing, I needed some test data that reproduces
> the issue; since stsp can't share the customer's 100MB XML file, and
> we'd probably want other inputs or sizes anyway, I wrote a program
> that attempts to generate such a thing. I'm attaching that program...
> 
> To build, rename to .c extension and, e.g.,
> $ gcc gen_diff_test_data.c -o gen_diff_test_data
> 
> To run it, provide two parameters:
> 
> The first is a 'seed' value like you'd provide to a pseudo random
> number generator at init time.
> 
> The second is a 'length' parameter that says how long (approximately)
> you want the output data to be. (The program nearly always overshoots
> this by a small amount.)
> 
> Rather than using the system's pseudo random number generator, this
> program includes its own implementation to ensure that users on any
> system can get the same results when using the same parameters. So if
> different people want to test with the same sets of input, you only
> have to share 2 numbers, rather than send each other files >100MB of
> useless junk.
> 
> Example: Generate two files of approx 100 MB, containing lots of
> differences and diff them:
> 
> $ gen_diff_test_data 98 100m > one.txt
> $ gen_diff_test_data 99 100m > two.txt
> $ time diff one.txt two.txt > /dev/null
> 
> With the above parameters, it takes my system's diff about 50 seconds
> to come up with something that looks reasonable at a glance; svn's
> diff has been crunching away for a while now...

Thank you Nathan, this is incredibly useful!

Would you consider committing this tool to our repository, e.g. somewhere
within the tools/dev/ subtree?

Thanks,
Stefan

Re: [PATCH] limit diff effort to fix performance issue

Posted by Nathan Hartman <ha...@gmail.com>.

On Wed, Jun 2, 2021 at 8:20 AM Stefan Sperling <st...@elego.de> wrote:
>
> I've heard of several instances where users complain that 'svn update'
> takes an extraordinary amount of time to run (in terms of hours).
>
> At elego we have received files which reproduce such behaviour.
> These files are XML files that are almost 100MB in size. While versioning
> such data is not the primary use case for Subversion, this should not get
> in the way of people getting work done. So I would like to get this fixed.
>
> The root of the performance problem is located in Subversion's diff code.
>
> The files contain in the order of 1.200.000 lines of XML text. I cannot
> share the full files, unfortunately.

In order to do some testing, I needed some test data that reproduces
the issue; since stsp can't share the customer's 100MB XML file, and
we'd probably want other inputs or sizes anyway, I wrote a program
that attempts to generate such a thing. I'm attaching that program...

To build, rename to .c extension and, e.g.,
$ gcc gen_diff_test_data.c -o gen_diff_test_data

To run it, provide two parameters:

The first is a 'seed' value like you'd provide to a pseudo random
number generator at init time.

The second is a 'length' parameter that says how long (approximately)
you want the output data to be. (The program nearly always overshoots
this by a small amount.)

Rather than using the system's pseudo random number generator, this
program includes its own implementation to ensure that users on any
system can get the same results when using the same parameters. So if
different people want to test with the same sets of input, you only
have to share 2 numbers, rather than send each other files >100MB of
useless junk.

Example: Generate two files of approx 100 MB, containing lots of
differences and diff them:

$ gen_diff_test_data 98 100m > one.txt
$ gen_diff_test_data 99 100m > two.txt
$ time diff one.txt two.txt > /dev/null

With the above parameters, it takes my system's diff about 50 seconds
to come up with something that looks reasonable at a glance; svn's
diff has been crunching away for a while now...

Cheers,
Nathan

Re: [PATCH] limit diff effort to fix performance issue

Posted by Nathan Hartman <ha...@gmail.com>.

On Wed, Jun 2, 2021 at 8:20 AM Stefan Sperling <st...@elego.de> wrote:

> I've heard of several instances where users complain that 'svn update'
> takes an extraordinary amount of time to run (in terms of hours).

Ouch!

P vs NP situation? In general I think it is reasonable to limit effort
in order to generate acceptable results (in this case diffs that work)
in a reasonable time (seconds).

Thoughts below:

(snip)

As a reference, Git manages to produce a diff in about 13 seconds:
>
> $ time git diff 01.arxml 02.arxml > /dev/null
>     0m13.39s real     0m13.04s user     0m00.36s system

(snip)

The patch below implements an effort limit for Subversion's LCS diff.
> Diffing the offending XML files in a Subversion working copy becomes
> fast enough to be usable:
>
> $ time svn diff > /dev/null
>     0m06.82s real     0m05.40s user     0m01.41s system
>
> The resulting diff doesn't look as pretty as Git's diff output, but I don't
> think this will matter in practice.

One observation is that Git appears to try for twice as long in this
case, so perhaps doubling our effort limit from 1024 to 2048 will give
(some) parity between the two VCSs' prettiness and performance here?
(Of course, this case is just one data point...)

Another thought is to limit based on time rather than iterations, or
based on (time x file size). But for files that hit the limit, that
would cause non-deterministic output (inconsistent from one run to
another), which is kind of A Bad Thing. Perhaps (iterations x file
size) would be better, as that would be deterministic.

My only concern is that some corner-case diffs that are currently readable
> could become less readable. If such cases are found, we could easily
> address
> the problem by bumping the limit. Note: Any such diffs would still be
> *valid*.

(snip)

Before anyone asks, without evidence that it is really needed I don't think
> adding a config option to set this limit is warranted. Other
> implementations
> are also using hard-coded limits. We could add an environment variable to
> allow for experimentation, perhaps. In any case, I would prefer to make
> such tweaks in follow-up patches and keep this initial fix simple so it
> can be backported easily.

Agreed that it doesn't justify a new config.

I think a config doesn't make sense anyway: If the user wants to
control diff quality, a command line argument or API parameter would
allow us to keep performance snappy by default but provide for a
deep(er) think when the user specifically asks for it. (How would it
be specified? A time limit in seconds?) But, agreed that anything
like that, if done, should be a totally separate patch.

Thanks for the detailed write-up and patch.

I'll run some experiments...

Cheers,
Nathan

Re: [PATCH] limit diff effort to fix performance issue

Posted by "C. Michael Pilato" <cm...@red-bean.com>.

On Wed, Jun 2, 2021 at 8:20 AM Stefan Sperling <st...@elego.de> wrote:

> ther diff implementations limit the effort spent by aborting after a
> certain number of search iterations, and then simply use the valid
> traversal which was most recently discovered. The libxdiff code used by Git
> does this, as does the diff implementation which Neels Hofmyer wrote for
> Game of Trees (see
> https://git.gameoftrees.org/gitweb/?p=diff.git;a=summary)

[Emerging without years of context, he says...]

Even GNU diff offers options along these lines, right?  By default, it
shoots for a valid but relatively snappy response at the cost of
conciseness.  But can you can add --minimal if you really want it to work
harder at producing tighter output.

I say that not to propose that you make this behavior optional, but merely
to point out another (practically ancient) case of a diff
algorithm's default behavior favoring performance over conciseness.

Oh, and "hey!"

-- Mike