You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@subversion.apache.org by jc...@apache.org on 2016/10/16 20:58:16 UTC
svn commit: r1765190 - /subversion/trunk/notes/diff-optimizations.txt

Author: jcorvel
Date: Sun Oct 16 20:58:16 2016
New Revision: 1765190

URL: http://svn.apache.org/viewvc?rev=1765190&view=rev
Log:
* notes/diff-optimizations.txt: Remove suggestion I.5. (Discarding
  non-matching lines before running the LCS algorithm), since this was
  already done a long time ago. It was implemented by Morten Kloster (moklo)
  (in a slightly different way than the suggestion in the notes, i.e. by
  counting tokens and skipping them later in the LCS algorithm, but it
  does the trick nicely) and committed in r1132811.
  While here, eliminate stale footnotes and renumber them.

Modified:
    subversion/trunk/notes/diff-optimizations.txt

Modified: subversion/trunk/notes/diff-optimizations.txt
URL: http://svn.apache.org/viewvc/subversion/trunk/notes/diff-optimizations.txt?rev=1765190&r1=1765189&r2=1765190&view=diff
==============================================================================
--- subversion/trunk/notes/diff-optimizations.txt (original)
+++ subversion/trunk/notes/diff-optimizations.txt Sun Oct 16 20:58:16 2016
@@ -42,62 +42,20 @@ I. Speeding up "minimal" diff (no heuris
   - This approach probably conflicts with the "Merge hash calculation with 
     EOL scanning" suggestion.
 
-5) Discarding non-matching lines before running the LCS (Longest Common
-   Subsequence) algorithm.
-
-  - Can make a *huge* impact on files which have a lot of different/unique
-    lines.
-    (for instance, an example script from the users list [1], which
-    generates two files with random lines, 3/4th of which are
-    non-matching: diff goes from several hours to a couple of seconds;
-    another example is a big file of which indentation style was changed
-    from tabs to spaces, and diffing without an ignore-whitespace option).
-
-  - GNU diff does this as well, but strangely, it does it only as part
-    of a heuristic (which also discards other "confusing lines"). If run
-    with --minimal, these lines are not discarded (though, AFAICS, there
-    should be no problem discarding the non-matching lines even when going
-    for a minimal diff).
-
-  - From mailthread [2]:
-    The core algorithm for calculating a (minimal) diff is to first find
-    the Longest Common Subsequence (the longest sequence of lines which
-    appear both in A and in B (not necessarily as contiguous lines, there
-    can be gaps)). The actual diff can be derived from this very quickly.
-
-    But lines which only appear in A, or only in B, can never be part of
-    the LCS, because they are not common lines to begin with.
-
-    So we can just as well calculate the LCS of A' and B' (A and B
-    without their "definitely non-matching" lines). This will also be an
-    LCS of A and B, because there are no common lines added which can make
-    it even longer.
-
-    The only thing I'm not 100% sure of is if it would yield the same LCS
-    (there can be many LCS's, all equally long, hence the different
-    possible diff-representations which are sometimes not nice for human
-    readers). However, gut feeling tells me it will be the same. I can't
-    prove this though, so feel free to come up with a counter example :-).
-    (although having a different LCS would not be a disaster: it would
-    still be a minimal diff, but a different one).
-
-    The practical difficulty is to map line numbers from lines in A and B
-    to their corresponding lines in A' and B', and back again.
-
 
 II. Going for a non-minimal diff (i.e. heuristics)
 --------------------------------------------------
 
 In some cases, heuristics can make a big difference (while not guaranteeing
 that you'll get a minimal diff).
-See also issue #1966 (libsvn_diff needs 'non-minimal-diff' mode) [3].
+See also issue #1966 (libsvn_diff needs 'non-minimal-diff' mode) [1].
 
 1) Make prefix/suffix scanning able to skip 1 or a couple of 
    non-matching lines, if it is able to find strictly more matching lines
    after that, to keep the prefix/suffix scanning going.
 
    This will usually work well, but can sometimes lead to missynchronization
-   (see [4]):
+   (see [2]):
 
      bxxaxxbxx
       || ||
@@ -111,7 +69,7 @@ See also issue #1966 (libsvn_diff needs
 
 2) Add some heuristics-based shortcuts in the LCS algorithm.
       
-3) Implement another diff algorithm, such as "Patience Diff" [5], which is
+3) Implement another diff algorithm, such as "Patience Diff" [3], which is
    already implemented in several other (D)VCS's. It has the potential to
    be much faster (reducing the problem to calculating several, much
    smaller LCS's), and has the added advantage of often producing "nicer"
@@ -122,12 +80,8 @@ See also issue #1966 (libsvn_diff needs
 References
 ----------
 
-[1] http://svn.haxx.se/users/archive-2011-01/0319.shtml
-[2] http://svn.haxx.se/dev/archive-2011-02/0772.shtml
-[3] http://subversion.tigris.org/issues/show_bug.cgi?id=1966 (libsvn_diff
+[1] http://subversion.tigris.org/issues/show_bug.cgi?id=1966 (libsvn_diff
 needs 'non-minimal-diff' mode)
-[4] Miller, W., and Myers, E.W. "A File Comparison Program.", Software -
+[2] Miller, W., and Myers, E.W. "A File Comparison Program.", Software -
     Practice & Experience 15 (1985), pp. 1025-1040.
-[5] http://bramcohen.livejournal.com/73318.html
-[6] Mailthread with an earlier version of these notes, and some additional
-    discussion: http://svn.haxx.se/dev/archive-2011-02/0490.shtml
\ No newline at end of file
+[3] http://bramcohen.livejournal.com/73318.html