You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Johan Corveleyn <jc...@gmail.com> on 2011/02/15 01:42:42 UTC

Further diff optimization ideas

Hi,

Here are some more ideas for optimizing "svn diff". Maybe I should
start to write up a notes file, but for now I'm settling for a list in
a mail.

[ note: these ideas are not original, other people have observed some
of these as well (and discussed on IRC and all), and some inspiration
also comes from studying GNU diff's code, and reading some research
papers. I'm just putting them out here. ]

I. Speeding up "minimal" diff (no heuristics)
---------------------------------------------

1) Further low-level optimization of prefix/suffix scanning. [ Quick win ]

- Can speed up identical prefix/suffix processing by another 50% (rough guess).


2) Faster hash function for hashing lines. [ Quick win ]

- Currently, we use adler32 (after the line has been normalized).

- GNU diff uses some simple bit shifting scheme, which seems to be a
lot faster (maybe it has more hash-collisions, but that doesn't seem
to matter that much).

- A quick test with a similar bit-shifting hash, in an extra for-loop
running over all characters of a line, provided a big speedup (> 50%)
of the get_tokens step (which is a big factor after prefix/suffix has
been stripped).

-  Maybe the fast hash algorithm FNV-1a, recently mentioned by Blair
in [1], could also be a good option (though it may be even too complex
for our purposes, simple bit-shifting might be even faster (as I said,
hash collisions aren't too bad)).


3) Choose size of token.c#svn_diff__tree_t roots array (roots for the
multiple-rooted token tree) dynamically, relative to the number of
lines in the file. Or just choose a fixed size bigger than 127. [
Quick win if just picking a larger fixed size ]

- GNU diff chooses a size dynamically, a prime number between 1/3rd
and 2/3rds the number of lines.

- OTOH, svn diff suffers not that much from the small size, because it
uses a binary search tree rooted at every element in the array (GNU
diff uses a simple linked list, so needs to compare "linearly" with
lines part of the same "equivalence class"). Still, simply increasing
the size of the array can have significant impact for big files,
without sacrificing memory too much IMHO.


4) Choosing a bigger CHUNK_SIZE in diff_file.c. [ Quick win ]

- Currently, we read data from the files in chunks of 128k (i.e. "1 <<
17"). I think we can safely increase that a couple of "shifts", to
512k or even 1 or 2 MB (what's 1 MB of RAM these days, considering
this is purely a client-side operation). For comparison: GNU diff
simply reads both files into memory in full (but I don't think we
should go that route, the chunked operation of svn diff is a really
nice feature IMO).


5) Discarding non-matching lines before running the LCS (Longest
Common Subsequence) algorithm. [ Lot of work ]

- With "non-matching lines", I mean lines which don't have a single
match in the other file. These lines can be immediately identified as
added or deleted lines. They can be "discarded" (noting the
adds/deletes), and the LCS algorithm can be run on the remainder. This
can make a *huge* impact on files which have a lot of different/unique
lines.

(for instance, an example script from the users list [2], which
generates two files with random lines, 3/4th of which are
non-matching: diff goes from several hours to a couple of seconds;
another example is a big file of which indentation style was changed
from tabs to spaces, and diffing without an ignore-whitespace option).

- GNU diff does this as well, but strangely, it does it only as part
of a heuristic (which also discards other "confusing lines"). If run
with --minimal, these lines are not discarded (though, AFAICS, there
should be no problem discarding the non-matching lines even when going
for a minimal diff).


II. Going for a non-minimal diff (i.e. heuristics)
--------------------------------------------------

I think prefix/suffix-scanning, together with the suggestions from I.,
will already make "svn diff" quite fast in most real-world cases. It
will normally be just as fast, if not faster than GNU "diff --minimal"
(barring low-level differences, and the advantage of GNU diff of doing
everything in memory).

If I.5. is implemented, "svn diff" will usually be faster than GNU
"diff --minimal", because GNU diff doesn't do that unless it's running
with heuristics (so I think in a lot of cases, "svn diff" will match
performance of GNU diff running with heuristics (which is its
default)).

Still, in some cases, heuristics can make a big difference (while not
guaranteeing that you'll get a minimal diff).

- Existing issue:
http://subversion.tigris.org/issues/show_bug.cgi?id=1966 (libsvn_diff
needs 'non-minimal-diff' mode.).

- Apart from implementing heuristics as part of the current algorithm
(which can quickly become quite complex), we could also take a look at
an alternative approach, such as "Patience Diff" [3], which is already
implemented in several other (D)VCS's. It has the potential to be much
faster (reducing the problem to calculating several, much smaller,
LCS's), and has the added advantage of often producing "nicer" diff
output.


Cheers,
-- 
Johan

[1] http://svn.haxx.se/dev/archive-2011-01/0263.shtml
[2] http://svn.haxx.se/users/archive-2011-01/0319.shtml
[3] http://bramcohen.livejournal.com/73318.html

Re: Further diff optimization ideas

Posted by Stefan Fuhrmann <eq...@web.de>.

On 13.05.2011 10:44, Julian Foad wrote:
> Johan Corveleyn wrote:
>> Ok, to wrap this up for now: r1102471 finally put these thoughts into
>> notes/diff-optimizations.txt, with some of Stefan2's feedback/ideas
>> integrated into it.
> Johan, thank you for writing down the ideas you've thought through so
> much, where we can find them later, in addition to all the improvements
> you have already implemented.
/me is looking forward to discuss "everything diff-y" in Berlin ;)

-- Stefan^2.

Re: Further diff optimization ideas

Posted by Julian Foad <ju...@wandisco.com>.

Johan Corveleyn wrote:
> Ok, to wrap this up for now: r1102471 finally put these thoughts into
> notes/diff-optimizations.txt, with some of Stefan2's feedback/ideas
> integrated into it.

Johan, thank you for writing down the ideas you've thought through so
much, where we can find them later, in addition to all the improvements
you have already implemented.

- Julian

Re: Further diff optimization ideas

Posted by Johan Corveleyn <jc...@gmail.com>.

Ok, to wrap this up for now: r1102471 finally put these thoughts into
notes/diff-optimizations.txt, with some of Stefan2's feedback/ideas
integrated into it.

I also added another, previously mentioned idea into the notes file,
which I forgot to mention in this mailthread:

--- 8< ---
Avoid some hashing by exploiting the fact that matching lines often come
   in series.

  - If the previous line had a match with the other file, first try to
    directly compare (memcmp) the next line with the successor of the
    matched line. Only if it doesn't match, calculate the hash to insert
    it into the container.
  - This approach probably conflicts with the "Merge hash calculation with
    EOL scanning" suggestion.
--- 8< ---

(not sure if this is a worthwhile idea, but just thought I'd mention it).

Cheers,
-- 
Johan

Re: Further diff optimization ideas

Posted by Johan Corveleyn <jc...@gmail.com>.

On Mon, Feb 21, 2011 at 8:55 AM, Stefan Fuhrmann <eq...@web.de> wrote:
> On 15.02.2011 01:42, Johan Corveleyn wrote:
>>
>> Hi,
>>
>> Here are some more ideas for optimizing "svn diff". Maybe I should
>> start to write up a notes file, but for now I'm settling for a list in
>> a mail.
>>
>> [ note: these ideas are not original, other people have observed some
>> of these as well (and discussed on IRC and all), and some inspiration
>> also comes from studying GNU diff's code, and reading some research
>> papers. I'm just putting them out here. ]
>>
>> I. Speeding up "minimal" diff (no heuristics)
>> ---------------------------------------------
>>
>> 1) Further low-level optimization of prefix/suffix scanning. [ Quick win ]
>>
>> - Can speed up identical prefix/suffix processing by another 50% (rough
>> guess).
>
> Go for it. It's a quick win. Maybe, defer the merger until 1.8.
>>
>> 2) Faster hash function for hashing lines. [ Quick win ]
>>
>> - Currently, we use adler32 (after the line has been normalized).
>>
>> - GNU diff uses some simple bit shifting scheme, which seems to be a
>> lot faster (maybe it has more hash-collisions, but that doesn't seem
>> to matter that much).
>>
>> - A quick test with a similar bit-shifting hash, in an extra for-loop
>> running over all characters of a line, provided a big speedup (>  50%)
>> of the get_tokens step (which is a big factor after prefix/suffix has
>> been stripped).
>>
>> -  Maybe the fast hash algorithm FNV-1a, recently mentioned by Blair
>> in [1], could also be a good option (though it may be even too complex
>> for our purposes, simple bit-shifting might be even faster (as I said,
>> hash collisions aren't too bad)).
>
> Without having done any measurement on this myself,
> my suspicion is that the actual hash function is not the
> key factor. I guess the following are the issues:
>
> * finding the end of the line is relatively expensive
> * function calling overhead, including loop setup & finalization
>  (typ. line length is 30 chars - measurement taken on kde.org)
> * inserting the result in a hash container
>
> The actual hash functions are not an issue as long as
> they contain no DIV/MOD statement - not even at the
> end of the process. To, me the keys to higher speed
> would be:
>
> * merge hash calculation with EOL scanning
> * use a low collision rate / low overhead hash container
> * call static functions only
>
> I came up with some ideas that should increase the
> throughput to 50+ mio lines / sec / core (assuming
> an average of 30 chars / line). But it's not for the faint
> of hear and needs an extra post ;)

After more discussion about this with stefan2 on irc, I have done some
more measurements of the different hash variants (adler32 vs.
apr_hashfunc_default vs. the GNU diff HASH). It seems I was mistaken,
and you are correct, Stefan. The hash function itself is not a big
factor, in fact the currently used adler32 was slightly faster than
the other two.

My initial assumption was based on old measurements I had done once,
with an artificial file full of lines with large random numbers. Not
exactly a representative example of source code or anything.
Apparently, it made a big difference on that example.

Now I retested with a file with random lines with random characters
(lines of a random line length, between 0 and 80 characters), and
there was hardly any difference between the hash functions.

So I'll just stop looking at this. Though I would certainly like to
see one day a solution implemented that integrates the EOL scanning
with the ignore-XXX normalization and the hash calculation, all in one
pass. I agree that would be a much more optimal way of processing the
lines. Processing 50 mio lines / sec definitely sounds nice :-).

>>
>> 3) Choose size of token.c#svn_diff__tree_t roots array (roots for the
>> multiple-rooted token tree) dynamically, relative to the number of
>> lines in the file. Or just choose a fixed size bigger than 127. [
>> Quick win if just picking a larger fixed size ]
>
> Maybe 1013 would be a more suitable size (assuming
> each entry consists of exactly one pointer). Larger
> values push the hash out the L1 cache.

Ok, thanks. Will try that.

> If the hash function is reasonably close to an even,
> independent distribution, an array size of 2^N can
> be faster: The number of collisions should be the
> same (good hash function) but the index is much
> cheaper to calculate.
>
> Plain Adler32 is not a good hash in that sense but
> can easily be improved to
>
> index = (a32 + (a32 >> 16)) & (2^N-1)
>>
>> - GNU diff chooses a size dynamically, a prime number between 1/3rd
>> and 2/3rds the number of lines.
>
> I.e., there will be ~2 entries per bucket on average.
>>
>> - OTOH, svn diff suffers not that much from the small size, because it
>> uses a binary search tree rooted at every element in the array (GNU
>> diff uses a simple linked list, so needs to compare "linearly" with
>> lines part of the same "equivalence class"). Still, simply increasing
>> the size of the array can have significant impact for big files,
>> without sacrificing memory too much IMHO.
>
> If we get >4 entries on average per bucket (i.e. >512
> lines total). Therefore, our lookup is likely more expensive.
> Also, a single-linked list uses less memory.

I don't understand. Are you saying that, from 512 lines onward, the
tree will be less expensive, or more expensive for the lookup?

In any case, I'm mainly interested in large files here, because for
small files the differences will almost always be negligible. So it's
pretty likely that we'll have more than 512 lines in the interesting
cases.

> I wonder what the actual distribution looks like because
> certain lines are very frequent (e.g. empty lines) and
> will end up in the same bucket.

Yes, but identical lines will just reuse the same node (and the
position_list associates line numbers with pointers to nodes in the
tree), so they will not increase the size of the tree/list of that
bucket.

>>
>> 4) Choosing a bigger CHUNK_SIZE in diff_file.c. [ Quick win ]
>>
>> - Currently, we read data from the files in chunks of 128k (i.e. "1<<
>> 17"). I think we can safely increase that a couple of "shifts", to
>> 512k or even 1 or 2 MB (what's 1 MB of RAM these days, considering
>> this is purely a client-side operation). For comparison: GNU diff
>> simply reads both files into memory in full (but I don't think we
>> should go that route, the chunked operation of svn diff is a really
>> nice feature IMO).
>
> I think ~1MB will be safe. We can keep 2 of these chunks
> in L3 cache on virtually any modern CPU. That may be
> relevant when comparing the actual line content.

Ok, will try that.

>>
>> 5) Discarding non-matching lines before running the LCS (Longest
>> Common Subsequence) algorithm. [ Lot of work ]
>>
>> - With "non-matching lines", I mean lines which don't have a single
>> match in the other file. These lines can be immediately identified as
>> added or deleted lines. They can be "discarded" (noting the
>> adds/deletes), and the LCS algorithm can be run on the remainder. This
>> can make a *huge* impact on files which have a lot of different/unique
>> lines.
>
> I can see how "definitely non-matching" lines are relevant
> but could you provide a sketch how they are used to sub-
> divide (or whatever) the file with respect to the LCS code?

I haven't really worked this out in detail, but I think it will have
no impact on the result. The reasoning is as follows:

- The core algorithm for calculating a (minimal) diff is to first find
the Longest Common Subsequence (the longest sequence of lines which
appear both in A and in B (not necessarily as contiguous lines, there
can be gaps)). The actual diff can be derived from this very quickly.

- But lines which only appear in A, or only in B, can never be part of
the LCS, because they are not common lines to begin with.

- So we can just as well calculate the LCS of A' and B' (A and B
without their "definitely non-matching" lines). This will also be an
LCS of A and B, because there are no common lines added which can make
it even longer.

The only thing I'm not 100% sure of is if it would yield the same LCS
(there can be many LCS's, all equally long, hence the different
possible diff-representations which are sometimes not nice for human
readers). However, gut feeling tells me it will be the same. I can't
prove this though, so feel free to come up with a counter example :-).
(although having a different LCS would not be a disaster: it would
still be a minimal diff, but a different one).

The practical difficulty is to map line numbers from lines in A and B
to their corresponding lines in A' and B', and back again.

> With mainly source code in mind, one would expect the
> actual difference to contain a number of unique lines. Diff
> could become an almost O(N) operation in that case.
>>
>> (for instance, an example script from the users list [2], which
>> generates two files with random lines, 3/4th of which are
>> non-matching: diff goes from several hours to a couple of seconds;
>> another example is a big file of which indentation style was changed
>> from tabs to spaces, and diffing without an ignore-whitespace option).
>>
>> - GNU diff does this as well, but strangely, it does it only as part
>> of a heuristic (which also discards other "confusing lines"). If run
>> with --minimal, these lines are not discarded (though, AFAICS, there
>> should be no problem discarding the non-matching lines even when going
>> for a minimal diff).
>
> It is obvious how unique lines split the contents on both sides
> individually. However, it may be hard to prove that  a particular
> alignment of the chunks in between on one side with those on
> the respective other yields a minimal result.
>>
>> II. Going for a non-minimal diff (i.e. heuristics)
>> --------------------------------------------------
>>
>> I think prefix/suffix-scanning, together with the suggestions from I.,
>> will already make "svn diff" quite fast in most real-world cases. It
>> will normally be just as fast, if not faster than GNU "diff --minimal"
>> (barring low-level differences, and the advantage of GNU diff of doing
>> everything in memory).
>>
>> If I.5. is implemented, "svn diff" will usually be faster than GNU
>> "diff --minimal", because GNU diff doesn't do that unless it's running
>> with heuristics (so I think in a lot of cases, "svn diff" will match
>> performance of GNU diff running with heuristics (which is its
>> default)).
>>
>> Still, in some cases, heuristics can make a big difference (while not
>> guaranteeing that you'll get a minimal diff).
>>
>> - Existing issue:
>> http://subversion.tigris.org/issues/show_bug.cgi?id=1966 (libsvn_diff
>> needs 'non-minimal-diff' mode.).
>>
>> - Apart from implementing heuristics as part of the current algorithm
>> (which can quickly become quite complex), we could also take a look at
>> an alternative approach, such as "Patience Diff" [3], which is already
>> implemented in several other (D)VCS's. It has the potential to be much
>> faster (reducing the problem to calculating several, much smaller,
>> LCS's), and has the added advantage of often producing "nicer" diff
>> output.
>
> Patience Diff seems to be ideally suited for source code
> diffs because function headers etc. provide many of the
> "only one matching" lines evenly spread across the whole
> file.

Yes, that could be quite interesting.

Though the choice between normal diff and patience diff (which is
really a heuristic, in some cases it will give a non-minimal diff (but
those are quite rare I think)) should be up to the user I think (AFAIK
that's also the case with the DVCS's that implement this: it's an
option). So optimizing the normal diff as far as we can is still very
much worthwhile...

Cheers,
-- 
Johan

Re: Further diff optimization ideas

Posted by Stefan Fuhrmann <eq...@web.de>.

On 15.02.2011 01:42, Johan Corveleyn wrote:
> Hi,
>
> Here are some more ideas for optimizing "svn diff". Maybe I should
> start to write up a notes file, but for now I'm settling for a list in
> a mail.
>
> [ note: these ideas are not original, other people have observed some
> of these as well (and discussed on IRC and all), and some inspiration
> also comes from studying GNU diff's code, and reading some research
> papers. I'm just putting them out here. ]
>
> I. Speeding up "minimal" diff (no heuristics)
> ---------------------------------------------
>
> 1) Further low-level optimization of prefix/suffix scanning. [ Quick win ]
>
> - Can speed up identical prefix/suffix processing by another 50% (rough guess).
Go for it. It's a quick win. Maybe, defer the merger until 1.8.
> 2) Faster hash function for hashing lines. [ Quick win ]
>
> - Currently, we use adler32 (after the line has been normalized).
>
> - GNU diff uses some simple bit shifting scheme, which seems to be a
> lot faster (maybe it has more hash-collisions, but that doesn't seem
> to matter that much).
>
> - A quick test with a similar bit-shifting hash, in an extra for-loop
> running over all characters of a line, provided a big speedup (>  50%)
> of the get_tokens step (which is a big factor after prefix/suffix has
> been stripped).
>
> -  Maybe the fast hash algorithm FNV-1a, recently mentioned by Blair
> in [1], could also be a good option (though it may be even too complex
> for our purposes, simple bit-shifting might be even faster (as I said,
> hash collisions aren't too bad)).
Without having done any measurement on this myself,
my suspicion is that the actual hash function is not the
key factor. I guess the following are the issues:

* finding the end of the line is relatively expensive
* function calling overhead, including loop setup & finalization
   (typ. line length is 30 chars - measurement taken on kde.org)
* inserting the result in a hash container

The actual hash functions are not an issue as long as
they contain no DIV/MOD statement - not even at the
end of the process. To, me the keys to higher speed
would be:

* merge hash calculation with EOL scanning
* use a low collision rate / low overhead hash container
* call static functions only

I came up with some ideas that should increase the
throughput to 50+ mio lines / sec / core (assuming
an average of 30 chars / line). But it's not for the faint
of hear and needs an extra post ;)
> 3) Choose size of token.c#svn_diff__tree_t roots array (roots for the
> multiple-rooted token tree) dynamically, relative to the number of
> lines in the file. Or just choose a fixed size bigger than 127. [
> Quick win if just picking a larger fixed size ]
Maybe 1013 would be a more suitable size (assuming
each entry consists of exactly one pointer). Larger
values push the hash out the L1 cache.

If the hash function is reasonably close to an even,
independent distribution, an array size of 2^N can
be faster: The number of collisions should be the
same (good hash function) but the index is much
cheaper to calculate.

Plain Adler32 is not a good hash in that sense but
can easily be improved to

index = (a32 + (a32 >> 16)) & (2^N-1)
> - GNU diff chooses a size dynamically, a prime number between 1/3rd
> and 2/3rds the number of lines.
I.e., there will be ~2 entries per bucket on average.
> - OTOH, svn diff suffers not that much from the small size, because it
> uses a binary search tree rooted at every element in the array (GNU
> diff uses a simple linked list, so needs to compare "linearly" with
> lines part of the same "equivalence class"). Still, simply increasing
> the size of the array can have significant impact for big files,
> without sacrificing memory too much IMHO.
If we get >4 entries on average per bucket (i.e. >512
lines total). Therefore, our lookup is likely more expensive.
Also, a single-linked list uses less memory.

I wonder what the actual distribution looks like because
certain lines are very frequent (e.g. empty lines) and
will end up in the same bucket.
> 4) Choosing a bigger CHUNK_SIZE in diff_file.c. [ Quick win ]
>
> - Currently, we read data from the files in chunks of 128k (i.e. "1<<
> 17"). I think we can safely increase that a couple of "shifts", to
> 512k or even 1 or 2 MB (what's 1 MB of RAM these days, considering
> this is purely a client-side operation). For comparison: GNU diff
> simply reads both files into memory in full (but I don't think we
> should go that route, the chunked operation of svn diff is a really
> nice feature IMO).
I think ~1MB will be safe. We can keep 2 of these chunks
in L3 cache on virtually any modern CPU. That may be
relevant when comparing the actual line content.
> 5) Discarding non-matching lines before running the LCS (Longest
> Common Subsequence) algorithm. [ Lot of work ]
>
> - With "non-matching lines", I mean lines which don't have a single
> match in the other file. These lines can be immediately identified as
> added or deleted lines. They can be "discarded" (noting the
> adds/deletes), and the LCS algorithm can be run on the remainder. This
> can make a *huge* impact on files which have a lot of different/unique
> lines.
I can see how "definitely non-matching" lines are relevant
but could you provide a sketch how they are used to sub-
divide (or whatever) the file with respect to the LCS code?

With mainly source code in mind, one would expect the
actual difference to contain a number of unique lines. Diff
could become an almost O(N) operation in that case.
> (for instance, an example script from the users list [2], which
> generates two files with random lines, 3/4th of which are
> non-matching: diff goes from several hours to a couple of seconds;
> another example is a big file of which indentation style was changed
> from tabs to spaces, and diffing without an ignore-whitespace option).
>
> - GNU diff does this as well, but strangely, it does it only as part
> of a heuristic (which also discards other "confusing lines"). If run
> with --minimal, these lines are not discarded (though, AFAICS, there
> should be no problem discarding the non-matching lines even when going
> for a minimal diff).
It is obvious how unique lines split the contents on both sides
individually. However, it may be hard to prove that  a particular
alignment of the chunks in between on one side with those on
the respective other yields a minimal result.
> II. Going for a non-minimal diff (i.e. heuristics)
> --------------------------------------------------
>
> I think prefix/suffix-scanning, together with the suggestions from I.,
> will already make "svn diff" quite fast in most real-world cases. It
> will normally be just as fast, if not faster than GNU "diff --minimal"
> (barring low-level differences, and the advantage of GNU diff of doing
> everything in memory).
>
> If I.5. is implemented, "svn diff" will usually be faster than GNU
> "diff --minimal", because GNU diff doesn't do that unless it's running
> with heuristics (so I think in a lot of cases, "svn diff" will match
> performance of GNU diff running with heuristics (which is its
> default)).
>
> Still, in some cases, heuristics can make a big difference (while not
> guaranteeing that you'll get a minimal diff).
>
> - Existing issue:
> http://subversion.tigris.org/issues/show_bug.cgi?id=1966 (libsvn_diff
> needs 'non-minimal-diff' mode.).
>
> - Apart from implementing heuristics as part of the current algorithm
> (which can quickly become quite complex), we could also take a look at
> an alternative approach, such as "Patience Diff" [3], which is already
> implemented in several other (D)VCS's. It has the potential to be much
> faster (reducing the problem to calculating several, much smaller,
> LCS's), and has the added advantage of often producing "nicer" diff
> output.
Patience Diff seems to be ideally suited for source code
diffs because function headers etc. provide many of the
"only one matching" lines evenly spread across the whole
file.

-- Stefan^2.

Re: Further diff optimization ideas

Posted by Branko Čibej <br...@e-reka.si>.

On 15.02.2011 13:04, Johan Corveleyn wrote:
>> In other news, why do we need our own hash function anyway? Or our own
>> table or tree or whatever? NIH syndrome?
> No idea, really. I'm probably not the guy you're expecting an answer
> from :-), but:
> - apr_hashfunc_default could be very useful here, I think.
> - Concerning the tree (see token.c#tree_insert_token), maybe the
> reason is that some special things are done while inserting nodes into
> the tree? (like first comparing with the hash, and if that matches,
> compare the lines themselves; and letting all matching nodes point to
> the same "token" (line), when they're being inserted).

That's what apr_hash_t does ... more or less. Your token can just be the
address of the string in the hash.

-- Brane

Re: Further diff optimization ideas

Posted by Johan Corveleyn <jc...@gmail.com>.

2011/2/15 Branko Čibej <br...@e-reka.si>:
> On 15.02.2011 01:42, Johan Corveleyn wrote:
>> 2) Faster hash function for hashing lines. [ Quick win ]
>>
>> - Currently, we use adler32 (after the line has been normalized).
>>
>> - GNU diff uses some simple bit shifting scheme, which seems to be a
>> lot faster (maybe it has more hash-collisions, but that doesn't seem
>> to matter that much).
>
> Is it (x << 5) + x? If so, that's the (in)famous "times 33" hash
> function, also see the essay in apr_hashfunc_default in
> http://svn.apache.org/viewvc/apr/apr/trunk/tables/apr_hash.c?view=markup

No, it's actually the following (see
http://git.savannah.gnu.org/cgit/diffutils.git/tree/src/io.c):

    /* Rotate an unsigned value to the left.  */
    #define ROL(v, n) ((v) << (n) | (v) >> (sizeof (v) * CHAR_BIT - (n)))

    /* Given a hash value and a new character, return a new hash value.  */
    #define HASH(h, c) ((c) + ROL (h, 7))

(which is then called for every character in a for-loop (while
"normalizing" the characters for ignore-xxx options)).

I think this code pre-dates the technique used in apr_hash.c (the
"essay" you refer to in apr_hashfunc_default mentions a 1992 article,
referencing a 1990 Usenet message; that code in GNU diff is, I think,
already there since 1986 or something). So it's possible that it's a
little bit less optimal than the one from apr_hash.c (don't know,
really).

> I prefer APR's implementation, because a reasonably good compiler these
> days /should/ be able to optimize (x * 33) to ((x << 5) + x), depending
> on the characteristics of the target architecture.

Yes, I think I'll give it a try for hashing the lines for diff. Good
idea (I don't know my way very well around the libraries yet, so
thanks a lot for the pointer).

> In other news, why do we need our own hash function anyway? Or our own
> table or tree or whatever? NIH syndrome?

No idea, really. I'm probably not the guy you're expecting an answer
from :-), but:
- apr_hashfunc_default could be very useful here, I think.
- Concerning the tree (see token.c#tree_insert_token), maybe the
reason is that some special things are done while inserting nodes into
the tree? (like first comparing with the hash, and if that matches,
compare the lines themselves; and letting all matching nodes point to
the same "token" (line), when they're being inserted).

Cheers,
-- 
Johan

Re: Further diff optimization ideas

Posted by Branko Čibej <br...@e-reka.si>.

On 15.02.2011 01:42, Johan Corveleyn wrote:
> 2) Faster hash function for hashing lines. [ Quick win ]
>
> - Currently, we use adler32 (after the line has been normalized).
>
> - GNU diff uses some simple bit shifting scheme, which seems to be a
> lot faster (maybe it has more hash-collisions, but that doesn't seem
> to matter that much).

Is it (x << 5) + x? If so, that's the (in)famous "times 33" hash
function, also see the essay in apr_hashfunc_default in
http://svn.apache.org/viewvc/apr/apr/trunk/tables/apr_hash.c?view=markup

I prefer APR's implementation, because a reasonably good compiler these
days /should/ be able to optimize (x * 33) to ((x << 5) + x), depending
on the characteristics of the target architecture.

In other news, why do we need our own hash function anyway? Or our own
table or tree or whatever? NIH syndrome?

-- Brane