You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Johan Corveleyn <jc...@gmail.com> on 2010/10/03 01:46:08 UTC

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Hi,

Here is a second iteration of the patch. It now passes make check.

Differences from the previous version are:
- Support for \r eol-style (\n and \r\n was already ok).
- The number of prefix_lines is now passed to svn_diff__lcs, so it can
use that value to set the position offset of the "EOF" marker
correctly, in case one of both files has become empty after skipping
the prefix. This fixes the crashes in blame_tests.py 2 and 7.

The patch is pretty big, so please let me know if I should split it up
to make it more reviewable (I could easily split it up between the
prefix-finding and the suffix-finding, at the cost of having overview
over the entire algorithm).

Still to do:
- Think about why results are sometimes different (because of
eliminated suffix, the LCS can sometimes be slightly different), and
what can be done about it.
- Generalize for more than 2 datasources (for diff3 and diff4).
- revv svn_diff_fns_t and maybe other stuff I've changed in public API.
- Add support for -x-b, -x-w, and -x--ignore-eol-style options.

But I'd like to do those things in follow-up patches, after this one
has been reviewed and digested a little bit. So at this point: review,
feedback, ... very welcome :-).

Log message:
[[[
Make svn_diff_diff skip identical prefix and suffix to make diff and blame
faster.

* subversion/include/svn_diff.h
  (svn_diff_fns_t): Added new function types datasources_open and
   get_prefix_lines to the vtable.

* subversion/libsvn_diff/diff_memory.c
  (datasources_open): New function (does nothing).
  (get_prefix_lines): New function (does nothing).
  (svn_diff__mem_vtable): Added new functions datasources_open and
   get_prefix_lines.

* subversion/libsvn_diff/diff_file.c
  (svn_diff__file_baton_t): Added members prefix_lines, suffix_start_chunk[4]
   and suffix_offset_in_chunk.
  (increment_pointer_or_chunk, decrement_pointer_or_chunk): New functions.
  (find_identical_prefix, find_identical_suffix): New functions.
  (datasources_open): New function, to open both datasources and find their
   identical prefix and suffix.
  (get_prefix_lines): New function.
  (datasource_get_next_token): Stop at start of identical suffix.
  (svn_diff__file_vtable): Added new functions datasources_open and
   get_prefix_lines.

* subversion/libsvn_diff/diff.h
  (svn_diff__get_tokens): Added argument "datasource_opened", to indicate that
   the datasource was already opened.

* subversion/libsvn_diff/token.c
  (svn_diff__get_tokens): Added argument "datasource_opened". Only open the
   datasource if datasource_opened is FALSE. Set the starting offset of the
   position list to the number of prefix lines.

* subversion/libsvn_diff/lcs.c
  (svn_diff__lcs): Added argument "prefix_lines". Use this to correctly set
   the offset of the sentinel position for EOF, even if one of the files
   became empty after eliminating the identical prefix.

* subversion/libsvn_diff/diff.c
  (svn_diff__diff): Add a chunk of "common" diff for identical prefix.
  (svn_diff_diff): Use new function datasources_open, to open original and
   modified at once, and find their identical prefix and suffix. Pass
   prefix_lines to svn_diff__lcs and to svn_diff__diff.

* subversion/libsvn_diff/diff3.c
  (svn_diff_diff3): Pass datasource_opened = FALSE to svn_diff__get_tokens.
   Pass prefix_lines = 0 to svn_diff__lcs.

* subversion/libsvn_diff/diff4.c
 (svn_diff_diff4): Pass datasource_opened = FALSE to svn_diff__get_tokens.
   Pass prefix_lines = 0 to svn_diff__lcs.
]]]

Cheers,
-- 
Johan

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Posted by Johan Corveleyn <jc...@gmail.com>.

On Sat, Oct 9, 2010 at 5:19 PM, Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> Johan Corveleyn wrote on Sat, Oct 09, 2010 at 14:21:09 +0200:
>> (side-note: I considered first doing suffix scanning, then prefix
>> scanning, so I could reuse the buffers/pointers from diff_baton all
>> the time, and still have everything pointing correctly after
>> eliminating prefix/suffix. But that could give vastly different
>> results in some cases, for instance when original file is entirely
>> identical to both the prefix and the suffix of the modified file. So I
>> decided it's best to stick with first prefix, then suffix).
>
> What Hyrum said.  How common /is/ this case?  And, anyway, in that case
> both "everything was appended" and "everything was prepended" are
> equally legitimate diffs.

Hm, I'm not sure about this one. I just wanted to try the maximum
reasonably possible to keep the results identical to what they were.
Using another buffer for suffix scanning didn't seem that big of a
deal (only slight increase in memory use (2 chunks of 128K in current
implementation)). I made that decision pretty early, before I knew of
the other problem of suffix scanning, and the keep-50-suffix-lines
compromise we decided upon.

There may be more subtle cases than the one I described, I don't know.
OTOH, now that we have the keep-50-suffix-lines, that may help also in
this case. I'll have to think about that. Maybe I can give it a go,
first suffix then prefix, and see if I can find real-life problems ...

-- 
Johan

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Johan Corveleyn wrote on Sat, Oct 09, 2010 at 14:21:09 +0200:
> (side-note: I considered first doing suffix scanning, then prefix
> scanning, so I could reuse the buffers/pointers from diff_baton all
> the time, and still have everything pointing correctly after
> eliminating prefix/suffix. But that could give vastly different
> results in some cases, for instance when original file is entirely
> identical to both the prefix and the suffix of the modified file. So I
> decided it's best to stick with first prefix, then suffix).

What Hyrum said.  How common /is/ this case?  And, anyway, in that case
both "everything was appended" and "everything was prepended" are
equally legitimate diffs.

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Posted by Johan Corveleyn <jc...@gmail.com>.

On Tue, Oct 12, 2010 at 12:10 PM, Julian Foad <ju...@wandisco.com> wrote:
> On Tue, 2010-10-12 at 00:31 +0200, Johan Corveleyn wrote:
>> On Mon, Oct 11, 2010 at 11:53 AM, Julian Foad <ju...@wandisco.com> wrote:
>> > On Sat, 2010-10-09, Johan Corveleyn wrote:
>> >> On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad <ju...@wandisco.com> wrote:
>> >> > So I wrote a patch - attached - that refactors this into an array of 4
>> >> > sub-structures, and simplifies all the code that uses them.
>> > [...]
>> >> Yes, great idea! That would indeed vastly simplify a lot of the code.
>> >> So please go ahead and commit the refactoring.
>> >
>> > OK, committed in r1021282.
>>
>> Thanks, looks much more manageable now.
>
> I'd like to see a simplified version of your last patch, taking
> advantage of that, before you go exploring other options.

Ok, here's a new version of the patch, taking advantage of your
file_info refactoring. This vastly simplifies the code, so that it
might actually be understandable now :-).

Other things I've done in this version:
1) Generalized everything to handle an array of datasources/files,
instead of just two. This makes it slightly more complex here and
there (using for loops everywhere), but I think it's ok, and it's also
more consistent/generic. If anyone has better ideas to do those for
loops, suggestions welcome.

This makes the algorithm usable by diff3 as well (and diff4 if needed
(?)). I have not yet enabled it for diff3, because I haven't yet
understood how it handles the generation of its diff output (needs to
take into account the prefix_lines. I tried some quick hacks, but lots
of tests were failing, so I'll have to look more into it -> that's for
a follow up patch). When I can enable it for diff3 (and diff4), I can
remove datasource_open (with one datasource).

2) Removed get_prefix_lines from svn_diff_fns_t (and its
implementations in diff_file.c and diff_memory.c). Instead I pass
prefix_lines directly to token.c#svn_diff__get_tokens.

3) If prefix scanning ended in the last chunk, the suffix scanning now
reuses that buffer which already contains the last chunk. As a special
case, this also avoids reading the file twice if it's smaller than 128
Kb.

4) Added doc strings everywhere. Feel free to edit those, I'm new at
documenting things in svn.

Still TODO:
- revv svn_diff_fns_t and maybe other stuff I've changed in public API.
- See if implementing the critical parts of increment_pointers and
decrement_pointers in a macro improves performance.
- Add support for -x-b, -x-w, and -x--ignore-eol-style options. For
this (and for other reasons), I'd still like to investigate pushing
this optimization into the token parsing/handling layer, to extract
entire tokens etc., even if this means the current patch has to be
thrown away. I'll take this up in a separate thread.

Log message:
[[[
Make svn_diff skip identical prefix and suffix to make diff and blame faster.

* subversion/include/svn_diff.h
  (svn_diff_fns_t): Added new function type datasources_open to the vtable.

* subversion/libsvn_diff/diff_memory.c
  (datasources_open): New function (does nothing).
  (svn_diff__mem_vtable): Added new function datasources_open.

* subversion/libsvn_diff/diff_file.c
  (svn_diff__file_baton_t): Added member prefix_lines, and inside the
   struct file_info the members suffix_start_chunk and suffix_offset_in_chunk.
  (increment_pointers, decrement_pointers): New functions.
  (is_one_at_bof, is_one_at_eof): New functions.
  (find_identical_prefix, find_identical_suffix): New functions.
  (datasources_open): New function, to open multiple datasources and find
   their identical prefix and suffix, so these can be excluded from the rest
   of the diff algorithm, as a performance optimization. From the identical
   suffix, 50 lines are kept to help the diff algorithm find the nicest
   possible diff representation in case of ambiguity.
  (datasource_get_next_token): Stop at start of identical suffix.
  (svn_diff__file_vtable): Added new function datasources_open.

* subversion/libsvn_diff/diff.h
  (svn_diff__get_tokens): Added argument "datasource_opened", to indicate that
   the datasource was already opened, and argument "prefix_lines", the number
   of identical prefix lines.and use
   this as the starting offset for the token we're getting.

* subversion/libsvn_diff/token.c
  (svn_diff__get_tokens): Added arguments "datasource_opened" and
   "prefix_lines". Only open the datasource if datasource_opened is FALSE.
   Set the starting offset of the position list to the number of prefix_lines.

* subversion/libsvn_diff/lcs.c
  (svn_diff__lcs): Added argument "prefix_lines". Use this to correctly set
   the offset of the sentinel position for EOF, even if one of the files
   became empty after eliminating the identical prefix.

* subversion/libsvn_diff/diff.c
  (svn_diff__diff): Add a chunk of "common" diff for identical prefix.
  (svn_diff_diff): Use new function datasources_open to open original and
   modified at once and find their identical prefix and suffix. Pass
   prefix_lines to svn_diff__get_tokens, svn_diff__lcs and to svn_diff__diff.

* subversion/libsvn_diff/diff3.c
  (svn_diff_diff3): Pass datasource_opened = FALSE and prefix_lines = 0 to
   svn_diff__get_tokens. Pass prefix_lines = 0 to svn_diff__lcs.

* subversion/libsvn_diff/diff4.c
  (svn_diff_diff4): Pass datasource_opened = FALSE and prefix_lines = 0 to
   svn_diff__get_tokens. Pass prefix_lines = 0 to svn_diff__lcs.
]]]

Cheers,
-- 
Johan

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Posted by Johan Corveleyn <jc...@gmail.com>.

On Tue, Oct 12, 2010 at 12:10 PM, Julian Foad <ju...@wandisco.com> wrote:
> On Tue, 2010-10-12 at 00:31 +0200, Johan Corveleyn wrote:
>> On Mon, Oct 11, 2010 at 11:53 AM, Julian Foad <ju...@wandisco.com> wrote:
>> > On Sat, 2010-10-09, Johan Corveleyn wrote:
>> >> On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad <ju...@wandisco.com> wrote:
>> >> > So I wrote a patch - attached - that refactors this into an array of 4
>> >> > sub-structures, and simplifies all the code that uses them.
>> > [...]
>> >> Yes, great idea! That would indeed vastly simplify a lot of the code.
>> >> So please go ahead and commit the refactoring.
>> >
>> > OK, committed in r1021282.
>>
>> Thanks, looks much more manageable now.
>
> I'd like to see a simplified version of your last patch, taking
> advantage of that, before you go exploring other options.

Ok, here's a new version of the patch, taking advantage of your
file_info refactoring. This vastly simplifies the code, so that it
might actually be understandable now :-).

Other things I've done in this version:
1) Generalized everything to handle an array of datasources/files,
instead of just two. This makes it slightly more complex here and
there (using for loops everywhere), but I think it's ok, and it's also
more consistent/generic. If anyone has better ideas to do those for
loops, suggestions welcome.

This makes the algorithm usable by diff3 as well (and diff4 if needed
(?)). I have not yet enabled it for diff3, because I haven't yet
understood how it handles the generation of its diff output (needs to
take into account the prefix_lines. I tried some quick hacks, but lots
of tests were failing, so I'll have to look more into it -> that's for
a follow up patch). When I can enable it for diff3 (and diff4), I can
remove datasource_open (with one datasource).

2) Removed get_prefix_lines from svn_diff_fns_t (and its
implementations in diff_file.c and diff_memory.c). Instead I pass
prefix_lines directly to token.c#svn_diff__get_tokens.

3) If prefix scanning ended in the last chunk, the suffix scanning now
reuses that buffer which already contains the last chunk. As a special
case, this also avoids reading the file twice if it's smaller than 128
Kb.

4) Added doc strings everywhere. Feel free to edit those, I'm new at
documenting things in svn.

Still TODO:
- revv svn_diff_fns_t and maybe other stuff I've changed in public API.
- See if implementing the critical parts of increment_pointers and
decrement_pointers in a macro improves performance.
- Add support for -x-b, -x-w, and -x--ignore-eol-style options. For
this (and for other reasons), I'd still like to investigate pushing
this optimization into the token parsing/handling layer, to extract
entire tokens etc., even if this means the current patch has to be
thrown away. I'll take this up in a separate thread.

Log message:
[[[
Make svn_diff skip identical prefix and suffix to make diff and blame faster.

* subversion/include/svn_diff.h
  (svn_diff_fns_t): Added new function type datasources_open to the vtable.

* subversion/libsvn_diff/diff_memory.c
  (datasources_open): New function (does nothing).
  (svn_diff__mem_vtable): Added new function datasources_open.

* subversion/libsvn_diff/diff_file.c
  (svn_diff__file_baton_t): Added member prefix_lines, and inside the
   struct file_info the members suffix_start_chunk and suffix_offset_in_chunk.
  (increment_pointers, decrement_pointers): New functions.
  (is_one_at_bof, is_one_at_eof): New functions.
  (find_identical_prefix, find_identical_suffix): New functions.
  (datasources_open): New function, to open multiple datasources and find
   their identical prefix and suffix, so these can be excluded from the rest
   of the diff algorithm, as a performance optimization. From the identical
   suffix, 50 lines are kept to help the diff algorithm find the nicest
   possible diff representation in case of ambiguity.
  (datasource_get_next_token): Stop at start of identical suffix.
  (svn_diff__file_vtable): Added new function datasources_open.

* subversion/libsvn_diff/diff.h
  (svn_diff__get_tokens): Added argument "datasource_opened", to indicate that
   the datasource was already opened, and argument "prefix_lines", the number
   of identical prefix lines.and use
   this as the starting offset for the token we're getting.

* subversion/libsvn_diff/token.c
  (svn_diff__get_tokens): Added arguments "datasource_opened" and
   "prefix_lines". Only open the datasource if datasource_opened is FALSE.
   Set the starting offset of the position list to the number of prefix_lines.

* subversion/libsvn_diff/lcs.c
  (svn_diff__lcs): Added argument "prefix_lines". Use this to correctly set
   the offset of the sentinel position for EOF, even if one of the files
   became empty after eliminating the identical prefix.

* subversion/libsvn_diff/diff.c
  (svn_diff__diff): Add a chunk of "common" diff for identical prefix.
  (svn_diff_diff): Use new function datasources_open to open original and
   modified at once and find their identical prefix and suffix. Pass
   prefix_lines to svn_diff__get_tokens, svn_diff__lcs and to svn_diff__diff.

* subversion/libsvn_diff/diff3.c
  (svn_diff_diff3): Pass datasource_opened = FALSE and prefix_lines = 0 to
   svn_diff__get_tokens. Pass prefix_lines = 0 to svn_diff__lcs.

* subversion/libsvn_diff/diff4.c
  (svn_diff_diff4): Pass datasource_opened = FALSE and prefix_lines = 0 to
   svn_diff__get_tokens. Pass prefix_lines = 0 to svn_diff__lcs.
]]]

Cheers,
-- 
Johan

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Posted by Julian Foad <ju...@wandisco.com>.

On Tue, 2010-10-12, Johan Corveleyn wrote:
> On Tue, Oct 12, 2010 at 12:10 PM, Julian Foad <ju...@wandisco.com> wrote:
> > On Tue, 2010-10-12 at 00:31 +0200, Johan Corveleyn wrote:
> >> On Mon, Oct 11, 2010 at 11:53 AM, Julian Foad <ju...@wandisco.com> wrote:
> >> > On Sat, 2010-10-09, Johan Corveleyn wrote:
> >> >> On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad <ju...@wandisco.com> wrote:
> >> >> > So I wrote a patch - attached - that refactors this into an array of 4
> >> >> > sub-structures, and simplifies all the code that uses them.
> >> > [...]
> >> >> Yes, great idea! That would indeed vastly simplify a lot of the code.
> >> >> So please go ahead and commit the refactoring.
> >> >
> >> > OK, committed in r1021282.
> >>
> >> Thanks, looks much more manageable now.
> >
> > I'd like to see a simplified version of your last patch, taking
> > advantage of that, before you go exploring other options.
> 
> Yes, I'll certainly do that in the coming days.
> 
> >> >> Also, maybe the last_chunk number could be included in the file_info
> >> >> struct? Now it's calculated in several places: "last_chunk0 =
> >> >> offset_to_chunk(file_baton->size[idx0])", or I have to pass it on
> >> >> every time as an extra argument. Seems like the sort of info that
> >> >> could be part of file_info.
> >> >
> >> > It looks to me like you only need to calculate it in exactly one place:
> >> > in increment_pointer_or_chunk().  I wouldn't worry about pre-calculating
> >> > for efficiency, as it's a trivial calculation.
> >>
> >> Hm, I don't know. That's recalculating it for every byte in the
> >> prefix. In the files I'm testing there could be 1 Mb or more of
> >
> > No, no, not every byte - only 1 in 131072 of the calls need to calculate
> > it - only when it reaches the end of a block.
> 
> Oh right. I guess it's ok then.
> 
> > May I ask whether you've tested the code that crosses from one chunk to
> > another?  I noticed the following, just now:
> >
> >> +      *at_least_one_end_reached =
> >> +        file_baton->curp[idx0] == file_baton->endp[idx0]
> >> +        || file_baton->curp[idx1] == file_baton->endp[idx1];
> >> +    }
> >> +
> >> +  /* If both files reached their end (i.e. are fully identical),
> >> we're done */
> >> +  if (file_baton->curp[idx0] == file_baton->endp[idx0]
> >> +        && file_baton->curp[idx1] == file_baton->endp[idx1])
> >
> > Those tests seem to be saying "if we're at the end of the current chunk
> > then we're at the end of the file".  That looks wrong.  What do you
> > think?
> 
> Well, it does work :-). The code is correct, but it's kind of
> difficult to see, so that's my fault.
> 
> The reason is that increment_pointer_or_chunk takes care that curp
> never equals endp (if curp==endp-1 it reads the next chunk), unless it
> reaches end of file. See this snippet of increment_pointer_or_chunk:
> +      if (*chunk_number == last_chunk_number)
> +        (*curp)++; /* *curp == *endp with last chunk signals end of file */
> 
> If you think this is too obscure to handle this, I agree, but I
> couldn't think of a better/easier way at that time. If you have any
> suggestions, feel free :-). But maybe I should refactor the patch
> first, so it's easier to see how this would work out best.
> 
> The same "obscurity" also applies to reaching "beginning of file"
> (which can happen in decrement_pointer_or_chunk, either during prefix
> scanning (rolling back the first line) or during suffix scanning (if
> there was no prefix, but still a difference in the first line)). There
> it's done by setting the chunk number to -1, which is checked for in
> find_identical_prefix and find_identical_suffix:
> +      if (*chunk_number == 0)
> +        (*chunk_number)--; /* *chunk_number == -1 signals beginning of file */
> 
> Not ideal, but it works ...

Oh OK, thanks for explaining.  No further comment, m'lord :-)

- Julian

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Posted by Johan Corveleyn <jc...@gmail.com>.

On Tue, Oct 12, 2010 at 12:10 PM, Julian Foad <ju...@wandisco.com> wrote:
> On Tue, 2010-10-12 at 00:31 +0200, Johan Corveleyn wrote:
>> On Mon, Oct 11, 2010 at 11:53 AM, Julian Foad <ju...@wandisco.com> wrote:
>> > On Sat, 2010-10-09, Johan Corveleyn wrote:
>> >> On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad <ju...@wandisco.com> wrote:
>> >> > So I wrote a patch - attached - that refactors this into an array of 4
>> >> > sub-structures, and simplifies all the code that uses them.
>> > [...]
>> >> Yes, great idea! That would indeed vastly simplify a lot of the code.
>> >> So please go ahead and commit the refactoring.
>> >
>> > OK, committed in r1021282.
>>
>> Thanks, looks much more manageable now.
>
> I'd like to see a simplified version of your last patch, taking
> advantage of that, before you go exploring other options.

Yes, I'll certainly do that in the coming days.

>> >> Also, maybe the last_chunk number could be included in the file_info
>> >> struct? Now it's calculated in several places: "last_chunk0 =
>> >> offset_to_chunk(file_baton->size[idx0])", or I have to pass it on
>> >> every time as an extra argument. Seems like the sort of info that
>> >> could be part of file_info.
>> >
>> > It looks to me like you only need to calculate it in exactly one place:
>> > in increment_pointer_or_chunk().  I wouldn't worry about pre-calculating
>> > for efficiency, as it's a trivial calculation.
>>
>> Hm, I don't know. That's recalculating it for every byte in the
>> prefix. In the files I'm testing there could be 1 Mb or more of
>
> No, no, not every byte - only 1 in 131072 of the calls need to calculate
> it - only when it reaches the end of a block.

Oh right. I guess it's ok then.

> May I ask whether you've tested the code that crosses from one chunk to
> another?  I noticed the following, just now:
>
>> +      *at_least_one_end_reached =
>> +        file_baton->curp[idx0] == file_baton->endp[idx0]
>> +        || file_baton->curp[idx1] == file_baton->endp[idx1];
>> +    }
>> +
>> +  /* If both files reached their end (i.e. are fully identical),
>> we're done */
>> +  if (file_baton->curp[idx0] == file_baton->endp[idx0]
>> +        && file_baton->curp[idx1] == file_baton->endp[idx1])
>
> Those tests seem to be saying "if we're at the end of the current chunk
> then we're at the end of the file".  That looks wrong.  What do you
> think?

Well, it does work :-). The code is correct, but it's kind of
difficult to see, so that's my fault.

The reason is that increment_pointer_or_chunk takes care that curp
never equals endp (if curp==endp-1 it reads the next chunk), unless it
reaches end of file. See this snippet of increment_pointer_or_chunk:
+      if (*chunk_number == last_chunk_number)
+        (*curp)++; /* *curp == *endp with last chunk signals end of file */

If you think this is too obscure to handle this, I agree, but I
couldn't think of a better/easier way at that time. If you have any
suggestions, feel free :-). But maybe I should refactor the patch
first, so it's easier to see how this would work out best.

The same "obscurity" also applies to reaching "beginning of file"
(which can happen in decrement_pointer_or_chunk, either during prefix
scanning (rolling back the first line) or during suffix scanning (if
there was no prefix, but still a difference in the first line)). There
it's done by setting the chunk number to -1, which is checked for in
find_identical_prefix and find_identical_suffix:
+      if (*chunk_number == 0)
+        (*chunk_number)--; /* *chunk_number == -1 signals beginning of file */

Not ideal, but it works ...

Cheers,
-- 
Johan

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Posted by Julian Foad <ju...@wandisco.com>.

On Tue, 2010-10-12 at 00:31 +0200, Johan Corveleyn wrote:
> On Mon, Oct 11, 2010 at 11:53 AM, Julian Foad <ju...@wandisco.com> wrote:
> > On Sat, 2010-10-09, Johan Corveleyn wrote:
> >> On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad <ju...@wandisco.com> wrote:
> >> > So I wrote a patch - attached - that refactors this into an array of 4
> >> > sub-structures, and simplifies all the code that uses them.
> > [...]
> >> Yes, great idea! That would indeed vastly simplify a lot of the code.
> >> So please go ahead and commit the refactoring.
> >
> > OK, committed in r1021282.
> 
> Thanks, looks much more manageable now.

I'd like to see a simplified version of your last patch, taking
advantage of that, before you go exploring other options.

> >> Also, maybe the last_chunk number could be included in the file_info
> >> struct? Now it's calculated in several places: "last_chunk0 =
> >> offset_to_chunk(file_baton->size[idx0])", or I have to pass it on
> >> every time as an extra argument. Seems like the sort of info that
> >> could be part of file_info.
> >
> > It looks to me like you only need to calculate it in exactly one place:
> > in increment_pointer_or_chunk().  I wouldn't worry about pre-calculating
> > for efficiency, as it's a trivial calculation.
> 
> Hm, I don't know. That's recalculating it for every byte in the
> prefix. In the files I'm testing there could be 1 Mb or more of

No, no, not every byte - only 1 in 131072 of the calls need to calculate
it - only when it reaches the end of a block.


May I ask whether you've tested the code that crosses from one chunk to
another?  I noticed the following, just now:

> +      *at_least_one_end_reached = 
> +        file_baton->curp[idx0] == file_baton->endp[idx0] 
> +        || file_baton->curp[idx1] == file_baton->endp[idx1];
> +    }
> +
> +  /* If both files reached their end (i.e. are fully identical),
> we're done */
> +  if (file_baton->curp[idx0] == file_baton->endp[idx0] 
> +        && file_baton->curp[idx1] == file_baton->endp[idx1])

Those tests seem to be saying "if we're at the end of the current chunk
then we're at the end of the file".  That looks wrong.  What do you
think?

In order to test this more easily than creating huge files, I wonder if
you can change CHUNK_SIZE and CHUNK_SHIFT to something much smaller such
as 8 bytes.  I'm trying this now - not with your patch, first, but with
just a plain trunk build, to see if it works.

- Julian



> > And in a later email you wrote:
> >> [...] Maybe I can give it a go,
> >> first suffix then prefix, and see if I can find real-life problems ...
> >
> > There's no need to re-visit that.  I think it's fine as is it, using a
> > separate pair of buffers.
> 
> I've been thinking some more about that. For relatively small files,
> smaller than 1 chunk (128 Kb), this seems wasteful. If the file is 127
> Kb, it all fits in one chunk, yet I still read it twice, once for
> suffix and once for prefix (and further token processing). Ok, caching
> by the OS might make this insignificant, but still.
> 
> Since most common source files in repositories are probably smaller
> than 128 Kb I might be hurting the common case, in cases where it's
> not compensated by performance gain from identical prefix/suffix.
> 
> Then again, maybe this can be solved with a specific optimization,
> without changing the current code too much: if file size is smaller
> than chunk size, I point the suffix buffer to the prefix buffer and
> I'm done.

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Posted by Johan Corveleyn <jc...@gmail.com>.

On Mon, Oct 11, 2010 at 11:53 AM, Julian Foad <ju...@wandisco.com> wrote:
> On Sat, 2010-10-09, Johan Corveleyn wrote:
>> On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad <ju...@wandisco.com> wrote:
>> > So I wrote a patch - attached - that refactors this into an array of 4
>> > sub-structures, and simplifies all the code that uses them.
> [...]
>> Yes, great idea! That would indeed vastly simplify a lot of the code.
>> So please go ahead and commit the refactoring.
>
> OK, committed in r1021282.

Thanks, looks much more manageable now.

>> Also, maybe the last_chunk number could be included in the file_info
>> struct? Now it's calculated in several places: "last_chunk0 =
>> offset_to_chunk(file_baton->size[idx0])", or I have to pass it on
>> every time as an extra argument. Seems like the sort of info that
>> could be part of file_info.
>
> It looks to me like you only need to calculate it in exactly one place:
> in increment_pointer_or_chunk().  I wouldn't worry about pre-calculating
> for efficiency, as it's a trivial calculation.

Hm, I don't know. That's recalculating it for every byte in the
prefix. In the files I'm testing there could be 1 Mb or more of
prefix, so 1 million times this calculation (and that for every diff
in say 2000 diffs for a blame operation). Unless the compiler can
optimize that to a single calculation, because it never changes (I
don't know enough about compilers to know if this is possible), it
might have an impact, even if it's a trivial calculation.

OTOH, I might be in premature-optimization-land here, the impact might
be negligible. If I find some time I'll test it.

> And in a later email you wrote:
>> [...] Maybe I can give it a go,
>> first suffix then prefix, and see if I can find real-life problems ...
>
> There's no need to re-visit that.  I think it's fine as is it, using a
> separate pair of buffers.

I've been thinking some more about that. For relatively small files,
smaller than 1 chunk (128 Kb), this seems wasteful. If the file is 127
Kb, it all fits in one chunk, yet I still read it twice, once for
suffix and once for prefix (and further token processing). Ok, caching
by the OS might make this insignificant, but still.

Since most common source files in repositories are probably smaller
than 128 Kb I might be hurting the common case, in cases where it's
not compensated by performance gain from identical prefix/suffix.

Then again, maybe this can be solved with a specific optimization,
without changing the current code too much: if file size is smaller
than chunk size, I point the suffix buffer to the prefix buffer and
I'm done.

Cheers,
-- 
Johan

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Posted by Julian Foad <ju...@wandisco.com>.

On Sat, 2010-10-09, Johan Corveleyn wrote:
> On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad <ju...@wandisco.com> wrote:
> > On Sat, 2010-10-09, Johan Corveleyn wrote:
> >> Ok, third iteration of the patch in attachment. It passes make check.
> >>
> >> As discussed in [1], this version keeps 50 lines of the identical
> >> suffix around, to give the algorithm a good chance to generate a diff
> >> output of good quality (in all but the most extreme cases, this will
> >> be the same as with the original svn_diff algorithm).
> >>
> >> That's about the only difference with the previous iteration. So for
> >> now, I'm submitting this for review. Any feedback is very welcome :-).
> >
> > Hi Johan.
> 
> Hi Julian,
> 
> Thanks for taking a look.
> 
> > I haven't reviewed it, but after seeing today's discussion I had just
> > scrolled quickly through the previous version of this patch.  I noticed
> > that the two main functions - find_identical_suffix and
> > find_identical_suffix - are both quite similar (but not quite similar
> > enough to make them easily share implementation) and both quite long,
> > and I noticed you wrote in an earlier email that you were finding it
> > hard to make the code readable.  I have a suggestion that may help.
[...]
> > So I wrote a patch - attached - that refactors this into an array of 4
> > sub-structures, and simplifies all the code that uses them.
[...]
> Yes, great idea! That would indeed vastly simplify a lot of the code.
> So please go ahead and commit the refactoring.

OK, committed in r1021282.


> Also, maybe the last_chunk number could be included in the file_info
> struct? Now it's calculated in several places: "last_chunk0 =
> offset_to_chunk(file_baton->size[idx0])", or I have to pass it on
> every time as an extra argument. Seems like the sort of info that
> could be part of file_info.

It looks to me like you only need to calculate it in exactly one place:
in increment_pointer_or_chunk().  I wouldn't worry about pre-calculating
for efficiency, as it's a trivial calculation.


> One more thing: you might have noticed that for find_identical_suffix
> I use other buffers, chunks, curp's, endp's, ... than for the prefix.
> For prefix scanning I can just use the stuff from the diff_baton,
> because after prefix scanning has finished, everything is buffered and
> pointing correctly for the normal algorithm to continue (i.e.
> everything points at the first byte of the first non-identical line).
> For suffix scanning I need to use other structures (newly alloc'd
> buffer etc), so as to not mess with those pointers/buffers from the
> diff_baton.

> Maybe I can give it a go,
> first suffix then prefix, and see if I can find real-life problems ...

> So: I think I'll need the file_info struct to be available out of the
> diff_baton_t struct as well, so I can use this in suffix scanning
> also.

Yes; I gave the struct type a name - "struct file_info" - so you can use
it in other places.

> (side-note: I considered first doing suffix scanning, then prefix
> scanning, so I could reuse the buffers/pointers from diff_baton all
> the time, and still have everything pointing correctly after
> eliminating prefix/suffix. But that could give vastly different
> results in some cases, for instance when original file is entirely
> identical to both the prefix and the suffix of the modified file. So I
> decided it's best to stick with first prefix, then suffix).

And in a later email you wrote:
> [...] Maybe I can give it a go,
> first suffix then prefix, and see if I can find real-life problems ...

There's no need to re-visit that.  I think it's fine as is it, using a
separate pair of buffers.


> > Responding to some of the other points you mentioned in a much earlier
> > mail:
> >
> >> 3) It's only implemented for 2 files. I'd like to generalize this for
> >> an array of datasources, so it can also be used for diff3 and diff4.
> >>
> >> 4) As a small hack, I had to add a flag "datasource_opened" to
> >> token.c#svn_diff__get_tokens, so I could have different behavior for
> >> regular diff vs. diff3 and diff4. If 3) gets implemented, this hack is
> >> no longer needed.
> >
> > Yes, I'd like to see 3), and so hack 4) will go away.
> 
> I'm wondering though how I should represent the datasources to pass
> into datasources_open. An array combined with a length parameter?
> Something like:
> 
>     static svn_error_t *
>     datasources_open(void *baton, apr_off_t *prefix_lines,
>                      svn_diff_datasource_e[] datasources, int datasources_len)
> 
> ? And then use for loops everywhere I now do things twice for the two
> datasources?

I haven't looked at how datasources_open() API is used or what form of
API would be best.

For the internal functions, you can now pass around an array of
file_info pointers rather than indexes.


> >> 5) I've struggled with making the code readable/understandable. It's
> >> likely that there is still a lot of room for improvement. I also
> >> probably need to document it some more.
> >
> > You need to write a full doc string for datasources_open(), at least.
> > It needs especially to say how it relates to datasource_open() - why
> > should the caller call this function rather than that function, and are
> > they both necessary - and how the caller is supposed to use the
> > 'prefix_lines' parameter.  And doc strings for
> > increment/decrement_...().
> >
> > But this makes me think, it looks to me like this whole
> > prefix-suffix-skipping functionality would fit better inside the
> > lower-level diff algorithm rather than inside the "datasources_open"
> > function.  Do you agree?
> >
> > It works as it is, but you were talking about wanting it to obey the
> > standard token-parsing rules such as "ignore white space", and so on.
> > It seems that things like this would be much better one level down.
> 
> Yes, I've been struggling with this. But I can't easily see it fit in
> the lower levels right now. Problem is that everything in those lower
> levels always acts on 1 datasource at a time (getting all the tokens,
> inserting them into the token tree, ... then the same for the next
> datasource). The datasource_open seemed to me to be the easiest place
> to combine datasources to do things for both of them "concurrently"
> (with least impact on the rest of the code).

I expect you're right.  You've studied this; I haven't.

> Maybe those lower-level functions could also be made
> "multi-datasource", but I have to think a little bit more about that.

My gut feeling is that's probably not the way to go.

I'll respond separately to your follow-up email about this.


> >> 6) I've not paid too much attention to low level optimizations, so
> >> here also there's probably room for improvement, which may be
> >> significant for the critical loops.
> >
> > Maybe. Not important at this stage.
> >
> >> 7) There are two warnings left "conversion from 'apr_off_t' to 'int'",
> >> in diff_file.c#find_identical_suffix. To completely eliminate this, I
> >> should probably make all "chunks" of type apr_off_t instead of int
> >> (but I have not done that yet, because the original implementation
> >> also used int for the chunk in the svn_diff__file_baton_t struct).
> >> Should I do this (also in the baton struct)? Or should I use an
> >> explicit cast somewhere?
> >
> > I recommend using an explicit cast where needed, in this patch.
> > Changing the 'chunk' data type everywhere could be done in a separate
> > patch, either before or after this patch.
> 
> Ok, will do.
> 
> One more thing I was wondering about: maybe increment_... and
> decrement_... should (eventually) be implemented as macro's? In fact,
> they are mostly just replacements for curp++ and curp-- but taking
> into account that we have chunks. Just a thought ...

If profiling shows that the function-call overghead is too high, then I
suggest that you make just the simple part of it a macro, like this: "if
((f)->curp < (f)->endp - 1) (f)->curp++; else function_call(...);".

> Thanks again for your input.

My pleasure.

- Julian

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Posted by Julian Foad <ju...@wandisco.com>.

On Sun, 2010-10-10 at 23:43 +0200, Johan Corveleyn wrote:
> On Sat, Oct 9, 2010 at 2:21 PM, Johan Corveleyn <jc...@gmail.com> wrote:
> > On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad <ju...@wandisco.com> wrote:
> >> But this makes me think, it looks to me like this whole
> >> prefix-suffix-skipping functionality would fit better inside the
> >> lower-level diff algorithm rather than inside the "datasources_open"
> >> function.  Do you agree?
> >>
> >> It works as it is, but you were talking about wanting it to obey the
> >> standard token-parsing rules such as "ignore white space", and so on.
> >> It seems that things like this would be much better one level down.
> >
> > Yes, I've been struggling with this. But I can't easily see it fit in
> > the lower levels right now. Problem is that everything in those lower
> > levels always acts on 1 datasource at a time (getting all the tokens,
> > inserting them into the token tree, ... then the same for the next
> > datasource). The datasource_open seemed to me to be the easiest place
> > to combine datasources to do things for both of them "concurrently"
> > (with least impact on the rest of the code).
> >
> > Maybe those lower-level functions could also be made
> > "multi-datasource", but I have to think a little bit more about that.
> 
> I've thought a little bit more about this, going through the code, and
> yeah, it might be a good idea to push the prefix/suffix scanning into
> the lower-level parts of the diff-algorithm (the token parsing /
> building the token tree).
> 
> Something like:
> - Make token.c#svn_diff__get_tokens take multiple datasources.
> - In diff.c, diff3.c and diff4.c replace the multiple calls to this
> function to one call passing multiple datasources.
> - token.c#svn_diff__get_tokens (with multiple datasources) will take
> care of identical prefix and suffix scanning based on "tokens" (so can
> take advantage of the standard token-parsing rules with ignore-*
> options, by simply calling svn_diff_fns_t.datasource_get_next_token).

One of the improvements we're looking for is to make use of the already
existing diff options - ignore-white-space etc.

We can get that improvement with a much smaller change: simply by
calling the 'get_next_token' routine, or rather a part of it, from
within your current 'find_identical_prefix' function, without touching
any of the generic diffN.c/token.c code and APIs.

> Some things needed to support this:
> - svn_diff_fns_t.datasource_get_next_token: calculation of the hash
> should be made conditional (I don't want to waste time for the adler32
> hash for lines that are not going to be put in the token tree).

Yes.  If you take this "smaller change" approach, then the way to do
this would be to factor out the non-adler part of
'datasource_get_next_token' into a separate private function, and call
that.

> - For suffix scanning, I'll need another variation of
> datasource_get_next_token to get tokens from the end going backwards
> (say "datasource_get_previous_token"). No hash needed.

Yes.

> Does that make sense? Or am I missing it completely?

I can't really comment on the question of moving this right down into
the diffN.c/token.c code, as I don't have time to study that right now.
The possible benefits I can see are:

* Sharing this feature across all kinds of diff implementations
(currently: file and in-memory-string).  Good from a practical POV if
and only if really long strings are being compared in memory.  Good from
a code-structural POV.

* I'm curious about whether it would be possible to integrate prefix
skipping into the main algorithm in such a way that the algorithm would
see the skipped prefix as a possible match, just like any other chunk
(including its adler32), but in a much quicker way than the algorithm
currently finds such a prefix.  If so, it would be able to handle better
the cases where one file has added a prefix that is a duplicate of
existing text.  Same for the suffix.

I'm really not sure whether it's worth pursuing that line of
investigation.  If you do, and it turns out well, that would be great.
If you want to just stick to the implementation in diff_file.c, that
seems like a more practical place to stop, now that you have basically
solved the problem.

- Julian

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Posted by Johan Corveleyn <jc...@gmail.com>.

On Sat, Oct 9, 2010 at 2:21 PM, Johan Corveleyn <jc...@gmail.com> wrote:
> On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad <ju...@wandisco.com> wrote:
>> But this makes me think, it looks to me like this whole
>> prefix-suffix-skipping functionality would fit better inside the
>> lower-level diff algorithm rather than inside the "datasources_open"
>> function.  Do you agree?
>>
>> It works as it is, but you were talking about wanting it to obey the
>> standard token-parsing rules such as "ignore white space", and so on.
>> It seems that things like this would be much better one level down.
>
> Yes, I've been struggling with this. But I can't easily see it fit in
> the lower levels right now. Problem is that everything in those lower
> levels always acts on 1 datasource at a time (getting all the tokens,
> inserting them into the token tree, ... then the same for the next
> datasource). The datasource_open seemed to me to be the easiest place
> to combine datasources to do things for both of them "concurrently"
> (with least impact on the rest of the code).
>
> Maybe those lower-level functions could also be made
> "multi-datasource", but I have to think a little bit more about that.

I've thought a little bit more about this, going through the code, and
yeah, it might be a good idea to push the prefix/suffix scanning into
the lower-level parts of the diff-algorithm (the token parsing /
building the token tree).

Something like:
- Make token.c#svn_diff__get_tokens take multiple datasources.
- In diff.c, diff3.c and diff4.c replace the multiple calls to this
function to one call passing multiple datasources.
- token.c#svn_diff__get_tokens (with multiple datasources) will take
care of identical prefix and suffix scanning based on "tokens" (so can
take advantage of the standard token-parsing rules with ignore-*
options, by simply calling svn_diff_fns_t.datasource_get_next_token).

Some things needed to support this:
- svn_diff_fns_t.datasource_get_next_token: calculation of the hash
should be made conditional (I don't want to waste time for the adler32
hash for lines that are not going to be put in the token tree).
- For suffix scanning, I'll need another variation of
datasource_get_next_token to get tokens from the end going backwards
(say "datasource_get_previous_token"). No hash needed.

Does that make sense? Or am I missing it completely?

It would mean I'd have to throw away the current patch and rewrite it
in the token layer (but I don't really mind, as long as a good
solution takes its place :-)). And your file_info refactoring would
also not be used (although it's certainly a worthwhile refactoring by
itself, making the current code clearer).

Cheers,
-- 
Johan

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Posted by Johan Corveleyn <jc...@gmail.com>.

On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad <ju...@wandisco.com> wrote:
> On Sat, 2010-10-09, Johan Corveleyn wrote:
>> Ok, third iteration of the patch in attachment. It passes make check.
>>
>> As discussed in [1], this version keeps 50 lines of the identical
>> suffix around, to give the algorithm a good chance to generate a diff
>> output of good quality (in all but the most extreme cases, this will
>> be the same as with the original svn_diff algorithm).
>>
>> That's about the only difference with the previous iteration. So for
>> now, I'm submitting this for review. Any feedback is very welcome :-).
>
> Hi Johan.

Hi Julian,

Thanks for taking a look.

> I haven't reviewed it, but after seeing today's discussion I had just
> scrolled quickly through the previous version of this patch.  I noticed
> that the two main functions - find_identical_suffix and
> find_identical_suffix - are both quite similar (but not quite similar
> enough to make them easily share implementation) and both quite long,
> and I noticed you wrote in an earlier email that you were finding it
> hard to make the code readable.  I have a suggestion that may help.
>
> I think the existing structure of the svn_diff__file_baton_t is
> unhelpful:
> {
>  const svn_diff_file_options_t *options;
>  const char *path[4];
>
>  apr_file_t *file[4];
>  apr_off_t size[4];
>
>  int chunk[4];
>  char *buffer[4];
>  char *curp[4];
>  char *endp[4];
>
>  /* List of free tokens that may be reused. */
>  svn_diff__file_token_t *tokens;
>
>  svn_diff__normalize_state_t normalize_state[4];
>
>  apr_pool_t *pool;
> } svn_diff__file_baton_t;
>
> All those array[4] fields are logically related, but this layout forces
> the programmer to address them individually.
>
> So I wrote a patch - attached - that refactors this into an array of 4
> sub-structures, and simplifies all the code that uses them.
>
> I think this will help you to get better code clarity because then your
> increment_pointer_or_chunk() for example will be able to take a single
> pointer to a file_info structure instead of a lot of pointers to
> individual members of the same.
>
> Would you take a look and let me know if you agree.  If so, I can commit
> the refactoring straight away.

Yes, great idea! That would indeed vastly simplify a lot of the code.
So please go ahead and commit the refactoring.

Also, maybe the last_chunk number could be included in the file_info
struct? Now it's calculated in several places: "last_chunk0 =
offset_to_chunk(file_baton->size[idx0])", or I have to pass it on
every time as an extra argument. Seems like the sort of info that
could be part of file_info.

One more thing: you might have noticed that for find_identical_suffix
I use other buffers, chunks, curp's, endp's, ... than for the prefix.
For prefix scanning I can just use the stuff from the diff_baton,
because after prefix scanning has finished, everything is buffered and
pointing correctly for the normal algorithm to continue (i.e.
everything points at the first byte of the first non-identical line).
For suffix scanning I need to use other structures (newly alloc'd
buffer etc), so as to not mess with those pointers/buffers from the
diff_baton.

So: I think I'll need the file_info struct to be available out of the
diff_baton_t struct as well, so I can use this in suffix scanning
also.

(side-note: I considered first doing suffix scanning, then prefix
scanning, so I could reuse the buffers/pointers from diff_baton all
the time, and still have everything pointing correctly after
eliminating prefix/suffix. But that could give vastly different
results in some cases, for instance when original file is entirely
identical to both the prefix and the suffix of the modified file. So I
decided it's best to stick with first prefix, then suffix).

> Responding to some of the other points you mentioned in a much earlier
> mail:
>
>> 3) It's only implemented for 2 files. I'd like to generalize this for
>> an array of datasources, so it can also be used for diff3 and diff4.
>>
>> 4) As a small hack, I had to add a flag "datasource_opened" to
>> token.c#svn_diff__get_tokens, so I could have different behavior for
>> regular diff vs. diff3 and diff4. If 3) gets implemented, this hack is
>> no longer needed.
>
> Yes, I'd like to see 3), and so hack 4) will go away.

I'm wondering though how I should represent the datasources to pass
into datasources_open. An array combined with a length parameter?
Something like:

    static svn_error_t *
    datasources_open(void *baton, apr_off_t *prefix_lines,
                     svn_diff_datasource_e[] datasources, int datasources_len)

? And then use for loops everywhere I now do things twice for the two
datasources?

>> 5) I've struggled with making the code readable/understandable. It's
>> likely that there is still a lot of room for improvement. I also
>> probably need to document it some more.
>
> You need to write a full doc string for datasources_open(), at least.
> It needs especially to say how it relates to datasource_open() - why
> should the caller call this function rather than that function, and are
> they both necessary - and how the caller is supposed to use the
> 'prefix_lines' parameter.  And doc strings for
> increment/decrement_...().
>
> But this makes me think, it looks to me like this whole
> prefix-suffix-skipping functionality would fit better inside the
> lower-level diff algorithm rather than inside the "datasources_open"
> function.  Do you agree?
>
> It works as it is, but you were talking about wanting it to obey the
> standard token-parsing rules such as "ignore white space", and so on.
> It seems that things like this would be much better one level down.

Yes, I've been struggling with this. But I can't easily see it fit in
the lower levels right now. Problem is that everything in those lower
levels always acts on 1 datasource at a time (getting all the tokens,
inserting them into the token tree, ... then the same for the next
datasource). The datasource_open seemed to me to be the easiest place
to combine datasources to do things for both of them "concurrently"
(with least impact on the rest of the code).

Maybe those lower-level functions could also be made
"multi-datasource", but I have to think a little bit more about that.

>> 6) I've not paid too much attention to low level optimizations, so
>> here also there's probably room for improvement, which may be
>> significant for the critical loops.
>
> Maybe. Not important at this stage.
>
>> 7) There are two warnings left "conversion from 'apr_off_t' to 'int'",
>> in diff_file.c#find_identical_suffix. To completely eliminate this, I
>> should probably make all "chunks" of type apr_off_t instead of int
>> (but I have not done that yet, because the original implementation
>> also used int for the chunk in the svn_diff__file_baton_t struct).
>> Should I do this (also in the baton struct)? Or should I use an
>> explicit cast somewhere?
>
> I recommend using an explicit cast where needed, in this patch.
> Changing the 'chunk' data type everywhere could be done in a separate
> patch, either before or after this patch.

Ok, will do.

One more thing I was wondering about: maybe increment_... and
decrement_... should (eventually) be implemented as macro's? In fact,
they are mostly just replacements for curp++ and curp-- but taking
into account that we have chunks. Just a thought ...

Thanks again for your input.

Cheers,
-- 
Johan

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Posted by Julian Foad <ju...@wandisco.com>.

On Sat, 2010-10-09, Johan Corveleyn wrote:
> Ok, third iteration of the patch in attachment. It passes make check.
> 
> As discussed in [1], this version keeps 50 lines of the identical
> suffix around, to give the algorithm a good chance to generate a diff
> output of good quality (in all but the most extreme cases, this will
> be the same as with the original svn_diff algorithm).
> 
> That's about the only difference with the previous iteration. So for
> now, I'm submitting this for review. Any feedback is very welcome :-).

Hi Johan.

I haven't reviewed it, but after seeing today's discussion I had just
scrolled quickly through the previous version of this patch.  I noticed
that the two main functions - find_identical_suffix and
find_identical_suffix - are both quite similar (but not quite similar
enough to make them easily share implementation) and both quite long,
and I noticed you wrote in an earlier email that you were finding it
hard to make the code readable.  I have a suggestion that may help.

I think the existing structure of the svn_diff__file_baton_t is
unhelpful:
{
  const svn_diff_file_options_t *options;
  const char *path[4];

  apr_file_t *file[4];
  apr_off_t size[4];

  int chunk[4];
  char *buffer[4];
  char *curp[4];
  char *endp[4];

  /* List of free tokens that may be reused. */
  svn_diff__file_token_t *tokens;

  svn_diff__normalize_state_t normalize_state[4];

  apr_pool_t *pool;
} svn_diff__file_baton_t;

All those array[4] fields are logically related, but this layout forces
the programmer to address them individually.

So I wrote a patch - attached - that refactors this into an array of 4
sub-structures, and simplifies all the code that uses them.

I think this will help you to get better code clarity because then your
increment_pointer_or_chunk() for example will be able to take a single
pointer to a file_info structure instead of a lot of pointers to
individual members of the same.

Would you take a look and let me know if you agree.  If so, I can commit
the refactoring straight away.

Responding to some of the other points you mentioned in a much earlier
mail:

> 3) It's only implemented for 2 files. I'd like to generalize this for
> an array of datasources, so it can also be used for diff3 and diff4.
> 
> 4) As a small hack, I had to add a flag "datasource_opened" to
> token.c#svn_diff__get_tokens, so I could have different behavior for
> regular diff vs. diff3 and diff4. If 3) gets implemented, this hack is
> no longer needed.

Yes, I'd like to see 3), and so hack 4) will go away.

> 5) I've struggled with making the code readable/understandable. It's
> likely that there is still a lot of room for improvement. I also
> probably need to document it some more.

You need to write a full doc string for datasources_open(), at least.
It needs especially to say how it relates to datasource_open() - why
should the caller call this function rather than that function, and are
they both necessary - and how the caller is supposed to use the
'prefix_lines' parameter.  And doc strings for
increment/decrement_...().

But this makes me think, it looks to me like this whole
prefix-suffix-skipping functionality would fit better inside the
lower-level diff algorithm rather than inside the "datasources_open"
function.  Do you agree?

It works as it is, but you were talking about wanting it to obey the
standard token-parsing rules such as "ignore white space", and so on.
It seems that things like this would be much better one level down.

> 6) I've not paid too much attention to low level optimizations, so
> here also there's probably room for improvement, which may be
> significant for the critical loops.

Maybe. Not important at this stage.

> 7) There are two warnings left "conversion from 'apr_off_t' to 'int'",
> in diff_file.c#find_identical_suffix. To completely eliminate this, I
> should probably make all "chunks" of type apr_off_t instead of int
> (but I have not done that yet, because the original implementation
> also used int for the chunk in the svn_diff__file_baton_t struct).
> Should I do this (also in the baton struct)? Or should I use an
> explicit cast somewhere?

I recommend using an explicit cast where needed, in this patch.
Changing the 'chunk' data type everywhere could be done in a separate
patch, either before or after this patch.

- Julian

> I still consider this a WIP, because of the following remaining todo's
> (which may have a lot of impact on the current implementation):
> - Generalize for more than 2 datasources (for diff3 and diff4).
> - revv svn_diff_fns_t and maybe other stuff I've changed in public API.
> - Add support for -x-b, -x-w, and -x--ignore-eol-style options. Maybe
> switch the implementation to read out entire lines before comparing
> (like datasources_get_next_token does).
> 
> Log message:
> [[[
> Make svn_diff_diff skip identical prefix and suffix to make diff and blame
> faster.
> 
> * subversion/include/svn_diff.h
>   (svn_diff_fns_t): Added new function types datasources_open and
>    get_prefix_lines to the vtable.
> 
> * subversion/libsvn_diff/diff_memory.c
>   (datasources_open): New function (does nothing).
>   (get_prefix_lines): New function (does nothing).
>   (svn_diff__mem_vtable): Added new functions datasources_open and
>    get_prefix_lines.
> 
> * subversion/libsvn_diff/diff_file.c
>   (svn_diff__file_baton_t): Added members prefix_lines, suffix_start_chunk[4]
>    and suffix_offset_in_chunk.
>   (increment_pointer_or_chunk, decrement_pointer_or_chunk): New functions.
>   (find_identical_prefix, find_identical_suffix): New functions.
>   (datasources_open): New function, to open both datasources and find their
>    identical prefix and suffix. From the identical suffix, 50 lines are kept to
>    help the diff algorithm find the nicest possible diff representation
>    in case of ambiguity.
>   (get_prefix_lines): New function.
>   (datasource_get_next_token): Stop at start of identical suffix.
>   (svn_diff__file_vtable): Added new functions datasources_open and
>    get_prefix_lines.
> 
> * subversion/libsvn_diff/diff.h
>   (svn_diff__get_tokens): Added argument "datasource_opened", to indicate that
>    the datasource was already opened.
> 
> * subversion/libsvn_diff/token.c
>   (svn_diff__get_tokens): Added argument "datasource_opened". Only open the
>    datasource if datasource_opened is FALSE. Set the starting offset of the
>    position list to the number of prefix lines.
> 
> * subversion/libsvn_diff/lcs.c
>   (svn_diff__lcs): Added argument "prefix_lines". Use this to correctly set
>    the offset of the sentinel position for EOF, even if one of the files
>    became empty after eliminating the identical prefix.
> 
> * subversion/libsvn_diff/diff.c
>   (svn_diff__diff): Add a chunk of "common" diff for identical prefix.
>   (svn_diff_diff): Use new function datasources_open, to open original and
>    modified at once, and find their identical prefix and suffix. Pass
>    prefix_lines to svn_diff__lcs and to svn_diff__diff.
> 
> * subversion/libsvn_diff/diff3.c
>   (svn_diff_diff3): Pass datasource_opened = FALSE to svn_diff__get_tokens.
>    Pass prefix_lines = 0 to svn_diff__lcs.
> 
> * subversion/libsvn_diff/diff4.c
>   (svn_diff_diff4): Pass datasource_opened = FALSE to svn_diff__get_tokens.
>    Pass prefix_lines = 0 to svn_diff__lcs.
> ]]]
> 
> 
> Cheers,

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Posted by Johan Corveleyn <jc...@gmail.com>.

Ok, third iteration of the patch in attachment. It passes make check.

As discussed in [1], this version keeps 50 lines of the identical
suffix around, to give the algorithm a good chance to generate a diff
output of good quality (in all but the most extreme cases, this will
be the same as with the original svn_diff algorithm).

That's about the only difference with the previous iteration. So for
now, I'm submitting this for review. Any feedback is very welcome :-).

I still consider this a WIP, because of the following remaining todo's
(which may have a lot of impact on the current implementation):
- Generalize for more than 2 datasources (for diff3 and diff4).
- revv svn_diff_fns_t and maybe other stuff I've changed in public API.
- Add support for -x-b, -x-w, and -x--ignore-eol-style options. Maybe
switch the implementation to read out entire lines before comparing
(like datasources_get_next_token does).

Log message:
[[[
Make svn_diff_diff skip identical prefix and suffix to make diff and blame
faster.

* subversion/include/svn_diff.h
  (svn_diff_fns_t): Added new function types datasources_open and
   get_prefix_lines to the vtable.

* subversion/libsvn_diff/diff_memory.c
  (datasources_open): New function (does nothing).
  (get_prefix_lines): New function (does nothing).
  (svn_diff__mem_vtable): Added new functions datasources_open and
   get_prefix_lines.

* subversion/libsvn_diff/diff_file.c
  (svn_diff__file_baton_t): Added members prefix_lines, suffix_start_chunk[4]
   and suffix_offset_in_chunk.
  (increment_pointer_or_chunk, decrement_pointer_or_chunk): New functions.
  (find_identical_prefix, find_identical_suffix): New functions.
  (datasources_open): New function, to open both datasources and find their
   identical prefix and suffix. From the identical suffix, 50 lines are kept to
   help the diff algorithm find the nicest possible diff representation
   in case of ambiguity.
  (get_prefix_lines): New function.
  (datasource_get_next_token): Stop at start of identical suffix.
  (svn_diff__file_vtable): Added new functions datasources_open and
   get_prefix_lines.

* subversion/libsvn_diff/diff.h
  (svn_diff__get_tokens): Added argument "datasource_opened", to indicate that
   the datasource was already opened.

* subversion/libsvn_diff/token.c
  (svn_diff__get_tokens): Added argument "datasource_opened". Only open the
   datasource if datasource_opened is FALSE. Set the starting offset of the
   position list to the number of prefix lines.

* subversion/libsvn_diff/lcs.c
  (svn_diff__lcs): Added argument "prefix_lines". Use this to correctly set
   the offset of the sentinel position for EOF, even if one of the files
   became empty after eliminating the identical prefix.

* subversion/libsvn_diff/diff.c
  (svn_diff__diff): Add a chunk of "common" diff for identical prefix.
  (svn_diff_diff): Use new function datasources_open, to open original and
   modified at once, and find their identical prefix and suffix. Pass
   prefix_lines to svn_diff__lcs and to svn_diff__diff.

* subversion/libsvn_diff/diff3.c
  (svn_diff_diff3): Pass datasource_opened = FALSE to svn_diff__get_tokens.
   Pass prefix_lines = 0 to svn_diff__lcs.

* subversion/libsvn_diff/diff4.c
  (svn_diff_diff4): Pass datasource_opened = FALSE to svn_diff__get_tokens.
   Pass prefix_lines = 0 to svn_diff__lcs.
]]]


Cheers,
-- 
Johan

[1] http://svn.haxx.se/dev/archive-2010-10/0141.shtml

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Posted by Johan Corveleyn <jc...@gmail.com>.

Ok, third iteration of the patch in attachment. It passes make check.

As discussed in [1], this version keeps 50 lines of the identical
suffix around, to give the algorithm a good chance to generate a diff
output of good quality (in all but the most extreme cases, this will
be the same as with the original svn_diff algorithm).

That's about the only difference with the previous iteration. So for
now, I'm submitting this for review. Any feedback is very welcome :-).

I still consider this a WIP, because of the following remaining todo's
(which may have a lot of impact on the current implementation):
- Generalize for more than 2 datasources (for diff3 and diff4).
- revv svn_diff_fns_t and maybe other stuff I've changed in public API.
- Add support for -x-b, -x-w, and -x--ignore-eol-style options. Maybe
switch the implementation to read out entire lines before comparing
(like datasources_get_next_token does).

Log message:
[[[
Make svn_diff_diff skip identical prefix and suffix to make diff and blame
faster.

* subversion/include/svn_diff.h
  (svn_diff_fns_t): Added new function types datasources_open and
   get_prefix_lines to the vtable.

* subversion/libsvn_diff/diff_memory.c
  (datasources_open): New function (does nothing).
  (get_prefix_lines): New function (does nothing).
  (svn_diff__mem_vtable): Added new functions datasources_open and
   get_prefix_lines.

* subversion/libsvn_diff/diff_file.c
  (svn_diff__file_baton_t): Added members prefix_lines, suffix_start_chunk[4]
   and suffix_offset_in_chunk.
  (increment_pointer_or_chunk, decrement_pointer_or_chunk): New functions.
  (find_identical_prefix, find_identical_suffix): New functions.
  (datasources_open): New function, to open both datasources and find their
   identical prefix and suffix. From the identical suffix, 50 lines are kept to
   help the diff algorithm find the nicest possible diff representation
   in case of ambiguity.
  (get_prefix_lines): New function.
  (datasource_get_next_token): Stop at start of identical suffix.
  (svn_diff__file_vtable): Added new functions datasources_open and
   get_prefix_lines.

* subversion/libsvn_diff/diff.h
  (svn_diff__get_tokens): Added argument "datasource_opened", to indicate that
   the datasource was already opened.

* subversion/libsvn_diff/token.c
  (svn_diff__get_tokens): Added argument "datasource_opened". Only open the
   datasource if datasource_opened is FALSE. Set the starting offset of the
   position list to the number of prefix lines.

* subversion/libsvn_diff/lcs.c
  (svn_diff__lcs): Added argument "prefix_lines". Use this to correctly set
   the offset of the sentinel position for EOF, even if one of the files
   became empty after eliminating the identical prefix.

* subversion/libsvn_diff/diff.c
  (svn_diff__diff): Add a chunk of "common" diff for identical prefix.
  (svn_diff_diff): Use new function datasources_open, to open original and
   modified at once, and find their identical prefix and suffix. Pass
   prefix_lines to svn_diff__lcs and to svn_diff__diff.

* subversion/libsvn_diff/diff3.c
  (svn_diff_diff3): Pass datasource_opened = FALSE to svn_diff__get_tokens.
   Pass prefix_lines = 0 to svn_diff__lcs.

* subversion/libsvn_diff/diff4.c
  (svn_diff_diff4): Pass datasource_opened = FALSE to svn_diff__get_tokens.
   Pass prefix_lines = 0 to svn_diff__lcs.
]]]


Cheers,
-- 
Johan

[1] http://svn.haxx.se/dev/archive-2010-10/0141.shtml

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Posted by Johan Corveleyn <jc...@gmail.com>.

On Sun, Oct 3, 2010 at 1:46 AM, Johan Corveleyn <jc...@gmail.com> wrote:
> Hi,
>
> Here is a second iteration of the patch. It now passes make check.
>
> Differences from the previous version are:
> - Support for \r eol-style (\n and \r\n was already ok).
> - The number of prefix_lines is now passed to svn_diff__lcs, so it can
> use that value to set the position offset of the "EOF" marker
> correctly, in case one of both files has become empty after skipping
> the prefix. This fixes the crashes in blame_tests.py 2 and 7.
>
> The patch is pretty big, so please let me know if I should split it up
> to make it more reviewable (I could easily split it up between the
> prefix-finding and the suffix-finding, at the cost of having overview
> over the entire algorithm).
>
> Still to do:
> - Think about why results are sometimes different (because of
> eliminated suffix, the LCS can sometimes be slightly different), and
> what can be done about it.

Hm, this problem has made me reconsider whether my patch is a good
solution (more details below). I'm thinking of other ways of
implementing this kind of optimization, so for now I'm pulling this
patch. Please don't review it yet, as I might go for a radically
different approach (sorry for any time spent that may be lost).

The details:
The problem happens if there are two (or more) non-matching parts with
an identical part in between (so the prefix scanning stops early,
doesn't meet the suffix in one of the files), and the suffix scanning
goes too far because the end of the last change is identical to the
original at that point.

For example (best viewed with fixed-width font):
[[[
original                     | modified
-----------------------------------------------------------
This is line 1               | This is line 1
This is line 2               | This line is *changed*
This is line 3               | This is line 3
... (more identical lines)   | ...
existing_function()          | existing_function()
{                            | {
  // do stuff                |   // do stuff
  return SVN_NO_ERROR;       |   return SVN_NO_ERROR;
}                            | }
                             |
This is the end of the file  | new_function()
-----------------------------| {
                             |   // do other stuff
                             |   return SVN_NO_ERROR;
                             | }
                             |
                             | This is the end of the file
                             |-----------------------------
]]]

The following identical suffix will be eliminated:
[[[
  return SVN_NO_ERROR;
}

This is the end of the file
]]]

which means the LCS algorithm cannot do better than to say that the
following lines are new:
[[[
+  return SVN_NO_ERROR;
+}
+
+new_function()
+{
+  // do other stuff
]]]

Not quite what we want/expect. If the suffix is not eliminated, the
LCS algorithm gives the correct/expected result. Also if the change in
the beginning of the file isn't there, the result is correct (because
the prefix scanning goes first, and eliminates the identical stuff
from the start (I'll leave it as an exercise to the reader to see that
the result is better)).

Interestingly, GNU diff also has this problem. It mitigates it a bit
by keeping a number of identical prefix/suffix lines (at least the
number of context lines, or a higher number if supplied by
--horizon-lines=NUM). This cannot completely solve the problem though:
for any given number of "horizon-lines" one can always come up with an
example that doesn't give the correct result.

So I need to find a way to not "eliminate" the identical suffix, but
to mark those lines as identical and then include them in the
position_list that's going to be used by the LCS algorithm. I'd like
to do the same for the prefix. This means I need to approach this
problem on a different level.

To be continued ...

Cheers,
-- 
Johan