You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Daniel Shahaf <d....@daniel.shahaf.name> on 2010/12/01 02:38:30 UTC

Re: diff-optimizations-tokens branch: I think I'm going to abandon it

Johan Corveleyn wrote on Wed, Dec 01, 2010 at 00:25:27 +0100:
> I am now considering to abandon the tokens-approach, for the following reasons:
...
> So, unless someone can convince me otherwise, I'm probably going to
> stop with the token approach. Because of 2), I don't think it's worth
> it to spend the effort needed for 1), especially because the
> byte-based approach already works.
> 

In other words, you're saying that the token-based approach: (b) won't be
as fast as the bytes-based approach can be, and (a) requires much effort
to be spent on implementing the reverse reading of tokens?  (i.e.,
a backwards gets())

> Any thoughts?
> 

-tokens/BRANCH-README mentions one of the advantages of the tokens
approach being that the comparison is done only after whitespace and
newlines have been canonicalized (if -x-w or -x--ignore-eol-style is in
effect).  IIRC, the -bytes approach doesn't currently take advantage of
these -x flags?

What is the practical effect of the fact the -bytes approach doesn't
take advantage of these flags?  If a file (with a moderately long
history) has had all its newlines changed in rN, then I assume that your
-bytes optimizations will speed up all the diffs that 'blame' performs
on that file, except for the single diff between rN and its predecessor?

Re: diff-optimizations-tokens branch: I think I'm going to abandon it

Posted by "C. Michael Pilato" <cm...@collab.net>.

On 12/02/2010 12:18 PM, Bill Tutt wrote:
[...]

. o O ( Who the heck is this Bill Tutt guy? )

Nice to read you again, Bill!

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand

Re: diff-optimizations-tokens branch: I think I'm going to abandon it

Posted by Johan Corveleyn <jc...@gmail.com>.

On Thu, Dec 2, 2010 at 9:59 PM, Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> Johan Corveleyn wrote on Thu, Dec 02, 2010 at 21:21:20 +0100:
>> On Thu, Dec 2, 2010 at 6:18 PM, Bill Tutt <bi...@tutts.org> wrote:
>> >    If tokens include keyword expansion operations then stop once you
>> > hit one. The possible source of bugs outways the perf gain in my mind
>> > here.
>>
>> Haven't thought about keyword expansion yet. But as you suggest: I'm
>> not going to bother doing special stuff for (expanded) keywords. If we
>> find a mismatch, we'll stop with the optimized scanning, and fall back
>> to the default algorithm.
>
> Only for the suffix scanning?  Or also for the prefix scanning?
>
> (I think you meant the former.)  For prefix scanning, I think it's
> common to have $Id$ or similar near the top of the file, so being able
> to collect matching lines past it too would be beneficial.

I left this mail flagged for a long time, because I wanted to get back
to it (but didn't have the time back then to think about it further).

On the topic of diff and keyword expansion: as I said, I don't do
anything special with keywords during prefix/suffix scanning. I just
work with the same data as the normal diff algorithm. But, luckily,
diff works with the "unexpanded-keywords" (de-translated?) version of
the file. I.e. diff doesn't notice any difference caused by expanded
keywords (so neither does the prefix/suffix scanning).

You can nicely see this if you perform an "svn diff" of a file with
keywords, with a line edited just above or below the expanded keyword
(or even on the same line). The diff output shows the unexpanded
keywords, though the keywords in the actual file are expanded.

Cheers,
-- 
Johan

Re: diff-optimizations-tokens branch: I think I'm going to abandon it

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Johan Corveleyn wrote on Thu, Dec 02, 2010 at 21:21:20 +0100:
> On Thu, Dec 2, 2010 at 6:18 PM, Bill Tutt <bi...@tutts.org> wrote:
> >    If tokens include keyword expansion operations then stop once you
> > hit one. The possible source of bugs outways the perf gain in my mind
> > here.
> 
> Haven't thought about keyword expansion yet. But as you suggest: I'm
> not going to bother doing special stuff for (expanded) keywords. If we
> find a mismatch, we'll stop with the optimized scanning, and fall back
> to the default algorithm.

Only for the suffix scanning?  Or also for the prefix scanning?

(I think you meant the former.)  For prefix scanning, I think it's
common to have $Id$ or similar near the top of the file, so being able
to collect matching lines past it too would be beneficial.

Re: diff-optimizations-tokens branch: I think I'm going to abandon it

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Johan Corveleyn wrote on Thu, Dec 02, 2010 at 21:21:20 +0100:
> On Thu, Dec 2, 2010 at 6:18 PM, Bill Tutt <bi...@tutts.org> wrote:
> >    If tokens include keyword expansion operations then stop once you
> > hit one. The possible source of bugs outways the perf gain in my mind
> > here.
> 
> Haven't thought about keyword expansion yet. But as you suggest: I'm
> not going to bother doing special stuff for (expanded) keywords. If we
> find a mismatch, we'll stop with the optimized scanning, and fall back
> to the default algorithm.

Only for the suffix scanning?  Or also for the prefix scanning?

(I think you meant the former.)  For prefix scanning, I think it's
common to have $Id$ or similar near the top of the file, so being able
to collect matching lines past it too would be beneficial.

Re: diff-optimizations-tokens branch: I think I'm going to abandon it

Posted by Branko Čibej <br...@xbc.nu>.

On 02.12.2010 21:21, Johan Corveleyn wrote:
> On Thu, Dec 2, 2010 at 6:18 PM, Bill Tutt <bi...@tutts.org> wrote:
>> Additional ignore whitespace related comment:
>> * IIRC, Perforce had an interesting twist on ignoring whitespace. You
>> could ignore just line leading/ending whitespace instead of all
>> whitespace differences but pay attention to any whitespace change
>> after the "trim" operation had completed.
>>
>> e.g.:
>> * "    aaa bbb   " vs "aaa bbb" would compare as equal
>> * "    aaa  bbb  " vs "aaa  bbb" would compare as equal
>> * "    aaa  bbb  " vs "aaa bbb" would compare as non-equal due to the
>> white space change in the middle of the line
> Cool (svn doesn't have that option). But I'm not sure what that would
> be useful for (as a user, I can't immediately imagine an important use
> case). Anyway, could still be a nice option...

It filters out the noise when code is only reindented, but catches
whitespace differences in, say, literal strings within the reindented code.

-- Brane

Re: diff-optimizations-tokens branch: I think I'm going to abandon it

Posted by Johan Corveleyn <jc...@gmail.com>.

On Thu, Dec 2, 2010 at 6:18 PM, Bill Tutt <bi...@tutts.org> wrote:
> Note: This email only tangentially relates to svn diff and more about
> reverse token scanning in general:
>
> As someone who has implemented suffix reverse token scanning before:

Thanks for the input. It's nice to see other people have also
struggled with this :-).

> * It simply isn't possible in DBCS code pages. Stick to byte only here.
>    SBCS and UTF-16 make reverse token stuff relatively
> straightforward. UTF-8 is a little trickier but still tractable.
>   At least UTF-8 is tractable in a way that DBCS isn't. You always
> know which part of a Unicode code point you are in. (i.e. byte 4 vs.
> byte 3 vs. etc...)

Ok, this further supports the decision to focus on the byte-based
approach. We'll only consider stuff identical if all bytes are
identical. That's the simplest route, and since it's only an
optimization anyway ...

> * I would recommend only supporting a subset of the diff options for
> reverse token scanning. i.e. ignore whitespace/ignore eol but skip
> ignore case (if svn has that, I forget...)

svn diff doesn't have an ignore-case option, so that's ok :-).

>    If tokens include keyword expansion operations then stop once you
> hit one. The possible source of bugs outways the perf gain in my mind
> here.

Haven't thought about keyword expansion yet. But as you suggest: I'm
not going to bother doing special stuff for (expanded) keywords. If we
find a mismatch, we'll stop with the optimized scanning, and fall back
to the default algorithm.

> * Suffix scanning does really require a seekable stream, if it isn't
> seekable then don't perform the reverse scanning.  It is only an
> optimization after all.

Hm, yes, we'll need to be careful about that. I'll start another mail
thread asking for known implementors of the svn_diff_fns_t functions,
to find out whether seeking around like that for suffix would be
supported.

> Additional ignore whitespace related comment:
> * IIRC, Perforce had an interesting twist on ignoring whitespace. You
> could ignore just line leading/ending whitespace instead of all
> whitespace differences but pay attention to any whitespace change
> after the "trim" operation had completed.
>
> e.g.:
> * "    aaa bbb   " vs "aaa bbb" would compare as equal
> * "    aaa  bbb  " vs "aaa  bbb" would compare as equal
> * "    aaa  bbb  " vs "aaa bbb" would compare as non-equal due to the
> white space change in the middle of the line

Cool (svn doesn't have that option). But I'm not sure what that would
be useful for (as a user, I can't immediately imagine an important use
case). Anyway, could still be a nice option...

Cheers,
-- 
Johan

Re: diff-optimizations-tokens branch: I think I'm going to abandon it

Posted by Bill Tutt <bi...@tutts.org>.

Note: This email only tangentially relates to svn diff and more about
reverse token scanning in general:

As someone who has implemented suffix reverse token scanning before:

* It simply isn't possible in DBCS code pages. Stick to byte only here.
   SBCS and UTF-16 make reverse token stuff relatively
straightforward. UTF-8 is a little trickier but still tractable.
   At least UTF-8 is tractable in a way that DBCS isn't. You always
know which part of a Unicode code point you are in. (i.e. byte 4 vs.
byte 3 vs. etc...)

* I would recommend only supporting a subset of the diff options for
reverse token scanning. i.e. ignore whitespace/ignore eol but skip
ignore case (if svn has that, I forget...)
   If tokens include keyword expansion operations then stop once you
hit one. The possible source of bugs outways the perf gain in my mind
here.
* Suffix scanning does really require a seekable stream, if it isn't
seekable then don't perform the reverse scanning.  It is only an
optimization after all.

Additional ignore whitespace related comment:
* IIRC, Perforce had an interesting twist on ignoring whitespace. You
could ignore just line leading/ending whitespace instead of all
whitespace differences but pay attention to any whitespace change
after the "trim" operation had completed.

e.g.:
* "    aaa bbb   " vs "aaa bbb" would compare as equal
* "    aaa  bbb  " vs "aaa  bbb" would compare as equal
* "    aaa  bbb  " vs "aaa bbb" would compare as non-equal due to the
white space change in the middle of the line

Fyi,
Bill

Re: diff-optimizations-tokens branch: I think I'm going to abandon it

Posted by Johan Corveleyn <jc...@gmail.com>.

On Thu, Dec 2, 2010 at 4:21 PM, Julian Foad <ju...@wandisco.com> wrote:
> Hi Johan.
>
> I've just read the whole of this thread.
>
> I didn't quite understand your original point (2) that "token-based
> suffix scanning will not be as fast as byte-based suffix scanning".
> Sure it won't, but is there any reason you mentioned suffix scanning
> there specifically?  The same is true of prefix scanning, of course.

I mentioned suffix specifically, because for prefix scanning the bytes
approach doesn't have that same advantage. It has to count the number
of lines, to have a correct line offset for the rest of the "normal"
algorithm. So the byte-based prefix scanning needs to check for
eol-characters just like the token-based version. That means that
theoretically (implementation details aside) both approaches should be
equally fast for prefix scanning.

But for suffix scanning, it seems I don't need to know the number of
lines, so the bytes approach can just ignore line structure for the
suffix.

> And both of them could be fast enough, I assume, if you disable the hash
> calculation in the "get token" callbacks like you were talking about.

Granted, the difference could be minimal (I haven't actually tested).

However, an additional two comparisons (against \r and \n), for every
byte in the suffix of say 500Kb, for say 2000 diffs of a blame
operation, it might cost another couple of seconds :-).

> But I don't think that necessarily affects the main point.  It looks
> like you've thoroughly investigated using a token based approach.  Thank
> you for doing so.  My initial feeling that it was worth investigating
> was in the hope that you might find some fairly straightforward and
> self-contained modification to the existing token-handling layer.  I
> think the result of this investigation, in which you needed to add
> token-fetch-backwards callbacks and so on, shows that this approach is
> too complex.  I don't want to see a complex implementation.  Therefore I
> support your inclination to abandon that approach and use the byte-wise
> approach instead.

Yep, it just seems too complex (as also confirmed by Bill Tutt in the
other mail).

The "pushback" stuff (which is inevitable, because we will always read
one token too far to discover the non-match, and that line needs to be
reread to calculate the hash) is also rather dirty/annoying (adds
additional svn_diff_fns_t-ness).

It was definitely worth investigating, if only to give me some peace
of mind that we have considered the options (and it was a great
learning experience).

Cheers,
-- 
Johan

Re: diff-optimizations-tokens branch: I think I'm going to abandon it

Posted by Julian Foad <ju...@wandisco.com>.

Hi Johan.

I've just read the whole of this thread.

I didn't quite understand your original point (2) that "token-based
suffix scanning will not be as fast as byte-based suffix scanning".
Sure it won't, but is there any reason you mentioned suffix scanning
there specifically?  The same is true of prefix scanning, of course.
And both of them could be fast enough, I assume, if you disable the hash
calculation in the "get token" callbacks like you were talking about.

But I don't think that necessarily affects the main point.  It looks
like you've thoroughly investigated using a token based approach.  Thank
you for doing so.  My initial feeling that it was worth investigating
was in the hope that you might find some fairly straightforward and
self-contained modification to the existing token-handling layer.  I
think the result of this investigation, in which you needed to add
token-fetch-backwards callbacks and so on, shows that this approach is
too complex.  I don't want to see a complex implementation.  Therefore I
support your inclination to abandon that approach and use the byte-wise
approach instead.

- Julian

Re: diff-optimizations-tokens branch: I think I'm going to abandon it

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Johan Corveleyn wrote on Thu, Dec 02, 2010 at 14:59:23 +0100:
> On Thu, Dec 2, 2010 at 6:43 AM, Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> > Johan Corveleyn wrote on Wed, Dec 01, 2010 at 13:34:48 +0100:
> >> On Wed, Dec 1, 2010 at 11:44 AM, Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> >> > Johan Corveleyn wrote on Wed, Dec 01, 2010 at 10:05:29 +0100:
> >> >> On Wed, Dec 1, 2010 at 3:38 AM, Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> >> >> > Johan Corveleyn wrote on Wed, Dec 01, 2010 at 00:25:27 +0100:
> >> You're right, that would impose additional constraints on other
> >> implementors. I don't know if being non-streamy (or less streamy
> >> anyway) would be problem ...
> >>
> >
> > We should have asked this before, but:
> >
> > Do we know who are the other implementors / typical use-cases of
> > svn_diff_fns_t?
> 
> Yeah, was wondering about this too. Are there in fact other
> implementors? Maybe plugins for IDE's or something? How could we find
> out? How "public" is this API in fact?
> 

[ you might want to ask this on a new thread / subject line to get more visibility ]

> For blame, I know all revisions are first converted to full-texts,
> which are written out to temp files. Then diff_file.c works on those
> two files.
> 

OK.

> >> Sidestep: I just now realized that I probably don't need to have the
> >> "reverse normalization algorithm" for implementing get_previous_token.
> >> The call to the normalization function in get_next_token is (I think)
> >> only needed to be able to calculate the hash. But since
> >> get_previous_token doesn't need to calculate hashes, I may be able to
> >> get by without normalization there. I'd only need to normalize inside
> >> token_compare, and I *think* I can just to that "forwardly", instead
> >> of backwards. Just thinking out loud here ...
> >>
> >
> > Is "normalization function" the thing that collapses newlines and
> > whitespaces if -x-w or -x--ignore-eol-style is in effect?  If so,
> > another purpose of calling that might be to make suffixes that are
> > not bytewise-identical, but only modulo-whitespace-identical, also
> > be lumped into the "identical suffix".
> >
> > (Haven't dived into the code yet; the above is based on my understanding
> > of your high-level descriptions)
> 
> Yes, it's svn_diff__normalize_buffer in libsvn_diff/util.c. I don't
> fully understand your suggestion above. The normalize function
> actually transforms a given buffer by normalizing all characters
> according to the ignore options (so if -x-w is given, all whitespace
> is simply skipped, or with -x-b all whitespace is transformed into a
> single space etc.).

Yes, that's what I assumed.

> Are you suggesting that I don't have to transform
> anything, but merely have to compare them "modulo-ignore-stuff", just
> to know if they can be skipped, and further not touch them?
> 

Well, yes, that's one option: if you write a 'smart' comparison routine
that takes -x--foo into account (so if -x-b, it ensures that both sides
have a non-empty string of whitespace whenever either has a whitespace,
etc.), then you can avoid the call to the normalization function.

> In diff_file.c the transformation actually uses the same source buffer
> also as target buffer (I *think* that's for optimization reasons: no
> need to memcopy too much; but I'm not sure, haven't studied that in
> too much detail). That actually caused me some headaches with the
> -tokens approach, because get_next_token actually changes the buffer.
> That's why I had to write a little bit more sophisticated
> token_pushback function (I couldn't simply roll back the pointers,
> because the next time get_next_token would be called, it would not
> calculate the raw_length correctly, and possible stop reading too
> early; so I wrote it to remember the pushed back token in a linked
> list, so it's information could be used for the next get_next_token).
> 
> In diff_memory.c, the transformation is done to another buffer (the
> size of which is first calculated to be big enough to hold the largest
> token of the list).
> 
> >> So: that makes the token-approach again a little bit more possible.
> >> But do we want it? It requires a lot more from implementors of
> >> svn_diff_fns_t. OTOH, it does offer a generic prefix/suffix
> >> optimization to all implementors of svn_diff_fns_t ...
> >>
> >
> > Another option is to add the "read backwards" callbacks to the API, but
> > designate them as optional.  Then you only do the suffix half of the
> > optimization of both/all sources support the backwards callbacks.
> >
> > (Is it a good idea?  Don't know.  But it's one option.)
> 
> Hm, maybe.
> 
> There is another problem though: prefix scanning uses the existing
> callback "datasource_get_next_token". But it doesn't need the hash
> that's calculated in there (I think the biggest speedup is coming from
> not having to calculate the hash when we're comparing prefix/suffix
> lines). So I've simply changed the implementation (both in diff_file.c
> and diff_memory.c) to handle NULL as argument for **hash by not
> calculating the hash. Existing code will probably not handle that, and
> probably crash (or otherwise will calculate the hash and drop it on
> the floor, but not providing any performance benifit).
> 
> Maybe it's better to have an explicit boolean flag "calculate_hash",
> instead of passing NULL, but either way it will not be backwards
> compatible with old implementors.
> 

I wouldn't worry: you have to revv the svn_diff_fns_t API anyway (right?),
so you can add a note saying "This output parameter is (now) optional"
in svn_diff_fns2_t. 

> >> So the choice still boils down to:
> >> 1) A specific implementation inside diff_file.c (byte-based).
> >>
> >
> > Not a bad option, I think.  The improvement isn't as useful in
> > diff_memory.c, since that API isn't likely to be used for large
> > amounts of data anyway (e.g., trunk uses it only for propery diffs).
> 
> Indeed. I don't know if property diffs will ever be large enough (and
> in large enough quantities for a single operation) to have a
> measurable difference. Maybe with large svn:mergeinfo properties,
> being diffed 100's or 1000's of times during a single merge operation?
> I have no idea.

We assume throughout the code that properties are not too large.  Even
with that merge operation, ask yourself whether the file data won't be
a few orders of magnitude larger than the properties data.

Re: diff-optimizations-tokens branch: I think I'm going to abandon it

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Johan Corveleyn wrote on Thu, Dec 02, 2010 at 14:59:23 +0100:
> On Thu, Dec 2, 2010 at 6:43 AM, Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> > Johan Corveleyn wrote on Wed, Dec 01, 2010 at 13:34:48 +0100:
> >> On Wed, Dec 1, 2010 at 11:44 AM, Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> >> > Johan Corveleyn wrote on Wed, Dec 01, 2010 at 10:05:29 +0100:
> >> >> On Wed, Dec 1, 2010 at 3:38 AM, Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> >> >> > Johan Corveleyn wrote on Wed, Dec 01, 2010 at 00:25:27 +0100:
> >> You're right, that would impose additional constraints on other
> >> implementors. I don't know if being non-streamy (or less streamy
> >> anyway) would be problem ...
> >>
> >
> > We should have asked this before, but:
> >
> > Do we know who are the other implementors / typical use-cases of
> > svn_diff_fns_t?
> 
> Yeah, was wondering about this too. Are there in fact other
> implementors? Maybe plugins for IDE's or something? How could we find
> out? How "public" is this API in fact?
> 

[ you might want to ask this on a new thread / subject line to get more visibility ]

> For blame, I know all revisions are first converted to full-texts,
> which are written out to temp files. Then diff_file.c works on those
> two files.
> 

OK.

> >> Sidestep: I just now realized that I probably don't need to have the
> >> "reverse normalization algorithm" for implementing get_previous_token.
> >> The call to the normalization function in get_next_token is (I think)
> >> only needed to be able to calculate the hash. But since
> >> get_previous_token doesn't need to calculate hashes, I may be able to
> >> get by without normalization there. I'd only need to normalize inside
> >> token_compare, and I *think* I can just to that "forwardly", instead
> >> of backwards. Just thinking out loud here ...
> >>
> >
> > Is "normalization function" the thing that collapses newlines and
> > whitespaces if -x-w or -x--ignore-eol-style is in effect?  If so,
> > another purpose of calling that might be to make suffixes that are
> > not bytewise-identical, but only modulo-whitespace-identical, also
> > be lumped into the "identical suffix".
> >
> > (Haven't dived into the code yet; the above is based on my understanding
> > of your high-level descriptions)
> 
> Yes, it's svn_diff__normalize_buffer in libsvn_diff/util.c. I don't
> fully understand your suggestion above. The normalize function
> actually transforms a given buffer by normalizing all characters
> according to the ignore options (so if -x-w is given, all whitespace
> is simply skipped, or with -x-b all whitespace is transformed into a
> single space etc.).

Yes, that's what I assumed.

> Are you suggesting that I don't have to transform
> anything, but merely have to compare them "modulo-ignore-stuff", just
> to know if they can be skipped, and further not touch them?
> 

Well, yes, that's one option: if you write a 'smart' comparison routine
that takes -x--foo into account (so if -x-b, it ensures that both sides
have a non-empty string of whitespace whenever either has a whitespace,
etc.), then you can avoid the call to the normalization function.

> In diff_file.c the transformation actually uses the same source buffer
> also as target buffer (I *think* that's for optimization reasons: no
> need to memcopy too much; but I'm not sure, haven't studied that in
> too much detail). That actually caused me some headaches with the
> -tokens approach, because get_next_token actually changes the buffer.
> That's why I had to write a little bit more sophisticated
> token_pushback function (I couldn't simply roll back the pointers,
> because the next time get_next_token would be called, it would not
> calculate the raw_length correctly, and possible stop reading too
> early; so I wrote it to remember the pushed back token in a linked
> list, so it's information could be used for the next get_next_token).
> 
> In diff_memory.c, the transformation is done to another buffer (the
> size of which is first calculated to be big enough to hold the largest
> token of the list).
> 
> >> So: that makes the token-approach again a little bit more possible.
> >> But do we want it? It requires a lot more from implementors of
> >> svn_diff_fns_t. OTOH, it does offer a generic prefix/suffix
> >> optimization to all implementors of svn_diff_fns_t ...
> >>
> >
> > Another option is to add the "read backwards" callbacks to the API, but
> > designate them as optional.  Then you only do the suffix half of the
> > optimization of both/all sources support the backwards callbacks.
> >
> > (Is it a good idea?  Don't know.  But it's one option.)
> 
> Hm, maybe.
> 
> There is another problem though: prefix scanning uses the existing
> callback "datasource_get_next_token". But it doesn't need the hash
> that's calculated in there (I think the biggest speedup is coming from
> not having to calculate the hash when we're comparing prefix/suffix
> lines). So I've simply changed the implementation (both in diff_file.c
> and diff_memory.c) to handle NULL as argument for **hash by not
> calculating the hash. Existing code will probably not handle that, and
> probably crash (or otherwise will calculate the hash and drop it on
> the floor, but not providing any performance benifit).
> 
> Maybe it's better to have an explicit boolean flag "calculate_hash",
> instead of passing NULL, but either way it will not be backwards
> compatible with old implementors.
> 

I wouldn't worry: you have to revv the svn_diff_fns_t API anyway (right?),
so you can add a note saying "This output parameter is (now) optional"
in svn_diff_fns2_t. 

> >> So the choice still boils down to:
> >> 1) A specific implementation inside diff_file.c (byte-based).
> >>
> >
> > Not a bad option, I think.  The improvement isn't as useful in
> > diff_memory.c, since that API isn't likely to be used for large
> > amounts of data anyway (e.g., trunk uses it only for propery diffs).
> 
> Indeed. I don't know if property diffs will ever be large enough (and
> in large enough quantities for a single operation) to have a
> measurable difference. Maybe with large svn:mergeinfo properties,
> being diffed 100's or 1000's of times during a single merge operation?
> I have no idea.

We assume throughout the code that properties are not too large.  Even
with that merge operation, ask yourself whether the file data won't be
a few orders of magnitude larger than the properties data.

Re: diff-optimizations-tokens branch: I think I'm going to abandon it

Posted by Johan Corveleyn <jc...@gmail.com>.

On Thu, Dec 2, 2010 at 6:43 AM, Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> [ finally getting back to this mail; having slept on it, etc. ]
>
> Johan Corveleyn wrote on Wed, Dec 01, 2010 at 13:34:48 +0100:
>> On Wed, Dec 1, 2010 at 11:44 AM, Daniel Shahaf <d....@daniel.shahaf.name> wrote:
>> > Johan Corveleyn wrote on Wed, Dec 01, 2010 at 10:05:29 +0100:
>> >> On Wed, Dec 1, 2010 at 3:38 AM, Daniel Shahaf <d....@daniel.shahaf.name> wrote:
>> >> > Johan Corveleyn wrote on Wed, Dec 01, 2010 at 00:25:27 +0100:
>> >> >> I am now considering to abandon the tokens-approach, for the following reasons:
>> >> > ...
>> >> >> So, unless someone can convince me otherwise, I'm probably going to
>> >> >> stop with the token approach. Because of 2), I don't think it's worth
>> >> >> it to spend the effort needed for 1), especially because the
>> >> >> byte-based approach already works.
>> >> >>
>> >> >
>> >> > In other words, you're saying that the token-based approach: (b) won't be
>> >> > as fast as the bytes-based approach can be, and (a) requires much effort
>> >> > to be spent on implementing the reverse reading of tokens?  (i.e.,
>> >> > a backwards gets())
>> >>
>> >> Yes.
>> >>
>> >> The reverse reading is quite hard (in the diff_file.c case) because of
>> >> the chunked reading of files. A line may be split by a "chunk
>> >> boundary" (it may even be split in the middle of an eol sequence
>> >> (between \r and \n), and it all still needs to be
>> >> canonicalized/normalized correctly (that's why we'll also need a
>> >> reverse normalization function). The current forward get_next_token
>> >> does this very well, and the reverse should be analogous, but I just
>> >> find it hard to reason about, and to get it implemented correctly. It
>> >> will take me a lot of time, and with a high probability of introducing
>> >> subtle bugs for special edge cases.
>> >>
>> >
>> > OK, so a reverse get_next_token() could be tricky to implement.  But,
>> > worse, won't having it mean that implementors of svn_diff_fns_t can't
>> > have streamy access to their source?  Since they would be required to
>> > provide sometimes a token from the start and sometimes a token from the
>> > end...
>> >
>> > Well, it's not streamy in our usual sense of the word, but it's
>> > "double-streamy" (no one requests the middle byte until either all
>> > bytes before or all bytes after it were transmitted already)
>>
>> Oooh, I hadn't considered other implementors besides the internal
>> diff_memory.c and diff_file.c.
>
> Just to clarify: diff_file.c and diff_memory.c are the only implementors
> of svn_diff_fns_t in the core code, correct?

Yes.

>> You're right, that would impose
>> additional constraints on other implementors. I don't know if being
>> non-streamy (or less streamy anyway) would be problem ...
>>
>
> We should have asked this before, but:
>
> Do we know who are the other implementors / typical use-cases of
> svn_diff_fns_t?

Yeah, was wondering about this too. Are there in fact other
implementors? Maybe plugins for IDE's or something? How could we find
out? How "public" is this API in fact?

For blame, I know all revisions are first converted to full-texts,
which are written out to temp files. Then diff_file.c works on those
two files.

>> In my last commit on the -tokens branch, I added a flag "open_at_end"
>> to the datasource_open function (function of svn_diff_fns_t), so the
>> datasource could be opened at the end. Also, other supporting
>> functions were needed: token_pushback_prefix, token_pushback_suffix
>> (to "push back" tokens that were read too far when scanning for
>> prefix/suffix) and the dreaded datasource_get_previous_token. Anyway,
>> the high-level idea was:
>>
>> 1) Open datasources at the end.
>>
>> 2) Scan for identical suffix (getting tokens in reverse).
>>
>> 3) At first non-match: pushback suffix token, and note where suffix starts.
>>
>> 4) Open datasources at the beginning.
>>
>> 5) Scan for identical prefix (getting tokens normally, but without hash).
>>
>> 6) At first non-match: pushback prefix token.
>>
>> 7) Run the "old algorithm": getting tokens forward, but with hash. The
>> get_next_token function should stop (return NULL for token) when it
>> arrives at the suffix.
>>
>>
>> Sidestep: I just now realized that I probably don't need to have the
>> "reverse normalization algorithm" for implementing get_previous_token.
>> The call to the normalization function in get_next_token is (I think)
>> only needed to be able to calculate the hash. But since
>> get_previous_token doesn't need to calculate hashes, I may be able to
>> get by without normalization there. I'd only need to normalize inside
>> token_compare, and I *think* I can just to that "forwardly", instead
>> of backwards. Just thinking out loud here ...
>>
>
> Is "normalization function" the thing that collapses newlines and
> whitespaces if -x-w or -x--ignore-eol-style is in effect?  If so,
> another purpose of calling that might be to make suffixes that are
> not bytewise-identical, but only modulo-whitespace-identical, also
> be lumped into the "identical suffix".
>
> (Haven't dived into the code yet; the above is based on my understanding
> of your high-level descriptions)

Yes, it's svn_diff__normalize_buffer in libsvn_diff/util.c. I don't
fully understand your suggestion above. The normalize function
actually transforms a given buffer by normalizing all characters
according to the ignore options (so if -x-w is given, all whitespace
is simply skipped, or with -x-b all whitespace is transformed into a
single space etc.). Are you suggesting that I don't have to transform
anything, but merely have to compare them "modulo-ignore-stuff", just
to know if they can be skipped, and further not touch them?

In diff_file.c the transformation actually uses the same source buffer
also as target buffer (I *think* that's for optimization reasons: no
need to memcopy too much; but I'm not sure, haven't studied that in
too much detail). That actually caused me some headaches with the
-tokens approach, because get_next_token actually changes the buffer.
That's why I had to write a little bit more sophisticated
token_pushback function (I couldn't simply roll back the pointers,
because the next time get_next_token would be called, it would not
calculate the raw_length correctly, and possible stop reading too
early; so I wrote it to remember the pushed back token in a linked
list, so it's information could be used for the next get_next_token).

In diff_memory.c, the transformation is done to another buffer (the
size of which is first calculated to be big enough to hold the largest
token of the list).

>> So: that makes the token-approach again a little bit more possible.
>> But do we want it? It requires a lot more from implementors of
>> svn_diff_fns_t. OTOH, it does offer a generic prefix/suffix
>> optimization to all implementors of svn_diff_fns_t ...
>>
>
> Another option is to add the "read backwards" callbacks to the API, but
> designate them as optional.  Then you only do the suffix half of the
> optimization of both/all sources support the backwards callbacks.
>
> (Is it a good idea?  Don't know.  But it's one option.)

Hm, maybe.

There is another problem though: prefix scanning uses the existing
callback "datasource_get_next_token". But it doesn't need the hash
that's calculated in there (I think the biggest speedup is coming from
not having to calculate the hash when we're comparing prefix/suffix
lines). So I've simply changed the implementation (both in diff_file.c
and diff_memory.c) to handle NULL as argument for **hash by not
calculating the hash. Existing code will probably not handle that, and
probably crash (or otherwise will calculate the hash and drop it on
the floor, but not providing any performance benifit).

Maybe it's better to have an explicit boolean flag "calculate_hash",
instead of passing NULL, but either way it will not be backwards
compatible with old implementors.

>> >> >> Any thoughts?
>> >> >>
>> >> >
>> >> > -tokens/BRANCH-README mentions one of the advantages of the tokens
>> >> > approach being that the comparison is done only after whitespace and
>> >> > newlines have been canonicalized (if -x-w or -x--ignore-eol-style is in
>> >> > effect).  IIRC, the -bytes approach doesn't currently take advantage of
>> >> > these -x flags?
>> >> >
>> >> > What is the practical effect of the fact the -bytes approach doesn't
>> >> > take advantage of these flags?  If a file (with a moderately long
>> >> > history) has had all its newlines changed in rN, then I assume that your
>> >> > -bytes optimizations will speed up all the diffs that 'blame' performs
>> >> > on that file, except for the single diff between rN and its predecessor?
>> >>
>> >> Yes, I thought that would be an important advantage of the tokens
>> >> approach, but as you suggest: the practical effect will most likely be
>> >> limited. Indeed, only this single diff between rN and its predecessor
>> >> (which may be 1 in 1000 diffs) will not benifit from
>> >> prefix/suffix-optimization. Even if the rate of such changes is like
>> >> 1/100th (let's say you change leading tabs to spaces, and vice versa,
>> >> every 100 revisions), it will hardly be noticeable.
>> >>
>> >> The perfectionist in me still thinks: hey, you can also implement this
>> >> normalization with the byte-based approach (in a streamy way). But
>> >> that will probably be quite hard, because I'd have to take into
>> >
>> > We can add that additional optimization after the basic -bytes support
>> > is working?  No need to implement all possible optimizations in one
>> > shot (let alone those of them that might have little relative benefit).
>>
>> Ok, that minor optimization can definitely wait.
>>
>>
>> So the choice still boils down to:
>> 1) A specific implementation inside diff_file.c (byte-based).
>>
>
> Not a bad option, I think.  The improvement isn't as useful in
> diff_memory.c, since that API isn't likely to be used for large
> amounts of data anyway (e.g., trunk uses it only for propery diffs).

Indeed. I don't know if property diffs will ever be large enough (and
in large enough quantities for a single operation) to have a
measurable difference. Maybe with large svn:mergeinfo properties,
being diffed 100's or 1000's of times during a single merge operation?
I have no idea.

>> 2) A generic implementation for all implementors of svn_diff_fns_t,
>> which still requires some work (get_previous_token), and imposes more
>> constraints on svn_diff_fns_t implementors ...
>>
>
> Might also be a not bad option... :-)

Cheers,
-- 
Johan

Re: diff-optimizations-tokens branch: I think I'm going to abandon it

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

[ finally getting back to this mail; having slept on it, etc. ]

Johan Corveleyn wrote on Wed, Dec 01, 2010 at 13:34:48 +0100:
> On Wed, Dec 1, 2010 at 11:44 AM, Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> > Johan Corveleyn wrote on Wed, Dec 01, 2010 at 10:05:29 +0100:
> >> On Wed, Dec 1, 2010 at 3:38 AM, Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> >> > Johan Corveleyn wrote on Wed, Dec 01, 2010 at 00:25:27 +0100:
> >> >> I am now considering to abandon the tokens-approach, for the following reasons:
> >> > ...
> >> >> So, unless someone can convince me otherwise, I'm probably going to
> >> >> stop with the token approach. Because of 2), I don't think it's worth
> >> >> it to spend the effort needed for 1), especially because the
> >> >> byte-based approach already works.
> >> >>
> >> >
> >> > In other words, you're saying that the token-based approach: (b) won't be
> >> > as fast as the bytes-based approach can be, and (a) requires much effort
> >> > to be spent on implementing the reverse reading of tokens?  (i.e.,
> >> > a backwards gets())
> >>
> >> Yes.
> >>
> >> The reverse reading is quite hard (in the diff_file.c case) because of
> >> the chunked reading of files. A line may be split by a "chunk
> >> boundary" (it may even be split in the middle of an eol sequence
> >> (between \r and \n), and it all still needs to be
> >> canonicalized/normalized correctly (that's why we'll also need a
> >> reverse normalization function). The current forward get_next_token
> >> does this very well, and the reverse should be analogous, but I just
> >> find it hard to reason about, and to get it implemented correctly. It
> >> will take me a lot of time, and with a high probability of introducing
> >> subtle bugs for special edge cases.
> >>
> >
> > OK, so a reverse get_next_token() could be tricky to implement.  But,
> > worse, won't having it mean that implementors of svn_diff_fns_t can't
> > have streamy access to their source?  Since they would be required to
> > provide sometimes a token from the start and sometimes a token from the
> > end...
> >
> > Well, it's not streamy in our usual sense of the word, but it's
> > "double-streamy" (no one requests the middle byte until either all
> > bytes before or all bytes after it were transmitted already)
> 
> Oooh, I hadn't considered other implementors besides the internal
> diff_memory.c and diff_file.c.

Just to clarify: diff_file.c and diff_memory.c are the only implementors
of svn_diff_fns_t in the core code, correct?

> You're right, that would impose
> additional constraints on other implementors. I don't know if being
> non-streamy (or less streamy anyway) would be problem ...
> 

We should have asked this before, but:

Do we know who are the other implementors / typical use-cases of
svn_diff_fns_t?

> In my last commit on the -tokens branch, I added a flag "open_at_end"
> to the datasource_open function (function of svn_diff_fns_t), so the
> datasource could be opened at the end. Also, other supporting
> functions were needed: token_pushback_prefix, token_pushback_suffix
> (to "push back" tokens that were read too far when scanning for
> prefix/suffix) and the dreaded datasource_get_previous_token. Anyway,
> the high-level idea was:
> 
> 1) Open datasources at the end.
> 
> 2) Scan for identical suffix (getting tokens in reverse).
> 
> 3) At first non-match: pushback suffix token, and note where suffix starts.
> 
> 4) Open datasources at the beginning.
> 
> 5) Scan for identical prefix (getting tokens normally, but without hash).
> 
> 6) At first non-match: pushback prefix token.
> 
> 7) Run the "old algorithm": getting tokens forward, but with hash. The
> get_next_token function should stop (return NULL for token) when it
> arrives at the suffix.
> 
> 
> Sidestep: I just now realized that I probably don't need to have the
> "reverse normalization algorithm" for implementing get_previous_token.
> The call to the normalization function in get_next_token is (I think)
> only needed to be able to calculate the hash. But since
> get_previous_token doesn't need to calculate hashes, I may be able to
> get by without normalization there. I'd only need to normalize inside
> token_compare, and I *think* I can just to that "forwardly", instead
> of backwards. Just thinking out loud here ...
> 

Is "normalization function" the thing that collapses newlines and
whitespaces if -x-w or -x--ignore-eol-style is in effect?  If so,
another purpose of calling that might be to make suffixes that are
not bytewise-identical, but only modulo-whitespace-identical, also
be lumped into the "identical suffix".

(Haven't dived into the code yet; the above is based on my understanding
of your high-level descriptions)

> So: that makes the token-approach again a little bit more possible.
> But do we want it? It requires a lot more from implementors of
> svn_diff_fns_t. OTOH, it does offer a generic prefix/suffix
> optimization to all implementors of svn_diff_fns_t ...
> 

Another option is to add the "read backwards" callbacks to the API, but
designate them as optional.  Then you only do the suffix half of the
optimization of both/all sources support the backwards callbacks.

(Is it a good idea?  Don't know.  But it's one option.)

> >> >> Any thoughts?
> >> >>
> >> >
> >> > -tokens/BRANCH-README mentions one of the advantages of the tokens
> >> > approach being that the comparison is done only after whitespace and
> >> > newlines have been canonicalized (if -x-w or -x--ignore-eol-style is in
> >> > effect).  IIRC, the -bytes approach doesn't currently take advantage of
> >> > these -x flags?
> >> >
> >> > What is the practical effect of the fact the -bytes approach doesn't
> >> > take advantage of these flags?  If a file (with a moderately long
> >> > history) has had all its newlines changed in rN, then I assume that your
> >> > -bytes optimizations will speed up all the diffs that 'blame' performs
> >> > on that file, except for the single diff between rN and its predecessor?
> >>
> >> Yes, I thought that would be an important advantage of the tokens
> >> approach, but as you suggest: the practical effect will most likely be
> >> limited. Indeed, only this single diff between rN and its predecessor
> >> (which may be 1 in 1000 diffs) will not benifit from
> >> prefix/suffix-optimization. Even if the rate of such changes is like
> >> 1/100th (let's say you change leading tabs to spaces, and vice versa,
> >> every 100 revisions), it will hardly be noticeable.
> >>
> >> The perfectionist in me still thinks: hey, you can also implement this
> >> normalization with the byte-based approach (in a streamy way). But
> >> that will probably be quite hard, because I'd have to take into
> >
> > We can add that additional optimization after the basic -bytes support
> > is working?  No need to implement all possible optimizations in one
> > shot (let alone those of them that might have little relative benefit).
> 
> Ok, that minor optimization can definitely wait.
> 
> 
> So the choice still boils down to:
> 1) A specific implementation inside diff_file.c (byte-based).
> 

Not a bad option, I think.  The improvement isn't as useful in
diff_memory.c, since that API isn't likely to be used for large
amounts of data anyway (e.g., trunk uses it only for propery diffs).

> 2) A generic implementation for all implementors of svn_diff_fns_t,
> which still requires some work (get_previous_token), and imposes more
> constraints on svn_diff_fns_t implementors ...
> 

Might also be a not bad option... :-)

> -- 
> Johan

Re: diff-optimizations-tokens branch: I think I'm going to abandon it

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

[ finally getting back to this mail; having slept on it, etc. ]

Johan Corveleyn wrote on Wed, Dec 01, 2010 at 13:34:48 +0100:
> On Wed, Dec 1, 2010 at 11:44 AM, Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> > Johan Corveleyn wrote on Wed, Dec 01, 2010 at 10:05:29 +0100:
> >> On Wed, Dec 1, 2010 at 3:38 AM, Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> >> > Johan Corveleyn wrote on Wed, Dec 01, 2010 at 00:25:27 +0100:
> >> >> I am now considering to abandon the tokens-approach, for the following reasons:
> >> > ...
> >> >> So, unless someone can convince me otherwise, I'm probably going to
> >> >> stop with the token approach. Because of 2), I don't think it's worth
> >> >> it to spend the effort needed for 1), especially because the
> >> >> byte-based approach already works.
> >> >>
> >> >
> >> > In other words, you're saying that the token-based approach: (b) won't be
> >> > as fast as the bytes-based approach can be, and (a) requires much effort
> >> > to be spent on implementing the reverse reading of tokens?  (i.e.,
> >> > a backwards gets())
> >>
> >> Yes.
> >>
> >> The reverse reading is quite hard (in the diff_file.c case) because of
> >> the chunked reading of files. A line may be split by a "chunk
> >> boundary" (it may even be split in the middle of an eol sequence
> >> (between \r and \n), and it all still needs to be
> >> canonicalized/normalized correctly (that's why we'll also need a
> >> reverse normalization function). The current forward get_next_token
> >> does this very well, and the reverse should be analogous, but I just
> >> find it hard to reason about, and to get it implemented correctly. It
> >> will take me a lot of time, and with a high probability of introducing
> >> subtle bugs for special edge cases.
> >>
> >
> > OK, so a reverse get_next_token() could be tricky to implement.  But,
> > worse, won't having it mean that implementors of svn_diff_fns_t can't
> > have streamy access to their source?  Since they would be required to
> > provide sometimes a token from the start and sometimes a token from the
> > end...
> >
> > Well, it's not streamy in our usual sense of the word, but it's
> > "double-streamy" (no one requests the middle byte until either all
> > bytes before or all bytes after it were transmitted already)
> 
> Oooh, I hadn't considered other implementors besides the internal
> diff_memory.c and diff_file.c.

Just to clarify: diff_file.c and diff_memory.c are the only implementors
of svn_diff_fns_t in the core code, correct?

> You're right, that would impose
> additional constraints on other implementors. I don't know if being
> non-streamy (or less streamy anyway) would be problem ...
> 

We should have asked this before, but:

Do we know who are the other implementors / typical use-cases of
svn_diff_fns_t?

> In my last commit on the -tokens branch, I added a flag "open_at_end"
> to the datasource_open function (function of svn_diff_fns_t), so the
> datasource could be opened at the end. Also, other supporting
> functions were needed: token_pushback_prefix, token_pushback_suffix
> (to "push back" tokens that were read too far when scanning for
> prefix/suffix) and the dreaded datasource_get_previous_token. Anyway,
> the high-level idea was:
> 
> 1) Open datasources at the end.
> 
> 2) Scan for identical suffix (getting tokens in reverse).
> 
> 3) At first non-match: pushback suffix token, and note where suffix starts.
> 
> 4) Open datasources at the beginning.
> 
> 5) Scan for identical prefix (getting tokens normally, but without hash).
> 
> 6) At first non-match: pushback prefix token.
> 
> 7) Run the "old algorithm": getting tokens forward, but with hash. The
> get_next_token function should stop (return NULL for token) when it
> arrives at the suffix.
> 
> 
> Sidestep: I just now realized that I probably don't need to have the
> "reverse normalization algorithm" for implementing get_previous_token.
> The call to the normalization function in get_next_token is (I think)
> only needed to be able to calculate the hash. But since
> get_previous_token doesn't need to calculate hashes, I may be able to
> get by without normalization there. I'd only need to normalize inside
> token_compare, and I *think* I can just to that "forwardly", instead
> of backwards. Just thinking out loud here ...
> 

Is "normalization function" the thing that collapses newlines and
whitespaces if -x-w or -x--ignore-eol-style is in effect?  If so,
another purpose of calling that might be to make suffixes that are
not bytewise-identical, but only modulo-whitespace-identical, also
be lumped into the "identical suffix".

(Haven't dived into the code yet; the above is based on my understanding
of your high-level descriptions)

> So: that makes the token-approach again a little bit more possible.
> But do we want it? It requires a lot more from implementors of
> svn_diff_fns_t. OTOH, it does offer a generic prefix/suffix
> optimization to all implementors of svn_diff_fns_t ...
> 

Another option is to add the "read backwards" callbacks to the API, but
designate them as optional.  Then you only do the suffix half of the
optimization of both/all sources support the backwards callbacks.

(Is it a good idea?  Don't know.  But it's one option.)

> >> >> Any thoughts?
> >> >>
> >> >
> >> > -tokens/BRANCH-README mentions one of the advantages of the tokens
> >> > approach being that the comparison is done only after whitespace and
> >> > newlines have been canonicalized (if -x-w or -x--ignore-eol-style is in
> >> > effect).  IIRC, the -bytes approach doesn't currently take advantage of
> >> > these -x flags?
> >> >
> >> > What is the practical effect of the fact the -bytes approach doesn't
> >> > take advantage of these flags?  If a file (with a moderately long
> >> > history) has had all its newlines changed in rN, then I assume that your
> >> > -bytes optimizations will speed up all the diffs that 'blame' performs
> >> > on that file, except for the single diff between rN and its predecessor?
> >>
> >> Yes, I thought that would be an important advantage of the tokens
> >> approach, but as you suggest: the practical effect will most likely be
> >> limited. Indeed, only this single diff between rN and its predecessor
> >> (which may be 1 in 1000 diffs) will not benifit from
> >> prefix/suffix-optimization. Even if the rate of such changes is like
> >> 1/100th (let's say you change leading tabs to spaces, and vice versa,
> >> every 100 revisions), it will hardly be noticeable.
> >>
> >> The perfectionist in me still thinks: hey, you can also implement this
> >> normalization with the byte-based approach (in a streamy way). But
> >> that will probably be quite hard, because I'd have to take into
> >
> > We can add that additional optimization after the basic -bytes support
> > is working?  No need to implement all possible optimizations in one
> > shot (let alone those of them that might have little relative benefit).
> 
> Ok, that minor optimization can definitely wait.
> 
> 
> So the choice still boils down to:
> 1) A specific implementation inside diff_file.c (byte-based).
> 

Not a bad option, I think.  The improvement isn't as useful in
diff_memory.c, since that API isn't likely to be used for large
amounts of data anyway (e.g., trunk uses it only for propery diffs).

> 2) A generic implementation for all implementors of svn_diff_fns_t,
> which still requires some work (get_previous_token), and imposes more
> constraints on svn_diff_fns_t implementors ...
> 

Might also be a not bad option... :-)

> -- 
> Johan

Re: diff-optimizations-tokens branch: I think I'm going to abandon it

Posted by Johan Corveleyn <jc...@gmail.com>.

On Wed, Dec 1, 2010 at 11:44 AM, Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> Johan Corveleyn wrote on Wed, Dec 01, 2010 at 10:05:29 +0100:
>> On Wed, Dec 1, 2010 at 3:38 AM, Daniel Shahaf <d....@daniel.shahaf.name> wrote:
>> > Johan Corveleyn wrote on Wed, Dec 01, 2010 at 00:25:27 +0100:
>> >> I am now considering to abandon the tokens-approach, for the following reasons:
>> > ...
>> >> So, unless someone can convince me otherwise, I'm probably going to
>> >> stop with the token approach. Because of 2), I don't think it's worth
>> >> it to spend the effort needed for 1), especially because the
>> >> byte-based approach already works.
>> >>
>> >
>> > In other words, you're saying that the token-based approach: (b) won't be
>> > as fast as the bytes-based approach can be, and (a) requires much effort
>> > to be spent on implementing the reverse reading of tokens?  (i.e.,
>> > a backwards gets())
>>
>> Yes.
>>
>> The reverse reading is quite hard (in the diff_file.c case) because of
>> the chunked reading of files. A line may be split by a "chunk
>> boundary" (it may even be split in the middle of an eol sequence
>> (between \r and \n), and it all still needs to be
>> canonicalized/normalized correctly (that's why we'll also need a
>> reverse normalization function). The current forward get_next_token
>> does this very well, and the reverse should be analogous, but I just
>> find it hard to reason about, and to get it implemented correctly. It
>> will take me a lot of time, and with a high probability of introducing
>> subtle bugs for special edge cases.
>>
>
> OK, so a reverse get_next_token() could be tricky to implement.  But,
> worse, won't having it mean that implementors of svn_diff_fns_t can't
> have streamy access to their source?  Since they would be required to
> provide sometimes a token from the start and sometimes a token from the
> end...
>
> Well, it's not streamy in our usual sense of the word, but it's
> "double-streamy" (no one requests the middle byte until either all
> bytes before or all bytes after it were transmitted already)

Oooh, I hadn't considered other implementors besides the internal
diff_memory.c and diff_file.c. You're right, that would impose
additional constraints on other implementors. I don't know if being
non-streamy (or less streamy anyway) would be problem ...

In my last commit on the -tokens branch, I added a flag "open_at_end"
to the datasource_open function (function of svn_diff_fns_t), so the
datasource could be opened at the end. Also, other supporting
functions were needed: token_pushback_prefix, token_pushback_suffix
(to "push back" tokens that were read too far when scanning for
prefix/suffix) and the dreaded datasource_get_previous_token. Anyway,
the high-level idea was:

1) Open datasources at the end.

2) Scan for identical suffix (getting tokens in reverse).

3) At first non-match: pushback suffix token, and note where suffix starts.

4) Open datasources at the beginning.

5) Scan for identical prefix (getting tokens normally, but without hash).

6) At first non-match: pushback prefix token.

7) Run the "old algorithm": getting tokens forward, but with hash. The
get_next_token function should stop (return NULL for token) when it
arrives at the suffix.

Sidestep: I just now realized that I probably don't need to have the
"reverse normalization algorithm" for implementing get_previous_token.
The call to the normalization function in get_next_token is (I think)
only needed to be able to calculate the hash. But since
get_previous_token doesn't need to calculate hashes, I may be able to
get by without normalization there. I'd only need to normalize inside
token_compare, and I *think* I can just to that "forwardly", instead
of backwards. Just thinking out loud here ...

So: that makes the token-approach again a little bit more possible.
But do we want it? It requires a lot more from implementors of
svn_diff_fns_t. OTOH, it does offer a generic prefix/suffix
optimization to all implementors of svn_diff_fns_t ...

>> >> Any thoughts?
>> >>
>> >
>> > -tokens/BRANCH-README mentions one of the advantages of the tokens
>> > approach being that the comparison is done only after whitespace and
>> > newlines have been canonicalized (if -x-w or -x--ignore-eol-style is in
>> > effect).  IIRC, the -bytes approach doesn't currently take advantage of
>> > these -x flags?
>> >
>> > What is the practical effect of the fact the -bytes approach doesn't
>> > take advantage of these flags?  If a file (with a moderately long
>> > history) has had all its newlines changed in rN, then I assume that your
>> > -bytes optimizations will speed up all the diffs that 'blame' performs
>> > on that file, except for the single diff between rN and its predecessor?
>>
>> Yes, I thought that would be an important advantage of the tokens
>> approach, but as you suggest: the practical effect will most likely be
>> limited. Indeed, only this single diff between rN and its predecessor
>> (which may be 1 in 1000 diffs) will not benifit from
>> prefix/suffix-optimization. Even if the rate of such changes is like
>> 1/100th (let's say you change leading tabs to spaces, and vice versa,
>> every 100 revisions), it will hardly be noticeable.
>>
>> The perfectionist in me still thinks: hey, you can also implement this
>> normalization with the byte-based approach (in a streamy way). But
>> that will probably be quite hard, because I'd have to take into
>
> We can add that additional optimization after the basic -bytes support
> is working?  No need to implement all possible optimizations in one
> shot (let alone those of them that might have little relative benefit).

Ok, that minor optimization can definitely wait.

So the choice still boils down to:
1) A specific implementation inside diff_file.c (byte-based).

2) A generic implementation for all implementors of svn_diff_fns_t,
which still requires some work (get_previous_token), and imposes more
constraints on svn_diff_fns_t implementors ...

-- 
Johan

Re: diff-optimizations-tokens branch: I think I'm going to abandon it

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Johan Corveleyn wrote on Wed, Dec 01, 2010 at 10:05:29 +0100:
> On Wed, Dec 1, 2010 at 3:38 AM, Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> > Johan Corveleyn wrote on Wed, Dec 01, 2010 at 00:25:27 +0100:
> >> I am now considering to abandon the tokens-approach, for the following reasons:
> > ...
> >> So, unless someone can convince me otherwise, I'm probably going to
> >> stop with the token approach. Because of 2), I don't think it's worth
> >> it to spend the effort needed for 1), especially because the
> >> byte-based approach already works.
> >>
> >
> > In other words, you're saying that the token-based approach: (b) won't be
> > as fast as the bytes-based approach can be, and (a) requires much effort
> > to be spent on implementing the reverse reading of tokens?  (i.e.,
> > a backwards gets())
> 
> Yes.
> 
> The reverse reading is quite hard (in the diff_file.c case) because of
> the chunked reading of files. A line may be split by a "chunk
> boundary" (it may even be split in the middle of an eol sequence
> (between \r and \n), and it all still needs to be
> canonicalized/normalized correctly (that's why we'll also need a
> reverse normalization function). The current forward get_next_token
> does this very well, and the reverse should be analogous, but I just
> find it hard to reason about, and to get it implemented correctly. It
> will take me a lot of time, and with a high probability of introducing
> subtle bugs for special edge cases.
> 

OK, so a reverse get_next_token() could be tricky to implement.  But,
worse, won't having it mean that implementors of svn_diff_fns_t can't
have streamy access to their source?  Since they would be required to
provide sometimes a token from the start and sometimes a token from the
end...

Well, it's not streamy in our usual sense of the word, but it's
"double-streamy" (no one requests the middle byte until either all
bytes before or all bytes after it were transmitted already)

> >> Any thoughts?
> >>
> >
> > -tokens/BRANCH-README mentions one of the advantages of the tokens
> > approach being that the comparison is done only after whitespace and
> > newlines have been canonicalized (if -x-w or -x--ignore-eol-style is in
> > effect).  IIRC, the -bytes approach doesn't currently take advantage of
> > these -x flags?
> >
> > What is the practical effect of the fact the -bytes approach doesn't
> > take advantage of these flags?  If a file (with a moderately long
> > history) has had all its newlines changed in rN, then I assume that your
> > -bytes optimizations will speed up all the diffs that 'blame' performs
> > on that file, except for the single diff between rN and its predecessor?
> 
> Yes, I thought that would be an important advantage of the tokens
> approach, but as you suggest: the practical effect will most likely be
> limited. Indeed, only this single diff between rN and its predecessor
> (which may be 1 in 1000 diffs) will not benifit from
> prefix/suffix-optimization. Even if the rate of such changes is like
> 1/100th (let's say you change leading tabs to spaces, and vice versa,
> every 100 revisions), it will hardly be noticeable.
> 
> The perfectionist in me still thinks: hey, you can also implement this
> normalization with the byte-based approach (in a streamy way). But
> that will probably be quite hard, because I'd have to take into

We can add that additional optimization after the basic -bytes support
is working?  No need to implement all possible optimizations in one
shot (let alone those of them that might have little relative benefit).

> account that I might have to read more bytes from one file than from
> the other file (right now I can always simply read a single byte from
> both files). And other than that, it will at least be "algorithm
> duplication": it will be a (streamy) re-implementation of the
> normalization algorithm that's already used in the token layer. So all
> in all it's probably not worth it ...
> 
> Cheers,
> -- 
> Johan

Re: diff-optimizations-tokens branch: I think I'm going to abandon it

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Johan Corveleyn wrote on Wed, Dec 01, 2010 at 10:05:29 +0100:
> On Wed, Dec 1, 2010 at 3:38 AM, Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> > Johan Corveleyn wrote on Wed, Dec 01, 2010 at 00:25:27 +0100:
> >> I am now considering to abandon the tokens-approach, for the following reasons:
> > ...
> >> So, unless someone can convince me otherwise, I'm probably going to
> >> stop with the token approach. Because of 2), I don't think it's worth
> >> it to spend the effort needed for 1), especially because the
> >> byte-based approach already works.
> >>
> >
> > In other words, you're saying that the token-based approach: (b) won't be
> > as fast as the bytes-based approach can be, and (a) requires much effort
> > to be spent on implementing the reverse reading of tokens?  (i.e.,
> > a backwards gets())
> 
> Yes.
> 
> The reverse reading is quite hard (in the diff_file.c case) because of
> the chunked reading of files. A line may be split by a "chunk
> boundary" (it may even be split in the middle of an eol sequence
> (between \r and \n), and it all still needs to be
> canonicalized/normalized correctly (that's why we'll also need a
> reverse normalization function). The current forward get_next_token
> does this very well, and the reverse should be analogous, but I just
> find it hard to reason about, and to get it implemented correctly. It
> will take me a lot of time, and with a high probability of introducing
> subtle bugs for special edge cases.
> 

OK, so a reverse get_next_token() could be tricky to implement.  But,
worse, won't having it mean that implementors of svn_diff_fns_t can't
have streamy access to their source?  Since they would be required to
provide sometimes a token from the start and sometimes a token from the
end...

Well, it's not streamy in our usual sense of the word, but it's
"double-streamy" (no one requests the middle byte until either all
bytes before or all bytes after it were transmitted already)

> >> Any thoughts?
> >>
> >
> > -tokens/BRANCH-README mentions one of the advantages of the tokens
> > approach being that the comparison is done only after whitespace and
> > newlines have been canonicalized (if -x-w or -x--ignore-eol-style is in
> > effect).  IIRC, the -bytes approach doesn't currently take advantage of
> > these -x flags?
> >
> > What is the practical effect of the fact the -bytes approach doesn't
> > take advantage of these flags?  If a file (with a moderately long
> > history) has had all its newlines changed in rN, then I assume that your
> > -bytes optimizations will speed up all the diffs that 'blame' performs
> > on that file, except for the single diff between rN and its predecessor?
> 
> Yes, I thought that would be an important advantage of the tokens
> approach, but as you suggest: the practical effect will most likely be
> limited. Indeed, only this single diff between rN and its predecessor
> (which may be 1 in 1000 diffs) will not benifit from
> prefix/suffix-optimization. Even if the rate of such changes is like
> 1/100th (let's say you change leading tabs to spaces, and vice versa,
> every 100 revisions), it will hardly be noticeable.
> 
> The perfectionist in me still thinks: hey, you can also implement this
> normalization with the byte-based approach (in a streamy way). But
> that will probably be quite hard, because I'd have to take into

We can add that additional optimization after the basic -bytes support
is working?  No need to implement all possible optimizations in one
shot (let alone those of them that might have little relative benefit).

> account that I might have to read more bytes from one file than from
> the other file (right now I can always simply read a single byte from
> both files). And other than that, it will at least be "algorithm
> duplication": it will be a (streamy) re-implementation of the
> normalization algorithm that's already used in the token layer. So all
> in all it's probably not worth it ...
> 
> Cheers,
> -- 
> Johan

Re: diff-optimizations-tokens branch: I think I'm going to abandon it

Posted by Johan Corveleyn <jc...@gmail.com>.

On Wed, Dec 1, 2010 at 3:38 AM, Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> Johan Corveleyn wrote on Wed, Dec 01, 2010 at 00:25:27 +0100:
>> I am now considering to abandon the tokens-approach, for the following reasons:
> ...
>> So, unless someone can convince me otherwise, I'm probably going to
>> stop with the token approach. Because of 2), I don't think it's worth
>> it to spend the effort needed for 1), especially because the
>> byte-based approach already works.
>>
>
> In other words, you're saying that the token-based approach: (b) won't be
> as fast as the bytes-based approach can be, and (a) requires much effort
> to be spent on implementing the reverse reading of tokens?  (i.e.,
> a backwards gets())

Yes.

The reverse reading is quite hard (in the diff_file.c case) because of
the chunked reading of files. A line may be split by a "chunk
boundary" (it may even be split in the middle of an eol sequence
(between \r and \n), and it all still needs to be
canonicalized/normalized correctly (that's why we'll also need a
reverse normalization function). The current forward get_next_token
does this very well, and the reverse should be analogous, but I just
find it hard to reason about, and to get it implemented correctly. It
will take me a lot of time, and with a high probability of introducing
subtle bugs for special edge cases.

>> Any thoughts?
>>
>
> -tokens/BRANCH-README mentions one of the advantages of the tokens
> approach being that the comparison is done only after whitespace and
> newlines have been canonicalized (if -x-w or -x--ignore-eol-style is in
> effect).  IIRC, the -bytes approach doesn't currently take advantage of
> these -x flags?
>
> What is the practical effect of the fact the -bytes approach doesn't
> take advantage of these flags?  If a file (with a moderately long
> history) has had all its newlines changed in rN, then I assume that your
> -bytes optimizations will speed up all the diffs that 'blame' performs
> on that file, except for the single diff between rN and its predecessor?

Yes, I thought that would be an important advantage of the tokens
approach, but as you suggest: the practical effect will most likely be
limited. Indeed, only this single diff between rN and its predecessor
(which may be 1 in 1000 diffs) will not benifit from
prefix/suffix-optimization. Even if the rate of such changes is like
1/100th (let's say you change leading tabs to spaces, and vice versa,
every 100 revisions), it will hardly be noticeable.

The perfectionist in me still thinks: hey, you can also implement this
normalization with the byte-based approach (in a streamy way). But
that will probably be quite hard, because I'd have to take into
account that I might have to read more bytes from one file than from
the other file (right now I can always simply read a single byte from
both files). And other than that, it will at least be "algorithm
duplication": it will be a (streamy) re-implementation of the
normalization algorithm that's already used in the token layer. So all
in all it's probably not worth it ...

Cheers,
-- 
Johan