You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Morten Kloster <mo...@gmail.com> on 2011/05/18 01:56:29 UTC

[PATCH] Speed-up of libsvn_diff using token counts

Log message:

Speed-up of libsvn_diff using token counts
By using indices, not node pointers, to refer to tokens, and counting
the number of each token, the longest common subsequence (lcs)
algorithm at the heart of libsvn_diff becomes much faster in many
situations by skipping tokens that are unique to one source or the other.

* subversion/libsvn_diff/token.c
  (svn_diff__node_t): Added 'index' member.
  (svn_diff__tree_t): Added 'node_count' member.
  (svn_diff__get_token_count): New method.
  (tree_insert_token): Assigns indices to nodes.
  (svn_diff__get_tokens): Uses index, not pointer,
   to identify node in position struct.

* subversion/libsvn_diff/lcs.c
  (svn_diff__snake): Takes token counts as additional argument.
   Skips tokens that are unique to one or the other file.
  (svn_diff__lcs): Takes token counts and number of different tokens
   as additional arguments. Calculates number of tokens unique to
   each source and uses this to compute effective length difference.

* subversion/libsvn_diff/diff.h
  (svn_diff__position_t): Changed node member from being
   a pointer to the node to an integer index for the node.
 Updated method declarations.

* subversion/libsvn_diff/diff.c
* subversion/libsvn_diff/diff3.c
* subversion/libsvn_diff/diff4.c
  (svn_diff_diff_2, svn_diff_diff_3, svn_diff_diff_4): Count the
   number of times each token appears in each source.
  (svn_diff__resolve_conflict): Takes number of different tokens
   as additional argument. Counts the number of times
   each token appears in each (partial) source.






Comments:
This patch can reduce the time required for a diff by a large factor
when comparing long files with large differences. As an extreme case,
splitting sqlite.c into two roughly equal parts and diffing them against
each other took about a minute on my laptop with the original code,
and 7-8 seconds with my patch. The speed-up is gained only if a
large fraction of the changed tokens are unique to one source or the
other. I can not see that the change would cause a significant
performance degradation in any reasonable case.

Counting the number of times each token appears is also useful for
potential additional features in the future, such as identifying moved
blocks outside of the longest common subsequence.

I have tested only svn_diff_diff_2 (that it gives identical results as
before), but the changes to svn_diff_diff3 and svn_diff_diff4 are
completely analogous. Tested on a Windows 7 64-bit machine,
TortoiseSVN build script.


Morten Kloster (first patch)

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Johan Corveleyn <jc...@gmail.com>.

On Thu, May 26, 2011 at 12:00 AM, Morten Kloster <mo...@gmail.com> wrote:
[ snip ]
> I have, however, done some significant testing today: I downloaded
> the first 100 revisions of merge.c in libsvn_client (merge.c is the
> largest code file in the current HEAD revision, with almost 800
> revisions), and ran diff of every one of those revisions against every
> other (as well as against the empty file, just in case). While
> the results weren't 100% identical for the original library and the
> patched library, the only differences were slightly different lcs's of
> equal quality (number of matches). Such differences are to be
> expected, since the patched version orders the files based on
> non-unique line length, whereas the original orders based on all
> lines - unique or not. Keep in mind that the official version also can
> give different results for file1 vs. file2 and file2 vs. file1, if the
> files have the same number of lines.

Oh yes, indeed. That could account for some differences (as opposed to
the original version, the roles of the files in the LCS algorithm
might now be reversed, because another one becomes the shortest
(modulo unique lines)). I think that's okay: it's still an LCS, so a
minimal diff, just a different one. It makes it more difficult to test
though, so it's a bit complicating ...

> The tests also highlighted the performance improvement
> (total running time for all ~10000 diffs):
>
> Original debug: Almost 30 minutes
> Original release: 6 minutes 13 secs
>
> Patched debug: 89 secs
> Patched release: 54 secs

Nice!

This test might be a bit extreme though, since you compare every
version against every other version (so even versions which are vastly
different, like the 1st against the 800th). In practice I think most
diffs will be between more similar revisions. But ok, it gives a good
general idea of what to expect.

I would be interested in similar timings based on diffs between every
consecutive revision, to see if it still has a big impact (that would
be similar to what an "svn blame" needs to do)

[ snip ]

>> Also, when looking at your patch I was first puzzled why you used a
>> signed integer for node->index, and unsigned integers for the
>> token/node counts. Especially since they are assigned to each other in
>> token.c:
>>
>> [[[
>> @@ -122,6 +134,7 @@ tree_insert_token(svn_diff__node_t **node, svn_dif
>>  new_node->right = NULL;
>>  new_node->hash = hash;
>>  new_node->token = token;
>> +  new_node->index = tree->node_count++;
>> ]]]
>>
>> Hm, tricky.
>>
>> Then I understood why, when I read your change to lcs.c#svn_diff__lcs,
>> where you make use of the signed-ness to come up with (fake) node
>> indices for the sentinel positions:
>>
>> [[[
>> -  /* These are never dereferenced, only compared by value, so we
>> -   * can safely fake these up and the void* cast is OK.
>> +  /* Negative indices will not be used elsewhere
>>   */
>> -  sentinel_position[0].node = (void*)&sentinel_position[0];
>> -  sentinel_position[1].node = (void*)&sentinel_position[1];
>> +  sentinel_position[0].node = -1;
>> +  sentinel_position[1].node = -2;
>> ]]]
>>
>> I'm not sure if that's really safe (because of your unsigned -> signed
>> assignment in token.c). Maybe there could be a better way to fake
>> these nodes for the sentinels? Can't think of any right now.
>>
>
> I'm sure that is safe, as long as the count doesn't exceed the max
> positive value for signed (which can never happen if we change the
> types to 64 bit for 64 bit machines, since the algorithm requires
> far more than 2 memory per count).

Okay. I'm thinking maybe it would be best to use signed ints all the
way then. Since you can always only use the numbers up until the max
of the signed version. Not sure though, other people with more C
experience might have a different opinion on this ...

>>>> I have a couple of other changes ready:
>>>>
>>>> One is a minor simplification, by getting rid of the idx parameter in
>>>> lcs.c - there is a simpler way to accomplish what it's for.
>>
>> Simplicity is always good :-).
>>
>
> One effect of this would be that the diff of file1 vs. file2 could be
> different (different lcs of same quality) than for file2 vs. file1 even
> if the files have different numbers of lines. I hope this is not a
> problem?

Hm, I'm not sure about that. I guess it wouldn't really hurt, because
it would always still be an LCS, just a different one.

But isn't the LCS algorithm supposed to use the shortest file as file
A, and the longest as file B in the algorithm (according to the
article at least). I suppose that's not really required, but it might
change the performance characteristics if you don't, IIUC. Because the
worst-case time complexity is O(NP), with N the number of lines of the
longest, and P *the number of deletes going from A to B*.

If A is the shortest one, that gives us the best chance that P will be
small. If A is even a subsequence of B, the problem is solved in
linear time. But if you take the largest file for A, that might mean
that P becomes a lot larger as well.

So I'd try to keep the choice of the indexes the same as it is now.
Based upon the length of both files you always get the shortest one in
the role of A, and the longest in the role of B.

Or am I misunderstanding what you're trying to do here?

>>>> The other change is a partial implementation of the other possible
>>>> enhancement I mentioned earlier, which takes advantage of unique
>>>> matches to "restart" the lcs algorithm whenever possible, thus
>>>> reducing the effect of the square computational complexity. The
>>>> "partial" part is that this version is only really effective if the two
>>>> files have roughly the same number of (non-unique) lines; it will not
>>>> reduce the (P * length difference) part of the complexity. For good
>>>> cases, though, it can reduce the time required by the lcs algorithm
>>>> by large factors (roughly the number of well-separated sections of
>>>> difference, if they are all about the same size).
>>
>> Hm, I'm not sure how effective this would be in real-world cases. I
>> imagine that often such unique matches are interspersed with
>> non-unique matches (e.g. braces, empty lines, typical headers/footers
>> in blocks of otherwise unique structured text, ...), breaking the
>> "power" of those unique matches.
>>
>> What do you think?
>>
>
> There is no need for the unique matches to be consecutive: As long
> as there are no differences between the files in between the unique
> matches, they can be spread out over an arbitrary number of lines.
> Thus this approach should work fine as long as there are overall
> more unique matches than non-unique differences between the files.

Oh, okay then :-).

> When I applied my new test program to that version, however, I
> found a number of problems I had missed before. I've fixed most of
> them (it now gets through about half the diffs with very few errors),
> but there's at least one bug left. Stay tuned. :-)

Okay.

One thing I'm wary of at this time, is introducing heuristics in the
standard algorithm. I think heuristics to speed things up would be a
good thing at some point (at the cost of guaranteeing a minimal diff),
but I think it needs to be optional (or it can be the default, but the
user should still have the option to go for a "guaranteed minimal
diff").

I'm not sure which way you're going with this latest idea, but I just
mention this because you talked about introducing weights into the LCS
algorithm (for the unique matches). Maybe you can do this without
losing the guarantee of finding an *longest* common subsequence (but
then please explain to me why it's still guaranteed). If there is even
the slightest risk of missynchronizing, and finding a sub-longest
common subsequence, I wouldn't go for it at this time (maybe
optionally as part of a --heuristics or --non-minimal or something).

Cheers,
-- 
Johan

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Morten Kloster <mo...@gmail.com>.

On Wed, May 25, 2011 at 1:44 AM, Johan Corveleyn <jc...@gmail.com> wrote:
> On Tue, May 24, 2011 at 10:46 PM, Stefan Sperling <st...@elego.de> wrote:
>> On Tue, May 24, 2011 at 10:22:58PM +0200, Morten Kloster wrote:
>>> Johan / Stefan - any progress on the reviewing part?
>>
>> I haven't had time to review this, sorry.
>> It got a bit lost within the recent flurry of activity during and
>> after the Berlin hackathon.
>
> Yes, I too have been a bit wrapped up in other issues, not to mention
> my $dayjob, so sorry for the wait. I have gone through your patch and
> spotted some minor things, but couldn't yet completely verify the core
> of your change (the stuff in lcs.c).
>
>> I would also need some time to look at this in detail since my
>> working knowledge of the diff code is... suboptimal.
>
> Having worked on it a while ago, I know the diff code pretty well. But
> I'm not the most experienced svn developer :-). Hmmmm, seems
> complementary :-).
>
>> Note also that your change might be deemed a bit too experimental at
>> this point as the main focus right now is on stabilisation and bug
>> fixing for the upcoming 1.7 release.
>
> Yes. I'm also unsure on this point right now. Maybe when I've had the
> opportunity to analyze and test your patch I might have enough
> confidence ... but right now I'm not sure.
>
> It would help if you did some additional tests/checks yourself (or
> tell us what tests you have already done -- you mentioned having
> tested svn_diff_diff_2, but you didn't say how or how much).
>
> - Have you run svn's testsuite? (you mentioned TortoiseSVN build
> scripts, not sure if you can easily run them (look for win-tests.py))
>
> - A test I did regularly when I worked on the diff code: run "svn
> blame" on a large file with a long history (I used a 60,000 line file
> which had 2200 changes in my repository (unfortunately I can't share
> it)). This executes diff2 for every change. Compare the result of the
> original svn with your patched svn. It should be exactly the same.
> Also try different diff-options for these blame runs
> (ignore-whitespace, ignore-eol-style, ...).
>

No, it's all I can do to compile the libraries on my home computer;
the svn.exe won't run for some reason. So all my testing is through
my own C++ program using the library.

I have, however, done some significant testing today: I downloaded
the first 100 revisions of merge.c in libsvn_client (merge.c is the
largest code file in the current HEAD revision, with almost 800
revisions), and ran diff of every one of those revisions against every
other (as well as against the empty file, just in case). While
the results weren't 100% identical for the original library and the
patched library, the only differences were slightly different lcs's of
equal quality (number of matches). Such differences are to be
expected, since the patched version orders the files based on
non-unique line length, whereas the original orders based on all
lines - unique or not. Keep in mind that the official version also can
give different results for file1 vs. file2 and file2 vs. file1, if the
files have the same number of lines.

The tests also highlighted the performance improvement
(total running time for all ~10000 diffs):

Original debug: Almost 30 minutes
Original release: 6 minutes 13 secs

Patched debug: 89 secs
Patched release: 54 secs

The minimal difference between the patched debug and patched
release suggest that those spend most of their time outside the
lcs algorithm anyway.

> To test diff3 you could run an "svn update" on a locally modified
> file. Just "update -r<older revision", make some changes in the file,
> and update again to let it merge the changes. Try this for some
> different cases relevant to your patch (e.g. file with lots of unique
> lines between the updates, with all unique lines, with no unique
> lines, large/small file, ...).

Difficult when svn.exe doesn't work... :-)
I might make tests for diff3 later.

>
> The diff4 code is not currently used in svn core, but we still like to
> keep it around and keep it working. There is one regression test for
> it in the testsuite. If that tests passes it should be ok I think.
>
>>> Only possible error I'm aware of is that I made the indices/counts
>>> 32 bit signed/unsigned integers, which means it'll run into trouble
>>> if there are more than ~2 billion different tokens (lines) or more than
>>> ~4 billion occurrences of the same token. If the library should be
>>> able to handle such files, they should be made 64 bit (for the
>>> indices, it should be safe to keep them 32 bit on 32 bit machines,
>>> as those could never have the memory to run such diffs anyway).
>>
>> Extending this for 64bit platforms sounds ok to me.
>
> +1.
>
> Also, when looking at your patch I was first puzzled why you used a
> signed integer for node->index, and unsigned integers for the
> token/node counts. Especially since they are assigned to each other in
> token.c:
>
> [[[
> @@ -122,6 +134,7 @@ tree_insert_token(svn_diff__node_t **node, svn_dif
>  new_node->right = NULL;
>  new_node->hash = hash;
>  new_node->token = token;
> +  new_node->index = tree->node_count++;
> ]]]
>
> Hm, tricky.
>
> Then I understood why, when I read your change to lcs.c#svn_diff__lcs,
> where you make use of the signed-ness to come up with (fake) node
> indices for the sentinel positions:
>
> [[[
> -  /* These are never dereferenced, only compared by value, so we
> -   * can safely fake these up and the void* cast is OK.
> +  /* Negative indices will not be used elsewhere
>   */
> -  sentinel_position[0].node = (void*)&sentinel_position[0];
> -  sentinel_position[1].node = (void*)&sentinel_position[1];
> +  sentinel_position[0].node = -1;
> +  sentinel_position[1].node = -2;
> ]]]
>
> I'm not sure if that's really safe (because of your unsigned -> signed
> assignment in token.c). Maybe there could be a better way to fake
> these nodes for the sentinels? Can't think of any right now.
>

I'm sure that is safe, as long as the count doesn't exceed the max
positive value for signed (which can never happen if we change the
types to 64 bit for 64 bit machines, since the algorithm requires
far more than 2 memory per count).

>> We have similar blue sky limits elsewhere, e.g. svn patch uses a
>> 64bit line counter for files it is patching (on all platforms -- but
>> it never reads more than one line into memory).
>>
>>> I have a couple of other changes ready:
>>>
>>> One is a minor simplification, by getting rid of the idx parameter in
>>> lcs.c - there is a simpler way to accomplish what it's for.
>
> Simplicity is always good :-).
>

One effect of this would be that the diff of file1 vs. file2 could be
different (different lcs of same quality) than for file2 vs. file1 even
if the files have different numbers of lines. I hope this is not a
problem?

>>> The other change is a partial implementation of the other possible
>>> enhancement I mentioned earlier, which takes advantage of unique
>>> matches to "restart" the lcs algorithm whenever possible, thus
>>> reducing the effect of the square computational complexity. The
>>> "partial" part is that this version is only really effective if the two
>>> files have roughly the same number of (non-unique) lines; it will not
>>> reduce the (P * length difference) part of the complexity. For good
>>> cases, though, it can reduce the time required by the lcs algorithm
>>> by large factors (roughly the number of well-separated sections of
>>> difference, if they are all about the same size).
>
> Hm, I'm not sure how effective this would be in real-world cases. I
> imagine that often such unique matches are interspersed with
> non-unique matches (e.g. braces, empty lines, typical headers/footers
> in blocks of otherwise unique structured text, ...), breaking the
> "power" of those unique matches.
>
> What do you think?
>

There is no need for the unique matches to be consecutive: As long
as there are no differences between the files in between the unique
matches, they can be spread out over an arbitrary number of lines.
Thus this approach should work fine as long as there are overall
more unique matches than non-unique differences between the files.

When I applied my new test program to that version, however, I
found a number of problems I had missed before. I've fixed most of
them (it now gets through about half the diffs with very few errors),
but there's at least one bug left. Stay tuned. :-)

>>> The question is how I should post these changes. The idx change
>>> is relatively independent of the others; should I post a patch from
>>> current HEAD revision for that change alone, should I post it in this
>>> thread or a new one, and should I specify how that patch interacts
>>> with the other patches (and if so, where)? The other patch is based
>>> on my original patch; should I post a patch of the additional
>>> changes or a cumulative patch from current HEAD revision?
>>
>> I'd say try to break down your changes into a series of self-contained
>> patches that each solve a single problem, with one thread of each patch.
>>
>> It's ok for patches you submit to depend on one another. Just note the
>> dependency in the submission email.
>>
>> It's also fine for patches to overlap or even conflict if that helps to
>> keep them in smaller reviewable chunks. As some of them get committed
>> you can rebase your remaining work against HEAD and repost.
>
> +1.
>
> Cheers,
> --
> Johan
>

Thanks, both, for your comments.

Morten

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Johan Corveleyn <jc...@gmail.com>.

On Tue, May 24, 2011 at 10:46 PM, Stefan Sperling <st...@elego.de> wrote:
> On Tue, May 24, 2011 at 10:22:58PM +0200, Morten Kloster wrote:
>> Johan / Stefan - any progress on the reviewing part?
>
> I haven't had time to review this, sorry.
> It got a bit lost within the recent flurry of activity during and
> after the Berlin hackathon.

Yes, I too have been a bit wrapped up in other issues, not to mention
my $dayjob, so sorry for the wait. I have gone through your patch and
spotted some minor things, but couldn't yet completely verify the core
of your change (the stuff in lcs.c).

> I would also need some time to look at this in detail since my
> working knowledge of the diff code is... suboptimal.

Having worked on it a while ago, I know the diff code pretty well. But
I'm not the most experienced svn developer :-). Hmmmm, seems
complementary :-).

> Note also that your change might be deemed a bit too experimental at
> this point as the main focus right now is on stabilisation and bug
> fixing for the upcoming 1.7 release.

Yes. I'm also unsure on this point right now. Maybe when I've had the
opportunity to analyze and test your patch I might have enough
confidence ... but right now I'm not sure.

It would help if you did some additional tests/checks yourself (or
tell us what tests you have already done -- you mentioned having
tested svn_diff_diff_2, but you didn't say how or how much).

- Have you run svn's testsuite? (you mentioned TortoiseSVN build
scripts, not sure if you can easily run them (look for win-tests.py))

- A test I did regularly when I worked on the diff code: run "svn
blame" on a large file with a long history (I used a 60,000 line file
which had 2200 changes in my repository (unfortunately I can't share
it)). This executes diff2 for every change. Compare the result of the
original svn with your patched svn. It should be exactly the same.
Also try different diff-options for these blame runs
(ignore-whitespace, ignore-eol-style, ...).

To test diff3 you could run an "svn update" on a locally modified
file. Just "update -r<older revision", make some changes in the file,
and update again to let it merge the changes. Try this for some
different cases relevant to your patch (e.g. file with lots of unique
lines between the updates, with all unique lines, with no unique
lines, large/small file, ...).

The diff4 code is not currently used in svn core, but we still like to
keep it around and keep it working. There is one regression test for
it in the testsuite. If that tests passes it should be ok I think.

>> Only possible error I'm aware of is that I made the indices/counts
>> 32 bit signed/unsigned integers, which means it'll run into trouble
>> if there are more than ~2 billion different tokens (lines) or more than
>> ~4 billion occurrences of the same token. If the library should be
>> able to handle such files, they should be made 64 bit (for the
>> indices, it should be safe to keep them 32 bit on 32 bit machines,
>> as those could never have the memory to run such diffs anyway).
>
> Extending this for 64bit platforms sounds ok to me.

+1.

Also, when looking at your patch I was first puzzled why you used a
signed integer for node->index, and unsigned integers for the
token/node counts. Especially since they are assigned to each other in
token.c:

[[[
@@ -122,6 +134,7 @@ tree_insert_token(svn_diff__node_t **node, svn_dif
  new_node->right = NULL;
  new_node->hash = hash;
  new_node->token = token;
+  new_node->index = tree->node_count++;
]]]

Hm, tricky.

Then I understood why, when I read your change to lcs.c#svn_diff__lcs,
where you make use of the signed-ness to come up with (fake) node
indices for the sentinel positions:

[[[
-  /* These are never dereferenced, only compared by value, so we
-   * can safely fake these up and the void* cast is OK.
+  /* Negative indices will not be used elsewhere
   */
-  sentinel_position[0].node = (void*)&sentinel_position[0];
-  sentinel_position[1].node = (void*)&sentinel_position[1];
+  sentinel_position[0].node = -1;
+  sentinel_position[1].node = -2;
]]]

I'm not sure if that's really safe (because of your unsigned -> signed
assignment in token.c). Maybe there could be a better way to fake
these nodes for the sentinels? Can't think of any right now.

> We have similar blue sky limits elsewhere, e.g. svn patch uses a
> 64bit line counter for files it is patching (on all platforms -- but
> it never reads more than one line into memory).
>
>> I have a couple of other changes ready:
>>
>> One is a minor simplification, by getting rid of the idx parameter in
>> lcs.c - there is a simpler way to accomplish what it's for.

Simplicity is always good :-).

>> The other change is a partial implementation of the other possible
>> enhancement I mentioned earlier, which takes advantage of unique
>> matches to "restart" the lcs algorithm whenever possible, thus
>> reducing the effect of the square computational complexity. The
>> "partial" part is that this version is only really effective if the two
>> files have roughly the same number of (non-unique) lines; it will not
>> reduce the (P * length difference) part of the complexity. For good
>> cases, though, it can reduce the time required by the lcs algorithm
>> by large factors (roughly the number of well-separated sections of
>> difference, if they are all about the same size).

Hm, I'm not sure how effective this would be in real-world cases. I
imagine that often such unique matches are interspersed with
non-unique matches (e.g. braces, empty lines, typical headers/footers
in blocks of otherwise unique structured text, ...), breaking the
"power" of those unique matches.

What do you think?

>> The question is how I should post these changes. The idx change
>> is relatively independent of the others; should I post a patch from
>> current HEAD revision for that change alone, should I post it in this
>> thread or a new one, and should I specify how that patch interacts
>> with the other patches (and if so, where)? The other patch is based
>> on my original patch; should I post a patch of the additional
>> changes or a cumulative patch from current HEAD revision?
>
> I'd say try to break down your changes into a series of self-contained
> patches that each solve a single problem, with one thread of each patch.
>
> It's ok for patches you submit to depend on one another. Just note the
> dependency in the submission email.
>
> It's also fine for patches to overlap or even conflict if that helps to
> keep them in smaller reviewable chunks. As some of them get committed
> you can rebase your remaining work against HEAD and repost.

+1.

Cheers,
-- 
Johan

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Stefan Sperling <st...@elego.de>.

On Tue, May 24, 2011 at 10:22:58PM +0200, Morten Kloster wrote:
> Johan / Stefan - any progress on the reviewing part?

I haven't had time to review this, sorry.
It got a bit lost within the recent flurry of activity during and
after the Berlin hackathon.

I would also need some time to look at this in detail since my
working knowledge of the diff code is... suboptimal.

Note also that your change might be deemed a bit too experimental at
this point as the main focus right now is on stabilisation and bug
fixing for the upcoming 1.7 release.

> Only possible error I'm aware of is that I made the indices/counts
> 32 bit signed/unsigned integers, which means it'll run into trouble
> if there are more than ~2 billion different tokens (lines) or more than
> ~4 billion occurrences of the same token. If the library should be
> able to handle such files, they should be made 64 bit (for the
> indices, it should be safe to keep them 32 bit on 32 bit machines,
> as those could never have the memory to run such diffs anyway).

Extending this for 64bit platforms sounds ok to me.

We have similar blue sky limits elsewhere, e.g. svn patch uses a
64bit line counter for files it is patching (on all platforms -- but
it never reads more than one line into memory).

> I have a couple of other changes ready:
> 
> One is a minor simplification, by getting rid of the idx parameter in
> lcs.c - there is a simpler way to accomplish what it's for.
> 
> The other change is a partial implementation of the other possible
> enhancement I mentioned earlier, which takes advantage of unique
> matches to "restart" the lcs algorithm whenever possible, thus
> reducing the effect of the square computational complexity. The
> "partial" part is that this version is only really effective if the two
> files have roughly the same number of (non-unique) lines; it will not
> reduce the (P * length difference) part of the complexity. For good
> cases, though, it can reduce the time required by the lcs algorithm
> by large factors (roughly the number of well-separated sections of
> difference, if they are all about the same size).
> 
> The question is how I should post these changes. The idx change
> is relatively independent of the others; should I post a patch from
> current HEAD revision for that change alone, should I post it in this
> thread or a new one, and should I specify how that patch interacts
> with the other patches (and if so, where)? The other patch is based
> on my original patch; should I post a patch of the additional
> changes or a cumulative patch from current HEAD revision?

I'd say try to break down your changes into a series of self-contained
patches that each solve a single problem, with one thread of each patch.

It's ok for patches you submit to depend on one another. Just note the
dependency in the submission email.

It's also fine for patches to overlap or even conflict if that helps to
keep them in smaller reviewable chunks. As some of them get committed
you can rebase your remaining work against HEAD and repost.

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Morten Kloster <mo...@gmail.com>.

Johan / Stefan - any progress on the reviewing part?

Only possible error I'm aware of is that I made the indices/counts
32 bit signed/unsigned integers, which means it'll run into trouble
if there are more than ~2 billion different tokens (lines) or more than
~4 billion occurrences of the same token. If the library should be
able to handle such files, they should be made 64 bit (for the
indices, it should be safe to keep them 32 bit on 32 bit machines,
as those could never have the memory to run such diffs anyway).


I have a couple of other changes ready:

One is a minor simplification, by getting rid of the idx parameter in
lcs.c - there is a simpler way to accomplish what it's for.

The other change is a partial implementation of the other possible
enhancement I mentioned earlier, which takes advantage of unique
matches to "restart" the lcs algorithm whenever possible, thus
reducing the effect of the square computational complexity. The
"partial" part is that this version is only really effective if the two
files have roughly the same number of (non-unique) lines; it will not
reduce the (P * length difference) part of the complexity. For good
cases, though, it can reduce the time required by the lcs algorithm
by large factors (roughly the number of well-separated sections of
difference, if they are all about the same size).

The question is how I should post these changes. The idx change
is relatively independent of the others; should I post a patch from
current HEAD revision for that change alone, should I post it in this
thread or a new one, and should I specify how that patch interacts
with the other patches (and if so, where)? The other patch is based
on my original patch; should I post a patch of the additional
changes or a cumulative patch from current HEAD revision?

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Stefan Sperling <st...@elego.de>.

On Thu, May 19, 2011 at 01:44:56PM +0200, Morten Kloster wrote:
> Here is an attached copy of the patch.

This one looks good. Thank you!

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Morten Kloster <mo...@gmail.com>.

For some reason, a lot of the indentation gets messed up when posted
in the archive - it looks fine both in my sent email and in the copy I
received myself.

Here is an attached copy of the patch.


Morten

> On Thu, May 19, 2011 at 11:38 AM, Stefan Sperling <st...@elego.de> wrote:
>> On Wed, May 18, 2011 at 07:31:54PM +0200, Morten Kloster wrote:
>>> >
>>> > I'm attaching a version of the patch re-generated with -x-pw.
>>> >
>>> > [[[
>>> > Index: subversion/libsvn_diff/diff.c
>>> > ===================================================================
>>> > --- subversion/libsvn_diff/diff.c � � � (revision 1104340)
>>> > +++ subversion/libsvn_diff/diff.c � � � (working copy)
>>> > @@ -105,6 +105,10 @@ svn_diff_diff_2(svn_diff_t **diff,
>>> > �{
>>> > � svn_diff__tree_t *tree;
>>> > � svn_diff__position_t *position_list[2];
>>> > + �apr_uint32_t num_tokens;
>>> > + �apr_uint32_t token_index;
>>> > + �svn_diff__position_t *current;
>>> > + �apr_uint32_t *token_counts[2];
>>
>> The indentation above looks wrong and in some other places, too.
>>
>> Morten, your patch is very interesting. But I don't want to review it
>> as posted. Can you please provide a fresh patch that includes only
>> your functional changes and uses correct code formatting as per our
>> coding guidelines? See the community guide at
>> http://subversion.apache.org/docs/community-guide/conventions.html
>>
>> This would speed up processing your patch a lot. Any overhead spent
>> separating functional from non-functional changes and checking the
>> coding style is time better spent by the patch submitter than the
>> patch reviewers. It's a question of one person investing the time
>> vs. N people investing the time. If only one person spends this time
>> the community as a whole wins time. Thank you :)
>>
>

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Morten Kloster <mo...@gmail.com>.

Never mind, it looks like it's just my browser messing up things.

On Thu, May 19, 2011 at 2:28 PM, Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> Did they?  I have not changed mutt/vim's configuration recently and I've
> sent many patches inline before.
>
> Morten Kloster wrote on Thu, May 19, 2011 at 14:14:03 +0200:
>> Yeah, it was the same problem that I had when I included the patch
>> directly in the email; the indentations got all messed up on the
>> archive.
>>
>> On Thu, May 19, 2011 at 2:07 PM, Daniel Shahaf <d....@daniel.shahaf.name> wrote:
>> > Morten Kloster wrote on Thu, May 19, 2011 at 13:38:56 +0200:
>> >> Here is a version without the whitespace changes in diff3.c. I have also kept
>> >> the original indentation level of the loop in lcs.c that I have wrapped in
>> >> another loop.
>> >>
>> >> The indentations in Daniel's version had little or nothing to do with
>> >> my patch. :)
>> >
>> > I generated my patch by applyings yours and running 'svn di -x-pw'.
>> >
>

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Did they?  I have not changed mutt/vim's configuration recently and I've
sent many patches inline before.

Morten Kloster wrote on Thu, May 19, 2011 at 14:14:03 +0200:
> Yeah, it was the same problem that I had when I included the patch
> directly in the email; the indentations got all messed up on the
> archive.
> 
> On Thu, May 19, 2011 at 2:07 PM, Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> > Morten Kloster wrote on Thu, May 19, 2011 at 13:38:56 +0200:
> >> Here is a version without the whitespace changes in diff3.c. I have also kept
> >> the original indentation level of the loop in lcs.c that I have wrapped in
> >> another loop.
> >>
> >> The indentations in Daniel's version had little or nothing to do with
> >> my patch. :)
> >
> > I generated my patch by applyings yours and running 'svn di -x-pw'.
> >

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Morten Kloster <mo...@gmail.com>.

Yeah, it was the same problem that I had when I included the patch
directly in the email; the indentations got all messed up on the
archive.

On Thu, May 19, 2011 at 2:07 PM, Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> Morten Kloster wrote on Thu, May 19, 2011 at 13:38:56 +0200:
>> Here is a version without the whitespace changes in diff3.c. I have also kept
>> the original indentation level of the loop in lcs.c that I have wrapped in
>> another loop.
>>
>> The indentations in Daniel's version had little or nothing to do with
>> my patch. :)
>
> I generated my patch by applyings yours and running 'svn di -x-pw'.
>

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Morten Kloster wrote on Thu, May 19, 2011 at 13:38:56 +0200:
> Here is a version without the whitespace changes in diff3.c. I have also kept
> the original indentation level of the loop in lcs.c that I have wrapped in
> another loop.
> 
> The indentations in Daniel's version had little or nothing to do with
> my patch. :)

I generated my patch by applyings yours and running 'svn di -x-pw'.

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Morten Kloster <mo...@gmail.com>.

Here is a version without the whitespace changes in diff3.c. I have also kept
the original indentation level of the loop in lcs.c that I have wrapped in
another loop.

The indentations in Daniel's version had little or nothing to do with
my patch. :)

[[[
Index: subversion/libsvn_diff/diff.c
===================================================================
--- subversion/libsvn_diff/diff.c	(revision 1104713)
+++ subversion/libsvn_diff/diff.c	(working copy)
@@ -105,6 +105,10 @@
 {
   svn_diff__tree_t *tree;
   svn_diff__position_t *position_list[2];
+  apr_uint32_t num_tokens;
+  apr_uint32_t token_index;
+  svn_diff__position_t *current;
+  apr_uint32_t *token_counts[2];
   svn_diff_datasource_e datasource[] = {svn_diff_datasource_original,
                                         svn_diff_datasource_modified};
   svn_diff__lcs_t *lcs;
@@ -138,6 +142,8 @@
                                prefix_lines,
                                subpool));

+  num_tokens = svn_diff__get_node_count(tree);
+
   /* The cool part is that we don't need the tokens anymore.
    * Allow the app to clean them up if it wants to.
    */
@@ -147,8 +153,35 @@
   /* We don't need the nodes in the tree either anymore, nor the tree itself */
   svn_pool_destroy(treepool);

+  token_counts[0] = apr_palloc(subpool, num_tokens * sizeof(**token_counts));
+  token_counts[1] = apr_palloc(subpool, num_tokens * sizeof(**token_counts));
+  for (token_index = 0; token_index < num_tokens; token_index++)
+    token_counts[0][token_index] = token_counts[1][token_index] = 0;
+
+  current = position_list[0];
+  if (current != NULL)
+    {
+      do
+        {
+          token_counts[0][current->node]++;
+          current = current->next;
+        }
+      while (current != position_list[0]);
+    }
+  current = position_list[1];
+  if (current != NULL)
+    {
+      do
+        {
+          token_counts[1][current->node]++;
+          current = current->next;
+        }
+      while (current != position_list[1]);
+    }
+
   /* Get the lcs */
-  lcs = svn_diff__lcs(position_list[0], position_list[1], prefix_lines,
+  lcs = svn_diff__lcs(position_list[0], position_list[1], token_counts[0],
+	                  token_counts[1], num_tokens, prefix_lines,
                       suffix_lines, subpool);

   /* Produce the diff */
Index: subversion/libsvn_diff/diff.h
===================================================================
--- subversion/libsvn_diff/diff.h	(revision 1104713)
+++ subversion/libsvn_diff/diff.h	(working copy)
@@ -62,7 +62,7 @@
 struct svn_diff__position_t
 {
   svn_diff__position_t *next;
-  svn_diff__node_t     *node;
+  apr_int32_t          node;
   apr_off_t             offset;
 };

@@ -91,12 +91,21 @@
 svn_diff__lcs_t *
 svn_diff__lcs(svn_diff__position_t *position_list1, /* pointer to
tail (ring) */
               svn_diff__position_t *position_list2, /* pointer to
tail (ring) */
+              apr_uint32_t *token_counts_list1, /* array of counts */
+              apr_uint32_t *token_counts_list2, /* array of counts */
+              apr_uint32_t num_tokens,
               apr_off_t prefix_lines,
               apr_off_t suffix_lines,
               apr_pool_t *pool);


 /*
+ * Returns number of tokens in a tree
+ */
+apr_uint32_t
+svn_diff__get_node_count(svn_diff__tree_t *tree);
+
+/*
  * Support functions to build a tree of token positions
  */
 void
@@ -128,6 +137,7 @@
 svn_diff__resolve_conflict(svn_diff_t *hunk,
                            svn_diff__position_t **position_list1,
                            svn_diff__position_t **position_list2,
+                           apr_uint32_t num_tokens,
                            apr_pool_t *pool);


Index: subversion/libsvn_diff/diff3.c
===================================================================
--- subversion/libsvn_diff/diff3.c	(revision 1104713)
+++ subversion/libsvn_diff/diff3.c	(working copy)
@@ -38,6 +38,7 @@
 svn_diff__resolve_conflict(svn_diff_t *hunk,
                            svn_diff__position_t **position_list1,
                            svn_diff__position_t **position_list2,
+                           apr_uint32_t num_tokens,
                            apr_pool_t *pool)
 {
     apr_off_t modified_start = hunk->modified_start + 1;
@@ -47,6 +48,9 @@
     apr_off_t latest_length = hunk->latest_length;
     svn_diff__position_t *start_position[2];
     svn_diff__position_t *position[2];
+    apr_uint32_t token_index;
+    svn_diff__position_t *current;
+    apr_uint32_t *token_counts[2];
     svn_diff__lcs_t *lcs = NULL;
     svn_diff__lcs_t **lcs_ref = &lcs;
     svn_diff_t **diff_ref = &hunk->resolved_diff;
@@ -173,9 +177,29 @@
         position[1]->next = start_position[1];
       }

-    *lcs_ref = svn_diff__lcs(position[0], position[1], 0, 0,
-                             subpool);
+    token_counts[0] = apr_palloc(subpool, num_tokens * sizeof(**token_counts));
+    token_counts[1] = apr_palloc(subpool, num_tokens * sizeof(**token_counts));
+    for (token_index = 0; token_index < num_tokens; token_index++)
+      token_counts[0][token_index] = token_counts[1][token_index] = 0;

+    current = position[0];
+    do
+      {
+        token_counts[0][current->node]++;
+        current = current->next;
+      }
+    while (current != position[0]);
+    current = position[1];
+    do
+      {
+        token_counts[1][current->node]++;
+        current = current->next;
+      }
+    while (current != position[1]);
+
+    *lcs_ref = svn_diff__lcs(position[0], position[1], token_counts[0],
+                             token_counts[1], num_tokens, 0, 0, subpool);
+
     /* Fix up the EOF lcs element in case one of
      * the two sequences was NULL.
      */
@@ -251,6 +275,10 @@
 {
   svn_diff__tree_t *tree;
   svn_diff__position_t *position_list[3];
+  apr_uint32_t num_tokens;
+  apr_uint32_t token_index;
+  svn_diff__position_t *current;
+  apr_uint32_t *token_counts[3];
   svn_diff_datasource_e datasource[] = {svn_diff_datasource_original,
                                         svn_diff_datasource_modified,
                                         svn_diff_datasource_latest};
@@ -292,6 +320,8 @@
                                prefix_lines,
                                subpool));

+  num_tokens = svn_diff__get_node_count(tree);
+
   /* Get rid of the tokens, we don't need them to calc the diff */
   if (vtable->token_discard_all != NULL)
     vtable->token_discard_all(diff_baton);
@@ -299,10 +329,52 @@
   /* We don't need the nodes in the tree either anymore, nor the tree itself */
   svn_pool_destroy(treepool);

+  token_counts[0] = apr_palloc(subpool, num_tokens * sizeof(**token_counts));
+  token_counts[1] = apr_palloc(subpool, num_tokens * sizeof(**token_counts));
+  token_counts[2] = apr_palloc(subpool, num_tokens * sizeof(**token_counts));
+  for (token_index = 0; token_index < num_tokens; token_index++)
+    {
+      token_counts[0][token_index] = token_counts[1][token_index] = 0;
+      token_counts[2][token_index] = 0;
+    }
+
+  current = position_list[0];
+  if (current != NULL)
+    {
+      do
+        {
+          token_counts[0][current->node]++;
+          current = current->next;
+        }
+      while (current != position_list[0]);
+    }
+  current = position_list[1];
+  if (current != NULL)
+    {
+      do
+        {
+          token_counts[1][current->node]++;
+          current = current->next;
+        }
+      while (current != position_list[1]);
+    }
+  current = position_list[2];
+  if (current != NULL)
+    {
+      do
+        {
+          token_counts[2][current->node]++;
+          current = current->next;
+        }
+      while (current != position_list[2]);
+    }
+
   /* Get the lcs for original-modified and original-latest */
-  lcs_om = svn_diff__lcs(position_list[0], position_list[1], prefix_lines,
+  lcs_om = svn_diff__lcs(position_list[0], position_list[1], token_counts[0],
+                         token_counts[1], num_tokens, prefix_lines,
                          suffix_lines, subpool);
-  lcs_ol = svn_diff__lcs(position_list[0], position_list[2], prefix_lines,
+  lcs_ol = svn_diff__lcs(position_list[0], position_list[2], token_counts[0],
+                         token_counts[2], num_tokens, prefix_lines,
                          suffix_lines, subpool);

   /* Produce a merged diff */
@@ -435,6 +507,7 @@
                 svn_diff__resolve_conflict(*diff_ref,
                                            &position_list[1],
                                            &position_list[2],
+                                           num_tokens,
                                            pool);
               }
             else if (is_modified)
Index: subversion/libsvn_diff/diff4.c
===================================================================
--- subversion/libsvn_diff/diff4.c	(revision 1104713)
+++ subversion/libsvn_diff/diff4.c	(working copy)
@@ -173,6 +173,10 @@
 {
   svn_diff__tree_t *tree;
   svn_diff__position_t *position_list[4];
+  apr_uint32_t num_tokens;
+  apr_uint32_t token_index;
+  svn_diff__position_t *current;
+  apr_uint32_t *token_counts[4];
   svn_diff_datasource_e datasource[] = {svn_diff_datasource_original,
                                         svn_diff_datasource_modified,
                                         svn_diff_datasource_latest,
@@ -227,6 +231,8 @@
                                prefix_lines,
                                subpool2));

+  num_tokens = svn_diff__get_node_count(tree);
+
   /* Get rid of the tokens, we don't need them to calc the diff */
   if (vtable->token_discard_all != NULL)
     vtable->token_discard_all(diff_baton);
@@ -234,8 +240,61 @@
   /* We don't need the nodes in the tree either anymore, nor the tree itself */
   svn_pool_clear(subpool3);

+  token_counts[0] = apr_palloc(subpool, num_tokens * sizeof(**token_counts));
+  token_counts[1] = apr_palloc(subpool, num_tokens * sizeof(**token_counts));
+  token_counts[2] = apr_palloc(subpool, num_tokens * sizeof(**token_counts));
+  token_counts[3] = apr_palloc(subpool, num_tokens * sizeof(**token_counts));
+  for (token_index = 0; token_index < num_tokens; token_index++)
+    {
+      token_counts[0][token_index] = token_counts[1][token_index] = 0;
+      token_counts[2][token_index] = token_counts[3][token_index] = 0;
+    }
+
+  current = position_list[0];
+  if (current != NULL)
+    {
+      do
+        {
+          token_counts[0][current->node]++;
+          current = current->next;
+        }
+      while (current != position_list[0]);
+    }
+  current = position_list[1];
+  if (current != NULL)
+    {
+      do
+        {
+          token_counts[1][current->node]++;
+          current = current->next;
+        }
+      while (current != position_list[1]);
+    }
+  current = position_list[2];
+  if (current != NULL)
+    {
+      do
+        {
+          token_counts[2][current->node]++;
+          current = current->next;
+        }
+      while (current != position_list[2]);
+    }
+  current = position_list[3];
+  if (current != NULL)
+    {
+      do
+        {
+          token_counts[3][current->node]++;
+          current = current->next;
+        }
+      while (current != position_list[3]);
+    }
+
   /* Get the lcs for original - latest */
-  lcs_ol = svn_diff__lcs(position_list[0], position_list[2], prefix_lines,
+  lcs_ol = svn_diff__lcs(position_list[0], position_list[2],
+                         token_counts[0], token_counts[2],
+                         num_tokens, prefix_lines,
                          suffix_lines, subpool3);
   diff_ol = svn_diff__diff(lcs_ol, 1, 1, TRUE, pool);

@@ -257,7 +316,9 @@
   /* Get the lcs for common ancestor - original
    * Do reverse adjustements
    */
-  lcs_adjust = svn_diff__lcs(position_list[3], position_list[2], prefix_lines,
+  lcs_adjust = svn_diff__lcs(position_list[3], position_list[2],
+                             token_counts[3], token_counts[2],
+                             num_tokens, prefix_lines,
                              suffix_lines, subpool3);
   diff_adjust = svn_diff__diff(lcs_adjust, 1, 1, FALSE, subpool3);
   adjust_diff(diff_ol, diff_adjust);
@@ -267,7 +328,9 @@
   /* Get the lcs for modified - common ancestor
    * Do forward adjustments
    */
-  lcs_adjust = svn_diff__lcs(position_list[1], position_list[3], prefix_lines,
+  lcs_adjust = svn_diff__lcs(position_list[1], position_list[3],
+                             token_counts[1], token_counts[3],
+                             num_tokens, prefix_lines,
                              suffix_lines, subpool3);
   diff_adjust = svn_diff__diff(lcs_adjust, 1, 1, FALSE, subpool3);
   adjust_diff(diff_ol, diff_adjust);
@@ -283,7 +346,7 @@
       if (hunk->type == svn_diff__type_conflict)
         {
           svn_diff__resolve_conflict(hunk, &position_list[1],
-                                     &position_list[2], pool);
+                                     &position_list[2], num_tokens, pool);
         }
     }

Index: subversion/libsvn_diff/lcs.c
===================================================================
--- subversion/libsvn_diff/lcs.c	(revision 1104713)
+++ subversion/libsvn_diff/lcs.c	(working copy)
@@ -51,6 +51,7 @@
 svn_diff__snake(apr_off_t k,
                 svn_diff__snake_t *fp,
                 int idx,
+				apr_uint32_t *token_counts[2],
                 svn_diff__lcs_t **freelist,
                 apr_pool_t *pool)
 {
@@ -92,6 +93,11 @@
     }


+  if (previous_lcs)
+    {
+      previous_lcs->refcount++;
+    }
+
   /* ### Optimization, skip all positions that don't have matchpoints
    * ### anyway. Beware of the sentinel, don't skip it!
    */
@@ -99,6 +105,8 @@
   position[0] = start_position[0];
   position[1] = start_position[1];

+  while (1)
+    {
   while (position[0]->node == position[1]->node)
     {
       position[0] = position[0]->next;
@@ -122,18 +130,19 @@
       lcs->length = position[1]->offset - start_position[1]->offset;
       lcs->next = previous_lcs;
       lcs->refcount = 1;
-      fp[k].lcs = lcs;
+      previous_lcs = lcs;
+      start_position[0] = position[0];
+      start_position[1] = position[1];
     }
-  else
-    {
-      fp[k].lcs = previous_lcs;
+      if (position[0]->node >= 0 && token_counts[1][position[0]->node] == 0)
+        start_position[0] = position[0] = position[0]->next;
+      else if (position[1]->node >= 0 &&
token_counts[0][position[1]->node] == 0)
+        start_position[1] = position[1] = position[1]->next;
+      else
+        break;
     }

-  if (previous_lcs)
-    {
-      previous_lcs->refcount++;
-    }
-
+  fp[k].lcs = previous_lcs;
   fp[k].position[0] = position[0];
   fp[k].position[1] = position[1];

@@ -188,12 +197,18 @@
 svn_diff__lcs_t *
 svn_diff__lcs(svn_diff__position_t *position_list1, /* pointer to
tail (ring) */
               svn_diff__position_t *position_list2, /* pointer to
tail (ring) */
+              apr_uint32_t *token_counts_list1, /* array of counts */
+              apr_uint32_t *token_counts_list2, /* array of counts */
+              apr_uint32_t num_tokens,
               apr_off_t prefix_lines,
               apr_off_t suffix_lines,
               apr_pool_t *pool)
 {
   int idx;
   apr_off_t length[2];
+  apr_uint32_t *token_counts[2];
+  apr_uint32_t unique_count[2];
+  apr_uint32_t token_index;
   svn_diff__snake_t *fp;
   apr_off_t d;
   apr_off_t k;
@@ -231,10 +246,19 @@
       return lcs;
     }

+  unique_count[1] = unique_count[0] = 0;
+  for (token_index = 0; token_index < num_tokens; token_index++)
+    {
+      if (token_counts_list1[token_index] == 0)
+        unique_count[1] += token_counts_list2[token_index];
+      if (token_counts_list2[token_index] == 0)
+        unique_count[0] += token_counts_list1[token_index];
+    }
+
   /* Calculate length of both sequences to be compared */
   length[0] = position_list1->offset - position_list1->next->offset + 1;
   length[1] = position_list2->offset - position_list2->next->offset + 1;
-  idx = length[0] > length[1] ? 1 : 0;
+  idx = length[0] - unique_count[0] > length[1] - unique_count[1] ? 1 : 0;

   /* strikerXXX: here we allocate the furthest point array, which is
    * strikerXXX: sized M + N + 3 (!)
@@ -246,18 +270,20 @@
   sentinel_position[idx].next = position_list1->next;
   position_list1->next = &sentinel_position[idx];
   sentinel_position[idx].offset = position_list1->offset + 1;
+  token_counts[idx] = token_counts_list1;

   sentinel_position[abs(1 - idx)].next = position_list2->next;
   position_list2->next = &sentinel_position[abs(1 - idx)];
   sentinel_position[abs(1 - idx)].offset = position_list2->offset + 1;
+  token_counts[abs(1 - idx)] = token_counts_list2;

-  /* These are never dereferenced, only compared by value, so we
-   * can safely fake these up and the void* cast is OK.
+  /* Negative indices will not be used elsewhere
    */
-  sentinel_position[0].node = (void*)&sentinel_position[0];
-  sentinel_position[1].node = (void*)&sentinel_position[1];
+  sentinel_position[0].node = -1;
+  sentinel_position[1].node = -2;

-  d = length[abs(1 - idx)] - length[idx];
+  d = length[abs(1 - idx)] - unique_count[abs(1 - idx)]
+      - (length[idx] - unique_count[idx]);

   /* k = -1 will be the first to be used to get previous
    * position information from, make sure it holds sane
@@ -272,12 +298,12 @@
       /* Forward */
       for (k = -p; k < d; k++)
         {
-          svn_diff__snake(k, fp, idx, &lcs_freelist, pool);
+          svn_diff__snake(k, fp, idx, token_counts, &lcs_freelist, pool);
         }

       for (k = d + p; k >= d; k--)
         {
-          svn_diff__snake(k, fp, idx, &lcs_freelist, pool);
+          svn_diff__snake(k, fp, idx, token_counts, &lcs_freelist, pool);
         }

       p++;
Index: subversion/libsvn_diff/token.c
===================================================================
--- subversion/libsvn_diff/token.c	(revision 1104713)
+++ subversion/libsvn_diff/token.c	(working copy)
@@ -46,6 +46,7 @@
   svn_diff__node_t     *right;

   apr_uint32_t          hash;
+  apr_int32_t           index;
   void                 *token;
 };

@@ -53,10 +54,20 @@
 {
   svn_diff__node_t     *root[SVN_DIFF__HASH_SIZE];
   apr_pool_t           *pool;
+  apr_uint32_t          node_count;
 };


 /*
+ * Returns number of tokens in a tree
+ */
+apr_uint32_t
+svn_diff__get_node_count(svn_diff__tree_t *tree)
+{
+  return tree->node_count;
+}
+
+/*
  * Support functions to build a tree of token positions
  */

@@ -65,6 +76,7 @@
 {
   *tree = apr_pcalloc(pool, sizeof(**tree));
   (*tree)->pool = pool;
+  (*tree)->node_count = 0;
 }


@@ -122,6 +134,7 @@
   new_node->right = NULL;
   new_node->hash = hash;
   new_node->token = token;
+  new_node->index = tree->node_count++;

   *node = *node_ref = new_node;

@@ -168,7 +181,7 @@
       /* Create a new position */
       position = apr_palloc(pool, sizeof(*position));
       position->next = NULL;
-      position->node = node;
+      position->node = node->index;
       position->offset = offset;

       *position_ref = position;
]]]

On Thu, May 19, 2011 at 11:38 AM, Stefan Sperling <st...@elego.de> wrote:
> On Wed, May 18, 2011 at 07:31:54PM +0200, Morten Kloster wrote:
>> >
>> > I'm attaching a version of the patch re-generated with -x-pw.
>> >
>> > [[[
>> > Index: subversion/libsvn_diff/diff.c
>> > ===================================================================
>> > --- subversion/libsvn_diff/diff.c       (revision 1104340)
>> > +++ subversion/libsvn_diff/diff.c       (working copy)
>> > @@ -105,6 +105,10 @@ svn_diff_diff_2(svn_diff_t **diff,
>> >  {
>> >   svn_diff__tree_t *tree;
>> >   svn_diff__position_t *position_list[2];
>> > +  apr_uint32_t num_tokens;
>> > +  apr_uint32_t token_index;
>> > +  svn_diff__position_t *current;
>> > +  apr_uint32_t *token_counts[2];
>
> The indentation above looks wrong and in some other places, too.
>
> Morten, your patch is very interesting. But I don't want to review it
> as posted. Can you please provide a fresh patch that includes only
> your functional changes and uses correct code formatting as per our
> coding guidelines? See the community guide at
> http://subversion.apache.org/docs/community-guide/conventions.html
>
> This would speed up processing your patch a lot. Any overhead spent
> separating functional from non-functional changes and checking the
> coding style is time better spent by the patch submitter than the
> patch reviewers. It's a question of one person investing the time
> vs. N people investing the time. If only one person spends this time
> the community as a whole wins time. Thank you :)
>

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Stefan Sperling <st...@elego.de>.

On Wed, May 18, 2011 at 07:31:54PM +0200, Morten Kloster wrote:
> >
> > I'm attaching a version of the patch re-generated with -x-pw.
> >
> > [[[
> > Index: subversion/libsvn_diff/diff.c
> > ===================================================================
> > --- subversion/libsvn_diff/diff.c       (revision 1104340)
> > +++ subversion/libsvn_diff/diff.c       (working copy)
> > @@ -105,6 +105,10 @@ svn_diff_diff_2(svn_diff_t **diff,
> >  {
> >   svn_diff__tree_t *tree;
> >   svn_diff__position_t *position_list[2];
> > +  apr_uint32_t num_tokens;
> > +  apr_uint32_t token_index;
> > +  svn_diff__position_t *current;
> > +  apr_uint32_t *token_counts[2];

The indentation above looks wrong and in some other places, too.

Morten, your patch is very interesting. But I don't want to review it
as posted. Can you please provide a fresh patch that includes only
your functional changes and uses correct code formatting as per our
coding guidelines? See the community guide at 
http://subversion.apache.org/docs/community-guide/conventions.html

This would speed up processing your patch a lot. Any overhead spent
separating functional from non-functional changes and checking the
coding style is time better spent by the patch submitter than the
patch reviewers. It's a question of one person investing the time
vs. N people investing the time. If only one person spends this time
the community as a whole wins time. Thank you :)

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Morten Kloster <mo...@gmail.com>.

On Wed, May 18, 2011 at 9:25 PM, Johan Corveleyn <jc...@gmail.com> wrote:
> On Wed, May 18, 2011 at 9:23 PM, Johan Corveleyn <jc...@gmail.com> wrote:
>> On Wed, May 18, 2011 at 7:31 PM, Morten Kloster <mo...@gmail.com> wrote:
>>> Thanks, Daniel.
>>>
>>> Johan:
>>> I've found your notes trunk/notes/diff-optimizations.txt. As you may
>>> have realized, my patch is a slightly different approach to the
>>> problem you discuss in section I.5, "Discarding non-matching
>>> lines before running the LCS (Longest Common Subsequence)
>>> algorithm." - rather than discarding the lines before calling LCS, I
>>> create count information that lets LCS skip the lines without
>>> counting them as (avoidable) changes. I believe my approach
>>> is easier to implement (not least now, when it's already done ;)
>>> and has less overhead, although the potential speed-up is
>>> somewhat less:
>>
>> Yes, I noticed the similarity :-). Your approach is a nice one, and
>> indeed easier to implement. With the discarding-approach, we'd have to
>> give each line an alternative line number (a "virtual line number", of
>> the file without its non-matching lines), then run the LCS on those
>> lines, and then when producing the diff output go back to using the
>> original line numbers. Or something like that.
>>
>> But I like your idea too, it sounds more light-weight, and might
>> provide most of the benefits.
>>
>>> LCS has running time of O( N + P D ), where N is #tokens in
>>> longest source, P is number of mismatches in smallest source
>>> and D is total number of mismatches in both sources.
>>> Discarding the non-matching tokens would replace both P and D
>>> by their discarded versions P' and D', giving a potential speed-up
>>> of the SQUARE of the inverse of the fraction of changed tokens
>>> that are not unique to one source, whereas my approach
>>> effectively only replaces P by P', not D, so you only get one
>>> power, not two (at least I think so; I haven't checked carefully).
>>
>> You seem to know your stuff :-).
>>
>> For me it's easier to understand by explaining D as "the length of the
>> shortest edit script (additions+deletes) from largest to smallest
>> source", and P as "the number of deleted tokens in that shortest edit
>> script". That's the terminology used in the article referred to by
>> lcs.c [1]. Or instead of "shortest edit script", I usually use the
>> term "minimal diff".
>>
>> I'm not sure about your calculation though. I should probably study
>> your patch first, because I haven't quite grokked it yet. But could
>> you explain why your approach doesn't reduce D (the deleted lines)?
>>
At least in some cases, you basically get the same behavior as for the
worst-case scenario below (where it's still N, not N'). That might not be
the case for a "typical" scenario, though - I'm not sure.

>> Also: the discarding-approach reduces N as well, just by making both
>> files smaller. In the "average case", where expected running time is
>> O(N + PD), that's probably not a big deal. But according to the
>> article the worst-case running time is O(NP). So reducing N also has a
>> big impact on the worst-case running time. Anyway, I speaking purely
>> theoretical here, I have no idea when this worst-case running time can
>> occur in practice.

Worst case-type scenario:

Source1: ababababababababababababababacccccccccccccccb
Source2: bbbababababababababababababababac

Best match is clearly to skip the 3 first b's in Source2. However, you will
be able to match almost all of Source2 against various parts of Source1
also if skip no or only 1 or 2 b's, which means LCS will take PN time
(actually (P+1)N, but complexity is the same).

Now insert a series of MANY d's after the first 10 tokens of Source1, and as
many e's after the first 10 tokens of Source2. With my code, you will have
to skip past each of those tokens for each of the 4 scans, so it's still P N,
not P N'.


>>
>>> There is another possible enhancement that is somewhat
>>> complementary to the one I've implemented: If the number of
>>> changed tokens is smaller than the number of uniquely
>>> matching tokens, and the changes are grouped in well-
>>> separated sections, then whenever you have found more
>>> unique matches than the number of changes you have "used",
>>> you can "restart" the problem from that point: you are
>>> guaranteed that there is no better match for the earlier
>>> part of either source.
>>
>> I'm not sure about that. I have to chew on it a little bit :-).
>>
>> Can you describe this idea in some more detail? Maybe give with an
>> example or something?
>>

Source1: aa1234abab56789ababababababa
Source2: bb1234aabb56789aababbbaaabbb

Here, the numbers are unique matches (only occur once in each file).
After scanning the first 5 tokens of each file, you know that even though
you might be able to match the a's / b's with later parts of the other file,
that would cost you at least three unique matches, so that's just not an
option - you know that the LCS will match 123 with 123. Similarly when
you get to 5678, that is more unique matches than what you could
possibly gain by matching the files otherwise.

The easiest way to implement this might be to NOT use strict LCS, but
rather let unique matches get double weight - you count unmatched
unique matches in BOTH files. This should also give nicer diffs (similar to
Patience Diff, but less extreme). It should be possible to stick to strict
LCS, but that would make the code more complicated - you would have
to check if you have already accounted for that token in the other file.
(To get this enhancement, you have to count some cost for a mismatched
unique match the first time you skip the token, regardless if it's in the
longer or shorter source file.)

>>> This reduces the running time from
>>> O( N + P' D ) to something like O( N + sum(P'n Dn)), I think,
>>> where the sum is over the different changed sections. I
>>> haven't worked out the details on how to best implement
>>> this approach, but it should work well with my initial patch.
>
> Sorry, forgot the footnote I intended to add:
> [1] "An O(NP) Sequence Comparison Algorithm", Sun Wu, Udi Manber, Gene Myers
>
> --
> Johan
>

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Johan Corveleyn <jc...@gmail.com>.

On Wed, May 18, 2011 at 9:23 PM, Johan Corveleyn <jc...@gmail.com> wrote:
> On Wed, May 18, 2011 at 7:31 PM, Morten Kloster <mo...@gmail.com> wrote:
>> Thanks, Daniel.
>>
>> Johan:
>> I've found your notes trunk/notes/diff-optimizations.txt. As you may
>> have realized, my patch is a slightly different approach to the
>> problem you discuss in section I.5, "Discarding non-matching
>> lines before running the LCS (Longest Common Subsequence)
>> algorithm." - rather than discarding the lines before calling LCS, I
>> create count information that lets LCS skip the lines without
>> counting them as (avoidable) changes. I believe my approach
>> is easier to implement (not least now, when it's already done ;)
>> and has less overhead, although the potential speed-up is
>> somewhat less:
>
> Yes, I noticed the similarity :-). Your approach is a nice one, and
> indeed easier to implement. With the discarding-approach, we'd have to
> give each line an alternative line number (a "virtual line number", of
> the file without its non-matching lines), then run the LCS on those
> lines, and then when producing the diff output go back to using the
> original line numbers. Or something like that.
>
> But I like your idea too, it sounds more light-weight, and might
> provide most of the benefits.
>
>> LCS has running time of O( N + P D ), where N is #tokens in
>> longest source, P is number of mismatches in smallest source
>> and D is total number of mismatches in both sources.
>> Discarding the non-matching tokens would replace both P and D
>> by their discarded versions P' and D', giving a potential speed-up
>> of the SQUARE of the inverse of the fraction of changed tokens
>> that are not unique to one source, whereas my approach
>> effectively only replaces P by P', not D, so you only get one
>> power, not two (at least I think so; I haven't checked carefully).
>
> You seem to know your stuff :-).
>
> For me it's easier to understand by explaining D as "the length of the
> shortest edit script (additions+deletes) from largest to smallest
> source", and P as "the number of deleted tokens in that shortest edit
> script". That's the terminology used in the article referred to by
> lcs.c [1]. Or instead of "shortest edit script", I usually use the
> term "minimal diff".
>
> I'm not sure about your calculation though. I should probably study
> your patch first, because I haven't quite grokked it yet. But could
> you explain why your approach doesn't reduce D (the deleted lines)?
>
> Also: the discarding-approach reduces N as well, just by making both
> files smaller. In the "average case", where expected running time is
> O(N + PD), that's probably not a big deal. But according to the
> article the worst-case running time is O(NP). So reducing N also has a
> big impact on the worst-case running time. Anyway, I speaking purely
> theoretical here, I have no idea when this worst-case running time can
> occur in practice.
>
>> There is another possible enhancement that is somewhat
>> complementary to the one I've implemented: If the number of
>> changed tokens is smaller than the number of uniquely
>> matching tokens, and the changes are grouped in well-
>> separated sections, then whenever you have found more
>> unique matches than the number of changes you have "used",
>> you can "restart" the problem from that point: you are
>> guaranteed that there is no better match for the earlier
>> part of either source.
>
> I'm not sure about that. I have to chew on it a little bit :-).
>
> Can you describe this idea in some more detail? Maybe give with an
> example or something?
>
>> This reduces the running time from
>> O( N + P' D ) to something like O( N + sum(P'n Dn)), I think,
>> where the sum is over the different changed sections. I
>> haven't worked out the details on how to best implement
>> this approach, but it should work well with my initial patch.

Sorry, forgot the footnote I intended to add:
[1] "An O(NP) Sequence Comparison Algorithm", Sun Wu, Udi Manber, Gene Myers

-- 
Johan

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Johan Corveleyn <jc...@gmail.com>.

On Wed, May 18, 2011 at 7:31 PM, Morten Kloster <mo...@gmail.com> wrote:
> Thanks, Daniel.
>
> Johan:
> I've found your notes trunk/notes/diff-optimizations.txt. As you may
> have realized, my patch is a slightly different approach to the
> problem you discuss in section I.5, "Discarding non-matching
> lines before running the LCS (Longest Common Subsequence)
> algorithm." - rather than discarding the lines before calling LCS, I
> create count information that lets LCS skip the lines without
> counting them as (avoidable) changes. I believe my approach
> is easier to implement (not least now, when it's already done ;)
> and has less overhead, although the potential speed-up is
> somewhat less:

Yes, I noticed the similarity :-). Your approach is a nice one, and
indeed easier to implement. With the discarding-approach, we'd have to
give each line an alternative line number (a "virtual line number", of
the file without its non-matching lines), then run the LCS on those
lines, and then when producing the diff output go back to using the
original line numbers. Or something like that.

But I like your idea too, it sounds more light-weight, and might
provide most of the benefits.

> LCS has running time of O( N + P D ), where N is #tokens in
> longest source, P is number of mismatches in smallest source
> and D is total number of mismatches in both sources.
> Discarding the non-matching tokens would replace both P and D
> by their discarded versions P' and D', giving a potential speed-up
> of the SQUARE of the inverse of the fraction of changed tokens
> that are not unique to one source, whereas my approach
> effectively only replaces P by P', not D, so you only get one
> power, not two (at least I think so; I haven't checked carefully).

You seem to know your stuff :-).

For me it's easier to understand by explaining D as "the length of the
shortest edit script (additions+deletes) from largest to smallest
source", and P as "the number of deleted tokens in that shortest edit
script". That's the terminology used in the article referred to by
lcs.c [1]. Or instead of "shortest edit script", I usually use the
term "minimal diff".

I'm not sure about your calculation though. I should probably study
your patch first, because I haven't quite grokked it yet. But could
you explain why your approach doesn't reduce D (the deleted lines)?

Also: the discarding-approach reduces N as well, just by making both
files smaller. In the "average case", where expected running time is
O(N + PD), that's probably not a big deal. But according to the
article the worst-case running time is O(NP). So reducing N also has a
big impact on the worst-case running time. Anyway, I speaking purely
theoretical here, I have no idea when this worst-case running time can
occur in practice.

> There is another possible enhancement that is somewhat
> complementary to the one I've implemented: If the number of
> changed tokens is smaller than the number of uniquely
> matching tokens, and the changes are grouped in well-
> separated sections, then whenever you have found more
> unique matches than the number of changes you have "used",
> you can "restart" the problem from that point: you are
> guaranteed that there is no better match for the earlier
> part of either source.

I'm not sure about that. I have to chew on it a little bit :-).

Can you describe this idea in some more detail? Maybe give with an
example or something?

> This reduces the running time from
> O( N + P' D ) to something like O( N + sum(P'n Dn)), I think,
> where the sum is over the different changed sections. I
> haven't worked out the details on how to best implement
> this approach, but it should work well with my initial patch.

-- 
Johan

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Morten Kloster <mo...@gmail.com>.

Thanks, Daniel.

Johan:
I've found your notes trunk/notes/diff-optimizations.txt. As you may
have realized, my patch is a slightly different approach to the
problem you discuss in section I.5, "Discarding non-matching
lines before running the LCS (Longest Common Subsequence)
algorithm." - rather than discarding the lines before calling LCS, I
create count information that lets LCS skip the lines without
counting them as (avoidable) changes. I believe my approach
is easier to implement (not least now, when it's already done ;)
and has less overhead, although the potential speed-up is
somewhat less:

LCS has running time of O( N + P D ), where N is #tokens in
longest source, P is number of mismatches in smallest source
and D is total number of mismatches in both sources.
Discarding the non-matching tokens would replace both P and D
by their discarded versions P' and D', giving a potential speed-up
of the SQUARE of the inverse of the fraction of changed tokens
that are not unique to one source, whereas my approach
effectively only replaces P by P', not D, so you only get one
power, not two (at least I think so; I haven't checked carefully).

There is another possible enhancement that is somewhat
complementary to the one I've implemented: If the number of
changed tokens is smaller than the number of uniquely
matching tokens, and the changes are grouped in well-
separated sections, then whenever you have found more
unique matches than the number of changes you have "used",
you can "restart" the problem from that point: you are
guaranteed that there is no better match for the earlier
part of either source. This reduces the running time from
O( N + P' D ) to something like O( N + sum(P'n Dn)), I think,
where the sum is over the different changed sections. I
haven't worked out the details on how to best implement
this approach, but it should work well with my initial patch.


Morten

On Wed, May 18, 2011 at 11:01 AM, Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> Please don't mix whitespace changes with functional changes.
>
> I'm attaching a version of the patch re-generated with -x-pw.
>
> [[[
> Index: subversion/libsvn_diff/diff.c
> ===================================================================
> --- subversion/libsvn_diff/diff.c       (revision 1104340)
> +++ subversion/libsvn_diff/diff.c       (working copy)
> @@ -105,6 +105,10 @@ svn_diff_diff_2(svn_diff_t **diff,
>  {
>   svn_diff__tree_t *tree;
>   svn_diff__position_t *position_list[2];
> +  apr_uint32_t num_tokens;
> +  apr_uint32_t token_index;
> +  svn_diff__position_t *current;
> +  apr_uint32_t *token_counts[2];
>   svn_diff_datasource_e datasource[] = {svn_diff_datasource_original,
>                                         svn_diff_datasource_modified};
>   svn_diff__lcs_t *lcs;
> @@ -138,6 +142,8 @@ svn_diff_diff_2(svn_diff_t **diff,
>                                prefix_lines,
>                                subpool));
>
> +  num_tokens = svn_diff__get_node_count(tree);
> +
>   /* The cool part is that we don't need the tokens anymore.
>    * Allow the app to clean them up if it wants to.
>    */
> @@ -147,8 +153,35 @@ svn_diff_diff_2(svn_diff_t **diff,
>   /* We don't need the nodes in the tree either anymore, nor the tree itself */
>   svn_pool_destroy(treepool);
>
> +  token_counts[0] = apr_palloc(subpool, num_tokens * sizeof(**token_counts));
> +  token_counts[1] = apr_palloc(subpool, num_tokens * sizeof(**token_counts));
> +  for (token_index = 0; token_index < num_tokens; token_index++)
> +    token_counts[0][token_index] = token_counts[1][token_index] = 0;
> +
> +  current = position_list[0];
> +  if (current != NULL)
> +    {
> +      do
> +        {
> +          token_counts[0][current->node]++;
> +          current = current->next;
> +        }
> +      while (current != position_list[0]);
> +    }
> +  current = position_list[1];
> +  if (current != NULL)
> +    {
> +      do
> +        {
> +          token_counts[1][current->node]++;
> +          current = current->next;
> +        }
> +      while (current != position_list[1]);
> +    }
> +
>   /* Get the lcs */
> -  lcs = svn_diff__lcs(position_list[0], position_list[1], prefix_lines,
> +  lcs = svn_diff__lcs(position_list[0], position_list[1], token_counts[0],
> +                         token_counts[1], num_tokens, prefix_lines,
>                       suffix_lines, subpool);
>
>   /* Produce the diff */
> Index: subversion/libsvn_diff/diff.h
> ===================================================================
> --- subversion/libsvn_diff/diff.h       (revision 1104340)
> +++ subversion/libsvn_diff/diff.h       (working copy)
> @@ -62,7 +62,7 @@ struct svn_diff_t {
>  struct svn_diff__position_t
>  {
>   svn_diff__position_t *next;
> -  svn_diff__node_t     *node;
> +  apr_int32_t          node;
>   apr_off_t             offset;
>  };
>
> @@ -91,12 +91,21 @@ typedef enum svn_diff__normalize_state_t
>  svn_diff__lcs_t *
>  svn_diff__lcs(svn_diff__position_t *position_list1, /* pointer to tail (ring) */
>               svn_diff__position_t *position_list2, /* pointer to tail (ring) */
> +              apr_uint32_t *token_counts_list1, /* array of counts */
> +              apr_uint32_t *token_counts_list2, /* array of counts */
> +              apr_uint32_t num_tokens,
>               apr_off_t prefix_lines,
>               apr_off_t suffix_lines,
>               apr_pool_t *pool);
>
>
>  /*
> + * Returns number of tokens in a tree
> + */
> +apr_uint32_t
> +svn_diff__get_node_count(svn_diff__tree_t *tree);
> +
> +/*
>  * Support functions to build a tree of token positions
>  */
>  void
> @@ -128,6 +137,7 @@ void
>  svn_diff__resolve_conflict(svn_diff_t *hunk,
>                            svn_diff__position_t **position_list1,
>                            svn_diff__position_t **position_list2,
> +                           apr_uint32_t num_tokens,
>                            apr_pool_t *pool);
>
>
> Index: subversion/libsvn_diff/token.c
> ===================================================================
> --- subversion/libsvn_diff/token.c      (revision 1104340)
> +++ subversion/libsvn_diff/token.c      (working copy)
> @@ -46,6 +46,7 @@ struct svn_diff__node_t
>   svn_diff__node_t     *right;
>
>   apr_uint32_t          hash;
> +  apr_int32_t           index;
>   void                 *token;
>  };
>
> @@ -53,10 +54,20 @@ struct svn_diff__tree_t
>  {
>   svn_diff__node_t     *root[SVN_DIFF__HASH_SIZE];
>   apr_pool_t           *pool;
> +  apr_uint32_t          node_count;
>  };
>
>
>  /*
> + * Returns number of tokens in a tree
> + */
> +apr_uint32_t
> +svn_diff__get_node_count(svn_diff__tree_t *tree)
> +{
> +  return tree->node_count;
> +}
> +
> +/*
>  * Support functions to build a tree of token positions
>  */
>
> @@ -65,6 +76,7 @@ svn_diff__tree_create(svn_diff__tree_t **tree, apr
>  {
>   *tree = apr_pcalloc(pool, sizeof(**tree));
>   (*tree)->pool = pool;
> +  (*tree)->node_count = 0;
>  }
>
>
> @@ -122,6 +134,7 @@ tree_insert_token(svn_diff__node_t **node, svn_dif
>   new_node->right = NULL;
>   new_node->hash = hash;
>   new_node->token = token;
> +  new_node->index = tree->node_count++;
>
>   *node = *node_ref = new_node;
>
> @@ -168,7 +181,7 @@ svn_diff__get_tokens(svn_diff__position_t **positi
>       /* Create a new position */
>       position = apr_palloc(pool, sizeof(*position));
>       position->next = NULL;
> -      position->node = node;
> +      position->node = node->index;
>       position->offset = offset;
>
>       *position_ref = position;
> Index: subversion/libsvn_diff/lcs.c
> ===================================================================
> --- subversion/libsvn_diff/lcs.c        (revision 1104340)
> +++ subversion/libsvn_diff/lcs.c        (working copy)
> @@ -51,6 +51,7 @@ static APR_INLINE void
>  svn_diff__snake(apr_off_t k,
>                 svn_diff__snake_t *fp,
>                 int idx,
> +                               apr_uint32_t *token_counts[2],
>                 svn_diff__lcs_t **freelist,
>                 apr_pool_t *pool)
>  {
> @@ -92,6 +93,11 @@ svn_diff__snake(apr_off_t k,
>     }
>
>
> +  if (previous_lcs)
> +    {
> +      previous_lcs->refcount++;
> +    }
> +
>   /* ### Optimization, skip all positions that don't have matchpoints
>    * ### anyway. Beware of the sentinel, don't skip it!
>    */
> @@ -99,6 +105,8 @@ svn_diff__snake(apr_off_t k,
>   position[0] = start_position[0];
>   position[1] = start_position[1];
>
> +  while (1)
> +    {
>   while (position[0]->node == position[1]->node)
>     {
>       position[0] = position[0]->next;
> @@ -122,18 +130,19 @@ svn_diff__snake(apr_off_t k,
>       lcs->length = position[1]->offset - start_position[1]->offset;
>       lcs->next = previous_lcs;
>       lcs->refcount = 1;
> -      fp[k].lcs = lcs;
> +          previous_lcs = lcs;
> +             start_position[0] = position[0];
> +             start_position[1] = position[1];
>     }
> +      if (position[0]->node >= 0 && token_counts[1][position[0]->node] == 0)
> +        start_position[0] = position[0] = position[0]->next;
> +      else if (position[1]->node >= 0 && token_counts[0][position[1]->node] == 0)
> +        start_position[1] = position[1] = position[1]->next;
>   else
> -    {
> -      fp[k].lcs = previous_lcs;
> +        break;
>     }
>
> -  if (previous_lcs)
> -    {
> -      previous_lcs->refcount++;
> -    }
> -
> +  fp[k].lcs = previous_lcs;
>   fp[k].position[0] = position[0];
>   fp[k].position[1] = position[1];
>
> @@ -188,12 +197,18 @@ prepend_lcs(svn_diff__lcs_t *lcs, apr_off_t lines,
>  svn_diff__lcs_t *
>  svn_diff__lcs(svn_diff__position_t *position_list1, /* pointer to tail (ring) */
>               svn_diff__position_t *position_list2, /* pointer to tail (ring) */
> +              apr_uint32_t *token_counts_list1, /* array of counts */
> +              apr_uint32_t *token_counts_list2, /* array of counts */
> +              apr_uint32_t num_tokens,
>               apr_off_t prefix_lines,
>               apr_off_t suffix_lines,
>               apr_pool_t *pool)
>  {
>   int idx;
>   apr_off_t length[2];
> +  apr_uint32_t *token_counts[2];
> +  apr_uint32_t unique_count[2];
> +  apr_uint32_t token_index;
>   svn_diff__snake_t *fp;
>   apr_off_t d;
>   apr_off_t k;
> @@ -231,10 +246,19 @@ svn_diff__lcs(svn_diff__position_t *position_list1
>       return lcs;
>     }
>
> +  unique_count[1] = unique_count[0] = 0;
> +  for (token_index = 0; token_index < num_tokens; token_index++)
> +    {
> +      if (token_counts_list1[token_index] == 0)
> +        unique_count[1] += token_counts_list2[token_index];
> +      if (token_counts_list2[token_index] == 0)
> +        unique_count[0] += token_counts_list1[token_index];
> +    }
> +
>   /* Calculate length of both sequences to be compared */
>   length[0] = position_list1->offset - position_list1->next->offset + 1;
>   length[1] = position_list2->offset - position_list2->next->offset + 1;
> -  idx = length[0] > length[1] ? 1 : 0;
> +  idx = length[0] - unique_count[0] > length[1] - unique_count[1] ? 1 : 0;
>
>   /* strikerXXX: here we allocate the furthest point array, which is
>    * strikerXXX: sized M + N + 3 (!)
> @@ -246,18 +270,20 @@ svn_diff__lcs(svn_diff__position_t *position_list1
>   sentinel_position[idx].next = position_list1->next;
>   position_list1->next = &sentinel_position[idx];
>   sentinel_position[idx].offset = position_list1->offset + 1;
> +  token_counts[idx] = token_counts_list1;
>
>   sentinel_position[abs(1 - idx)].next = position_list2->next;
>   position_list2->next = &sentinel_position[abs(1 - idx)];
>   sentinel_position[abs(1 - idx)].offset = position_list2->offset + 1;
> +  token_counts[abs(1 - idx)] = token_counts_list2;
>
> -  /* These are never dereferenced, only compared by value, so we
> -   * can safely fake these up and the void* cast is OK.
> +  /* Negative indices will not be used elsewhere
>    */
> -  sentinel_position[0].node = (void*)&sentinel_position[0];
> -  sentinel_position[1].node = (void*)&sentinel_position[1];
> +  sentinel_position[0].node = -1;
> +  sentinel_position[1].node = -2;
>
> -  d = length[abs(1 - idx)] - length[idx];
> +  d = length[abs(1 - idx)] - unique_count[abs(1 - idx)]
> +      - (length[idx] - unique_count[idx]);
>
>   /* k = -1 will be the first to be used to get previous
>    * position information from, make sure it holds sane
> @@ -272,12 +298,12 @@ svn_diff__lcs(svn_diff__position_t *position_list1
>       /* Forward */
>       for (k = -p; k < d; k++)
>         {
> -          svn_diff__snake(k, fp, idx, &lcs_freelist, pool);
> +          svn_diff__snake(k, fp, idx, token_counts, &lcs_freelist, pool);
>         }
>
>       for (k = d + p; k >= d; k--)
>         {
> -          svn_diff__snake(k, fp, idx, &lcs_freelist, pool);
> +          svn_diff__snake(k, fp, idx, token_counts, &lcs_freelist, pool);
>         }
>
>       p++;
> Index: subversion/libsvn_diff/diff3.c
> ===================================================================
> --- subversion/libsvn_diff/diff3.c      (revision 1104340)
> +++ subversion/libsvn_diff/diff3.c      (working copy)
> @@ -38,6 +38,7 @@ void
>  svn_diff__resolve_conflict(svn_diff_t *hunk,
>                            svn_diff__position_t **position_list1,
>                            svn_diff__position_t **position_list2,
> +                           apr_uint32_t num_tokens,
>                            apr_pool_t *pool)
>  {
>     apr_off_t modified_start = hunk->modified_start + 1;
> @@ -47,6 +48,9 @@ svn_diff__resolve_conflict(svn_diff_t *hunk,
>     apr_off_t latest_length = hunk->latest_length;
>     svn_diff__position_t *start_position[2];
>     svn_diff__position_t *position[2];
> +  apr_uint32_t token_index;
> +  svn_diff__position_t *current;
> +  apr_uint32_t *token_counts[2];
>     svn_diff__lcs_t *lcs = NULL;
>     svn_diff__lcs_t **lcs_ref = &lcs;
>     svn_diff_t **diff_ref = &hunk->resolved_diff;
> @@ -173,9 +177,29 @@ svn_diff__resolve_conflict(svn_diff_t *hunk,
>         position[1]->next = start_position[1];
>       }
>
> -    *lcs_ref = svn_diff__lcs(position[0], position[1], 0, 0,
> -                             subpool);
> +  token_counts[0] = apr_palloc(subpool, num_tokens * sizeof(**token_counts));
> +  token_counts[1] = apr_palloc(subpool, num_tokens * sizeof(**token_counts));
> +  for (token_index = 0; token_index < num_tokens; token_index++)
> +    token_counts[0][token_index] = token_counts[1][token_index] = 0;
>
> +  current = position[0];
> +  do
> +    {
> +      token_counts[0][current->node]++;
> +      current = current->next;
> +    }
> +  while (current != position[0]);
> +  current = position[1];
> +  do
> +    {
> +      token_counts[1][current->node]++;
> +      current = current->next;
> +    }
> +  while (current != position[1]);
> +
> +  *lcs_ref = svn_diff__lcs(position[0], position[1], token_counts[0],
> +                           token_counts[1], num_tokens, 0, 0, subpool);
> +
>     /* Fix up the EOF lcs element in case one of
>      * the two sequences was NULL.
>      */
> @@ -251,6 +275,10 @@ svn_diff_diff3_2(svn_diff_t **diff,
>  {
>   svn_diff__tree_t *tree;
>   svn_diff__position_t *position_list[3];
> +  apr_uint32_t num_tokens;
> +  apr_uint32_t token_index;
> +  svn_diff__position_t *current;
> +  apr_uint32_t *token_counts[3];
>   svn_diff_datasource_e datasource[] = {svn_diff_datasource_original,
>                                         svn_diff_datasource_modified,
>                                         svn_diff_datasource_latest};
> @@ -292,6 +320,8 @@ svn_diff_diff3_2(svn_diff_t **diff,
>                                prefix_lines,
>                                subpool));
>
> +  num_tokens = svn_diff__get_node_count(tree);
> +
>   /* Get rid of the tokens, we don't need them to calc the diff */
>   if (vtable->token_discard_all != NULL)
>     vtable->token_discard_all(diff_baton);
> @@ -299,10 +329,52 @@ svn_diff_diff3_2(svn_diff_t **diff,
>   /* We don't need the nodes in the tree either anymore, nor the tree itself */
>   svn_pool_destroy(treepool);
>
> +  token_counts[0] = apr_palloc(subpool, num_tokens * sizeof(**token_counts));
> +  token_counts[1] = apr_palloc(subpool, num_tokens * sizeof(**token_counts));
> +  token_counts[2] = apr_palloc(subpool, num_tokens * sizeof(**token_counts));
> +  for (token_index = 0; token_index < num_tokens; token_index++)
> +    {
> +      token_counts[0][token_index] = token_counts[1][token_index] = 0;
> +      token_counts[2][token_index] = 0;
> +    }
> +
> +  current = position_list[0];
> +  if (current != NULL)
> +    {
> +      do
> +        {
> +          token_counts[0][current->node]++;
> +          current = current->next;
> +        }
> +      while (current != position_list[0]);
> +    }
> +  current = position_list[1];
> +  if (current != NULL)
> +    {
> +      do
> +        {
> +          token_counts[1][current->node]++;
> +          current = current->next;
> +        }
> +      while (current != position_list[1]);
> +    }
> +  current = position_list[2];
> +  if (current != NULL)
> +    {
> +      do
> +        {
> +          token_counts[2][current->node]++;
> +          current = current->next;
> +        }
> +      while (current != position_list[2]);
> +    }
> +
>   /* Get the lcs for original-modified and original-latest */
> -  lcs_om = svn_diff__lcs(position_list[0], position_list[1], prefix_lines,
> +  lcs_om = svn_diff__lcs(position_list[0], position_list[1], token_counts[0],
> +                         token_counts[1], num_tokens, prefix_lines,
>                          suffix_lines, subpool);
> -  lcs_ol = svn_diff__lcs(position_list[0], position_list[2], prefix_lines,
> +  lcs_ol = svn_diff__lcs(position_list[0], position_list[2], token_counts[0],
> +                         token_counts[2], num_tokens, prefix_lines,
>                          suffix_lines, subpool);
>
>   /* Produce a merged diff */
> @@ -435,6 +507,7 @@ svn_diff_diff3_2(svn_diff_t **diff,
>                 svn_diff__resolve_conflict(*diff_ref,
>                                            &position_list[1],
>                                            &position_list[2],
> +                                           num_tokens,
>                                            pool);
>               }
>             else if (is_modified)
> Index: subversion/libsvn_diff/diff4.c
> ===================================================================
> --- subversion/libsvn_diff/diff4.c      (revision 1104340)
> +++ subversion/libsvn_diff/diff4.c      (working copy)
> @@ -173,6 +173,10 @@ svn_diff_diff4_2(svn_diff_t **diff,
>  {
>   svn_diff__tree_t *tree;
>   svn_diff__position_t *position_list[4];
> +  apr_uint32_t num_tokens;
> +  apr_uint32_t token_index;
> +  svn_diff__position_t *current;
> +  apr_uint32_t *token_counts[4];
>   svn_diff_datasource_e datasource[] = {svn_diff_datasource_original,
>                                         svn_diff_datasource_modified,
>                                         svn_diff_datasource_latest,
> @@ -227,6 +231,8 @@ svn_diff_diff4_2(svn_diff_t **diff,
>                                prefix_lines,
>                                subpool2));
>
> +  num_tokens = svn_diff__get_node_count(tree);
> +
>   /* Get rid of the tokens, we don't need them to calc the diff */
>   if (vtable->token_discard_all != NULL)
>     vtable->token_discard_all(diff_baton);
> @@ -234,8 +240,61 @@ svn_diff_diff4_2(svn_diff_t **diff,
>   /* We don't need the nodes in the tree either anymore, nor the tree itself */
>   svn_pool_clear(subpool3);
>
> +  token_counts[0] = apr_palloc(subpool, num_tokens * sizeof(**token_counts));
> +  token_counts[1] = apr_palloc(subpool, num_tokens * sizeof(**token_counts));
> +  token_counts[2] = apr_palloc(subpool, num_tokens * sizeof(**token_counts));
> +  token_counts[3] = apr_palloc(subpool, num_tokens * sizeof(**token_counts));
> +  for (token_index = 0; token_index < num_tokens; token_index++)
> +    {
> +      token_counts[0][token_index] = token_counts[1][token_index] = 0;
> +      token_counts[2][token_index] = token_counts[3][token_index] = 0;
> +    }
> +
> +  current = position_list[0];
> +  if (current != NULL)
> +    {
> +      do
> +        {
> +          token_counts[0][current->node]++;
> +          current = current->next;
> +        }
> +      while (current != position_list[0]);
> +    }
> +  current = position_list[1];
> +  if (current != NULL)
> +    {
> +      do
> +        {
> +          token_counts[1][current->node]++;
> +          current = current->next;
> +        }
> +      while (current != position_list[1]);
> +    }
> +  current = position_list[2];
> +  if (current != NULL)
> +    {
> +      do
> +        {
> +          token_counts[2][current->node]++;
> +          current = current->next;
> +        }
> +      while (current != position_list[2]);
> +    }
> +  current = position_list[3];
> +  if (current != NULL)
> +    {
> +      do
> +        {
> +          token_counts[3][current->node]++;
> +          current = current->next;
> +        }
> +      while (current != position_list[3]);
> +    }
> +
>   /* Get the lcs for original - latest */
> -  lcs_ol = svn_diff__lcs(position_list[0], position_list[2], prefix_lines,
> +  lcs_ol = svn_diff__lcs(position_list[0], position_list[2],
> +                         token_counts[0], token_counts[2],
> +                         num_tokens, prefix_lines,
>                          suffix_lines, subpool3);
>   diff_ol = svn_diff__diff(lcs_ol, 1, 1, TRUE, pool);
>
> @@ -257,7 +316,9 @@ svn_diff_diff4_2(svn_diff_t **diff,
>   /* Get the lcs for common ancestor - original
>    * Do reverse adjustements
>    */
> -  lcs_adjust = svn_diff__lcs(position_list[3], position_list[2], prefix_lines,
> +  lcs_adjust = svn_diff__lcs(position_list[3], position_list[2],
> +                             token_counts[3], token_counts[2],
> +                             num_tokens, prefix_lines,
>                              suffix_lines, subpool3);
>   diff_adjust = svn_diff__diff(lcs_adjust, 1, 1, FALSE, subpool3);
>   adjust_diff(diff_ol, diff_adjust);
> @@ -267,7 +328,9 @@ svn_diff_diff4_2(svn_diff_t **diff,
>   /* Get the lcs for modified - common ancestor
>    * Do forward adjustments
>    */
> -  lcs_adjust = svn_diff__lcs(position_list[1], position_list[3], prefix_lines,
> +  lcs_adjust = svn_diff__lcs(position_list[1], position_list[3],
> +                             token_counts[1], token_counts[3],
> +                             num_tokens, prefix_lines,
>                              suffix_lines, subpool3);
>   diff_adjust = svn_diff__diff(lcs_adjust, 1, 1, FALSE, subpool3);
>   adjust_diff(diff_ol, diff_adjust);
> @@ -283,7 +346,7 @@ svn_diff_diff4_2(svn_diff_t **diff,
>       if (hunk->type == svn_diff__type_conflict)
>         {
>           svn_diff__resolve_conflict(hunk, &position_list[1],
> -                                     &position_list[2], pool);
> +                                     &position_list[2], num_tokens, pool);
>         }
>     }
>
> ]]]
>
> Morten Kloster wrote on Wed, May 18, 2011 at 10:31:13 +0200:
>> On Wed, May 18, 2011 at 8:33 AM, Johan Corveleyn <jc...@gmail.com> wrote:
>> > On Wed, May 18, 2011 at 1:56 AM, Morten Kloster <mo...@gmail.com> wrote:
>> >> Log message:
>> >>
>> >> Speed-up of libsvn_diff using token counts
>> >> By using indices, not node pointers, to refer to tokens, and counting
>> >> the number of each token, the longest common subsequence (lcs)
>> >> algorithm at the heart of libsvn_diff becomes much faster in many
>> >> situations by skipping tokens that are unique to one source or the other.
>> >>
>> >> * subversion/libsvn_diff/token.c
>> >>  (svn_diff__node_t): Added 'index' member.
>> >>  (svn_diff__tree_t): Added 'node_count' member.
>> >>  (svn_diff__get_token_count): New method.
>> >>  (tree_insert_token): Assigns indices to nodes.
>> >>  (svn_diff__get_tokens): Uses index, not pointer,
>> >>   to identify node in position struct.
>> >>
>> >> * subversion/libsvn_diff/lcs.c
>> >>  (svn_diff__snake): Takes token counts as additional argument.
>> >>   Skips tokens that are unique to one or the other file.
>> >>  (svn_diff__lcs): Takes token counts and number of different tokens
>> >>   as additional arguments. Calculates number of tokens unique to
>> >>   each source and uses this to compute effective length difference.
>> >>
>> >> * subversion/libsvn_diff/diff.h
>> >>  (svn_diff__position_t): Changed node member from being
>> >>   a pointer to the node to an integer index for the node.
>> >>  Updated method declarations.
>> >>
>> >> * subversion/libsvn_diff/diff.c
>> >> * subversion/libsvn_diff/diff3.c
>> >> * subversion/libsvn_diff/diff4.c
>> >>  (svn_diff_diff_2, svn_diff_diff_3, svn_diff_diff_4): Count the
>> >>   number of times each token appears in each source.
>> >>  (svn_diff__resolve_conflict): Takes number of different tokens
>> >>   as additional argument. Counts the number of times
>> >>   each token appears in each (partial) source.
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> Comments:
>> >> This patch can reduce the time required for a diff by a large factor
>> >> when comparing long files with large differences. As an extreme case,
>> >> splitting sqlite.c into two roughly equal parts and diffing them against
>> >> each other took about a minute on my laptop with the original code,
>> >> and 7-8 seconds with my patch. The speed-up is gained only if a
>> >> large fraction of the changed tokens are unique to one source or the
>> >> other. I can not see that the change would cause a significant
>> >> performance degradation in any reasonable case.
>> >>
>> >> Counting the number of times each token appears is also useful for
>> >> potential additional features in the future, such as identifying moved
>> >> blocks outside of the longest common subsequence.
>> >>
>> >> I have tested only svn_diff_diff_2 (that it gives identical results as
>> >> before), but the changes to svn_diff_diff3 and svn_diff_diff4 are
>> >> completely analogous. Tested on a Windows 7 64-bit machine,
>> >> TortoiseSVN build script.
>> >>
>> >>
>> >> Morten Kloster (first patch)
>> >
>> > Interesting! The patch seems to be missing though. The list software
>> > tends to strip off attachments if they don't have a text/plain
>> > mime-type. Can you try sending it again with a .txt extension?
>> >
>> > --
>> > Johan
>> >
>>
>> Sorry, here's the patch with .txt extension. I found a bug in the
>> first version anyway, so it was just as well. :)
>>
>> And of course I meant sqlite3.c for the stress test.
>>
>> Morten
>
>

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Please don't mix whitespace changes with functional changes.

I'm attaching a version of the patch re-generated with -x-pw.

[[[
Index: subversion/libsvn_diff/diff.c
===================================================================
--- subversion/libsvn_diff/diff.c	(revision 1104340)
+++ subversion/libsvn_diff/diff.c	(working copy)
@@ -105,6 +105,10 @@ svn_diff_diff_2(svn_diff_t **diff,
 {
   svn_diff__tree_t *tree;
   svn_diff__position_t *position_list[2];
+  apr_uint32_t num_tokens;
+  apr_uint32_t token_index;
+  svn_diff__position_t *current;
+  apr_uint32_t *token_counts[2];
   svn_diff_datasource_e datasource[] = {svn_diff_datasource_original,
                                         svn_diff_datasource_modified};
   svn_diff__lcs_t *lcs;
@@ -138,6 +142,8 @@ svn_diff_diff_2(svn_diff_t **diff,
                                prefix_lines,
                                subpool));
 
+  num_tokens = svn_diff__get_node_count(tree);
+
   /* The cool part is that we don't need the tokens anymore.
    * Allow the app to clean them up if it wants to.
    */
@@ -147,8 +153,35 @@ svn_diff_diff_2(svn_diff_t **diff,
   /* We don't need the nodes in the tree either anymore, nor the tree itself */
   svn_pool_destroy(treepool);
 
+  token_counts[0] = apr_palloc(subpool, num_tokens * sizeof(**token_counts));
+  token_counts[1] = apr_palloc(subpool, num_tokens * sizeof(**token_counts));
+  for (token_index = 0; token_index < num_tokens; token_index++)
+    token_counts[0][token_index] = token_counts[1][token_index] = 0;
+
+  current = position_list[0];
+  if (current != NULL)
+    {
+      do
+        {
+          token_counts[0][current->node]++;
+          current = current->next;
+        }
+      while (current != position_list[0]);
+    }
+  current = position_list[1];
+  if (current != NULL)
+    {
+      do
+        {
+          token_counts[1][current->node]++;
+          current = current->next;
+        }
+      while (current != position_list[1]);
+    }
+
   /* Get the lcs */
-  lcs = svn_diff__lcs(position_list[0], position_list[1], prefix_lines,
+  lcs = svn_diff__lcs(position_list[0], position_list[1], token_counts[0],
+	                  token_counts[1], num_tokens, prefix_lines,
                       suffix_lines, subpool);
 
   /* Produce the diff */
Index: subversion/libsvn_diff/diff.h
===================================================================
--- subversion/libsvn_diff/diff.h	(revision 1104340)
+++ subversion/libsvn_diff/diff.h	(working copy)
@@ -62,7 +62,7 @@ struct svn_diff_t {
 struct svn_diff__position_t
 {
   svn_diff__position_t *next;
-  svn_diff__node_t     *node;
+  apr_int32_t          node;
   apr_off_t             offset;
 };
 
@@ -91,12 +91,21 @@ typedef enum svn_diff__normalize_state_t
 svn_diff__lcs_t *
 svn_diff__lcs(svn_diff__position_t *position_list1, /* pointer to tail (ring) */
               svn_diff__position_t *position_list2, /* pointer to tail (ring) */
+              apr_uint32_t *token_counts_list1, /* array of counts */
+              apr_uint32_t *token_counts_list2, /* array of counts */
+              apr_uint32_t num_tokens,
               apr_off_t prefix_lines,
               apr_off_t suffix_lines,
               apr_pool_t *pool);
 
 
 /*
+ * Returns number of tokens in a tree
+ */
+apr_uint32_t
+svn_diff__get_node_count(svn_diff__tree_t *tree);
+
+/*
  * Support functions to build a tree of token positions
  */
 void
@@ -128,6 +137,7 @@ void
 svn_diff__resolve_conflict(svn_diff_t *hunk,
                            svn_diff__position_t **position_list1,
                            svn_diff__position_t **position_list2,
+                           apr_uint32_t num_tokens,
                            apr_pool_t *pool);
 
 
Index: subversion/libsvn_diff/token.c
===================================================================
--- subversion/libsvn_diff/token.c	(revision 1104340)
+++ subversion/libsvn_diff/token.c	(working copy)
@@ -46,6 +46,7 @@ struct svn_diff__node_t
   svn_diff__node_t     *right;
 
   apr_uint32_t          hash;
+  apr_int32_t           index;
   void                 *token;
 };
 
@@ -53,10 +54,20 @@ struct svn_diff__tree_t
 {
   svn_diff__node_t     *root[SVN_DIFF__HASH_SIZE];
   apr_pool_t           *pool;
+  apr_uint32_t          node_count;
 };
 
 
 /*
+ * Returns number of tokens in a tree
+ */
+apr_uint32_t
+svn_diff__get_node_count(svn_diff__tree_t *tree)
+{
+  return tree->node_count;
+}
+
+/*
  * Support functions to build a tree of token positions
  */
 
@@ -65,6 +76,7 @@ svn_diff__tree_create(svn_diff__tree_t **tree, apr
 {
   *tree = apr_pcalloc(pool, sizeof(**tree));
   (*tree)->pool = pool;
+  (*tree)->node_count = 0;
 }
 
 
@@ -122,6 +134,7 @@ tree_insert_token(svn_diff__node_t **node, svn_dif
   new_node->right = NULL;
   new_node->hash = hash;
   new_node->token = token;
+  new_node->index = tree->node_count++;
 
   *node = *node_ref = new_node;
 
@@ -168,7 +181,7 @@ svn_diff__get_tokens(svn_diff__position_t **positi
       /* Create a new position */
       position = apr_palloc(pool, sizeof(*position));
       position->next = NULL;
-      position->node = node;
+      position->node = node->index;
       position->offset = offset;
 
       *position_ref = position;
Index: subversion/libsvn_diff/lcs.c
===================================================================
--- subversion/libsvn_diff/lcs.c	(revision 1104340)
+++ subversion/libsvn_diff/lcs.c	(working copy)
@@ -51,6 +51,7 @@ static APR_INLINE void
 svn_diff__snake(apr_off_t k,
                 svn_diff__snake_t *fp,
                 int idx,
+				apr_uint32_t *token_counts[2],
                 svn_diff__lcs_t **freelist,
                 apr_pool_t *pool)
 {
@@ -92,6 +93,11 @@ svn_diff__snake(apr_off_t k,
     }
 
 
+  if (previous_lcs)
+    {
+      previous_lcs->refcount++;
+    }
+
   /* ### Optimization, skip all positions that don't have matchpoints
    * ### anyway. Beware of the sentinel, don't skip it!
    */
@@ -99,6 +105,8 @@ svn_diff__snake(apr_off_t k,
   position[0] = start_position[0];
   position[1] = start_position[1];
 
+  while (1)
+    {
   while (position[0]->node == position[1]->node)
     {
       position[0] = position[0]->next;
@@ -122,18 +130,19 @@ svn_diff__snake(apr_off_t k,
       lcs->length = position[1]->offset - start_position[1]->offset;
       lcs->next = previous_lcs;
       lcs->refcount = 1;
-      fp[k].lcs = lcs;
+          previous_lcs = lcs;
+	      start_position[0] = position[0];
+	      start_position[1] = position[1];
     }
+      if (position[0]->node >= 0 && token_counts[1][position[0]->node] == 0)
+        start_position[0] = position[0] = position[0]->next;
+      else if (position[1]->node >= 0 && token_counts[0][position[1]->node] == 0)
+        start_position[1] = position[1] = position[1]->next;
   else
-    {
-      fp[k].lcs = previous_lcs;
+        break;
     }
 
-  if (previous_lcs)
-    {
-      previous_lcs->refcount++;
-    }
-
+  fp[k].lcs = previous_lcs;
   fp[k].position[0] = position[0];
   fp[k].position[1] = position[1];
 
@@ -188,12 +197,18 @@ prepend_lcs(svn_diff__lcs_t *lcs, apr_off_t lines,
 svn_diff__lcs_t *
 svn_diff__lcs(svn_diff__position_t *position_list1, /* pointer to tail (ring) */
               svn_diff__position_t *position_list2, /* pointer to tail (ring) */
+              apr_uint32_t *token_counts_list1, /* array of counts */
+              apr_uint32_t *token_counts_list2, /* array of counts */
+              apr_uint32_t num_tokens,
               apr_off_t prefix_lines,
               apr_off_t suffix_lines,
               apr_pool_t *pool)
 {
   int idx;
   apr_off_t length[2];
+  apr_uint32_t *token_counts[2];
+  apr_uint32_t unique_count[2];
+  apr_uint32_t token_index;
   svn_diff__snake_t *fp;
   apr_off_t d;
   apr_off_t k;
@@ -231,10 +246,19 @@ svn_diff__lcs(svn_diff__position_t *position_list1
       return lcs;
     }
 
+  unique_count[1] = unique_count[0] = 0;
+  for (token_index = 0; token_index < num_tokens; token_index++)
+    {
+      if (token_counts_list1[token_index] == 0)
+        unique_count[1] += token_counts_list2[token_index];
+      if (token_counts_list2[token_index] == 0)
+        unique_count[0] += token_counts_list1[token_index];
+    }
+
   /* Calculate length of both sequences to be compared */
   length[0] = position_list1->offset - position_list1->next->offset + 1;
   length[1] = position_list2->offset - position_list2->next->offset + 1;
-  idx = length[0] > length[1] ? 1 : 0;
+  idx = length[0] - unique_count[0] > length[1] - unique_count[1] ? 1 : 0;
 
   /* strikerXXX: here we allocate the furthest point array, which is
    * strikerXXX: sized M + N + 3 (!)
@@ -246,18 +270,20 @@ svn_diff__lcs(svn_diff__position_t *position_list1
   sentinel_position[idx].next = position_list1->next;
   position_list1->next = &sentinel_position[idx];
   sentinel_position[idx].offset = position_list1->offset + 1;
+  token_counts[idx] = token_counts_list1;
 
   sentinel_position[abs(1 - idx)].next = position_list2->next;
   position_list2->next = &sentinel_position[abs(1 - idx)];
   sentinel_position[abs(1 - idx)].offset = position_list2->offset + 1;
+  token_counts[abs(1 - idx)] = token_counts_list2;
 
-  /* These are never dereferenced, only compared by value, so we
-   * can safely fake these up and the void* cast is OK.
+  /* Negative indices will not be used elsewhere
    */
-  sentinel_position[0].node = (void*)&sentinel_position[0];
-  sentinel_position[1].node = (void*)&sentinel_position[1];
+  sentinel_position[0].node = -1;
+  sentinel_position[1].node = -2;
 
-  d = length[abs(1 - idx)] - length[idx];
+  d = length[abs(1 - idx)] - unique_count[abs(1 - idx)]
+      - (length[idx] - unique_count[idx]);
 
   /* k = -1 will be the first to be used to get previous
    * position information from, make sure it holds sane
@@ -272,12 +298,12 @@ svn_diff__lcs(svn_diff__position_t *position_list1
       /* Forward */
       for (k = -p; k < d; k++)
         {
-          svn_diff__snake(k, fp, idx, &lcs_freelist, pool);
+          svn_diff__snake(k, fp, idx, token_counts, &lcs_freelist, pool);
         }
 
       for (k = d + p; k >= d; k--)
         {
-          svn_diff__snake(k, fp, idx, &lcs_freelist, pool);
+          svn_diff__snake(k, fp, idx, token_counts, &lcs_freelist, pool);
         }
 
       p++;
Index: subversion/libsvn_diff/diff3.c
===================================================================
--- subversion/libsvn_diff/diff3.c	(revision 1104340)
+++ subversion/libsvn_diff/diff3.c	(working copy)
@@ -38,6 +38,7 @@ void
 svn_diff__resolve_conflict(svn_diff_t *hunk,
                            svn_diff__position_t **position_list1,
                            svn_diff__position_t **position_list2,
+                           apr_uint32_t num_tokens,
                            apr_pool_t *pool)
 {
     apr_off_t modified_start = hunk->modified_start + 1;
@@ -47,6 +48,9 @@ svn_diff__resolve_conflict(svn_diff_t *hunk,
     apr_off_t latest_length = hunk->latest_length;
     svn_diff__position_t *start_position[2];
     svn_diff__position_t *position[2];
+  apr_uint32_t token_index;
+  svn_diff__position_t *current;
+  apr_uint32_t *token_counts[2];
     svn_diff__lcs_t *lcs = NULL;
     svn_diff__lcs_t **lcs_ref = &lcs;
     svn_diff_t **diff_ref = &hunk->resolved_diff;
@@ -173,9 +177,29 @@ svn_diff__resolve_conflict(svn_diff_t *hunk,
         position[1]->next = start_position[1];
       }
 
-    *lcs_ref = svn_diff__lcs(position[0], position[1], 0, 0,
-                             subpool);
+  token_counts[0] = apr_palloc(subpool, num_tokens * sizeof(**token_counts));
+  token_counts[1] = apr_palloc(subpool, num_tokens * sizeof(**token_counts));
+  for (token_index = 0; token_index < num_tokens; token_index++)
+    token_counts[0][token_index] = token_counts[1][token_index] = 0;
 
+  current = position[0];
+  do
+    {
+      token_counts[0][current->node]++;
+      current = current->next;
+    }
+  while (current != position[0]);
+  current = position[1];
+  do
+    {
+      token_counts[1][current->node]++;
+      current = current->next;
+    }
+  while (current != position[1]);
+
+  *lcs_ref = svn_diff__lcs(position[0], position[1], token_counts[0],
+                           token_counts[1], num_tokens, 0, 0, subpool);
+
     /* Fix up the EOF lcs element in case one of
      * the two sequences was NULL.
      */
@@ -251,6 +275,10 @@ svn_diff_diff3_2(svn_diff_t **diff,
 {
   svn_diff__tree_t *tree;
   svn_diff__position_t *position_list[3];
+  apr_uint32_t num_tokens;
+  apr_uint32_t token_index;
+  svn_diff__position_t *current;
+  apr_uint32_t *token_counts[3];
   svn_diff_datasource_e datasource[] = {svn_diff_datasource_original,
                                         svn_diff_datasource_modified,
                                         svn_diff_datasource_latest};
@@ -292,6 +320,8 @@ svn_diff_diff3_2(svn_diff_t **diff,
                                prefix_lines,
                                subpool));
 
+  num_tokens = svn_diff__get_node_count(tree);
+
   /* Get rid of the tokens, we don't need them to calc the diff */
   if (vtable->token_discard_all != NULL)
     vtable->token_discard_all(diff_baton);
@@ -299,10 +329,52 @@ svn_diff_diff3_2(svn_diff_t **diff,
   /* We don't need the nodes in the tree either anymore, nor the tree itself */
   svn_pool_destroy(treepool);
 
+  token_counts[0] = apr_palloc(subpool, num_tokens * sizeof(**token_counts));
+  token_counts[1] = apr_palloc(subpool, num_tokens * sizeof(**token_counts));
+  token_counts[2] = apr_palloc(subpool, num_tokens * sizeof(**token_counts));
+  for (token_index = 0; token_index < num_tokens; token_index++)
+    {
+      token_counts[0][token_index] = token_counts[1][token_index] = 0;
+      token_counts[2][token_index] = 0;
+    }
+
+  current = position_list[0];
+  if (current != NULL)
+    {
+      do
+        {
+          token_counts[0][current->node]++;
+          current = current->next;
+        }
+      while (current != position_list[0]);
+    }
+  current = position_list[1];
+  if (current != NULL)
+    {
+      do
+        {
+          token_counts[1][current->node]++;
+          current = current->next;
+        }
+      while (current != position_list[1]);
+    }
+  current = position_list[2];
+  if (current != NULL)
+    {
+      do
+        {
+          token_counts[2][current->node]++;
+          current = current->next;
+        }
+      while (current != position_list[2]);
+    }
+
   /* Get the lcs for original-modified and original-latest */
-  lcs_om = svn_diff__lcs(position_list[0], position_list[1], prefix_lines,
+  lcs_om = svn_diff__lcs(position_list[0], position_list[1], token_counts[0],
+                         token_counts[1], num_tokens, prefix_lines,
                          suffix_lines, subpool);
-  lcs_ol = svn_diff__lcs(position_list[0], position_list[2], prefix_lines,
+  lcs_ol = svn_diff__lcs(position_list[0], position_list[2], token_counts[0],
+                         token_counts[2], num_tokens, prefix_lines,
                          suffix_lines, subpool);
 
   /* Produce a merged diff */
@@ -435,6 +507,7 @@ svn_diff_diff3_2(svn_diff_t **diff,
                 svn_diff__resolve_conflict(*diff_ref,
                                            &position_list[1],
                                            &position_list[2],
+                                           num_tokens,
                                            pool);
               }
             else if (is_modified)
Index: subversion/libsvn_diff/diff4.c
===================================================================
--- subversion/libsvn_diff/diff4.c	(revision 1104340)
+++ subversion/libsvn_diff/diff4.c	(working copy)
@@ -173,6 +173,10 @@ svn_diff_diff4_2(svn_diff_t **diff,
 {
   svn_diff__tree_t *tree;
   svn_diff__position_t *position_list[4];
+  apr_uint32_t num_tokens;
+  apr_uint32_t token_index;
+  svn_diff__position_t *current;
+  apr_uint32_t *token_counts[4];
   svn_diff_datasource_e datasource[] = {svn_diff_datasource_original,
                                         svn_diff_datasource_modified,
                                         svn_diff_datasource_latest,
@@ -227,6 +231,8 @@ svn_diff_diff4_2(svn_diff_t **diff,
                                prefix_lines,
                                subpool2));
 
+  num_tokens = svn_diff__get_node_count(tree);
+
   /* Get rid of the tokens, we don't need them to calc the diff */
   if (vtable->token_discard_all != NULL)
     vtable->token_discard_all(diff_baton);
@@ -234,8 +240,61 @@ svn_diff_diff4_2(svn_diff_t **diff,
   /* We don't need the nodes in the tree either anymore, nor the tree itself */
   svn_pool_clear(subpool3);
 
+  token_counts[0] = apr_palloc(subpool, num_tokens * sizeof(**token_counts));
+  token_counts[1] = apr_palloc(subpool, num_tokens * sizeof(**token_counts));
+  token_counts[2] = apr_palloc(subpool, num_tokens * sizeof(**token_counts));
+  token_counts[3] = apr_palloc(subpool, num_tokens * sizeof(**token_counts));
+  for (token_index = 0; token_index < num_tokens; token_index++)
+    {
+      token_counts[0][token_index] = token_counts[1][token_index] = 0;
+      token_counts[2][token_index] = token_counts[3][token_index] = 0;
+    }
+
+  current = position_list[0];
+  if (current != NULL)
+    {
+      do
+        {
+          token_counts[0][current->node]++;
+          current = current->next;
+        }
+      while (current != position_list[0]);
+    }
+  current = position_list[1];
+  if (current != NULL)
+    {
+      do
+        {
+          token_counts[1][current->node]++;
+          current = current->next;
+        }
+      while (current != position_list[1]);
+    }
+  current = position_list[2];
+  if (current != NULL)
+    {
+      do
+        {
+          token_counts[2][current->node]++;
+          current = current->next;
+        }
+      while (current != position_list[2]);
+    }
+  current = position_list[3];
+  if (current != NULL)
+    {
+      do
+        {
+          token_counts[3][current->node]++;
+          current = current->next;
+        }
+      while (current != position_list[3]);
+    }
+
   /* Get the lcs for original - latest */
-  lcs_ol = svn_diff__lcs(position_list[0], position_list[2], prefix_lines,
+  lcs_ol = svn_diff__lcs(position_list[0], position_list[2],
+                         token_counts[0], token_counts[2],
+                         num_tokens, prefix_lines,
                          suffix_lines, subpool3);
   diff_ol = svn_diff__diff(lcs_ol, 1, 1, TRUE, pool);
 
@@ -257,7 +316,9 @@ svn_diff_diff4_2(svn_diff_t **diff,
   /* Get the lcs for common ancestor - original
    * Do reverse adjustements
    */
-  lcs_adjust = svn_diff__lcs(position_list[3], position_list[2], prefix_lines,
+  lcs_adjust = svn_diff__lcs(position_list[3], position_list[2],
+                             token_counts[3], token_counts[2],
+                             num_tokens, prefix_lines,
                              suffix_lines, subpool3);
   diff_adjust = svn_diff__diff(lcs_adjust, 1, 1, FALSE, subpool3);
   adjust_diff(diff_ol, diff_adjust);
@@ -267,7 +328,9 @@ svn_diff_diff4_2(svn_diff_t **diff,
   /* Get the lcs for modified - common ancestor
    * Do forward adjustments
    */
-  lcs_adjust = svn_diff__lcs(position_list[1], position_list[3], prefix_lines,
+  lcs_adjust = svn_diff__lcs(position_list[1], position_list[3],
+                             token_counts[1], token_counts[3],
+                             num_tokens, prefix_lines,
                              suffix_lines, subpool3);
   diff_adjust = svn_diff__diff(lcs_adjust, 1, 1, FALSE, subpool3);
   adjust_diff(diff_ol, diff_adjust);
@@ -283,7 +346,7 @@ svn_diff_diff4_2(svn_diff_t **diff,
       if (hunk->type == svn_diff__type_conflict)
         {
           svn_diff__resolve_conflict(hunk, &position_list[1],
-                                     &position_list[2], pool);
+                                     &position_list[2], num_tokens, pool);
         }
     }
 
]]]

Morten Kloster wrote on Wed, May 18, 2011 at 10:31:13 +0200:
> On Wed, May 18, 2011 at 8:33 AM, Johan Corveleyn <jc...@gmail.com> wrote:
> > On Wed, May 18, 2011 at 1:56 AM, Morten Kloster <mo...@gmail.com> wrote:
> >> Log message:
> >>
> >> Speed-up of libsvn_diff using token counts
> >> By using indices, not node pointers, to refer to tokens, and counting
> >> the number of each token, the longest common subsequence (lcs)
> >> algorithm at the heart of libsvn_diff becomes much faster in many
> >> situations by skipping tokens that are unique to one source or the other.
> >>
> >> * subversion/libsvn_diff/token.c
> >>  (svn_diff__node_t): Added 'index' member.
> >>  (svn_diff__tree_t): Added 'node_count' member.
> >>  (svn_diff__get_token_count): New method.
> >>  (tree_insert_token): Assigns indices to nodes.
> >>  (svn_diff__get_tokens): Uses index, not pointer,
> >>   to identify node in position struct.
> >>
> >> * subversion/libsvn_diff/lcs.c
> >>  (svn_diff__snake): Takes token counts as additional argument.
> >>   Skips tokens that are unique to one or the other file.
> >>  (svn_diff__lcs): Takes token counts and number of different tokens
> >>   as additional arguments. Calculates number of tokens unique to
> >>   each source and uses this to compute effective length difference.
> >>
> >> * subversion/libsvn_diff/diff.h
> >>  (svn_diff__position_t): Changed node member from being
> >>   a pointer to the node to an integer index for the node.
> >>  Updated method declarations.
> >>
> >> * subversion/libsvn_diff/diff.c
> >> * subversion/libsvn_diff/diff3.c
> >> * subversion/libsvn_diff/diff4.c
> >>  (svn_diff_diff_2, svn_diff_diff_3, svn_diff_diff_4): Count the
> >>   number of times each token appears in each source.
> >>  (svn_diff__resolve_conflict): Takes number of different tokens
> >>   as additional argument. Counts the number of times
> >>   each token appears in each (partial) source.
> >>
> >>
> >>
> >>
> >>
> >>
> >> Comments:
> >> This patch can reduce the time required for a diff by a large factor
> >> when comparing long files with large differences. As an extreme case,
> >> splitting sqlite.c into two roughly equal parts and diffing them against
> >> each other took about a minute on my laptop with the original code,
> >> and 7-8 seconds with my patch. The speed-up is gained only if a
> >> large fraction of the changed tokens are unique to one source or the
> >> other. I can not see that the change would cause a significant
> >> performance degradation in any reasonable case.
> >>
> >> Counting the number of times each token appears is also useful for
> >> potential additional features in the future, such as identifying moved
> >> blocks outside of the longest common subsequence.
> >>
> >> I have tested only svn_diff_diff_2 (that it gives identical results as
> >> before), but the changes to svn_diff_diff3 and svn_diff_diff4 are
> >> completely analogous. Tested on a Windows 7 64-bit machine,
> >> TortoiseSVN build script.
> >>
> >>
> >> Morten Kloster (first patch)
> >
> > Interesting! The patch seems to be missing though. The list software
> > tends to strip off attachments if they don't have a text/plain
> > mime-type. Can you try sending it again with a .txt extension?
> >
> > --
> > Johan
> >
> 
> Sorry, here's the patch with .txt extension. I found a bug in the
> first version anyway, so it was just as well. :)
> 
> And of course I meant sqlite3.c for the stress test.
> 
> Morten

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Morten Kloster <mo...@gmail.com>.

On Wed, May 18, 2011 at 8:33 AM, Johan Corveleyn <jc...@gmail.com> wrote:
> On Wed, May 18, 2011 at 1:56 AM, Morten Kloster <mo...@gmail.com> wrote:
>> Log message:
>>
>> Speed-up of libsvn_diff using token counts
>> By using indices, not node pointers, to refer to tokens, and counting
>> the number of each token, the longest common subsequence (lcs)
>> algorithm at the heart of libsvn_diff becomes much faster in many
>> situations by skipping tokens that are unique to one source or the other.
>>
>> * subversion/libsvn_diff/token.c
>> �(svn_diff__node_t): Added 'index' member.
>> �(svn_diff__tree_t): Added 'node_count' member.
>> �(svn_diff__get_token_count): New method.
>> �(tree_insert_token): Assigns indices to nodes.
>> �(svn_diff__get_tokens): Uses index, not pointer,
>> � to identify node in position struct.
>>
>> * subversion/libsvn_diff/lcs.c
>> �(svn_diff__snake): Takes token counts as additional argument.
>> � Skips tokens that are unique to one or the other file.
>> �(svn_diff__lcs): Takes token counts and number of different tokens
>> � as additional arguments. Calculates number of tokens unique to
>> � each source and uses this to compute effective length difference.
>>
>> * subversion/libsvn_diff/diff.h
>> �(svn_diff__position_t): Changed node member from being
>> � a pointer to the node to an integer index for the node.
>> �Updated method declarations.
>>
>> * subversion/libsvn_diff/diff.c
>> * subversion/libsvn_diff/diff3.c
>> * subversion/libsvn_diff/diff4.c
>> �(svn_diff_diff_2, svn_diff_diff_3, svn_diff_diff_4): Count the
>> � number of times each token appears in each source.
>> �(svn_diff__resolve_conflict): Takes number of different tokens
>> � as additional argument. Counts the number of times
>> � each token appears in each (partial) source.
>>
>>
>>
>>
>>
>>
>> Comments:
>> This patch can reduce the time required for a diff by a large factor
>> when comparing long files with large differences. As an extreme case,
>> splitting sqlite.c into two roughly equal parts and diffing them against
>> each other took about a minute on my laptop with the original code,
>> and 7-8 seconds with my patch. The speed-up is gained only if a
>> large fraction of the changed tokens are unique to one source or the
>> other. I can not see that the change would cause a significant
>> performance degradation in any reasonable case.
>>
>> Counting the number of times each token appears is also useful for
>> potential additional features in the future, such as identifying moved
>> blocks outside of the longest common subsequence.
>>
>> I have tested only svn_diff_diff_2 (that it gives identical results as
>> before), but the changes to svn_diff_diff3 and svn_diff_diff4 are
>> completely analogous. Tested on a Windows 7 64-bit machine,
>> TortoiseSVN build script.
>>
>>
>> Morten Kloster (first patch)
>
> Interesting! The patch seems to be missing though. The list software
> tends to strip off attachments if they don't have a text/plain
> mime-type. Can you try sending it again with a .txt extension?
>
> --
> Johan
>

Sorry, here's the patch with .txt extension. I found a bug in the
first version anyway, so it was just as well. :)

And of course I meant sqlite3.c for the stress test.

Morten

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Johan Corveleyn <jc...@gmail.com>.

On Wed, May 18, 2011 at 1:56 AM, Morten Kloster <mo...@gmail.com> wrote:
> Log message:
>
> Speed-up of libsvn_diff using token counts
> By using indices, not node pointers, to refer to tokens, and counting
> the number of each token, the longest common subsequence (lcs)
> algorithm at the heart of libsvn_diff becomes much faster in many
> situations by skipping tokens that are unique to one source or the other.
>
> * subversion/libsvn_diff/token.c
>  (svn_diff__node_t): Added 'index' member.
>  (svn_diff__tree_t): Added 'node_count' member.
>  (svn_diff__get_token_count): New method.
>  (tree_insert_token): Assigns indices to nodes.
>  (svn_diff__get_tokens): Uses index, not pointer,
>   to identify node in position struct.
>
> * subversion/libsvn_diff/lcs.c
>  (svn_diff__snake): Takes token counts as additional argument.
>   Skips tokens that are unique to one or the other file.
>  (svn_diff__lcs): Takes token counts and number of different tokens
>   as additional arguments. Calculates number of tokens unique to
>   each source and uses this to compute effective length difference.
>
> * subversion/libsvn_diff/diff.h
>  (svn_diff__position_t): Changed node member from being
>   a pointer to the node to an integer index for the node.
>  Updated method declarations.
>
> * subversion/libsvn_diff/diff.c
> * subversion/libsvn_diff/diff3.c
> * subversion/libsvn_diff/diff4.c
>  (svn_diff_diff_2, svn_diff_diff_3, svn_diff_diff_4): Count the
>   number of times each token appears in each source.
>  (svn_diff__resolve_conflict): Takes number of different tokens
>   as additional argument. Counts the number of times
>   each token appears in each (partial) source.
>
>
>
>
>
>
> Comments:
> This patch can reduce the time required for a diff by a large factor
> when comparing long files with large differences. As an extreme case,
> splitting sqlite.c into two roughly equal parts and diffing them against
> each other took about a minute on my laptop with the original code,
> and 7-8 seconds with my patch. The speed-up is gained only if a
> large fraction of the changed tokens are unique to one source or the
> other. I can not see that the change would cause a significant
> performance degradation in any reasonable case.
>
> Counting the number of times each token appears is also useful for
> potential additional features in the future, such as identifying moved
> blocks outside of the longest common subsequence.
>
> I have tested only svn_diff_diff_2 (that it gives identical results as
> before), but the changes to svn_diff_diff3 and svn_diff_diff4 are
> completely analogous. Tested on a Windows 7 64-bit machine,
> TortoiseSVN build script.
>
>
> Morten Kloster (first patch)

Interesting! The patch seems to be missing though. The list software
tends to strip off attachments if they don't have a text/plain
mime-type. Can you try sending it again with a .txt extension?

-- 
Johan

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Johan Corveleyn <jc...@gmail.com>.

On Mon, Jun 6, 2011 at 7:38 PM, Morten Kloster <mo...@gmail.com> wrote:
> On Mon, Jun 6, 2011 at 3:17 AM, Johan Corveleyn <jc...@gmail.com> wrote:
>> On Wed, Jun 1, 2011 at 5:56 PM, Morten Kloster <mo...@gmail.com> wrote:
> []
>> Hi Morten,
>>
>> Sorry, it took me a little while longer than expected, but I finally
>> got around to it.
>>
>> I did some more tests, and upon further investigation, I too can't
>> really find a significant performance decrease because of the token
>> counting. So that looks good.
>>
>> I've reviewed it all, and it looks fine. I just have one small
>> question about the changes in lcs.c:
>>
>> [[[
>> @@ -218,11 +228,17 @@ prepend_lcs(svn_diff__lcs_t *lcs, apr_off_t lines,
>>  svn_diff__lcs_t *
>>  svn_diff__lcs(svn_diff__position_t *position_list1, /* pointer to
>> tail (ring) */
>>               svn_diff__position_t *position_list2, /* pointer to
>> tail (ring) */
>> +              svn_diff__token_index_t *token_counts_list1, /* array
>> of counts */
>> +              svn_diff__token_index_t *token_counts_list2, /* array
>> of counts */
>> +              svn_diff__token_index_t num_tokens,
>>               apr_off_t prefix_lines,
>>               apr_off_t suffix_lines,
>>               apr_pool_t *pool)
>>  {
>>   apr_off_t length[2];
>> +  svn_diff__token_index_t *token_counts[2];
>> +  svn_diff__token_index_t unique_count[2];
>> +  svn_diff__token_index_t token_index;
>>
>> ...
>>
>> @@ -281,16 +310,17 @@ svn_diff__lcs(svn_diff__position_t *position_list1
>>   sentinel_position[0].next = position_list1->next;
>>   position_list1->next = &sentinel_position[0];
>>   sentinel_position[0].offset = position_list1->offset + 1;
>> +  token_counts[0] = token_counts_list1;
>>
>>   sentinel_position[1].next = position_list2->next;
>>   position_list2->next = &sentinel_position[1];
>>   sentinel_position[1].offset = position_list2->offset + 1;
>> +  token_counts[1] = token_counts_list2;
>>
>> ...
>>
>> @@ -310,12 +340,12 @@ svn_diff__lcs(svn_diff__position_t *position_list1
>>       /* For k < 0, insertions are free */
>>       for (k = (d < 0 ? d : 0) - p; k < 0; k++)
>>         {
>> -          svn_diff__snake(fp + k, &lcs_freelist, pool);
>> +          svn_diff__snake(fp + k, token_counts, &lcs_freelist, pool);
>>         }
>>          /* for k > 0, deletions are free */
>>       for (k = (d > 0 ? d : 0) + p; k >= 0; k--)
>>         {
>> -          svn_diff__snake(fp + k, &lcs_freelist, pool);
>> +          svn_diff__snake(fp + k, token_counts, &lcs_freelist, pool);
>>         }
>>
>>       p++;
>> ]]]
>>
>> Why do you copy the token_counts_list parameters into the array
>> token_counts[]? Is that to simplify the passing of data to __snake
>> (keeping number of arguments to a minimum)?
>>
>> Can't you just pass the token_counts_list parameters 'as is' to
>> __snake as two extra arguments, and lose the token_counts[] variable?
>>
>> But maybe you have a reason for this, because of compiler optimization
>> or inlining or something like that?
>>
>> Apart from that I have no more remarks, so I think I'll be able to
>> commit this soon.
>>
>> Thanks for your work.
>> --
>> Johan
>>
>
> I made the token_counts array when idx was still in, and then its
> values depended on the value of idx. Now it it no longer required as
> such, although I find it somewhat tidier to use arrays for "everything"
> in svn_diff__snake. I get no easily detectable difference in speed on
> my system from removing it, but if you want to remove it, feel free.
>
> Thanks for the review. :-)

I committed your patch in r1132811.

I decided to leave the token_counts array in there in lcs.c, because
it seems indeed somewhat tidier. We can always change it later if
someone thinks that's a good idea.

I did some small tweaks, nothing functional:
- Moved the docstring of svn_diff__get_token_counts to the declaration
in diff.h.
- Also in diff.h, in the docstring of svn_diff__lcs, I added some
mention of the token_counts_list arguments.
- And changed your usage of 'method' with 'function' in the log message :-)

Cheers,
-- 
Johan

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Morten Kloster <mo...@gmail.com>.

On Mon, Jun 6, 2011 at 3:17 AM, Johan Corveleyn <jc...@gmail.com> wrote:
> On Wed, Jun 1, 2011 at 5:56 PM, Morten Kloster <mo...@gmail.com> wrote:
[]
> Hi Morten,
>
> Sorry, it took me a little while longer than expected, but I finally
> got around to it.
>
> I did some more tests, and upon further investigation, I too can't
> really find a significant performance decrease because of the token
> counting. So that looks good.
>
> I've reviewed it all, and it looks fine. I just have one small
> question about the changes in lcs.c:
>
> [[[
> @@ -218,11 +228,17 @@ prepend_lcs(svn_diff__lcs_t *lcs, apr_off_t lines,
>  svn_diff__lcs_t *
>  svn_diff__lcs(svn_diff__position_t *position_list1, /* pointer to
> tail (ring) */
>               svn_diff__position_t *position_list2, /* pointer to
> tail (ring) */
> +              svn_diff__token_index_t *token_counts_list1, /* array
> of counts */
> +              svn_diff__token_index_t *token_counts_list2, /* array
> of counts */
> +              svn_diff__token_index_t num_tokens,
>               apr_off_t prefix_lines,
>               apr_off_t suffix_lines,
>               apr_pool_t *pool)
>  {
>   apr_off_t length[2];
> +  svn_diff__token_index_t *token_counts[2];
> +  svn_diff__token_index_t unique_count[2];
> +  svn_diff__token_index_t token_index;
>
> ...
>
> @@ -281,16 +310,17 @@ svn_diff__lcs(svn_diff__position_t *position_list1
>   sentinel_position[0].next = position_list1->next;
>   position_list1->next = &sentinel_position[0];
>   sentinel_position[0].offset = position_list1->offset + 1;
> +  token_counts[0] = token_counts_list1;
>
>   sentinel_position[1].next = position_list2->next;
>   position_list2->next = &sentinel_position[1];
>   sentinel_position[1].offset = position_list2->offset + 1;
> +  token_counts[1] = token_counts_list2;
>
> ...
>
> @@ -310,12 +340,12 @@ svn_diff__lcs(svn_diff__position_t *position_list1
>       /* For k < 0, insertions are free */
>       for (k = (d < 0 ? d : 0) - p; k < 0; k++)
>         {
> -          svn_diff__snake(fp + k, &lcs_freelist, pool);
> +          svn_diff__snake(fp + k, token_counts, &lcs_freelist, pool);
>         }
>          /* for k > 0, deletions are free */
>       for (k = (d > 0 ? d : 0) + p; k >= 0; k--)
>         {
> -          svn_diff__snake(fp + k, &lcs_freelist, pool);
> +          svn_diff__snake(fp + k, token_counts, &lcs_freelist, pool);
>         }
>
>       p++;
> ]]]
>
> Why do you copy the token_counts_list parameters into the array
> token_counts[]? Is that to simplify the passing of data to __snake
> (keeping number of arguments to a minimum)?
>
> Can't you just pass the token_counts_list parameters 'as is' to
> __snake as two extra arguments, and lose the token_counts[] variable?
>
> But maybe you have a reason for this, because of compiler optimization
> or inlining or something like that?
>
> Apart from that I have no more remarks, so I think I'll be able to
> commit this soon.
>
> Thanks for your work.
> --
> Johan
>

I made the token_counts array when idx was still in, and then its
values depended on the value of idx. Now it it no longer required as
such, although I find it somewhat tidier to use arrays for "everything"
in svn_diff__snake. I get no easily detectable difference in speed on
my system from removing it, but if you want to remove it, feel free.

Thanks for the review. :-)

Morten

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Johan Corveleyn <jc...@gmail.com>.

On Wed, Jun 1, 2011 at 5:56 PM, Morten Kloster <mo...@gmail.com> wrote:
> On Wed, Jun 1, 2011 at 1:35 AM, Johan Corveleyn <jc...@gmail.com> wrote:
>> On Tue, May 31, 2011 at 12:44 PM, Johan Corveleyn <jc...@gmail.com> wrote:
>> ...
> []
>> I'll get into some more testing and reviewing tomorrow or the day
>> after (unless someone else beats me to it :-)).
>>
>> Cheers,
>> --
>> Johan
>>
>
> I had trouble getting any reliable time differences between the
> unpatched and patched versions. By putting the token counting
> in a x300 loop, I got an increase in running time of ~10% for my
> test file (a 40MB file - the output from all my merge.c diffs -
> against a copy where I have changed the first and last line, as
> you suggested)... that was with ignore eol, but not whitespace.
> So here it looks like the patch barely increases running time at
> all; almost all the time is spent reading the files and assigning
> tokens.

Hi Morten,

Sorry, it took me a little while longer than expected, but I finally
got around to it.

I did some more tests, and upon further investigation, I too can't
really find a significant performance decrease because of the token
counting. So that looks good.

I've reviewed it all, and it looks fine. I just have one small
question about the changes in lcs.c:

[[[
@@ -218,11 +228,17 @@ prepend_lcs(svn_diff__lcs_t *lcs, apr_off_t lines,
 svn_diff__lcs_t *
 svn_diff__lcs(svn_diff__position_t *position_list1, /* pointer to
tail (ring) */
               svn_diff__position_t *position_list2, /* pointer to
tail (ring) */
+              svn_diff__token_index_t *token_counts_list1, /* array
of counts */
+              svn_diff__token_index_t *token_counts_list2, /* array
of counts */
+              svn_diff__token_index_t num_tokens,
               apr_off_t prefix_lines,
               apr_off_t suffix_lines,
               apr_pool_t *pool)
 {
   apr_off_t length[2];
+  svn_diff__token_index_t *token_counts[2];
+  svn_diff__token_index_t unique_count[2];
+  svn_diff__token_index_t token_index;

...

@@ -281,16 +310,17 @@ svn_diff__lcs(svn_diff__position_t *position_list1
   sentinel_position[0].next = position_list1->next;
   position_list1->next = &sentinel_position[0];
   sentinel_position[0].offset = position_list1->offset + 1;
+  token_counts[0] = token_counts_list1;

   sentinel_position[1].next = position_list2->next;
   position_list2->next = &sentinel_position[1];
   sentinel_position[1].offset = position_list2->offset + 1;
+  token_counts[1] = token_counts_list2;

...

@@ -310,12 +340,12 @@ svn_diff__lcs(svn_diff__position_t *position_list1
       /* For k < 0, insertions are free */
       for (k = (d < 0 ? d : 0) - p; k < 0; k++)
         {
-          svn_diff__snake(fp + k, &lcs_freelist, pool);
+          svn_diff__snake(fp + k, token_counts, &lcs_freelist, pool);
         }
          /* for k > 0, deletions are free */
       for (k = (d > 0 ? d : 0) + p; k >= 0; k--)
         {
-          svn_diff__snake(fp + k, &lcs_freelist, pool);
+          svn_diff__snake(fp + k, token_counts, &lcs_freelist, pool);
         }

       p++;
]]]

Why do you copy the token_counts_list parameters into the array
token_counts[]? Is that to simplify the passing of data to __snake
(keeping number of arguments to a minimum)?

Can't you just pass the token_counts_list parameters 'as is' to
__snake as two extra arguments, and lose the token_counts[] variable?

But maybe you have a reason for this, because of compiler optimization
or inlining or something like that?

Apart from that I have no more remarks, so I think I'll be able to
commit this soon.

Thanks for your work.
-- 
Johan

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Morten Kloster <mo...@gmail.com>.

On Wed, Jun 1, 2011 at 1:35 AM, Johan Corveleyn <jc...@gmail.com> wrote:
> On Tue, May 31, 2011 at 12:44 PM, Johan Corveleyn <jc...@gmail.com> wrote:
> ...
[]
> I'll get into some more testing and reviewing tomorrow or the day
> after (unless someone else beats me to it :-)).
>
> Cheers,
> --
> Johan
>

I had trouble getting any reliable time differences between the
unpatched and patched versions. By putting the token counting
in a x300 loop, I got an increase in running time of ~10% for my
test file (a 40MB file - the output from all my merge.c diffs -
against a copy where I have changed the first and last line, as
you suggested)... that was with ignore eol, but not whitespace.
So here it looks like the patch barely increases running time at
all; almost all the time is spent reading the files and assigning
tokens.

Morten

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Johan Corveleyn <jc...@gmail.com>.

On Tue, May 31, 2011 at 12:44 PM, Johan Corveleyn <jc...@gmail.com> wrote:
...
> Maybe a new 'knob' should be added for this? To make it easier to
> (stress) test the LCS ... sorry don't have time to do it myself now.

Added the new 'knob' in r1129957 (with followup in r1129965). To
disable the prefix/suffix scanning (thus letting all lines by handled
by the LCS algorithm), just define SVN_DISABLE_PREFIX_SUFFIX_SCANNING.

I'll get into some more testing and reviewing tomorrow or the day
after (unless someone else beats me to it :-)).

Cheers,
-- 
Johan

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Johan Corveleyn <jc...@gmail.com>.

On Tue, May 31, 2011 at 12:03 PM, Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> Johan Corveleyn wrote on Tue, May 31, 2011 at 02:53:47 +0200:
>> - Take a closer look at measuring the overhead of the token counting.
>> Maybe you can also provide some numbers here? I think a good test for
>> measuring this in practice is:
>>   1. take a very large file
>>   2. change a line in the beginning and at the end
>>      (eliminates prefix/suffix scanning, making sure everything goes to LCS)
>
> That sounds roundabout.  -DSUFFIX_LINES_TO_KEEP=0 ?

No, that only impacts suffix scanning (not prefix). Plus it actually
makes the suffix scanning even more thorough than usual :-). It makes
sure every line of suffix is eliminated, and not a single line is
"kept".

The SUFFIX_LINES_TO_KEEP=50 (the default) just helps to give the LCS
scanning some "wiggle room", so as to find (in most practical cases)
the same LCS as before. Without this setting, you can get different
LCS'es than before.

To programmatically disable the prefix/suffix scanning, I think it's
easiest to just comment out the following lines in
libsvn_diff/diff_file.c#datasources_open:

[[[
  SVN_ERR(find_identical_prefix(&reached_one_eof, prefix_lines,
                                files, datasources_len, file_baton->pool));

  if (!reached_one_eof)
    /* No file consisted totally of identical prefix,
     * so there may be some identical suffix.  */
    SVN_ERR(find_identical_suffix(suffix_lines, files, datasources_len,
                                  file_baton->pool));
]]]

That should work correctly, because the prefix_lines and suffix_lines
are already initialized to 0 in the beginning of this function (which
is the correct value if there is no prefix/suffix scanning).

(sorry, I don't have a working copy at hand here, and in the middle of
something else, so can't make a proper patch or anything).

Maybe a new 'knob' should be added for this? To make it easier to
(stress) test the LCS ... sorry don't have time to do it myself now.

-- 
Johan

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Johan Corveleyn wrote on Tue, May 31, 2011 at 02:53:47 +0200:
> - Take a closer look at measuring the overhead of the token counting.
> Maybe you can also provide some numbers here? I think a good test for
> measuring this in practice is:
>   1. take a very large file
>   2. change a line in the beginning and at the end
>      (eliminates prefix/suffix scanning, making sure everything goes to LCS)

That sounds roundabout.  -DSUFFIX_LINES_TO_KEEP=0 ?

>   3. diff those two
> 
> - Take a closer look at the code. I've skimmed through it, and it
> looked good. But I need to go for a second pass, but right now (almost
> 3 am) I really need to get some sleep first :-).
> 
> -- 
> Johan

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Johan Corveleyn <jc...@gmail.com>.

On Mon, May 30, 2011 at 8:50 PM, Morten Kloster <mo...@gmail.com> wrote:
> Johan, any progress on reviewing the code on your part? Things are
> a bit simpler now with the idx patch in: Since that patch settled the
> file (re)order issue, this patch now produces fully identical results to
> HEAD.

Thanks for the updated patch, it applies cleanly to HEAD of trunk indeed.

I've done some tests, and they look good. The result of my 2200-rev
blame is identical before or after this patch (the idx patch caused
some minor changes, but at first sight nothing drastic -- seemingly
unimportant lines, which were switched between revs they were
attributed to).

With my usual blame testcase, I use the -x-b option (ignore changes in
amount of whitespace). This helps the blame speed tremendously in this
case, because some of the revisions in the history have been
spaces->tabs and vice-versa. With this test, I could see the slight
overhead of the token counting of your patch (2%-3%), because there
are not that much "one-sided" lines. So it's slightly slower than
before.

However, when running the 2200-rev blame without ignore-options, I can
see a *huge* improvement thanks to your patch. This test used to take
over an hour. Now it's finished in 1m30s, as fast as the test with
-x-b. For the revs that change spaces->tabs all lines are "one-sided".

For real work, I prefer to use the -x-b option (I'm not interested in
who changed whitespace), but it's interesting to see the power of this
patch.

Anyway, I'm currently running the entire test-suite on your patch. So
far no problem. But before committing it I'd still want to do two
things:

- Take a closer look at measuring the overhead of the token counting.
Maybe you can also provide some numbers here? I think a good test for
measuring this in practice is:
  1. take a very large file
  2. change a line in the beginning and at the end
     (eliminates prefix/suffix scanning, making sure everything goes to LCS)
  3. diff those two

- Take a closer look at the code. I've skimmed through it, and it
looked good. But I need to go for a second pass, but right now (almost
3 am) I really need to get some sleep first :-).

-- 
Johan

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Morten Kloster <mo...@gmail.com>.

On Mon, May 30, 2011 at 12:26 PM, Stefan Sperling <st...@elego.de> wrote:
> On Mon, May 30, 2011 at 08:25:38AM +0200, Markus Schaber wrote:
>> Hi, Morten,
>>
>> > Von: Morten Kloster [mailto:morklo@gmail.com]
>>
>> > I haven't changed the index/count types yet. What's the right type to use
>> > to get signed 32 bit on 32-bit machines and signed 64 bit on 64-bit
>> > machines?
>>
>> You should use intptr_t from stdint.h, I'd guess.
>
> Not possible, unfortunately.
> We don't use any C99 constructs in Subversion because some compilers
> (especially on Windows) don't support them.
>

Well, it's just a typedef now, so it's no biggie if and when we decide
to change it.

Johan, any progress on reviewing the code on your part? Things are
a bit simpler now with the idx patch in: Since that patch settled the
file (re)order issue, this patch now produces fully identical results to
HEAD.

Morten

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Stefan Sperling <st...@elego.de>.

On Mon, May 30, 2011 at 08:25:38AM +0200, Markus Schaber wrote:
> Hi, Morten,
> 
> > Von: Morten Kloster [mailto:morklo@gmail.com]
> 
> > I haven't changed the index/count types yet. What's the right type to use
> > to get signed 32 bit on 32-bit machines and signed 64 bit on 64-bit
> > machines?
> 
> You should use intptr_t from stdint.h, I'd guess.

Not possible, unfortunately.
We don't use any C99 constructs in Subversion because some compilers
(especially on Windows) don't support them.

AW: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Markus Schaber <m....@3s-software.com>.

Hi, Morten,

> Von: Morten Kloster [mailto:morklo@gmail.com]

> I haven't changed the index/count types yet. What's the right type to use
> to get signed 32 bit on 32-bit machines and signed 64 bit on 64-bit
> machines?

You should use intptr_t from stdint.h, I'd guess.

Regards,

Markus Schaber

-- 
___________________________
We software Automation.

3S-Smart Software Solutions GmbH
Markus Schaber | Entwicklung
Memminger Str. 151 | 87439 Kempten | Tel. +49-831-54031-0 | Fax +49-831-54031-50

Email: m.schaber@3s-software.com | Web: http://www.3s-software.com 
CoDeSys Internet-Forum: http://forum.3s-software.com

Geschäftsführer: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner | Handelsregister: Kempten HRB 6186 | USt-IDNr.: DE 167014915

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Branko Čibej <br...@e-reka.si>.

On 29.05.2011 20:02, Morten Kloster wrote:
> Bert informed me that intptr_t might not be work on all systems, so I've
> made a new typedef svn_diff__token_index_t and set it to long int for now.

You could try apr_uintptr_t.

-- Brane

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Morten Kloster <mo...@gmail.com>.

Here's a version that resolves conflicts from r1128921.

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Morten Kloster <mo...@gmail.com>.

On Sun, May 29, 2011 at 6:17 PM, Morten Kloster <mo...@gmail.com> wrote:
> On Sun, May 29, 2011 at 6:00 PM, Bert Huijben <be...@qqmail.nl> wrote:
>>
>>
>>> -----Original Message-----
>>> From: Morten Kloster [mailto:morklo@gmail.com]
>>> Sent: zondag 29 mei 2011 17:35
>>> To: Julian Foad
>>> Cc: Mark Phippard; dev@subversion.apache.org
>>> Subject: Re: [PATCH] Speed-up of libsvn_diff using token counts
>>>
>>> On Fri, May 27, 2011 at 7:57 PM, Julian Foad <ju...@wandisco.com>
>>> wrote:
>>> > Morten Kloster wrote:
>>> >> On Fri, May 27, 2011 at 4:55 PM, Julian Foad <ju...@wandisco.com>
>>> wrote:
>>> >> > Morten Kloster wrote:
>>> >> >> I haven't changed the index/count types yet. What's the right type
>>> >> >> to use to get signed 32 bit on 32-bit machines and signed 64 bit
>>> >> >> on 64-bit machines?
>>> >> >
>>> >> > "int"?
>>> >>
>>> >> Is int guaranteed to correspond to (or be larger than) single-process
>>> >> addressable space?
>>> >
>>> > No. �Some 64-bit platforms use 32-bit ints by default and 64-bit
>>> > pointers.
>>> >
>>> > But do you really need to guarantee that svn's text-diff will cope with
>>> > more than 2 billion lines on such a system? �Personally I don't think
>>> > so. �If you think that is important, you can use "long int".
>>
>> long int isn't guaranteed to be 64 bit either. (E.g. on Windows 64 long int
>> is just 32 bit)
>>
>> You need something like intptr_t if you want an integer type of pointer
>> size.
>>
>> But threating more than 4 billion lines of text as just every other textfile
>> doesn't seem necessary to me.
>>
>> � � � �Bert
>>
>>
>
> And I forgot the attachment, as well... Ok, trying again, using intptr_t this
> time.
>
>
> Morten
>

Bert informed me that intptr_t might not be work on all systems, so I've
made a new typedef svn_diff__token_index_t and set it to long int for now.

[[[
Speed-up of libsvn_diff using token counts
By using indices, not node pointers, to refer to tokens, and counting
the number of each token, the longest common subsequence (lcs)
algorithm at the heart of libsvn_diff becomes much faster in many
situations by skipping tokens that are unique to one source or the other.

* subversion/libsvn_diff/token.c
  (svn_diff__node_t): Added 'index' member.
  (svn_diff__tree_t): Added 'node_count' member.
  (svn_diff__get_node_count): New method.
  (tree_insert_token): Assigns indices to nodes.
  (svn_diff__get_tokens): Uses token_index (int), not node pointer,
   to identify node in position struct.

* subversion/libsvn_diff/lcs.c
  (svn_diff__snake): Takes token counts as additional argument.
   Skips tokens that are unique to one or the other file.
  (svn_diff__lcs): Takes token counts and number of different tokens
   as additional arguments. Calculates number of tokens unique to
   each source and uses this to compute effective length difference.

* subversion/libsvn_diff/diff.h
  (svn_diff__token_index_t): New type for token indices/counts
  (svn_diff__position_t): Replaced node pointer member
   by an integer token_index.
  Updated and added new method declarations.

* subversion/libsvn_diff/diff.c
* subversion/libsvn_diff/diff3.c
* subversion/libsvn_diff/diff4.c
  (svn_diff_diff_2, svn_diff_diff_3, svn_diff_diff_4): Count the
   number of times each token appears in each source.
  (svn_diff__resolve_conflict): Takes number of different tokens
   as additional argument. Counts the number of times
   each token appears in each (partial) source.
  (svn_diff__get_token_counts): New method.
]]]

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Morten Kloster <mo...@gmail.com>.

On Sun, May 29, 2011 at 6:00 PM, Bert Huijben <be...@qqmail.nl> wrote:
>
>
>> -----Original Message-----
>> From: Morten Kloster [mailto:morklo@gmail.com]
>> Sent: zondag 29 mei 2011 17:35
>> To: Julian Foad
>> Cc: Mark Phippard; dev@subversion.apache.org
>> Subject: Re: [PATCH] Speed-up of libsvn_diff using token counts
>>
>> On Fri, May 27, 2011 at 7:57 PM, Julian Foad <ju...@wandisco.com>
>> wrote:
>> > Morten Kloster wrote:
>> >> On Fri, May 27, 2011 at 4:55 PM, Julian Foad <ju...@wandisco.com>
>> wrote:
>> >> > Morten Kloster wrote:
>> >> >> I haven't changed the index/count types yet. What's the right type
>> >> >> to use to get signed 32 bit on 32-bit machines and signed 64 bit
>> >> >> on 64-bit machines?
>> >> >
>> >> > "int"?
>> >>
>> >> Is int guaranteed to correspond to (or be larger than) single-process
>> >> addressable space?
>> >
>> > No. �Some 64-bit platforms use 32-bit ints by default and 64-bit
>> > pointers.
>> >
>> > But do you really need to guarantee that svn's text-diff will cope with
>> > more than 2 billion lines on such a system? �Personally I don't think
>> > so. �If you think that is important, you can use "long int".
>
> long int isn't guaranteed to be 64 bit either. (E.g. on Windows 64 long int
> is just 32 bit)
>
> You need something like intptr_t if you want an integer type of pointer
> size.
>
> But threating more than 4 billion lines of text as just every other textfile
> doesn't seem necessary to me.
>
> � � � �Bert
>
>

And I forgot the attachment, as well... Ok, trying again, using intptr_t this
time.


Morten

RE: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Bert Huijben <be...@qqmail.nl>.


> -----Original Message-----
> From: Morten Kloster [mailto:morklo@gmail.com]
> Sent: zondag 29 mei 2011 17:35
> To: Julian Foad
> Cc: Mark Phippard; dev@subversion.apache.org
> Subject: Re: [PATCH] Speed-up of libsvn_diff using token counts
> 
> On Fri, May 27, 2011 at 7:57 PM, Julian Foad <ju...@wandisco.com>
> wrote:
> > Morten Kloster wrote:
> >> On Fri, May 27, 2011 at 4:55 PM, Julian Foad <ju...@wandisco.com>
> wrote:
> >> > Morten Kloster wrote:
> >> >> I haven't changed the index/count types yet. What's the right type
> >> >> to use to get signed 32 bit on 32-bit machines and signed 64 bit
> >> >> on 64-bit machines?
> >> >
> >> > "int"?
> >>
> >> Is int guaranteed to correspond to (or be larger than) single-process
> >> addressable space?
> >
> > No.  Some 64-bit platforms use 32-bit ints by default and 64-bit
> > pointers.
> >
> > But do you really need to guarantee that svn's text-diff will cope with
> > more than 2 billion lines on such a system?  Personally I don't think
> > so.  If you think that is important, you can use "long int".

long int isn't guaranteed to be 64 bit either. (E.g. on Windows 64 long int
is just 32 bit)

You need something like intptr_t if you want an integer type of pointer
size.

But threating more than 4 billion lines of text as just every other textfile
doesn't seem necessary to me.

	Bert

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Morten Kloster <mo...@gmail.com>.

On Fri, May 27, 2011 at 7:57 PM, Julian Foad <ju...@wandisco.com> wrote:
> Morten Kloster wrote:
>> On Fri, May 27, 2011 at 4:55 PM, Julian Foad <ju...@wandisco.com> wrote:
>> > Morten Kloster wrote:
>> >> I haven't changed the index/count types yet. What's the right type
>> >> to use to get signed 32 bit on 32-bit machines and signed 64 bit
>> >> on 64-bit machines?
>> >
>> > "int"?
>>
>> Is int guaranteed to correspond to (or be larger than) single-process
>> addressable space?
>
> No.  Some 64-bit platforms use 32-bit ints by default and 64-bit
> pointers.
>
> But do you really need to guarantee that svn's text-diff will cope with
> more than 2 billion lines on such a system?  Personally I don't think
> so.  If you think that is important, you can use "long int".
>
> - Julian
>
>
>

Here's an updated version which uses int. I have also moved the
token counting to a separate function (as it was repeated in
many places), resolved conflicts from the revision 1128857, and
renamed node to token_index in svn_diff__position_t.

I've checked that this new version passes all libsvn_diff tests
and gives identical results to current HEAD for all diffs
between any of the first 100 revisions of merge.c in libsvn_client.

To my knowledge, this version should be ready for commit. It will
need minor conflict resolution to fit with the fp argument patch.

Updated log message:
[[[
Speed-up of libsvn_diff using token counts
By using indices, not node pointers, to refer to tokens, and counting
the number of each token, the longest common subsequence (lcs)
algorithm at the heart of libsvn_diff becomes much faster in many
situations by skipping tokens that are unique to one source or the other.

* subversion/libsvn_diff/token.c
  (svn_diff__node_t): Added 'index' member.
  (svn_diff__tree_t): Added 'node_count' member.
  (svn_diff__get_node_count): New method.
  (tree_insert_token): Assigns indices to nodes.
  (svn_diff__get_tokens): Uses token_index (int), not node pointer,
   to identify node in position struct.

* subversion/libsvn_diff/lcs.c
  (svn_diff__snake): Takes token counts as additional argument.
   Skips tokens that are unique to one or the other file.
  (svn_diff__lcs): Takes token counts and number of different tokens
   as additional arguments. Calculates number of tokens unique to
   each source and uses this to compute effective length difference.

* subversion/libsvn_diff/diff.h
  (svn_diff__position_t): Replaced node pointer member
   by an integer token_index.
  Updated and added new method declarations.

* subversion/libsvn_diff/diff.c
* subversion/libsvn_diff/diff3.c
* subversion/libsvn_diff/diff4.c
  (svn_diff_diff_2, svn_diff_diff_3, svn_diff_diff_4): Count the
   number of times each token appears in each source.
  (svn_diff__resolve_conflict): Takes number of different tokens
   as additional argument. Counts the number of times
   each token appears in each (partial) source.
  (svn_diff__get_token_counts): New method.
]]]

Morten

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Julian Foad <ju...@wandisco.com>.

Morten Kloster wrote:
> On Fri, May 27, 2011 at 4:55 PM, Julian Foad <ju...@wandisco.com> wrote:
> > Morten Kloster wrote:
> >> I haven't changed the index/count types yet. What's the right type
> >> to use to get signed 32 bit on 32-bit machines and signed 64 bit
> >> on 64-bit machines?
> >
> > "int"?
> 
> Is int guaranteed to correspond to (or be larger than) single-process
> addressable space?

No.  Some 64-bit platforms use 32-bit ints by default and 64-bit
pointers.

But do you really need to guarantee that svn's text-diff will cope with
more than 2 billion lines on such a system?  Personally I don't think
so.  If you think that is important, you can use "long int".

- Julian

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Morten Kloster <mo...@gmail.com>.

On Fri, May 27, 2011 at 4:55 PM, Julian Foad <ju...@wandisco.com> wrote:
> Morten Kloster wrote:
>> I haven't changed the index/count types yet. What's the right type
>> to use to get signed 32 bit on 32-bit machines and signed 64 bit
>> on 64-bit machines?
>
> "int"?
>
> - Julian
>
>
>

Is int guaranteed to correspond to (or be larger than) single-process
addressable
space?

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Julian Foad <ju...@wandisco.com>.

Morten Kloster wrote:
> I haven't changed the index/count types yet. What's the right type
> to use to get signed 32 bit on 32-bit machines and signed 64 bit
> on 64-bit machines?

"int"?

- Julian

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Mark Phippard <ma...@gmail.com>.

On Fri, May 27, 2011 at 10:43 AM, Morten Kloster <mo...@gmail.com> wrote:
> My bad; I had forgotten to wrap two of the counting loops in NULL
> checks. This version fixes it. Thanks again for catching that bug,
> Mark.
>
> I've also figured out how to run the test C programs. The new
> version passes all libsvn_diff tests.

I applied this patch to latest trunk and all tests pass.  This is on OSX.

-- 
Thanks

Mark Phippard
http://markphip.blogspot.com/

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Morten Kloster <mo...@gmail.com>.

My bad; I had forgotten to wrap two of the counting loops in NULL
checks. This version fixes it. Thanks again for catching that bug,
Mark.

I've also figured out how to run the test C programs. The new
version passes all libsvn_diff tests.

I haven't changed the index/count types yet. What's the right type
to use to get signed 32 bit on 32-bit machines and signed 64 bit
on 64-bit machines?

Johan, there are no heuristics involved, but I'll discuss that in detail
in a new thread for that patch.

When diffing each of the first 200 versions of merge.c against the
previous version, the total running time is 14 secs for original
version and 12 secs for patched for release, 29 secs and 19 secs
for debug. In this case most of the time is spent reading
the tokens from file anyway, so it doesn't really test the speed of
the LCS algorithm itself.


Morten

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Morten Kloster <mo...@gmail.com>.

Diff3 seemed to work fine here in preliminary testing. I'll run
more test, but just in case, which version of the patch did
you use? If you somehow got the .patch file that I included
with my original post, which didn't make it to the archive,
that version had a bug (as noted in my 2nd post). It's
best to use libsvn_diff_new2.patch(.txt), from my 6th post.
There is no functional difference between that one and
libsvn_diff_new.patch(.txt) from my 2nd post, but it has
fewer whitespace changes.

Morten

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Morten Kloster <mo...@gmail.com>.

On Thu, May 26, 2011 at 6:32 PM, Mark Phippard <ma...@gmail.com> wrote:
> On Thu, May 26, 2011 at 11:13 AM, Mark Phippard <ma...@gmail.com> wrote:
>> I applied your patch on OSX and figured I would at least run the tests
>> for you.  Builds go OK, but the one of the tests fail and two other
>> tests crash your patch applied.  The crashes look the same in both
>> cases.
>>
>> Note, I have not run tests recently on this box, so I am not 100%
>> certain the tests crash from your patch, but given that it is in diff
>> code, it seems likely.
>
> Just to followup, I reverted your patch and re-ran the tests and did
> not get any failures.
>
> --
> Thanks
>
> Mark Phippard
> http://markphip.blogspot.com/
>

I guess there's a bug in diff3, then (and probably in diff4 as well);
I've tested basic diff quite extensively by now. I'll look into it.

Thanks for doing the tests! :-)

Morten

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Mark Phippard <ma...@gmail.com>.

On Thu, May 26, 2011 at 11:13 AM, Mark Phippard <ma...@gmail.com> wrote:
> I applied your patch on OSX and figured I would at least run the tests
> for you.  Builds go OK, but the one of the tests fail and two other
> tests crash your patch applied.  The crashes look the same in both
> cases.
>
> Note, I have not run tests recently on this box, so I am not 100%
> certain the tests crash from your patch, but given that it is in diff
> code, it seems likely.

Just to followup, I reverted your patch and re-ran the tests and did
not get any failures.

-- 
Thanks

Mark Phippard
http://markphip.blogspot.com/

Re: [PATCH] Speed-up of libsvn_diff using token counts

Posted by Mark Phippard <ma...@gmail.com>.

Hi,

I applied your patch on OSX and figured I would at least run the tests
for you.  Builds go OK, but the one of the tests fail and two other
tests crash your patch applied.  The crashes look the same in both
cases.

Note, I have not run tests recently on this box, so I am not 100%
certain the tests crash from your patch, but given that it is in diff
code, it seems likely.

Here is the test that fails and the output in tests.log:

subversion/tests/svn_test_main.c:285: (apr_err=200006)
svn_tests: E200006: Test crashed (run in debugger with '--allow-segfaults')
FAIL:  diff-diff3-test 6: 3-way merge, conflicting overlapping changes

Here is the first crash.  It is in merge_tests.py:

EXCEPTION: SVNProcessTerminatedBySignal
Traceback (most recent call last):
  File "/Users/mphippard/work/src-trunk/subversion/tests/cmdline/svntest/main.py",
line 1288, in run
    rc = self.pred.run(sandbox)
  File "/Users/mphippard/work/src-trunk/subversion/tests/cmdline/svntest/testcase.py",
line 254, in run
    return self._delegate.run(sandbox)
  File "/Users/mphippard/work/src-trunk/subversion/tests/cmdline/svntest/testcase.py",
line 176, in run
    return self.func(sandbox)
  File "/Users/mphippard/work/src-trunk/subversion/tests/cmdline/merge_tests.py",
line 3216, in cherry_pick_text_conflict
    0) # not a dry_run
  File "/Users/mphippard/work/src-trunk/subversion/tests/cmdline/svntest/actions.py",
line 1014, in run_and_verify_merge
    exit_code, out, err = main.run_svn(error_re_string, *merge_command)
  File "/Users/mphippard/work/src-trunk/subversion/tests/cmdline/svntest/main.py",
line 591, in run_svn
    *(_with_auth(_with_config_dir(varargs))))
  File "/Users/mphippard/work/src-trunk/subversion/tests/cmdline/svntest/main.py",
line 347, in run_command
    None, *varargs)
  File "/Users/mphippard/work/src-trunk/subversion/tests/cmdline/svntest/main.py",
line 509, in run_command_stdin
    *varargs)
  File "/Users/mphippard/work/src-trunk/subversion/tests/cmdline/svntest/main.py",
line 479, in spawn_process
    stdout_lines, stderr_lines, exit_code = wait_on_pipe(kid, binary_mode)
  File "/Users/mphippard/work/src-trunk/subversion/tests/cmdline/svntest/main.py",
line 445, in wait_on_pipe
    raise SVNProcessTerminatedBySignal
SVNProcessTerminatedBySignal
FAIL:  merge_tests.py 27: cherry-pick a dependent change, get conflict

And here is the crash output:

0   libsvn_diff-1.0.dylib         	0x0000000100191956
svn_diff__resolve_conflict + 891
1   libsvn_diff-1.0.dylib         	0x000000010019265c svn_diff_diff3_2 + 2508
2   libsvn_diff-1.0.dylib         	0x0000000100196968
svn_diff_file_diff3_2 + 185
3   libsvn_wc-1.0.dylib           	0x00000001000f2356 do_text_merge + 247
4   libsvn_wc-1.0.dylib           	0x00000001000f39c8 merge_text_file + 565
5   libsvn_wc-1.0.dylib           	0x00000001000f4cc2
svn_wc__internal_merge + 1300
6   libsvn_wc-1.0.dylib           	0x00000001000f51a9 svn_wc_merge4 + 1032
7   libsvn_client-1.0.dylib       	0x0000000100075944 merge_file_changed + 1946
8   libsvn_client-1.0.dylib       	0x000000010009cf30 close_file + 1242
9   libsvn_delta-1.0.dylib        	0x0000000100430b85 close_file + 136
10  libsvn_delta-1.0.dylib        	0x0000000100430b85 close_file + 136
11  libsvn_repos-1.0.dylib        	0x00000001001cf49c update_entry + 3070
12  libsvn_repos-1.0.dylib        	0x00000001001cfe1a delta_dirs + 2406
13  libsvn_repos-1.0.dylib        	0x00000001001d0368 drive + 1233
14  libsvn_repos-1.0.dylib        	0x00000001001d0838 finish_report + 909
15  libsvn_repos-1.0.dylib        	0x00000001001d0c9a
svn_repos_finish_report + 37
16  libsvn_ra_local-1.0.dylib     	0x00000001001a9b33
reporter_finish_report + 41
17  libsvn_client-1.0.dylib       	0x000000010007c81c
drive_merge_report_editor + 2251
18  libsvn_client-1.0.dylib       	0x00000001000838af do_directory_merge + 2146
19  libsvn_client-1.0.dylib       	0x00000001000845b5 do_merge + 2182
20  libsvn_client-1.0.dylib       	0x0000000100089976 merge_peg_locked + 1720
21  libsvn_client-1.0.dylib       	0x0000000100089a99 merge_peg_cb + 187
22  libsvn_wc-1.0.dylib           	0x00000001000f1a78
svn_wc__call_with_write_lock + 145
23  libsvn_client-1.0.dylib       	0x0000000100089c22
svn_client_merge_peg4 + 318
24  svn                           	0x0000000100011827 svn_cl__merge + 2744
25  svn                           	0x0000000100010bbb main + 11470
26  svn                           	0x00000001000016f8 start + 52


The other crash is in stat_tests.py:

EXCEPTION: SVNProcessTerminatedBySignal
Traceback (most recent call last):
  File "/Users/mphippard/work/src-trunk/subversion/tests/cmdline/svntest/main.py",
line 1288, in run
    rc = self.pred.run(sandbox)
  File "/Users/mphippard/work/src-trunk/subversion/tests/cmdline/svntest/testcase.py",
line 176, in run
    return self.func(sandbox)
  File "/Users/mphippard/work/src-trunk/subversion/tests/cmdline/stat_tests.py",
line 1095, in status_add_plus_conflict
    branch_url, '-r', '4:5', trunk_dir)
  File "/Users/mphippard/work/src-trunk/subversion/tests/cmdline/svntest/actions.py",
line 268, in run_and_verify_svn
    expected_exit, *varargs)
  File "/Users/mphippard/work/src-trunk/subversion/tests/cmdline/svntest/actions.py",
line 306, in run_and_verify_svn2
    exit_code, out, err = main.run_svn(want_err, *varargs)
  File "/Users/mphippard/work/src-trunk/subversion/tests/cmdline/svntest/main.py",
line 591, in run_svn
    *(_with_auth(_with_config_dir(varargs))))
  File "/Users/mphippard/work/src-trunk/subversion/tests/cmdline/svntest/main.py",
line 347, in run_command
    None, *varargs)
  File "/Users/mphippard/work/src-trunk/subversion/tests/cmdline/svntest/main.py",
line 509, in run_command_stdin
    *varargs)
  File "/Users/mphippard/work/src-trunk/subversion/tests/cmdline/svntest/main.py",
line 479, in spawn_process
    stdout_lines, stderr_lines, exit_code = wait_on_pipe(kid, binary_mode)
  File "/Users/mphippard/work/src-trunk/subversion/tests/cmdline/svntest/main.py",
line 445, in wait_on_pipe
    raise SVNProcessTerminatedBySignal
SVNProcessTerminatedBySignal
FAIL:  stat_tests.py 22: status on conflicted added file

And here is the crash output
Exception Type:  EXC_BAD_ACCESS (SIGSEGV)
Exception Codes: KERN_INVALID_ADDRESS at 0x0000000000000008
Crashed Thread:  0  Dispatch queue: com.apple.main-thread

Thread 0 Crashed:  Dispatch queue: com.apple.main-thread
0   libsvn_diff-1.0.dylib         	0x0000000100191990
svn_diff__resolve_conflict + 949
1   libsvn_diff-1.0.dylib         	0x000000010019265c svn_diff_diff3_2 + 2508
2   libsvn_diff-1.0.dylib         	0x0000000100196968
svn_diff_file_diff3_2 + 185
3   libsvn_wc-1.0.dylib           	0x00000001000f2356 do_text_merge + 247
4   libsvn_wc-1.0.dylib           	0x00000001000f39c8 merge_text_file + 565
5   libsvn_wc-1.0.dylib           	0x00000001000f4cc2
svn_wc__internal_merge + 1300
6   libsvn_wc-1.0.dylib           	0x00000001000f51a9 svn_wc_merge4 + 1032
7   libsvn_client-1.0.dylib       	0x0000000100075944 merge_file_changed + 1946
8   libsvn_client-1.0.dylib       	0x000000010009cf30 close_file + 1242
9   libsvn_delta-1.0.dylib        	0x0000000100430b85 close_file + 136
10  libsvn_delta-1.0.dylib        	0x0000000100430b85 close_file + 136
11  libsvn_repos-1.0.dylib        	0x00000001001cf49c update_entry + 3070
12  libsvn_repos-1.0.dylib        	0x00000001001cfe1a delta_dirs + 2406
13  libsvn_repos-1.0.dylib        	0x00000001001d0368 drive + 1233
14  libsvn_repos-1.0.dylib        	0x00000001001d0838 finish_report + 909
15  libsvn_repos-1.0.dylib        	0x00000001001d0c9a
svn_repos_finish_report + 37
16  libsvn_ra_local-1.0.dylib     	0x00000001001a9b33
reporter_finish_report + 41
17  libsvn_client-1.0.dylib       	0x000000010007c81c
drive_merge_report_editor + 2251
18  libsvn_client-1.0.dylib       	0x00000001000838af do_directory_merge + 2146
19  libsvn_client-1.0.dylib       	0x00000001000845b5 do_merge + 2182
20  libsvn_client-1.0.dylib       	0x0000000100089976 merge_peg_locked + 1720
21  libsvn_client-1.0.dylib       	0x0000000100089a99 merge_peg_cb + 187
22  libsvn_wc-1.0.dylib           	0x00000001000f1a78
svn_wc__call_with_write_lock + 145
23  libsvn_client-1.0.dylib       	0x0000000100089c22
svn_client_merge_peg4 + 318
24  svn                           	0x0000000100011827 svn_cl__merge + 2744
25  svn                           	0x0000000100010bbb main + 11470
26  svn                           	0x00000001000016f8 start + 52

-- 
Thanks

Mark Phippard
http://markphip.blogspot.com/