You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Julian Foad <ju...@btopenworld.com> on 2012/11/22 17:01:35 UTC

Re: svn commit: r1412425 - in /subversion/trunk/subversion: include/private/svn_string_private.h libsvn_subr/string.c tests/libsvn_subr/string-test.c

> Author: brane

> Date: Thu Nov 22 05:43:56 2012
> New Revision: 1412425
> 
> URL: http://svn.apache.org/viewvc?rev=1412425&view=rev
> Log:
> Groundwork for issue #4261. Utility for calculating string similarity.
> 
> * subversion/include/private/svn_string_private.h
>   (svn_cstring__similarity): New prototype.
> * subversion/libsvn_subr/string.c
>   (svn_cstring__similarity): Implement.
> 
> * subversion/tests/libsvn_subr/string-test.c
>   (test_string_similarity): New test case.
>   (test_funcs): Register test_string_similarity.

> Modified: subversion/trunk/subversion/include/private/svn_string_private.h
> +/**
> + * Computes the similarity score STRA and STRB, given as the ratio
> + * between the length of their longest common subsequence and the
> + * length of the strings, normalized to the range [0..1000].

"longest common non-contiguous subsequence"

"average length of the two strings"

Just use 'float' and [0..1]?  I'm thinking we may want to use this for comparing diff hunks, and there may be more than 500 characters in each hunk and it matters whether just one character has been changed.

> + * The result is equivalent to Python's
> + *
> + *   difflib.SequenceMatcher.ratio
> + *
> + * Optionally sets *RLCS to the length of the longest common
> + * subsequence of STRA and STRB. Using BUFFER for temporary storage,
> + * requires memory proportional to the length of the shorter string.
[...]

> Modified: subversion/trunk/subversion/tests/libsvn_subr/string-test.c
> +static svn_error_t *
> +test_string_similarity(apr_pool_t *pool)
> +{
> +  const struct sim_score_test_t
> +  {
> +    const char *stra;
> +    const char *strb;
> +    apr_size_t lcs;
> +    int score;
> +  } tests[] =
> +      {
> +#define SCORE(lcs, len) ((2000 * (lcs) + (len)/2) / (len))
> +
> +        /* Equality */
> +        {"",       "",          0, 1000},
> +        {"quoth",  "quoth",     5, SCORE(5, 5+5)},
> +
> +        /* Deletion at start */
> +        {"quoth",  "uoth",      4, SCORE(4, 5+4)},
> +        {"uoth",   "quoth",     4, SCORE(4, 4+5)},
> +
> +        /* Deletion at end */
> +        {"quoth",  "quot",      4, SCORE(4, 5+4)},
> +        {"quot",   "quoth",     4, SCORE(4, 4+5)},
> +
> +        /* Insertion at start */
> +        {"quoth",  "Xquoth",    5, SCORE(5, 5+6)},
> +        {"Xquoth", "quoth",     5, SCORE(5, 6+5)},
> +
> +        /* Insertion at end */
> +        {"quoth",  "quothX",    5, SCORE(5, 5+6)},
> +        {"quothX", "quoth",     5, SCORE(5, 6+5)},

Half of the examples in each of these four pairs are redundant.  
(The word "deletion" implies an ordering of string-A -> string-B, so
 by that terminology "uoth" -> "quoth" is an Insertion, same as 
"quoth" -> "Xquoth".)

> +        /* Insertion in middle */

(That is, "Insertion/deletion in middle".)

> +        {"quoth",  "quoXth",    5, SCORE(5, 5+6)},
> +        {"quoXth", "quoth",     5, SCORE(5, 6+5)},
> +
> +        /* Transposition at start */
> +        {"quoth",  "uqoth",     4, SCORE(4, 5+5)},
> +        {"uqoth",  "quoth",     4, SCORE(4, 5+5)},
> +
> +        /* Transposition at end */
> +        {"quoth",  "quoht",     4, SCORE(4, 5+5)},
> +        {"quoht",  "quoth",     4, SCORE(4, 5+5)},
> +
> +        /* Transposition in middle */
> +        {"quoth",  "qutoh",     4, SCORE(4, 5+5)},
> +        {"qutoh",  "quoth",     4, SCORE(4, 5+5)},

Same redundance in the three "transposition" pairs.

> +        /* Difference */
> +        {"quoth",  "raven",     0, SCORE(0, 5+5)},
> +        {"raven",  "quoth",     0, SCORE(0, 5+5)},

And this pair.

> +        {"x",      "",          0, SCORE(0, 1+0)},
> +        {"",       "x",         0, SCORE(0, 0+1)},
> +        {"",       "quoth",     0, SCORE(0, 0+5)},
> +        {"quoth",  "",          0, SCORE(0, 5+0)},
> +        {"quoth",  "the raven", 2, SCORE(2, 5+9)},
> +        {"the raven",  "quoth", 2, SCORE(2, 5+9)},

And this one.

> +        {NULL, NULL}
> +      };


- Julian

Re: svn commit: r1412425 - in /subversion/trunk/subversion: include/private/svn_string_private.h libsvn_subr/string.c tests/libsvn_subr/string-test.c

Posted by Branko Čibej <br...@wandisco.com>.

On 22.11.2012 18:29, Branko Čibej wrote:
> On 22.11.2012 17:01, Julian Foad wrote:
>> Just use 'float' and [0..1]?  I'm thinking we may want to use this for comparing diff hunks, and there may be more than 500 characters in each hunk and it matters whether just one character has been changed.
> Could do that, too. Harder to test though. :) It's trivial to extend the
> range to [0..10**6] or even 10**9 if that's a problem. Frankly it feels
> nice to keep the whole thing integral and not leave floating-point
> messes around.

Actually I'm being silly. The function already provides the actual LCS
length, so if 3 significant digits aren't enough, you can always compare
that to the length of the hunk.

-- Brane


-- 
Branko Čibej
Director of Subversion | WANdisco | www.wandisco.com

Re: svn commit: r1412425 - in /subversion/trunk/subversion: include/private/svn_string_private.h libsvn_subr/string.c tests/libsvn_subr/string-test.c

Posted by Branko Čibej <br...@wandisco.com>.

On 22.11.2012 17:01, Julian Foad wrote:
>> Author: brane
>> Modified: subversion/trunk/subversion/include/private/svn_string_private.h
>> +/**
>> + * Computes the similarity score STRA and STRB, given as the ratio
>> + * between the length of their longest common subsequence and the
>> + * length of the strings, normalized to the range [0..1000].
> "longest common non-contiguous subsequence"

Yes, we know it is, but that's not what the name of the algorithm is.

> "average length of the two strings"

This is true; however, saying "average" always sounds, to me, as if
we're dealing with a whole bunch of values, rather than just two.

> Just use 'float' and [0..1]?  I'm thinking we may want to use this for comparing diff hunks, and there may be more than 500 characters in each hunk and it matters whether just one character has been changed.

Could do that, too. Harder to test though. :) It's trivial to extend the
range to [0..10**6] or even 10**9 if that's a problem. Frankly it feels
nice to keep the whole thing integral and not leave floating-point
messes around.

> [...]
>
>> Modified: subversion/trunk/subversion/tests/libsvn_subr/string-test.c
>> +static svn_error_t *
>> +test_string_similarity(apr_pool_t *pool)
>> +{
>> +  const struct sim_score_test_t
>> +  {
>> +    const char *stra;
>> +    const char *strb;
>> +    apr_size_t lcs;
>> +    int score;
>> +  } tests[] =
>> +      {
>> +#define SCORE(lcs, len) ((2000 * (lcs) + (len)/2) / (len))
>> +
>> +        /* Equality */
>> +        {"",       "",          0, 1000},
>> +        {"quoth",  "quoth",     5, SCORE(5, 5+5)},
>> +
>> +        /* Deletion at start */
>> +        {"quoth",  "uoth",      4, SCORE(4, 5+4)},
>> +        {"uoth",   "quoth",     4, SCORE(4, 4+5)},
>> +
>> +        /* Deletion at end */
>> +        {"quoth",  "quot",      4, SCORE(4, 5+4)},
>> +        {"quot",   "quoth",     4, SCORE(4, 4+5)},
>> +
>> +        /* Insertion at start */
>> +        {"quoth",  "Xquoth",    5, SCORE(5, 5+6)},
>> +        {"Xquoth", "quoth",     5, SCORE(5, 6+5)},
>> +
>> +        /* Insertion at end */
>> +        {"quoth",  "quothX",    5, SCORE(5, 5+6)},
>> +        {"quothX", "quoth",     5, SCORE(5, 6+5)},
> Half of the examples in each of these four pairs are redundant.  
> (The word "deletion" implies an ordering of string-A -> string-B, so
>  by that terminology "uoth" -> "quoth" is an Insertion, same as 
> "quoth" -> "Xquoth".)

Theoretically and academically speaking, yes, they're redundant.
Practically, however, this helped me find a bug that only popped up when
exactly one of these examples was removed. I love symmetric algorithms
but my bugs are always asymmetric and some of them break causality, too. :)

So in order to avoid falling into the incompleteness trap, I just
mirrored all the test cases. It's less of a problem to have duplicate
tests than it is to miss one.

-- Brane

-- 
Branko Čibej
Director of Subversion | WANdisco | www.wandisco.com