You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Henri Yandell <ba...@generationjava.com> on 2003/10/22 22:53:53 UTC

Re: [lang] [PATCH] StringUtils.getLevenshteinDistance() fix for OutOfMemoryError when used with LONG strings

Cool. Sounds good and I'll have a go at applying your patch.

Any chance of you submitting your other unit tests for the class?

Hen

On Wed, 22 Oct 2003, Chas Emerick wrote:

> I've never submitted a patch for an open-source project before (never
> got around to it lo these many years I guess), so I apologize for any
> errors in form or convention that I commit.
>
> I was planning on using the
> StringUtils.getLevenshteinDistance(String,String) method in a
> particular circumstance where I needed to get the edit distance between
> two large strings (anywhere between 10K - 500K characters each).
> However, I found that calling that method (as of v2.0 of commons-lang)
> would result in an OutOfMemoryError when strings of such lengths were
> provided.
>
> A quick look at the source revealed that the current implementation
> (which uses the sample impl. at http://www.merriampark.com/ld.htm)
> creates a matrix with dimensions corresponding to the lengths of the
> two strings provided.  Clearly, a 100K x 100K int[] is problematic.
>
> Therefore, I've (mostly) rewritten the method using a two int arrays of
> the size of the first string's length, and the method now works
> properly with larger strings (beastly slow, but that's quadratic
> algorithms for you) .  I've tested the new implementation against the
> ten or so testcases mentioned in the javadocs, as well as another
> half-dozen of my own, and everything looks good.
>
> If my mail client botches the patch diff, you can get it at
> http://www.snowtide.com/commons-lang-LDpatch.txt
>
> Chas Emerick   |   cemerick@snowtide.com
>
> http://www.snowtide.com
> Snowtide Informatics Systems, Inc.
>
> ========================================================================
> ==
> --- StringUtils.java.old        Wed Oct 22 12:58:04 2003
> +++ StringUtils.java    Wed Oct 22 13:51:36 2003
> @@ -4255,8 +4255,8 @@
>        * another, where each change is a single character modification
> (deletion,
>        * insertion or substitution).</p>
>        *
> -     * <p>This implementation of the Levenshtein distance algorithm
> -     * is from <a
> href="http://www.merriampark.com/ld.htm">http://www.merriampark.com/ld.
> htm</a></p>
> +     * <p>This implementation of the Levenshtein distance algorithm
> was originally based
> upon the one
> +     * presented at <a
> href="http://www.merriampark.com/ld.htm">http://www.merriampark.co
> m/ld.htm</a></p>
>        *
>        * <pre>
>        * StringUtils.getLevenshteinDistance(null, *)             =
> IllegalArgumentException
> @@ -4281,76 +4281,64 @@
>           if (s == null || t == null) {
>               throw new IllegalArgumentException("Strings must not be
> null");
>           }
> -        int d[][]; // matrix
> -        int n; // length of s
> -        int m; // length of t
> -        int i; // iterates through s
> -        int j; // iterates through t
> -        char s_i; // ith character of s
> -        char t_j; // jth character of t
> -        int cost; // cost
> -
> -        // Step 1
> -        n = s.length();
> -        m = t.length();
> +
> +               /*
> +               The difference between this impl. and the previous is
> that, rather than cr
> eating and retaining a matrix of size
> +               s.length()+1 by t.length()+1, we maintain two
> single-dimensional arrays of
>   length s.length()+1.  The first, d,
> +               is the 'current working' distance array that maintains
> the newest distance
>   cost counts as we iterate through
> +               the characters of String s.  Each time we increment the
> index of String t
> we are comparing, d is copied to p,
> +               the second int[].  Doing so allows us to retain the
> previous cost counts a
> s required by the algorithm
> +               (taking the minimum of the cost count to the left, up
> one, and diagonally
> up and to the left of the current
> +               cost count being calculated).  (Note that the arrays
> aren't really copied
> anymore, just switched...this is clearly much
> +               better than cloning an array or doing a
> System.arraycopy() each time throu
> gh the outer loop.)
> +
> +               Effectively, the difference between the two
> implementations is this one do
> es not cause an out of memory condition
> +               when calculating the LD over two very large strings.
> +               */
> +
> +        int n = s.length(); // length of s
> +        int m = t.length(); // length of t
> +
>           if (n == 0) {
>               return m;
> -        }
> -        if (m == 0) {
> +        } else if (m == 0) {
>               return n;
>           }
> -        d = new int[n + 1][m + 1];
>
> -        // Step 2
> -        for (i = 0; i <= n; i++) {
> -            d[i][0] = i;
> -        }
> +        int p[] = new int[n+1]; //'previous' cost array, horizontally
> +        int d[] = new int[n+1]; // cost array, horizontally
> +               int _d[]; //placeholder to assist in swapping p and d
>
> -        for (j = 0; j <= m; j++) {
> -            d[0][j] = j;
> -        }
> -
> -        // Step 3
> -        for (i = 1; i <= n; i++) {
> -            s_i = s.charAt(i - 1);
> -
> -            // Step 4
> -            for (j = 1; j <= m; j++) {
> -                t_j = t.charAt(j - 1);
> -
> -                // Step 5
> -                if (s_i == t_j) {
> -                    cost = 0;
> -                } else {
> -                    cost = 1;
> -                }
> +        //indexes into strings s and t
> +        int i; // iterates through s
> +        int j; // iterates through t
>
> -                // Step 6
> -                d[i][j] = min(d[i - 1][j] + 1, d[i][j - 1] + 1, d[i -
> 1][j - 1] + cost);
> -            }
> -        }
> +        char t_j; // jth character of t
>
> -        // Step 7
> -        return d[n][m];
> -    }
> +        int cost; // cost
>
> -    /**
> -     * <p>Gets the minimum of three <code>int</code> values.</p>
> -     *
> -     * @param a  value 1
> -     * @param b  value 2
> -     * @param c  value 3
> -     * @return  the smallest of the values
> -     */
> -    private static int min(int a, int b, int c) {
> -        // Method copied from NumberUtils to avoid dependency on
> subpackage
> -        if (b < a) {
> -            a = b;
> -        }
> -        if (c < a) {
> -            a = c;
> +        for (i = 0; i<=n; i++) {
> +            p[i] = i;
>           }
> -        return a;
> +
> +        for (j = 1; j<=m; j++) {
> +            t_j = t.charAt(j-1);
> +            d[0] = j;
> +
> +            for (i=1; i<=n; i++) {
> +                cost = s.charAt(i-1)==t_j ? 0 : 1;
> +
> +                d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1),
> p[i-1]+cost);  //minimum of c
> ell to the left+1, to the top+1, diagonally left and up +cost
> +            }
> +
> +            //copy current distance counts to 'previous row' distance
> counts
> +            _d = p;
> +            p = d;
> +            d = _d;
> +        }
> +
> +        //our last action in the above loop was to switch d and p, so
> p now actual
> ly has the most recent cost counts
> +        return p[n];
>       }
>
>   }
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [lang] [PATCH] StringUtils.getLevenshteinDistance() fix for OutOfMemoryError when used with LONG strings

Posted by Chas Emerick <ce...@snowtide.com>.
Sure...although there's not much I can add with regard to the simpler  
test cases, I could use RandomStringUtils to generate some long strings  
to test behaviour that the patch addresses.  The downside is that the  
algorithm is fundamentally quadratic, so doing such a test would  
significantly extend the time required to complete the java-lang tests  
(i.e. ~7 minutes to calculate the character-level edit distance of two  
strings of 50K each).  If you think that's OK, I'll submit a patch to  
the test class.

Chas Emerick   |   cemerick@snowtide.com

http://www.snowtide.com
Snowtide Informatics Systems, Inc.


On Wednesday, October 22, 2003, at 04:53 PM, Henri Yandell wrote:

>
> Cool. Sounds good and I'll have a go at applying your patch.
>
> Any chance of you submitting your other unit tests for the class?
>
> Hen
>
> On Wed, 22 Oct 2003, Chas Emerick wrote:
>
>> I've never submitted a patch for an open-source project before (never
>> got around to it lo these many years I guess), so I apologize for any
>> errors in form or convention that I commit.
>>
>> I was planning on using the
>> StringUtils.getLevenshteinDistance(String,String) method in a
>> particular circumstance where I needed to get the edit distance  
>> between
>> two large strings (anywhere between 10K - 500K characters each).
>> However, I found that calling that method (as of v2.0 of commons-lang)
>> would result in an OutOfMemoryError when strings of such lengths were
>> provided.
>>
>> A quick look at the source revealed that the current implementation
>> (which uses the sample impl. at http://www.merriampark.com/ld.htm)
>> creates a matrix with dimensions corresponding to the lengths of the
>> two strings provided.  Clearly, a 100K x 100K int[] is problematic.
>>
>> Therefore, I've (mostly) rewritten the method using a two int arrays  
>> of
>> the size of the first string's length, and the method now works
>> properly with larger strings (beastly slow, but that's quadratic
>> algorithms for you) .  I've tested the new implementation against the
>> ten or so testcases mentioned in the javadocs, as well as another
>> half-dozen of my own, and everything looks good.
>>
>> If my mail client botches the patch diff, you can get it at
>> http://www.snowtide.com/commons-lang-LDpatch.txt
>>
>> Chas Emerick   |   cemerick@snowtide.com
>>
>> http://www.snowtide.com
>> Snowtide Informatics Systems, Inc.
>>
>> ====================================================================== 
>> ==
>> ==
>> --- StringUtils.java.old        Wed Oct 22 12:58:04 2003
>> +++ StringUtils.java    Wed Oct 22 13:51:36 2003
>> @@ -4255,8 +4255,8 @@
>>        * another, where each change is a single character modification
>> (deletion,
>>        * insertion or substitution).</p>
>>        *
>> -     * <p>This implementation of the Levenshtein distance algorithm
>> -     * is from <a
>> href="http://www.merriampark.com/ld.htm">http://www.merriampark.com/ 
>> ld.
>> htm</a></p>
>> +     * <p>This implementation of the Levenshtein distance algorithm
>> was originally based
>> upon the one
>> +     * presented at <a
>> href="http://www.merriampark.com/ld.htm">http://www.merriampark.co
>> m/ld.htm</a></p>
>>        *
>>        * <pre>
>>        * StringUtils.getLevenshteinDistance(null, *)             =
>> IllegalArgumentException
>> @@ -4281,76 +4281,64 @@
>>           if (s == null || t == null) {
>>               throw new IllegalArgumentException("Strings must not be
>> null");
>>           }
>> -        int d[][]; // matrix
>> -        int n; // length of s
>> -        int m; // length of t
>> -        int i; // iterates through s
>> -        int j; // iterates through t
>> -        char s_i; // ith character of s
>> -        char t_j; // jth character of t
>> -        int cost; // cost
>> -
>> -        // Step 1
>> -        n = s.length();
>> -        m = t.length();
>> +
>> +               /*
>> +               The difference between this impl. and the previous is
>> that, rather than cr
>> eating and retaining a matrix of size
>> +               s.length()+1 by t.length()+1, we maintain two
>> single-dimensional arrays of
>>   length s.length()+1.  The first, d,
>> +               is the 'current working' distance array that maintains
>> the newest distance
>>   cost counts as we iterate through
>> +               the characters of String s.  Each time we increment  
>> the
>> index of String t
>> we are comparing, d is copied to p,
>> +               the second int[].  Doing so allows us to retain the
>> previous cost counts a
>> s required by the algorithm
>> +               (taking the minimum of the cost count to the left, up
>> one, and diagonally
>> up and to the left of the current
>> +               cost count being calculated).  (Note that the arrays
>> aren't really copied
>> anymore, just switched...this is clearly much
>> +               better than cloning an array or doing a
>> System.arraycopy() each time throu
>> gh the outer loop.)
>> +
>> +               Effectively, the difference between the two
>> implementations is this one do
>> es not cause an out of memory condition
>> +               when calculating the LD over two very large strings.
>> +               */
>> +
>> +        int n = s.length(); // length of s
>> +        int m = t.length(); // length of t
>> +
>>           if (n == 0) {
>>               return m;
>> -        }
>> -        if (m == 0) {
>> +        } else if (m == 0) {
>>               return n;
>>           }
>> -        d = new int[n + 1][m + 1];
>>
>> -        // Step 2
>> -        for (i = 0; i <= n; i++) {
>> -            d[i][0] = i;
>> -        }
>> +        int p[] = new int[n+1]; //'previous' cost array, horizontally
>> +        int d[] = new int[n+1]; // cost array, horizontally
>> +               int _d[]; //placeholder to assist in swapping p and d
>>
>> -        for (j = 0; j <= m; j++) {
>> -            d[0][j] = j;
>> -        }
>> -
>> -        // Step 3
>> -        for (i = 1; i <= n; i++) {
>> -            s_i = s.charAt(i - 1);
>> -
>> -            // Step 4
>> -            for (j = 1; j <= m; j++) {
>> -                t_j = t.charAt(j - 1);
>> -
>> -                // Step 5
>> -                if (s_i == t_j) {
>> -                    cost = 0;
>> -                } else {
>> -                    cost = 1;
>> -                }
>> +        //indexes into strings s and t
>> +        int i; // iterates through s
>> +        int j; // iterates through t
>>
>> -                // Step 6
>> -                d[i][j] = min(d[i - 1][j] + 1, d[i][j - 1] + 1, d[i -
>> 1][j - 1] + cost);
>> -            }
>> -        }
>> +        char t_j; // jth character of t
>>
>> -        // Step 7
>> -        return d[n][m];
>> -    }
>> +        int cost; // cost
>>
>> -    /**
>> -     * <p>Gets the minimum of three <code>int</code> values.</p>
>> -     *
>> -     * @param a  value 1
>> -     * @param b  value 2
>> -     * @param c  value 3
>> -     * @return  the smallest of the values
>> -     */
>> -    private static int min(int a, int b, int c) {
>> -        // Method copied from NumberUtils to avoid dependency on
>> subpackage
>> -        if (b < a) {
>> -            a = b;
>> -        }
>> -        if (c < a) {
>> -            a = c;
>> +        for (i = 0; i<=n; i++) {
>> +            p[i] = i;
>>           }
>> -        return a;
>> +
>> +        for (j = 1; j<=m; j++) {
>> +            t_j = t.charAt(j-1);
>> +            d[0] = j;
>> +
>> +            for (i=1; i<=n; i++) {
>> +                cost = s.charAt(i-1)==t_j ? 0 : 1;
>> +
>> +                d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1),
>> p[i-1]+cost);  //minimum of c
>> ell to the left+1, to the top+1, diagonally left and up +cost
>> +            }
>> +
>> +            //copy current distance counts to 'previous row' distance
>> counts
>> +            _d = p;
>> +            p = d;
>> +            d = _d;
>> +        }
>> +
>> +        //our last action in the above loop was to switch d and p, so
>> p now actual
>> ly has the most recent cost counts
>> +        return p[n];
>>       }
>>
>>   }
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org