You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by luciano-medallia <gi...@git.apache.org> on 2018/07/15 20:16:24 UTC

[GitHub] commons-text pull request #85: Add optimization to limited levenshtein dista...

GitHub user luciano-medallia opened a pull request:

    https://github.com/apache/commons-text/pull/85

    Add optimization to limited levenshtein distance

    When the length difference exceeds the threshold, we can return -1 immediately skipping the algorithm. The reason is that the length difference is a lower bound of the unit-cost Levenshtein distance.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/luciano-medallia/commons-text master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/commons-text/pull/85.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #85
    
----
commit 6b85ebeb0bb999d3dc158c4afb0f8ae4c6eeacac
Author: Luciano Quintabani <lq...@...>
Date:   2018-07-15T20:10:51Z

    Add optimization to limited levenshtein distance

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


[GitHub] commons-text pull request #85: Add optimization to limited levenshtein dista...

Posted by luciano-medallia <gi...@git.apache.org>.
Github user luciano-medallia commented on a diff in the pull request:

    https://github.com/apache/commons-text/pull/85#discussion_r203757309
  
    --- Diff: src/main/java/org/apache/commons/text/similarity/LevenshteinDistance.java ---
    @@ -270,12 +270,6 @@ private static int limitedCompare(CharSequence left, CharSequence right, final i
                 final int max = j > Integer.MAX_VALUE - threshold ? n : Math.min(
                         n, j + threshold);
     
    -            // the stripe may lead off of the table if s and t are of different
    -            // sizes
    -            if (min > max) {
    --- End diff --
    
    Hi @chtompki,
    
    Did you get a chance to give this a look? The change will probably require a little extra clean up to remove comments about the "stripe running off the matrix" issue.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


[GitHub] commons-text pull request #85: Add optimization to limited levenshtein dista...

Posted by chtompki <gi...@git.apache.org>.
Github user chtompki commented on a diff in the pull request:

    https://github.com/apache/commons-text/pull/85#discussion_r202562835
  
    --- Diff: src/main/java/org/apache/commons/text/similarity/LevenshteinDistance.java ---
    @@ -270,12 +270,6 @@ private static int limitedCompare(CharSequence left, CharSequence right, final i
                 final int max = j > Integer.MAX_VALUE - threshold ? n : Math.min(
                         n, j + threshold);
     
    -            // the stripe may lead off of the table if s and t are of different
    -            // sizes
    -            if (min > max) {
    --- End diff --
    
    I'll try to give this a look in the morning (UTC-4), and many thanks for the contribution.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


[GitHub] commons-text issue #85: Add optimization to limited levenshtein distance

Posted by coveralls <gi...@git.apache.org>.
Github user coveralls commented on the issue:

    https://github.com/apache/commons-text/pull/85
  
    
    [![Coverage Status](https://coveralls.io/builds/17994356/badge)](https://coveralls.io/builds/17994356)
    
    Coverage decreased (-0.02%) to 97.813% when pulling **6b85ebeb0bb999d3dc158c4afb0f8ae4c6eeacac on luciano-medallia:master** into **6ad577115e3b7b225af0ee0f3914ee12ec398e84 on apache:master**.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


[GitHub] commons-text pull request #85: Add optimization to limited levenshtein dista...

Posted by luciano-medallia <gi...@git.apache.org>.
Github user luciano-medallia commented on a diff in the pull request:

    https://github.com/apache/commons-text/pull/85#discussion_r202554676
  
    --- Diff: src/main/java/org/apache/commons/text/similarity/LevenshteinDistance.java ---
    @@ -270,12 +270,6 @@ private static int limitedCompare(CharSequence left, CharSequence right, final i
                 final int max = j > Integer.MAX_VALUE - threshold ? n : Math.min(
                         n, j + threshold);
     
    -            // the stripe may lead off of the table if s and t are of different
    -            // sizes
    -            if (min > max) {
    --- End diff --
    
    Now we know that m - n <= threshold. Therefore,
    
    j - threshold > n cannot be true, because j's maximum value is m, and in that case the inequality is equivalent to:
    
    m - m > threshold


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


[GitHub] commons-text pull request #85: Add optimization to limited levenshtein dista...

Posted by kinow <gi...@git.apache.org>.
Github user kinow commented on a diff in the pull request:

    https://github.com/apache/commons-text/pull/85#discussion_r204208775
  
    --- Diff: src/main/java/org/apache/commons/text/similarity/LevenshteinDistance.java ---
    @@ -270,12 +270,6 @@ private static int limitedCompare(CharSequence left, CharSequence right, final i
                 final int max = j > Integer.MAX_VALUE - threshold ? n : Math.min(
                         n, j + threshold);
     
    -            // the stripe may lead off of the table if s and t are of different
    -            // sizes
    -            if (min > max) {
    --- End diff --
    
    A good catch I think @luciano-medallia . @chtompki here's what I understood from reading the code and debugging a bit.
    
    If you have the strings s1="papa" and s2="papakura", and you use the threshold 3, the algorithm won't initiate, and we will fail earlier.
    
    That's because the we are saying that any edit distance under the threshold (3) must be ignored, and we would now be checking that the difference between the length of both (4) is greater than the threshold.
    
    So as the difference between both is 4, and it is greater than our threshold of 3, there's no need to enter the main algorithm.
    
    If we increase the threshold to 4, then we get the distance of 4 as it's ok according to the threshold in our code.
    
    The part of the code removed I assumed was a constraint removed by this new check. Also debugged with different values and couldn't find a combination to trigger that if. So I think the change is OK.
    
    All tests passed locally too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


[GitHub] commons-text pull request #85: Add optimization to limited levenshtein dista...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/commons-text/pull/85


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org