You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Eli Lindsey (JIRA)" <ji...@apache.org> on 2011/03/07 21:05:59 UTC

[jira] Created: (LANG-684) Levenshtein Distance Within a Given Threshold

Levenshtein Distance Within a Given Threshold
---------------------------------------------

                 Key: LANG-684
                 URL: https://issues.apache.org/jira/browse/LANG-684
             Project: Commons Lang
          Issue Type: New Feature
          Components: lang.*
    Affects Versions: 3.1
            Reporter: Eli Lindsey
            Priority: Minor


It'd be nice to have a function that calculates the Levenshtein distance only if it's within some integer threshold.  

Oftentimes you care less about the actual LD and more about it being within a certain range.  This common, limited computation can be performed much faster than the normal unbounded LD method; instead of O(nm), you can do it in O(km) (n and m are string lengths, k is the threshold).

Also, providing a function like this makes it easier for library users to rewrite the unbounded Levenshtein function to run in O(dm) time (d is the edit distance) if necessary.

I'm attaching a patch that implements this function and adds appropriate test cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (LANG-684) Levenshtein Distance Within a Given Threshold

Posted by "Eli Lindsey (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LANG-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eli Lindsey updated LANG-684:
-----------------------------

    Attachment: LevenshteinDistanceWithThreshold.patch

> Levenshtein Distance Within a Given Threshold
> ---------------------------------------------
>
>                 Key: LANG-684
>                 URL: https://issues.apache.org/jira/browse/LANG-684
>             Project: Commons Lang
>          Issue Type: New Feature
>          Components: lang.*
>    Affects Versions: 3.1
>            Reporter: Eli Lindsey
>            Priority: Minor
>         Attachments: LevenshteinDistanceWithThreshold.patch
>
>
> It'd be nice to have a function that calculates the Levenshtein distance only if it's within some integer threshold.  
> Oftentimes you care less about the actual LD and more about it being within a certain range.  This common, limited computation can be performed much faster than the normal unbounded LD method; instead of O(nm), you can do it in O(km) (n and m are string lengths, k is the threshold).
> Also, providing a function like this makes it easier for library users to rewrite the unbounded Levenshtein function to run in O(dm) time (d is the edit distance) if necessary.
> I'm attaching a patch that implements this function and adds appropriate test cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (LANG-684) Levenshtein Distance Within a Given Threshold

Posted by "Henri Yandell (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LANG-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Henri Yandell updated LANG-684:
-------------------------------

    Fix Version/s:     (was: 3.x)
                   3.0


I'm not against moving LD to Codec. It used to live with Soundex etc way back in the early days of the codebases, but LD didn't really fit Codec and stayed with Lang. We need to decide before 3.0; and ideally we would release a version in Codec fairly soon.

> Levenshtein Distance Within a Given Threshold
> ---------------------------------------------
>
>                 Key: LANG-684
>                 URL: https://issues.apache.org/jira/browse/LANG-684
>             Project: Commons Lang
>          Issue Type: New Feature
>          Components: lang.*
>            Reporter: Eli Lindsey
>            Priority: Minor
>             Fix For: 3.0
>
>         Attachments: LevenshteinDistanceWithThreshold.patch
>
>
> It'd be nice to have a function that calculates the Levenshtein distance only if it's within some integer threshold.  
> Oftentimes you care less about the actual LD and more about it being within a certain range.  This common, limited computation can be performed much faster than the normal unbounded LD method; instead of O(nm), you can do it in O(km) (n and m are string lengths, k is the threshold).
> Also, providing a function like this makes it easier for library users to rewrite the unbounded Levenshtein function to run in O(dm) time (d is the edit distance) if necessary.
> I'm attaching a patch that implements this function and adds appropriate test cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (LANG-684) Levenshtein Distance Within a Given Threshold

Posted by "Gary D. Gregory (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LANG-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13039183#comment-13039183 ] 

Gary D. Gregory commented on LANG-684:
--------------------------------------

I am just scratching my head here because when I look at SU, the LD method sticks out because it operates at a completely different (higher and more complex) level of abstraction. Other SU methods perform at a much more basic level. Just scratchin' the noggin... otherwise, if it does not tickle anyone else and there are no IP issues, we should be OK to add it.

> Levenshtein Distance Within a Given Threshold
> ---------------------------------------------
>
>                 Key: LANG-684
>                 URL: https://issues.apache.org/jira/browse/LANG-684
>             Project: Commons Lang
>          Issue Type: New Feature
>          Components: lang.*
>            Reporter: Eli Lindsey
>            Priority: Minor
>             Fix For: 3.x
>
>         Attachments: LevenshteinDistanceWithThreshold.patch
>
>
> It'd be nice to have a function that calculates the Levenshtein distance only if it's within some integer threshold.  
> Oftentimes you care less about the actual LD and more about it being within a certain range.  This common, limited computation can be performed much faster than the normal unbounded LD method; instead of O(nm), you can do it in O(km) (n and m are string lengths, k is the threshold).
> Also, providing a function like this makes it easier for library users to rewrite the unbounded Levenshtein function to run in O(dm) time (d is the edit distance) if necessary.
> I'm attaching a patch that implements this function and adds appropriate test cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (LANG-684) Levenshtein Distance Within a Given Threshold

Posted by "Henri Yandell (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LANG-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Henri Yandell updated LANG-684:
-------------------------------

    Affects Version/s:     (was: 3.1)
        Fix Version/s: 3.1

> Levenshtein Distance Within a Given Threshold
> ---------------------------------------------
>
>                 Key: LANG-684
>                 URL: https://issues.apache.org/jira/browse/LANG-684
>             Project: Commons Lang
>          Issue Type: New Feature
>          Components: lang.*
>            Reporter: Eli Lindsey
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LevenshteinDistanceWithThreshold.patch
>
>
> It'd be nice to have a function that calculates the Levenshtein distance only if it's within some integer threshold.  
> Oftentimes you care less about the actual LD and more about it being within a certain range.  This common, limited computation can be performed much faster than the normal unbounded LD method; instead of O(nm), you can do it in O(km) (n and m are string lengths, k is the threshold).
> Also, providing a function like this makes it easier for library users to rewrite the unbounded Levenshtein function to run in O(dm) time (d is the edit distance) if necessary.
> I'm attaching a patch that implements this function and adds appropriate test cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (LANG-684) Levenshtein Distance Within a Given Threshold

Posted by "Eli Lindsey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LANG-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038983#comment-13038983 ] 

Eli Lindsey commented on LANG-684:
----------------------------------

Any feedback on the patch?

> Levenshtein Distance Within a Given Threshold
> ---------------------------------------------
>
>                 Key: LANG-684
>                 URL: https://issues.apache.org/jira/browse/LANG-684
>             Project: Commons Lang
>          Issue Type: New Feature
>          Components: lang.*
>            Reporter: Eli Lindsey
>            Priority: Minor
>             Fix For: 3.x
>
>         Attachments: LevenshteinDistanceWithThreshold.patch
>
>
> It'd be nice to have a function that calculates the Levenshtein distance only if it's within some integer threshold.  
> Oftentimes you care less about the actual LD and more about it being within a certain range.  This common, limited computation can be performed much faster than the normal unbounded LD method; instead of O(nm), you can do it in O(km) (n and m are string lengths, k is the threshold).
> Also, providing a function like this makes it easier for library users to rewrite the unbounded Levenshtein function to run in O(dm) time (d is the edit distance) if necessary.
> I'm attaching a patch that implements this function and adds appropriate test cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (LANG-684) Levenshtein Distance Within a Given Threshold

Posted by "Stephen Colebourne (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LANG-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13039170#comment-13039170 ] 

Stephen Colebourne commented on LANG-684:
-----------------------------------------

Well, we should either move all Levenshtein distance to [codec] or add this. I've never used LD, but constraining the time itoperates in seems like a Good Idea to me.

> Levenshtein Distance Within a Given Threshold
> ---------------------------------------------
>
>                 Key: LANG-684
>                 URL: https://issues.apache.org/jira/browse/LANG-684
>             Project: Commons Lang
>          Issue Type: New Feature
>          Components: lang.*
>            Reporter: Eli Lindsey
>            Priority: Minor
>             Fix For: 3.x
>
>         Attachments: LevenshteinDistanceWithThreshold.patch
>
>
> It'd be nice to have a function that calculates the Levenshtein distance only if it's within some integer threshold.  
> Oftentimes you care less about the actual LD and more about it being within a certain range.  This common, limited computation can be performed much faster than the normal unbounded LD method; instead of O(nm), you can do it in O(km) (n and m are string lengths, k is the threshold).
> Also, providing a function like this makes it easier for library users to rewrite the unbounded Levenshtein function to run in O(dm) time (d is the edit distance) if necessary.
> I'm attaching a patch that implements this function and adds appropriate test cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (LANG-684) Levenshtein Distance Within a Given Threshold

Posted by "Matt Benson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LANG-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050063#comment-13050063 ] 

Matt Benson commented on LANG-684:
----------------------------------

I am no expert, and I see how LD might be used in proximity to soundex codecs, but at the end of the day I agree with Hen that it's not exactly a fit.  We should just add this improvement to [lang] and have done with it.

> Levenshtein Distance Within a Given Threshold
> ---------------------------------------------
>
>                 Key: LANG-684
>                 URL: https://issues.apache.org/jira/browse/LANG-684
>             Project: Commons Lang
>          Issue Type: New Feature
>          Components: lang.*
>            Reporter: Eli Lindsey
>            Priority: Minor
>             Fix For: 3.0
>
>         Attachments: LevenshteinDistanceWithThreshold.patch
>
>
> It'd be nice to have a function that calculates the Levenshtein distance only if it's within some integer threshold.  
> Oftentimes you care less about the actual LD and more about it being within a certain range.  This common, limited computation can be performed much faster than the normal unbounded LD method; instead of O(nm), you can do it in O(km) (n and m are string lengths, k is the threshold).
> Also, providing a function like this makes it easier for library users to rewrite the unbounded Levenshtein function to run in O(dm) time (d is the edit distance) if necessary.
> I'm attaching a patch that implements this function and adds appropriate test cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (LANG-684) Levenshtein Distance Within a Given Threshold

Posted by "Gary D. Gregory (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LANG-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13039165#comment-13039165 ] 

Gary D. Gregory commented on LANG-684:
--------------------------------------

This functionality feels pretty high level for StringUtils. I wonder if this type of code would not a better match for the [codec] project? Thoughts from anyone?

> Levenshtein Distance Within a Given Threshold
> ---------------------------------------------
>
>                 Key: LANG-684
>                 URL: https://issues.apache.org/jira/browse/LANG-684
>             Project: Commons Lang
>          Issue Type: New Feature
>          Components: lang.*
>            Reporter: Eli Lindsey
>            Priority: Minor
>             Fix For: 3.x
>
>         Attachments: LevenshteinDistanceWithThreshold.patch
>
>
> It'd be nice to have a function that calculates the Levenshtein distance only if it's within some integer threshold.  
> Oftentimes you care less about the actual LD and more about it being within a certain range.  This common, limited computation can be performed much faster than the normal unbounded LD method; instead of O(nm), you can do it in O(km) (n and m are string lengths, k is the threshold).
> Also, providing a function like this makes it easier for library users to rewrite the unbounded Levenshtein function to run in O(dm) time (d is the edit distance) if necessary.
> I'm attaching a patch that implements this function and adds appropriate test cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (LANG-684) Levenshtein Distance Within a Given Threshold

Posted by "Gary D. Gregory (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LANG-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050074#comment-13050074 ] 

Gary D. Gregory commented on LANG-684:
--------------------------------------

I'm OK with that.

> Levenshtein Distance Within a Given Threshold
> ---------------------------------------------
>
>                 Key: LANG-684
>                 URL: https://issues.apache.org/jira/browse/LANG-684
>             Project: Commons Lang
>          Issue Type: New Feature
>          Components: lang.*
>            Reporter: Eli Lindsey
>            Priority: Minor
>             Fix For: 3.0
>
>         Attachments: LevenshteinDistanceWithThreshold.patch
>
>
> It'd be nice to have a function that calculates the Levenshtein distance only if it's within some integer threshold.  
> Oftentimes you care less about the actual LD and more about it being within a certain range.  This common, limited computation can be performed much faster than the normal unbounded LD method; instead of O(nm), you can do it in O(km) (n and m are string lengths, k is the threshold).
> Also, providing a function like this makes it easier for library users to rewrite the unbounded Levenshtein function to run in O(dm) time (d is the edit distance) if necessary.
> I'm attaching a patch that implements this function and adds appropriate test cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (LANG-684) Levenshtein Distance Within a Given Threshold

Posted by "Matt Benson (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LANG-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Benson resolved LANG-684.
------------------------------

    Resolution: Fixed

Committed revision 1136496.

> Levenshtein Distance Within a Given Threshold
> ---------------------------------------------
>
>                 Key: LANG-684
>                 URL: https://issues.apache.org/jira/browse/LANG-684
>             Project: Commons Lang
>          Issue Type: New Feature
>          Components: lang.*
>            Reporter: Eli Lindsey
>            Priority: Minor
>             Fix For: 3.0
>
>         Attachments: LevenshteinDistanceWithThreshold.patch
>
>
> It'd be nice to have a function that calculates the Levenshtein distance only if it's within some integer threshold.  
> Oftentimes you care less about the actual LD and more about it being within a certain range.  This common, limited computation can be performed much faster than the normal unbounded LD method; instead of O(nm), you can do it in O(km) (n and m are string lengths, k is the threshold).
> Also, providing a function like this makes it easier for library users to rewrite the unbounded Levenshtein function to run in O(dm) time (d is the edit distance) if necessary.
> I'm attaching a patch that implements this function and adds appropriate test cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira