You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Benson Margulies (JIRA)" <ji...@apache.org> on 2010/02/08 02:51:28 UTC

[jira] Created: (LANG-591) A more complex Levenshtein distance would be useful

A more complex Levenshtein distance would be useful
---------------------------------------------------

                 Key: LANG-591
                 URL: https://issues.apache.org/jira/browse/LANG-591
             Project: Commons Lang
          Issue Type: New Feature
          Components: lang.*
    Affects Versions: 3.0
            Reporter: Benson Margulies


For some applications, it is necessary to get insert/delete/substitution counts from the distance algorithm. I am attaching a patch that provides this.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (LANG-591) A more complex Levenshtein distance would be useful

Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LANG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836585#action_12836585 ] 

Benson Margulies commented on LANG-591:
---------------------------------------

Henri,

I've spent too much time fixing things that allocate objects in a tight loop, and I think it's gone to my brain. I can't really claim that, in this case, anyone is all that likely to be running this 1,000,000,000 times in a loop.

So if you think that choice 1 is the natural alternative, I'll do it. 

Would you like the 'diff' algorithm while I'm at it (longest common subsequence)? I ended up coding one of them, too.

--benson


> A more complex Levenshtein distance would be useful
> ---------------------------------------------------
>
>                 Key: LANG-591
>                 URL: https://issues.apache.org/jira/browse/LANG-591
>             Project: Commons Lang
>          Issue Type: New Feature
>          Components: lang.*
>    Affects Versions: 3.0
>            Reporter: Benson Margulies
>         Attachments: LANG-591.diff
>
>
> For some applications, it is necessary to get insert/delete/substitution counts from the distance algorithm. I am attaching a patch that provides this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (LANG-591) A more complex Levenshtein distance would be useful

Posted by "Henri Yandell (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LANG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Henri Yandell updated LANG-591:
-------------------------------

    Fix Version/s: 3.1

I think #1 is best *nod*.

Putting in 3.1.

Diff algorithm sounds great :)

> A more complex Levenshtein distance would be useful
> ---------------------------------------------------
>
>                 Key: LANG-591
>                 URL: https://issues.apache.org/jira/browse/LANG-591
>             Project: Commons Lang
>          Issue Type: New Feature
>          Components: lang.*
>    Affects Versions: 3.0
>            Reporter: Benson Margulies
>             Fix For: 3.1
>
>         Attachments: LANG-591.diff
>
>
> For some applications, it is necessary to get insert/delete/substitution counts from the distance algorithm. I am attaching a patch that provides this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (LANG-591) A more complex Levenshtein distance would be useful

Posted by "Henri Yandell (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LANG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830816#action_12830816 ] 

Henri Yandell commented on LANG-591:
------------------------------------

Not a fan of the API - i.e. LevenshteinResults should be an immutable return, not a mutable passed in. setXxx methods should be removed.

I'd also be tempted to retcon the old API on top of the new one - slows it down a bit, though a private method with a boolean 'collateResults' parameter could be used to keep things largely optimized. 

> A more complex Levenshtein distance would be useful
> ---------------------------------------------------
>
>                 Key: LANG-591
>                 URL: https://issues.apache.org/jira/browse/LANG-591
>             Project: Commons Lang
>          Issue Type: New Feature
>          Components: lang.*
>    Affects Versions: 3.0
>            Reporter: Benson Margulies
>         Attachments: LANG-591.diff
>
>
> For some applications, it is necessary to get insert/delete/substitution counts from the distance algorithm. I am attaching a patch that provides this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (LANG-591) A more complex Levenshtein distance would be useful

Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LANG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benson Margulies updated LANG-591:
----------------------------------

    Attachment:     (was: LANG-591.diff)

> A more complex Levenshtein distance would be useful
> ---------------------------------------------------
>
>                 Key: LANG-591
>                 URL: https://issues.apache.org/jira/browse/LANG-591
>             Project: Commons Lang
>          Issue Type: New Feature
>          Components: lang.*
>    Affects Versions: 3.0
>            Reporter: Benson Margulies
>             Fix For: 3.1
>
>         Attachments: LANG-591.patch
>
>
> For some applications, it is necessary to get insert/delete/substitution counts from the distance algorithm. I am attaching a patch that provides this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (LANG-591) A more complex Levenshtein distance would be useful

Posted by "Henri Yandell (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LANG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836537#action_12836537 ] 

Henri Yandell commented on LANG-591:
------------------------------------


Why is #1 a performance issue?


> A more complex Levenshtein distance would be useful
> ---------------------------------------------------
>
>                 Key: LANG-591
>                 URL: https://issues.apache.org/jira/browse/LANG-591
>             Project: Commons Lang
>          Issue Type: New Feature
>          Components: lang.*
>    Affects Versions: 3.0
>            Reporter: Benson Margulies
>         Attachments: LANG-591.diff
>
>
> For some applications, it is necessary to get insert/delete/substitution counts from the distance algorithm. I am attaching a patch that provides this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (LANG-591) A more complex Levenshtein distance would be useful

Posted by "Henri Yandell (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LANG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890367#action_12890367 ] 

Henri Yandell commented on LANG-591:
------------------------------------

API looks good. I'd still remove the setXyz rather than just make them package-private. Wait for someone to have a use case to need them.

Should it go inside StringUtils?

Is it the same algorithm as the current StringUtils code? Looks different?

> A more complex Levenshtein distance would be useful
> ---------------------------------------------------
>
>                 Key: LANG-591
>                 URL: https://issues.apache.org/jira/browse/LANG-591
>             Project: Commons Lang
>          Issue Type: New Feature
>          Components: lang.*
>    Affects Versions: 3.0
>            Reporter: Benson Margulies
>             Fix For: 3.1
>
>         Attachments: LANG-591.diff
>
>
> For some applications, it is necessary to get insert/delete/substitution counts from the distance algorithm. I am attaching a patch that provides this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (LANG-591) A more complex Levenshtein distance would be useful

Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LANG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benson Margulies updated LANG-591:
----------------------------------

    Attachment: LANG-591.patch

Flushed setters.

> A more complex Levenshtein distance would be useful
> ---------------------------------------------------
>
>                 Key: LANG-591
>                 URL: https://issues.apache.org/jira/browse/LANG-591
>             Project: Commons Lang
>          Issue Type: New Feature
>          Components: lang.*
>    Affects Versions: 3.0
>            Reporter: Benson Margulies
>             Fix For: 3.1
>
>         Attachments: LANG-591.patch
>
>
> For some applications, it is necessary to get insert/delete/substitution counts from the distance algorithm. I am attaching a patch that provides this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (LANG-591) A more complex Levenshtein distance would be useful

Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LANG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882851#action_12882851 ] 

Benson Margulies commented on LANG-591:
---------------------------------------

Hey, any reaction to revision?

> A more complex Levenshtein distance would be useful
> ---------------------------------------------------
>
>                 Key: LANG-591
>                 URL: https://issues.apache.org/jira/browse/LANG-591
>             Project: Commons Lang
>          Issue Type: New Feature
>          Components: lang.*
>    Affects Versions: 3.0
>            Reporter: Benson Margulies
>             Fix For: 3.1
>
>         Attachments: LANG-591.diff
>
>
> For some applications, it is necessary to get insert/delete/substitution counts from the distance algorithm. I am attaching a patch that provides this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (LANG-591) A more complex Levenshtein distance would be useful

Posted by "Henri Yandell (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LANG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12889588#action_12889588 ] 

Henri Yandell commented on LANG-591:
------------------------------------

Sorry for not picking up on the revised patch.

Any reason why the API isn't:

    public static LevenshteinResults getLevenshteinDistance(CharSequence s, CharSequence t)?

I would also delete the setXyz methods on the Results object. They're not used and I don't see any value in being able to edit the results object.

Otherwise it looks good to me :)

> A more complex Levenshtein distance would be useful
> ---------------------------------------------------
>
>                 Key: LANG-591
>                 URL: https://issues.apache.org/jira/browse/LANG-591
>             Project: Commons Lang
>          Issue Type: New Feature
>          Components: lang.*
>    Affects Versions: 3.0
>            Reporter: Benson Margulies
>             Fix For: 3.1
>
>         Attachments: LANG-591.diff
>
>
> For some applications, it is necessary to get insert/delete/substitution counts from the distance algorithm. I am attaching a patch that provides this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (LANG-591) A more complex Levenshtein distance would be useful

Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LANG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830824#action_12830824 ] 

Benson Margulies commented on LANG-591:
---------------------------------------

Fair enough. How abouts I refine it outside until you like it, and then we can discuss retcon? One issue is my preference for array+offset+length for speed in ironic counterpoint to the slower algorithm.


> A more complex Levenshtein distance would be useful
> ---------------------------------------------------
>
>                 Key: LANG-591
>                 URL: https://issues.apache.org/jira/browse/LANG-591
>             Project: Commons Lang
>          Issue Type: New Feature
>          Components: lang.*
>    Affects Versions: 3.0
>            Reporter: Benson Margulies
>         Attachments: LANG-591.diff
>
>
> For some applications, it is necessary to get insert/delete/substitution counts from the distance algorithm. I am attaching a patch that provides this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (LANG-591) A more complex Levenshtein distance would be useful

Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LANG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benson Margulies updated LANG-591:
----------------------------------

    Attachment:     (was: LANG-591.diff)

> A more complex Levenshtein distance would be useful
> ---------------------------------------------------
>
>                 Key: LANG-591
>                 URL: https://issues.apache.org/jira/browse/LANG-591
>             Project: Commons Lang
>          Issue Type: New Feature
>          Components: lang.*
>    Affects Versions: 3.0
>            Reporter: Benson Margulies
>             Fix For: 3.1
>
>         Attachments: LANG-591.diff
>
>
> For some applications, it is necessary to get insert/delete/substitution counts from the distance algorithm. I am attaching a patch that provides this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (LANG-591) A more complex Levenshtein distance would be useful

Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LANG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benson Margulies updated LANG-591:
----------------------------------

    Attachment: LANG-591.diff

Updated patch.

> A more complex Levenshtein distance would be useful
> ---------------------------------------------------
>
>                 Key: LANG-591
>                 URL: https://issues.apache.org/jira/browse/LANG-591
>             Project: Commons Lang
>          Issue Type: New Feature
>          Components: lang.*
>    Affects Versions: 3.0
>            Reporter: Benson Margulies
>             Fix For: 3.1
>
>         Attachments: LANG-591.diff
>
>
> For some applications, it is necessary to get insert/delete/substitution counts from the distance algorithm. I am attaching a patch that provides this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (LANG-591) A more complex Levenshtein distance would be useful

Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LANG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benson Margulies updated LANG-591:
----------------------------------

    Attachment:     (was: LANG-591.diff)

> A more complex Levenshtein distance would be useful
> ---------------------------------------------------
>
>                 Key: LANG-591
>                 URL: https://issues.apache.org/jira/browse/LANG-591
>             Project: Commons Lang
>          Issue Type: New Feature
>          Components: lang.*
>    Affects Versions: 3.0
>            Reporter: Benson Margulies
>             Fix For: 3.1
>
>         Attachments: LANG-591.diff
>
>
> For some applications, it is necessary to get insert/delete/substitution counts from the distance algorithm. I am attaching a patch that provides this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (LANG-591) A more complex Levenshtein distance would be useful

Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LANG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benson Margulies updated LANG-591:
----------------------------------

    Attachment: LANG-591.diff

Patch.

> A more complex Levenshtein distance would be useful
> ---------------------------------------------------
>
>                 Key: LANG-591
>                 URL: https://issues.apache.org/jira/browse/LANG-591
>             Project: Commons Lang
>          Issue Type: New Feature
>          Components: lang.*
>    Affects Versions: 3.0
>            Reporter: Benson Margulies
>         Attachments: LANG-591.diff
>
>
> For some applications, it is necessary to get insert/delete/substitution counts from the distance algorithm. I am attaching a patch that provides this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (LANG-591) A more complex Levenshtein distance would be useful

Posted by "Henri Yandell (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LANG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838170#action_12838170 ] 

Henri Yandell edited comment on LANG-591 at 2/25/10 3:06 AM:
-------------------------------------------------------------

I think #1 is best.

Putting in 3.1.

Diff algorithm sounds great :)

      was (Author: bayard):
    I think #1 is best *nod*.

Putting in 3.1.

Diff algorithm sounds great :)
  
> A more complex Levenshtein distance would be useful
> ---------------------------------------------------
>
>                 Key: LANG-591
>                 URL: https://issues.apache.org/jira/browse/LANG-591
>             Project: Commons Lang
>          Issue Type: New Feature
>          Components: lang.*
>    Affects Versions: 3.0
>            Reporter: Benson Margulies
>             Fix For: 3.1
>
>         Attachments: LANG-591.diff
>
>
> For some applications, it is necessary to get insert/delete/substitution counts from the distance algorithm. I am attaching a patch that provides this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (LANG-591) A more complex Levenshtein distance would be useful

Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LANG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benson Margulies updated LANG-591:
----------------------------------

    Attachment: LANG-591.diff

Update to latest comments.

> A more complex Levenshtein distance would be useful
> ---------------------------------------------------
>
>                 Key: LANG-591
>                 URL: https://issues.apache.org/jira/browse/LANG-591
>             Project: Commons Lang
>          Issue Type: New Feature
>          Components: lang.*
>    Affects Versions: 3.0
>            Reporter: Benson Margulies
>             Fix For: 3.1
>
>         Attachments: LANG-591.diff
>
>
> For some applications, it is necessary to get insert/delete/substitution counts from the distance algorithm. I am attaching a patch that provides this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (LANG-591) A more complex Levenshtein distance would be useful

Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LANG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890447#action_12890447 ] 

Benson Margulies commented on LANG-591:
---------------------------------------

Sorry about the setters, I lost focus.

At the end of the day, there's only one LD algorithm, however different the particular version. However, I do have a field in this class to avoid new'ing a new grid each time. Wouldn't that be a bit ugly in StringUtils?

> A more complex Levenshtein distance would be useful
> ---------------------------------------------------
>
>                 Key: LANG-591
>                 URL: https://issues.apache.org/jira/browse/LANG-591
>             Project: Commons Lang
>          Issue Type: New Feature
>          Components: lang.*
>    Affects Versions: 3.0
>            Reporter: Benson Margulies
>             Fix For: 3.1
>
>         Attachments: LANG-591.diff
>
>
> For some applications, it is necessary to get insert/delete/substitution counts from the distance algorithm. I am attaching a patch that provides this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (LANG-591) A more complex Levenshtein distance would be useful

Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LANG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830927#action_12830927 ] 

Benson Margulies commented on LANG-591:
---------------------------------------

Hen,

There's a thread-safety issue here. The choices are:

1) an API that takes the strings in and returns a new results object. I object to this on performance grounds.
2) a separate Levenshtein object that takes two strings and then returns all the various results via get accessors (no result object).
3) an API like the one I put in the patch.

I think that #2 is the best choice, but I'd like to be on the same page as you before I diddle further.

--benson


> A more complex Levenshtein distance would be useful
> ---------------------------------------------------
>
>                 Key: LANG-591
>                 URL: https://issues.apache.org/jira/browse/LANG-591
>             Project: Commons Lang
>          Issue Type: New Feature
>          Components: lang.*
>    Affects Versions: 3.0
>            Reporter: Benson Margulies
>         Attachments: LANG-591.diff
>
>
> For some applications, it is necessary to get insert/delete/substitution counts from the distance algorithm. I am attaching a patch that provides this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.