You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Benson Margulies (JIRA)" <ji...@apache.org> on 2010/02/08 02:51:28 UTC
[jira] Created: (LANG-591) A more complex Levenshtein distance
would be useful
A more complex Levenshtein distance would be useful
---------------------------------------------------
Key: LANG-591
URL: https://issues.apache.org/jira/browse/LANG-591
Project: Commons Lang
Issue Type: New Feature
Components: lang.*
Affects Versions: 3.0
Reporter: Benson Margulies
For some applications, it is necessary to get insert/delete/substitution counts from the distance algorithm. I am attaching a patch that provides this.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (LANG-591) A more complex Levenshtein distance
would be useful
Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LANG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836585#action_12836585 ]
Benson Margulies commented on LANG-591:
---------------------------------------
Henri,
I've spent too much time fixing things that allocate objects in a tight loop, and I think it's gone to my brain. I can't really claim that, in this case, anyone is all that likely to be running this 1,000,000,000 times in a loop.
So if you think that choice 1 is the natural alternative, I'll do it.
Would you like the 'diff' algorithm while I'm at it (longest common subsequence)? I ended up coding one of them, too.
--benson
> A more complex Levenshtein distance would be useful
> ---------------------------------------------------
>
> Key: LANG-591
> URL: https://issues.apache.org/jira/browse/LANG-591
> Project: Commons Lang
> Issue Type: New Feature
> Components: lang.*
> Affects Versions: 3.0
> Reporter: Benson Margulies
> Attachments: LANG-591.diff
>
>
> For some applications, it is necessary to get insert/delete/substitution counts from the distance algorithm. I am attaching a patch that provides this.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (LANG-591) A more complex Levenshtein distance
would be useful
Posted by "Henri Yandell (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LANG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Henri Yandell updated LANG-591:
-------------------------------
Fix Version/s: 3.1
I think #1 is best *nod*.
Putting in 3.1.
Diff algorithm sounds great :)
> A more complex Levenshtein distance would be useful
> ---------------------------------------------------
>
> Key: LANG-591
> URL: https://issues.apache.org/jira/browse/LANG-591
> Project: Commons Lang
> Issue Type: New Feature
> Components: lang.*
> Affects Versions: 3.0
> Reporter: Benson Margulies
> Fix For: 3.1
>
> Attachments: LANG-591.diff
>
>
> For some applications, it is necessary to get insert/delete/substitution counts from the distance algorithm. I am attaching a patch that provides this.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (LANG-591) A more complex Levenshtein distance
would be useful
Posted by "Henri Yandell (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LANG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830816#action_12830816 ]
Henri Yandell commented on LANG-591:
------------------------------------
Not a fan of the API - i.e. LevenshteinResults should be an immutable return, not a mutable passed in. setXxx methods should be removed.
I'd also be tempted to retcon the old API on top of the new one - slows it down a bit, though a private method with a boolean 'collateResults' parameter could be used to keep things largely optimized.
> A more complex Levenshtein distance would be useful
> ---------------------------------------------------
>
> Key: LANG-591
> URL: https://issues.apache.org/jira/browse/LANG-591
> Project: Commons Lang
> Issue Type: New Feature
> Components: lang.*
> Affects Versions: 3.0
> Reporter: Benson Margulies
> Attachments: LANG-591.diff
>
>
> For some applications, it is necessary to get insert/delete/substitution counts from the distance algorithm. I am attaching a patch that provides this.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (LANG-591) A more complex Levenshtein distance
would be useful
Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LANG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benson Margulies updated LANG-591:
----------------------------------
Attachment: (was: LANG-591.diff)
> A more complex Levenshtein distance would be useful
> ---------------------------------------------------
>
> Key: LANG-591
> URL: https://issues.apache.org/jira/browse/LANG-591
> Project: Commons Lang
> Issue Type: New Feature
> Components: lang.*
> Affects Versions: 3.0
> Reporter: Benson Margulies
> Fix For: 3.1
>
> Attachments: LANG-591.patch
>
>
> For some applications, it is necessary to get insert/delete/substitution counts from the distance algorithm. I am attaching a patch that provides this.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (LANG-591) A more complex Levenshtein distance
would be useful
Posted by "Henri Yandell (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LANG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836537#action_12836537 ]
Henri Yandell commented on LANG-591:
------------------------------------
Why is #1 a performance issue?
> A more complex Levenshtein distance would be useful
> ---------------------------------------------------
>
> Key: LANG-591
> URL: https://issues.apache.org/jira/browse/LANG-591
> Project: Commons Lang
> Issue Type: New Feature
> Components: lang.*
> Affects Versions: 3.0
> Reporter: Benson Margulies
> Attachments: LANG-591.diff
>
>
> For some applications, it is necessary to get insert/delete/substitution counts from the distance algorithm. I am attaching a patch that provides this.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (LANG-591) A more complex Levenshtein distance
would be useful
Posted by "Henri Yandell (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LANG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890367#action_12890367 ]
Henri Yandell commented on LANG-591:
------------------------------------
API looks good. I'd still remove the setXyz rather than just make them package-private. Wait for someone to have a use case to need them.
Should it go inside StringUtils?
Is it the same algorithm as the current StringUtils code? Looks different?
> A more complex Levenshtein distance would be useful
> ---------------------------------------------------
>
> Key: LANG-591
> URL: https://issues.apache.org/jira/browse/LANG-591
> Project: Commons Lang
> Issue Type: New Feature
> Components: lang.*
> Affects Versions: 3.0
> Reporter: Benson Margulies
> Fix For: 3.1
>
> Attachments: LANG-591.diff
>
>
> For some applications, it is necessary to get insert/delete/substitution counts from the distance algorithm. I am attaching a patch that provides this.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (LANG-591) A more complex Levenshtein distance
would be useful
Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LANG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benson Margulies updated LANG-591:
----------------------------------
Attachment: LANG-591.patch
Flushed setters.
> A more complex Levenshtein distance would be useful
> ---------------------------------------------------
>
> Key: LANG-591
> URL: https://issues.apache.org/jira/browse/LANG-591
> Project: Commons Lang
> Issue Type: New Feature
> Components: lang.*
> Affects Versions: 3.0
> Reporter: Benson Margulies
> Fix For: 3.1
>
> Attachments: LANG-591.patch
>
>
> For some applications, it is necessary to get insert/delete/substitution counts from the distance algorithm. I am attaching a patch that provides this.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (LANG-591) A more complex Levenshtein distance
would be useful
Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LANG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882851#action_12882851 ]
Benson Margulies commented on LANG-591:
---------------------------------------
Hey, any reaction to revision?
> A more complex Levenshtein distance would be useful
> ---------------------------------------------------
>
> Key: LANG-591
> URL: https://issues.apache.org/jira/browse/LANG-591
> Project: Commons Lang
> Issue Type: New Feature
> Components: lang.*
> Affects Versions: 3.0
> Reporter: Benson Margulies
> Fix For: 3.1
>
> Attachments: LANG-591.diff
>
>
> For some applications, it is necessary to get insert/delete/substitution counts from the distance algorithm. I am attaching a patch that provides this.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (LANG-591) A more complex Levenshtein distance
would be useful
Posted by "Henri Yandell (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LANG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12889588#action_12889588 ]
Henri Yandell commented on LANG-591:
------------------------------------
Sorry for not picking up on the revised patch.
Any reason why the API isn't:
public static LevenshteinResults getLevenshteinDistance(CharSequence s, CharSequence t)?
I would also delete the setXyz methods on the Results object. They're not used and I don't see any value in being able to edit the results object.
Otherwise it looks good to me :)
> A more complex Levenshtein distance would be useful
> ---------------------------------------------------
>
> Key: LANG-591
> URL: https://issues.apache.org/jira/browse/LANG-591
> Project: Commons Lang
> Issue Type: New Feature
> Components: lang.*
> Affects Versions: 3.0
> Reporter: Benson Margulies
> Fix For: 3.1
>
> Attachments: LANG-591.diff
>
>
> For some applications, it is necessary to get insert/delete/substitution counts from the distance algorithm. I am attaching a patch that provides this.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (LANG-591) A more complex Levenshtein distance
would be useful
Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LANG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830824#action_12830824 ]
Benson Margulies commented on LANG-591:
---------------------------------------
Fair enough. How abouts I refine it outside until you like it, and then we can discuss retcon? One issue is my preference for array+offset+length for speed in ironic counterpoint to the slower algorithm.
> A more complex Levenshtein distance would be useful
> ---------------------------------------------------
>
> Key: LANG-591
> URL: https://issues.apache.org/jira/browse/LANG-591
> Project: Commons Lang
> Issue Type: New Feature
> Components: lang.*
> Affects Versions: 3.0
> Reporter: Benson Margulies
> Attachments: LANG-591.diff
>
>
> For some applications, it is necessary to get insert/delete/substitution counts from the distance algorithm. I am attaching a patch that provides this.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (LANG-591) A more complex Levenshtein distance
would be useful
Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LANG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benson Margulies updated LANG-591:
----------------------------------
Attachment: (was: LANG-591.diff)
> A more complex Levenshtein distance would be useful
> ---------------------------------------------------
>
> Key: LANG-591
> URL: https://issues.apache.org/jira/browse/LANG-591
> Project: Commons Lang
> Issue Type: New Feature
> Components: lang.*
> Affects Versions: 3.0
> Reporter: Benson Margulies
> Fix For: 3.1
>
> Attachments: LANG-591.diff
>
>
> For some applications, it is necessary to get insert/delete/substitution counts from the distance algorithm. I am attaching a patch that provides this.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (LANG-591) A more complex Levenshtein distance
would be useful
Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LANG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benson Margulies updated LANG-591:
----------------------------------
Attachment: LANG-591.diff
Updated patch.
> A more complex Levenshtein distance would be useful
> ---------------------------------------------------
>
> Key: LANG-591
> URL: https://issues.apache.org/jira/browse/LANG-591
> Project: Commons Lang
> Issue Type: New Feature
> Components: lang.*
> Affects Versions: 3.0
> Reporter: Benson Margulies
> Fix For: 3.1
>
> Attachments: LANG-591.diff
>
>
> For some applications, it is necessary to get insert/delete/substitution counts from the distance algorithm. I am attaching a patch that provides this.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (LANG-591) A more complex Levenshtein distance
would be useful
Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LANG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benson Margulies updated LANG-591:
----------------------------------
Attachment: (was: LANG-591.diff)
> A more complex Levenshtein distance would be useful
> ---------------------------------------------------
>
> Key: LANG-591
> URL: https://issues.apache.org/jira/browse/LANG-591
> Project: Commons Lang
> Issue Type: New Feature
> Components: lang.*
> Affects Versions: 3.0
> Reporter: Benson Margulies
> Fix For: 3.1
>
> Attachments: LANG-591.diff
>
>
> For some applications, it is necessary to get insert/delete/substitution counts from the distance algorithm. I am attaching a patch that provides this.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (LANG-591) A more complex Levenshtein distance
would be useful
Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LANG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benson Margulies updated LANG-591:
----------------------------------
Attachment: LANG-591.diff
Patch.
> A more complex Levenshtein distance would be useful
> ---------------------------------------------------
>
> Key: LANG-591
> URL: https://issues.apache.org/jira/browse/LANG-591
> Project: Commons Lang
> Issue Type: New Feature
> Components: lang.*
> Affects Versions: 3.0
> Reporter: Benson Margulies
> Attachments: LANG-591.diff
>
>
> For some applications, it is necessary to get insert/delete/substitution counts from the distance algorithm. I am attaching a patch that provides this.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (LANG-591) A more complex Levenshtein
distance would be useful
Posted by "Henri Yandell (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LANG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838170#action_12838170 ]
Henri Yandell edited comment on LANG-591 at 2/25/10 3:06 AM:
-------------------------------------------------------------
I think #1 is best.
Putting in 3.1.
Diff algorithm sounds great :)
was (Author: bayard):
I think #1 is best *nod*.
Putting in 3.1.
Diff algorithm sounds great :)
> A more complex Levenshtein distance would be useful
> ---------------------------------------------------
>
> Key: LANG-591
> URL: https://issues.apache.org/jira/browse/LANG-591
> Project: Commons Lang
> Issue Type: New Feature
> Components: lang.*
> Affects Versions: 3.0
> Reporter: Benson Margulies
> Fix For: 3.1
>
> Attachments: LANG-591.diff
>
>
> For some applications, it is necessary to get insert/delete/substitution counts from the distance algorithm. I am attaching a patch that provides this.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (LANG-591) A more complex Levenshtein distance
would be useful
Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LANG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benson Margulies updated LANG-591:
----------------------------------
Attachment: LANG-591.diff
Update to latest comments.
> A more complex Levenshtein distance would be useful
> ---------------------------------------------------
>
> Key: LANG-591
> URL: https://issues.apache.org/jira/browse/LANG-591
> Project: Commons Lang
> Issue Type: New Feature
> Components: lang.*
> Affects Versions: 3.0
> Reporter: Benson Margulies
> Fix For: 3.1
>
> Attachments: LANG-591.diff
>
>
> For some applications, it is necessary to get insert/delete/substitution counts from the distance algorithm. I am attaching a patch that provides this.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (LANG-591) A more complex Levenshtein distance
would be useful
Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LANG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890447#action_12890447 ]
Benson Margulies commented on LANG-591:
---------------------------------------
Sorry about the setters, I lost focus.
At the end of the day, there's only one LD algorithm, however different the particular version. However, I do have a field in this class to avoid new'ing a new grid each time. Wouldn't that be a bit ugly in StringUtils?
> A more complex Levenshtein distance would be useful
> ---------------------------------------------------
>
> Key: LANG-591
> URL: https://issues.apache.org/jira/browse/LANG-591
> Project: Commons Lang
> Issue Type: New Feature
> Components: lang.*
> Affects Versions: 3.0
> Reporter: Benson Margulies
> Fix For: 3.1
>
> Attachments: LANG-591.diff
>
>
> For some applications, it is necessary to get insert/delete/substitution counts from the distance algorithm. I am attaching a patch that provides this.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (LANG-591) A more complex Levenshtein distance
would be useful
Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LANG-591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830927#action_12830927 ]
Benson Margulies commented on LANG-591:
---------------------------------------
Hen,
There's a thread-safety issue here. The choices are:
1) an API that takes the strings in and returns a new results object. I object to this on performance grounds.
2) a separate Levenshtein object that takes two strings and then returns all the various results via get accessors (no result object).
3) an API like the one I put in the patch.
I think that #2 is the best choice, but I'd like to be on the same page as you before I diddle further.
--benson
> A more complex Levenshtein distance would be useful
> ---------------------------------------------------
>
> Key: LANG-591
> URL: https://issues.apache.org/jira/browse/LANG-591
> Project: Commons Lang
> Issue Type: New Feature
> Components: lang.*
> Affects Versions: 3.0
> Reporter: Benson Margulies
> Attachments: LANG-591.diff
>
>
> For some applications, it is necessary to get insert/delete/substitution counts from the distance algorithm. I am attaching a patch that provides this.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.