You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Rob Tompkins <ch...@gmail.com> on 2016/12/19 02:47:12 UTC
[text][TEXT-32] Regarding more edit distances.
Hello,
With the thought that we want more "edit distances”/“similarity scores” in the codebase for the potential 1.0 release of TEXT, I’ve opened an associated Jira (TEXT-32). I was wondering if any folks had any input about further ideas.
The first idea that I stumbled upon was an edit distance based upon the longest common substring. It feels a tad coarse, but that doesn’t necessarily mean that it’s not worth including.
Other thoughts and ideas?
Cheers,
-Rob
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org
Re: [text][TEXT-32] Regarding more edit distances.
Posted by "Bruno P. Kinoshita" <br...@yahoo.com.br.INVALID>.
Hi Rob,
LCS can still be useful for bioinformatics/genetics. So I'd say that's worth including. In Java, if I ever needed it, I would probably look for it at Biojava (which I just did and couldn't easily find it there).
As for the other string distances, I always look at this GitHub project:
https://github.com/tdebatty/java-string-similarity
And also Talend (I think Data Quality has some string distances). However, I think having the API design, and some string distances implemented could be enough for a 1.0. Then we can add more, and release more
versions.
Cheers
Bruno
----- Original Message -----
> From: Rob Tompkins <ch...@gmail.com>
> To: Commons Developers List <de...@commons.apache.org>
> Sent: Monday, 19 December 2016 3:47 PM
> Subject: [text][TEXT-32] Regarding more edit distances.
>
> Hello,
>
> With the thought that we want more "edit distances”/“similarity scores” in
> the codebase for the potential 1.0 release of TEXT, I’ve opened an associated
> Jira (TEXT-32). I was wondering if any folks had any input about further ideas.
>
> The first idea that I stumbled upon was an edit distance based upon the longest
> common substring. It feels a tad coarse, but that doesn’t necessarily mean that
> it’s not worth including.
>
> Other thoughts and ideas?
>
> Cheers,
> -Rob
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org