You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Rob Tompkins <ch...@gmail.com> on 2016/12/19 02:47:12 UTC

[text][TEXT-32] Regarding more edit distances.

Hello,

With the thought that we want more "edit distances”/“similarity scores” in the codebase for the potential 1.0 release of TEXT, I’ve opened an associated Jira (TEXT-32). I was wondering if any folks had any input about further ideas.

The first idea that I stumbled upon was an edit distance based upon the longest common substring. It feels a tad coarse, but that doesn’t necessarily mean that it’s not worth including.

Other thoughts and ideas?

Cheers,
-Rob
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [text][TEXT-32] Regarding more edit distances.

Posted by "Bruno P. Kinoshita" <br...@yahoo.com.br.INVALID>.
Hi Rob,

LCS can still be useful for bioinformatics/genetics. So I'd say that's worth including. In Java, if I ever needed it, I would probably look for it at Biojava (which I just did and couldn't easily find it there).


As for the other string distances, I always look at this GitHub project:

https://github.com/tdebatty/java-string-similarity

And also Talend (I think Data Quality has some string distances). However, I think having the API design, and some string distances implemented could be enough for a 1.0. Then we can add more, and release more
versions.


Cheers
Bruno



----- Original Message -----
> From: Rob Tompkins <ch...@gmail.com>
> To: Commons Developers List <de...@commons.apache.org>
> Sent: Monday, 19 December 2016 3:47 PM
> Subject: [text][TEXT-32] Regarding more edit distances.
> 
> Hello,
> 
> With the thought that we want more "edit distances”/“similarity scores” in 
> the codebase for the potential 1.0 release of TEXT, I’ve opened an associated 
> Jira (TEXT-32). I was wondering if any folks had any input about further ideas.
> 
> The first idea that I stumbled upon was an edit distance based upon the longest 
> common substring. It feels a tad coarse, but that doesn’t necessarily mean that 
> it’s not worth including.
> 
> Other thoughts and ideas?
> 
> Cheers,
> -Rob
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org