You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@commons.apache.org by Benedikt Ritter <br...@apache.org> on 2014/11/12 13:34:26 UTC

[text] Incorporating Bruno Kinoshita's work

Hi,

the git repo for [text] is ready and I've done the initial bootstraping
already. I've also created a new component in the SANDBOX jira project. The
first issue is to extract algorithms from [lang] [1]. I remember people
saying, that theere is code in codec too. Please feel free to create
tickets for this.

Bruno already has some code that may fit into [text] [2]. I've given it a
brief review an here are few things which caught my eye:

- Inclusion of Talend code into [text] is not possible (the is code
licensed by www.talend.com)
- spellchecker package: nice idea, which I haven't thought about before.
Further more I could imagine a hyphenation package. Both should be locale
dependend.
- Looking at EditDistance [3] I'm not sure we need T extends Number, if the
only possible values for T are Integer and Double. Maybe we only need an
IntegerEditDistance and a DoubleEditDistance.

Regarding the last point: I'm currently not fond that there is a common
interface fot EditingDistance algorithms. For example Levenshtein has the
optional threshold parameter, which Jaro-Winkler has not (at least judging
from the implementation in [lang]). Fuzzy Distance needs a locale for
uncapitalizing. I think finding an interface that fits them all will be
difficult to accomplish... But we'll see :-)

Regards,
Benedikt

[1] https://issues.apache.org/jira/browse/SANDBOX-483
[2]
https://github.com/kinow/text/tree/master/src/main/java/text/string_metric
[3]
https://github.com/kinow/text/blob/master/src/main/java/text/string_metric/EditDistance.java

-- 
http://people.apache.org/~britter/
http://www.systemoutprintln.de/
http://twitter.com/BenediktRitter
http://github.com/britter

Re: [text] Incorporating Bruno Kinoshita's work

Posted by "Bruno P. Kinoshita" <br...@yahoo.com.br>.

Hi ebourg!

>I think a PROPOSAL.html file defining the scope and goals of thecomponent would a good idea.
+1
> > - Inclusion of Talend code into [text] is not possible (the is code
> > licensed by www.talend.com)
>
> What is this code about?
Talend Open Studio [1]. More specifically, part of the solution that provides a mechanism for combining multiple algorithms and uses probability, weights and thresholds for comparing attributes. 

I started using Talend Open Studio, but have already switched to Duke [2], which is Apache License, has some algorithms already implemented [3] and includes the probabilistic methods as well.
Probably it's a better idea to use Duke's code, in case we decide to include the probabilistic method.

[1] https://www.talend.com/products/talend-open-studio
[2] https://github.com/larsga/Duke/
[3] https://github.com/larsga/Duke/tree/master/src/main/java/no/priv/garshol/duke/comparators


      From: Emmanuel Bourg <eb...@apache.org>
 To: Commons Developers List <de...@commons.apache.org> 
 Sent: Wednesday, November 12, 2014 11:34 AM
 Subject: Re: [text] Incorporating Bruno Kinoshita's work
   
Le 12/11/2014 13:34, Benedikt Ritter a écrit :

> the git repo for [text] is ready and I've done the initial bootstraping
> already. I've also created a new component in the SANDBOX jira project. The
> first issue is to extract algorithms from [lang] [1]. I remember people
> saying, that theere is code in codec too. Please feel free to create
> tickets for this.

I think a PROPOSAL.html file defining the scope and goals of the
component would a good idea.

> - Inclusion of Talend code into [text] is not possible (the is code
> licensed by www.talend.com)

What is this code about?



> - spellchecker package: nice idea, which I haven't thought about before.
> Further more I could imagine a hyphenation package. Both should be locale
> dependend.

I may be wrong, but I'm under the impression a full spellchecking API is
probably too big for a small utility component.

Emmanuel Bourg


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [text] Incorporating Bruno Kinoshita's work

Posted by Benedikt Ritter <br...@apache.org>.

Hello Emmanuel

2014-11-12 14:34 GMT+01:00 Emmanuel Bourg <eb...@apache.org>:

> Le 12/11/2014 13:34, Benedikt Ritter a écrit :
>
> > the git repo for [text] is ready and I've done the initial bootstraping
> > already. I've also created a new component in the SANDBOX jira project.
> The
> > first issue is to extract algorithms from [lang] [1]. I remember people
> > saying, that theere is code in codec too. Please feel free to create
> > tickets for this.
>
> I think a PROPOSAL.html file defining the scope and goals of the
> component would a good idea.
>

Yes, I'll work on that as soon as I have the time.


>
> > - Inclusion of Talend code into [text] is not possible (the is code
> > licensed by www.talend.com)
>
> What is this code about?
>

I don't know. It's some code that Bruno used in hier project. I don't think
we can use it.


>
> > - spellchecker package: nice idea, which I haven't thought about before.
> > Further more I could imagine a hyphenation package. Both should be locale
> > dependend.
>
> I may be wrong, but I'm under the impression a full spellchecking API is
> probably too big for a small utility component.
>

Yes you're right. I think it would be best to start up with the stuff we
already have. I already have a local branch (yeah, git!) for incorporating
the three algorithms from [lang].  We'll see were we go from there.

Benedikt


>
> Emmanuel Bourg
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>


-- 
http://people.apache.org/~britter/
http://www.systemoutprintln.de/
http://twitter.com/BenediktRitter
http://github.com/britter

Re: [text] Incorporating Bruno Kinoshita's work

Posted by Emmanuel Bourg <eb...@apache.org>.

Le 12/11/2014 13:34, Benedikt Ritter a écrit :

> the git repo for [text] is ready and I've done the initial bootstraping
> already. I've also created a new component in the SANDBOX jira project. The
> first issue is to extract algorithms from [lang] [1]. I remember people
> saying, that theere is code in codec too. Please feel free to create
> tickets for this.

I think a PROPOSAL.html file defining the scope and goals of the
component would a good idea.

> - Inclusion of Talend code into [text] is not possible (the is code
> licensed by www.talend.com)

What is this code about?

> - spellchecker package: nice idea, which I haven't thought about before.
> Further more I could imagine a hyphenation package. Both should be locale
> dependend.

I may be wrong, but I'm under the impression a full spellchecking API is
probably too big for a small utility component.

Emmanuel Bourg


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [text] Incorporating Bruno Kinoshita's work

Posted by "J.Pietschmann" <j3...@yahoo.de>.

Hi all,

> Great idea this hyphenation package, +1.

There is a hyphenation package in Apache FOP. Whoever goes to work
on a similar package in [text] should have a look there.

Regards
J.Pietschmann

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [text] Incorporating Bruno Kinoshita's work

Posted by "Bruno P. Kinoshita" <br...@yahoo.com.br>.

Hi Benedikt!
Thanks for bootstrapping the project :)
> - spellchecker package: nice idea, which I haven't thought about before.
Further more I could imagine a hyphenation package. Both should be locale
dependend.
Great idea this hyphenation package, +1. I would already have a use case in a production tool that parses some crazy PDF's and uses OpenNLP. We have cases where we need to check hyphenation before tagging the words, and at the moment what we have is a not so elegant solution.

If we manage to create a common API for spellchecker we could, perhaps, create implementations that call hunspell and jazzy, that have artefacts in maven central and are fairly easy to use. Not sure if that fits in text, maybe only the common interface.

> - Looking at EditDistance [3] I'm not sure we need T extends Number, if the
only possible values for T are Integer and Double. Maybe we only need an
IntegerEditDistance and a DoubleEditDistance.
Could be. As the code was quickly written for a proof of concept for a customer, definitely there are parts that need further thinking. I'd fine with either T extends Number or IntegerEditDistance and DoubleEditDistance.

> Regarding the last point: I'm currently not fond that there is a common
interface fot EditingDistance algorithms. For example Levenshtein has the
optional threshold parameter, which Jaro-Winkler has not (at least judging
from the implementation in [lang]). Fuzzy Distance needs a locale for
uncapitalizing. I think finding an interface that fits them all will be
difficult to accomplish... But we'll see :-)
I had thought about just a marker interface. So I could write some code to scan the classpath looking for implementations of this interface and let the user decide which one to use for his data quality job (regardless of the different parameters used in each algorithm).

I shamelessly stole the name StringMetric from this Wikipedia article [1], but maybe we could find a better name for it?
Thanks again Benedikt!
Bruno
[1] http://en.wikipedia.org/wiki/String_metric

From: Benedikt Ritter <br...@apache.org>
To: Commons Developers List <de...@commons.apache.org>
Sent: Wednesday, November 12, 2014 10:34 AM
Subject: [text] Incorporating Bruno Kinoshita's work

Hi,

the git repo for [text] is ready and I've done the initial bootstraping
already. I've also created a new component in the SANDBOX jira project. The
first issue is to extract algorithms from [lang] [1]. I remember people
saying, that theere is code in codec too. Please feel free to create
tickets for this.

Bruno already has some code that may fit into [text] [2]. I've given it a
brief review an here are few things which caught my eye:

- Inclusion of Talend code into [text] is not possible (the is code
licensed by www.talend.com)
- spellchecker package: nice idea, which I haven't thought about before.
Further more I could imagine a hyphenation package. Both should be locale
dependend.
- Looking at EditDistance [3] I'm not sure we need T extends Number, if the
only possible values for T are Integer and Double. Maybe we only need an
IntegerEditDistance and a DoubleEditDistance.

Regarding the last point: I'm currently not fond that there is a common
interface fot EditingDistance algorithms. For example Levenshtein has the
optional threshold parameter, which Jaro-Winkler has not (at least judging
from the implementation in [lang]). Fuzzy Distance needs a locale for
uncapitalizing. I think finding an interface that fits them all will be
difficult to accomplish... But we'll see :-)

Regards,
Benedikt

[1] https://issues.apache.org/jira/browse/SANDBOX-483
[2]
https://github.com/kinow/text/tree/master/src/main/java/text/string_metric
[3]
https://github.com/kinow/text/blob/master/src/main/java/text/string_metric/EditDistance.java

--
http://people.apache.org/~britter/
http://www.systemoutprintln.de/
http://twitter.com/BenediktRitter
http://github.com/britter