You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@opennlp.apache.org by Jörn Kottmann <ko...@gmail.com> on 2011/08/18 12:19:54 UTC

Stemmer

Hi all,

the contribution from Boris contains a Porter stemmer.

Up to now we do not have support for stemming in OpenNLP.
Should we add a component to OpenNLP which is dedicated to stemming?

I believe that could be useful for many, and could also be useful as part of
our feature generation. To start with we might only have the Porter stemmer,
but that could easily be extended to more languages over time.

Any opinions?

Jörn

Re: Stemmer

Posted by Jörn Kottmann <ko...@gmail.com>.
On 8/18/11 5:49 PM, Jason Baldridge wrote:
> Adding stemmers would be nice, and it could be a fairly easy path to
> bringing in new developers since it is pretty much independent of other
> components and easy to test.
>
> However, I would also note that it would be great to get real morphological
> analysis in there. There is a lot of recent interest in the NLP research
> community toward learning morphological analyzers, and perhaps that can
> eventually make its way into OpenNLP.

+1 for both. For me it looks like that proper lemmatization could
be useful for Boris contribution also.

Jörn



Re: Stemmer

Posted by Jason Baldridge <ja...@gmail.com>.
Adding stemmers would be nice, and it could be a fairly easy path to
bringing in new developers since it is pretty much independent of other
components and easy to test.

However, I would also note that it would be great to get real morphological
analysis in there. There is a lot of recent interest in the NLP research
community toward learning morphological analyzers, and perhaps that can
eventually make its way into OpenNLP.

Jason

On Thu, Aug 18, 2011 at 5:52 AM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 8/18/11 12:38 PM, Olivier Grisel wrote:
>
>> True but working on a generic API adapter would make it possible to
>> benefit from the huge set of existing tokenizers / analyzers from the
>> Lucene community. Although I am aware that most of the time lucene
>> analyzers drop the punctuation information which is mostly useless for
>> Information Retrieval but often critical for NLP.
>>
>
> As far as I know is Lucene redistributing the snowball stemmers,
> that would could also be an option for us, then we directly have
> stemmers for all languages we currently support.
>
> I do not really see a benefit for adapting Lucene analyzers,
> if someone wants to use a Lcuene tokenizer instead of an OpenNLP
> one he can simply do that, and then provide the
> tokenized text to OpenNLP. That is already supported.
>
> Jörn
>



-- 
Jason Baldridge
Assistant Professor, Department of Linguistics
The University of Texas at Austin
http://www.jasonbaldridge.com
http://twitter.com/jasonbaldridge

Re: Stemmer

Posted by Jörn Kottmann <ko...@gmail.com>.
On 8/18/11 12:38 PM, Olivier Grisel wrote:
> True but working on a generic API adapter would make it possible to
> benefit from the huge set of existing tokenizers / analyzers from the
> Lucene community. Although I am aware that most of the time lucene
> analyzers drop the punctuation information which is mostly useless for
> Information Retrieval but often critical for NLP.

As far as I know is Lucene redistributing the snowball stemmers,
that would could also be an option for us, then we directly have
stemmers for all languages we currently support.

I do not really see a benefit for adapting Lucene analyzers,
if someone wants to use a Lcuene tokenizer instead of an OpenNLP
one he can simply do that, and then provide the
tokenized text to OpenNLP. That is already supported.

Jörn

Re: Stemmer

Posted by Olivier Grisel <ol...@ensta.org>.
2011/8/18 Jörn Kottmann <ko...@gmail.com>:
> On 8/18/11 12:24 PM, Olivier Grisel wrote:
>>
>> Is this better or cover more languages than what's already provided by
>> Apache Lucene? Maybe it should better be contributed to the Lucene
>> project and make it easy to use the generic, battle tested Lucene
>> analyzers / tokenizers infrastructure to generate features in OpenNLP.
>
> The OpenNLP APIs are all not designed to work on token streams, instead
> a user usually has to provide an entire sentence at once, so that does not
> make a nice fit.

One could treat each sentence as an individual token stream to make a
generic Lucene adapter.

> And since we are an NLP library I believe it is absolutly fine to implement
> our own stemming here.

True but working on a generic API adapter would make it possible to
benefit from the huge set of existing tokenizers / analyzers from the
Lucene community. Although I am aware that most of the time lucene
analyzers drop the punctuation information which is mostly useless for
Information Retrieval but often critical for NLP.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: Stemmer

Posted by Jörn Kottmann <ko...@gmail.com>.
On 8/18/11 12:24 PM, Olivier Grisel wrote:
> Is this better or cover more languages than what's already provided by
> Apache Lucene? Maybe it should better be contributed to the Lucene
> project and make it easy to use the generic, battle tested Lucene
> analyzers / tokenizers infrastructure to generate features in OpenNLP.

The OpenNLP APIs are all not designed to work on token streams, instead
a user usually has to provide an entire sentence at once, so that does not
make a nice fit.

And since we are an NLP library I believe it is absolutly fine to implement
our own stemming here.

Jörn

Re: Stemmer

Posted by Olivier Grisel <ol...@ensta.org>.
2011/8/18 Jörn Kottmann <ko...@gmail.com>:
> Hi all,
>
> the contribution from Boris contains a Porter stemmer.
>
> Up to now we do not have support for stemming in OpenNLP.
> Should we add a component to OpenNLP which is dedicated to stemming?
>
> I believe that could be useful for many, and could also be useful as part of
> our feature generation. To start with we might only have the Porter stemmer,
> but that could easily be extended to more languages over time.

Is this better or cover more languages than what's already provided by
Apache Lucene? Maybe it should better be contributed to the Lucene
project and make it easy to use the generic, battle tested Lucene
analyzers / tokenizers infrastructure to generate features in OpenNLP.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel