You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Martin Wunderlich <ma...@gmx.net> on 2014/02/23 14:35:08 UTC

Stemming, Stoplists and Language Models?

Hi all, 

I recently started working with OpenNLP for a project in the area of text classification with neural networks. So far, OpenNLP is a great library and very useful. 
There are just three things that I haven't been able to find, but maybe they do exist: 
- language models: e.g. to create a bigram language model with relative and absolute frequencies from several texts 
- stemming: to reduce different word forms in inflected languages to a canonical root form
- stoplist: to remove certain words (e.g. from the language model) that are deemed irrelevant

Do these functions exist in OpenNLP? If not, can you recommend another library to complement these functions? 

Kind regards, 

Martin

Re: Stemming, Stoplists and Language Models?

Posted by Alexandre Patry <al...@nlpfu.com>.

On 14-02-23 08:35 AM, Martin Wunderlich wrote:
> Hi all,
>
> I recently started working with OpenNLP for a project in the area of 
> text classification with neural networks. So far, OpenNLP is a great 
> library and very useful.
> There are just three things that I haven't been able to find, but 
> maybe they do exist:
> - language models: e.g. to create a bigram language model with 
> relative and absolute frequencies from several texts
> - stemming: to reduce different word forms in inflected languages to a 
> canonical root form
> - stoplist: to remove certain words (e.g. from the language model) 
> that are deemed irrelevant
>
> Do these functions exist in OpenNLP? If not, can you recommend another 
> library to complement these functions?
Lucene's analyzers-common [1] has stemming algorithms and stoplists for 
many languages (for examples, look at [2] and [3]) . It might be a good 
starting point.

Hope this help,

Alexandre

[1] http://lucene.apache.org/core/4_6_1/analyzers-common/index.html
[2] 
http://lucene.apache.org/core/4_6_1/analyzers-common/org/apache/lucene/analysis/en/EnglishAnalyzer.html
[3] 
http://lucene.apache.org/core/4_6_1/analyzers-common/org/apache/lucene/analysis/fr/FrenchAnalyzer.html
>
> Kind regards,
>
> Martin
>
>

Re: Stemming, Stoplists and Language Models?

Posted by Martin Wunderlich <ma...@gmx.net>.

Hi all, 

Thanks a lot for all the replies. I need to look into what Lucene provides and see how far I'll get. 
@Jörn, I will make sure to log the IRA tickets and think about making a contribution. I am not sure, if my programming skills are sufficient and I'd need to look into the source code, but I'll definitely check it out when / if time allows.  

Cheers, 

Martin
 

Am 23.02.2014 um 15:24 schrieb Jörn Kottmann <ko...@gmail.com>:

> Hello,
> 
> the current trunk version includes the Porter and Snowball stemmers. We didn't develop the ourself
> but redistribute them as part of OpenNLP.
> It would be nice to add more stemmers, in case you need a certain one it would be nice if you could
> point it out, and we might be able to redistribute it as well. Or maybe just implement it.
> 
> We don't have stoplists, but I think it will be easy to change that. We could probably use the ones from snowball.
> 
> There is no language modeling, it would be nice to get a contribution there. Maybe you are interested in implementing it?
> 
> Anyway, it would be nice if you could open two ira issues to request stopword lists and the language model.
> 
> Jörn
> 
> On 02/23/2014 02:35 PM, Martin Wunderlich wrote:
>> Hi all,
>> 
>> I recently started working with OpenNLP for a project in the area of text classification with neural networks. So far, OpenNLP is a great library and very useful.
>> There are just three things that I haven't been able to find, but maybe they do exist:
>> - language models: e.g. to create a bigram language model with relative and absolute frequencies from several texts
>> - stemming: to reduce different word forms in inflected languages to a canonical root form
>> - stoplist: to remove certain words (e.g. from the language model) that are deemed irrelevant
>> 
>> Do these functions exist in OpenNLP? If not, can you recommend another library to complement these functions?
>> 
>> Kind regards,
>> 
>> Martin
>> 
>> 
>

Re: Stemming, Stoplists and Language Models?

Posted by Martin Wunderlich <ma...@gmx.net>.

Hello, 

Sorry for reviving this thread again, but I have come across another question 
related to it. 

When working with stemming and stop word lists in order to pre-process the text data, wouldn't this mean that are as many language models as there are parameter combinations? 
For instance, if I have boolean pre-processing parameters in my application - useStemming yes/no and useStopList yes/no - do I end up with 2^2 = 4 language models? Perhaps a naive question, but it seems that the use of such pre-processing parameters inflates the LM data that I need to manage quite a bit.

Cheers, 

Martin
 

Am 23.02.2014 um 15:24 schrieb Jörn Kottmann <ko...@gmail.com>:

> Hello,
> 
> the current trunk version includes the Porter and Snowball stemmers. We didn't develop the ourself
> but redistribute them as part of OpenNLP.
> It would be nice to add more stemmers, in case you need a certain one it would be nice if you could
> point it out, and we might be able to redistribute it as well. Or maybe just implement it.
> 
> We don't have stoplists, but I think it will be easy to change that. We could probably use the ones from snowball.
> 
> There is no language modeling, it would be nice to get a contribution there. Maybe you are interested in implementing it?
> 
> Anyway, it would be nice if you could open two ira issues to request stopword lists and the language model.
> 
> Jörn
> 
> On 02/23/2014 02:35 PM, Martin Wunderlich wrote:
>> Hi all,
>> 
>> I recently started working with OpenNLP for a project in the area of text classification with neural networks. So far, OpenNLP is a great library and very useful.
>> There are just three things that I haven't been able to find, but maybe they do exist:
>> - language models: e.g. to create a bigram language model with relative and absolute frequencies from several texts
>> - stemming: to reduce different word forms in inflected languages to a canonical root form
>> - stoplist: to remove certain words (e.g. from the language model) that are deemed irrelevant
>> 
>> Do these functions exist in OpenNLP? If not, can you recommend another library to complement these functions?
>> 
>> Kind regards,
>> 
>> Martin
>> 
>> 
>

Re: Stemming, Stoplists and Language Models?

Posted by Jörn Kottmann <ko...@gmail.com>.

Thanks, sorry for the delay. I will add it today to the sandbox.

Jörn

On 03/04/2014 08:24 PM, Tommaso Teofili wrote:
> I've attached the adapted source code to be donated into the Jira issue [1]
>
> Regards,
> Tommaso
>
> [1] : https://issues.apache.org/jira/browse/OPENNLP-657
>
>
> 2014-02-28 10:11 GMT+01:00 Tommaso Teofili <to...@gmail.com>:
>
>>
>>
>> 2014-02-27 22:55 GMT+01:00 Tommaso Teofili <to...@gmail.com>:
>>
>>
>>>
>>> 2014-02-27 12:16 GMT+01:00 Jörn Kottmann <ko...@gmail.com>:
>>>
>>> On 02/23/2014 06:35 PM, Tommaso Teofili wrote:
>>>>> I have implemented a very simple set of nlp tools at [1], with
>>>>> implementations for ngrams [2] and language modeling [3] tasks too.
>>>>> I'd be happy to donate it to Apache OpenNLP if the community is
>>>>> interested.
>>>>>
>>>> Yes, that sounds very interesting. We already have ngram support, maybe
>>>> we can merge your implementation
>>>> with the current one in case there are any missing features.
>>>>
>>> sure
>>>
>>>
>>>> It would be nice if you could create an issue to contribute the code.
>>>>
>>> yes, I'll do that
>>>
>> done, here it is: https://issues.apache.org/jira/browse/OPENNLP-657
>>
>> Regards,
>> Tommaso
>>
>>
>>>
>>>> Do you think we should directly include in opennlp-tools or first ship
>>>> it as an addon or make it part of the sandbox?
>>>
>>> maybe I'd put it in the sandbox to start, where to move things after that
>>> would also depend a bit on where the different features best fit: ngram /
>>> language modeling would fit well in opennlp-tools and maybe CFGs too, maybe
>>> gradient descent / regression in opennlp-ml, not sure about naive bayes and
>>> anomaly detection but I guess we can decide that also later on.
>>>
>>> Thanks,
>>> Tommaso
>>>
>>>
>>>>
>>>> Jörn
>>>>
>>>

Re: Stemming, Stoplists and Language Models?

Posted by Tommaso Teofili <to...@gmail.com>.

I've attached the adapted source code to be donated into the Jira issue [1]

Regards,
Tommaso

[1] : https://issues.apache.org/jira/browse/OPENNLP-657


2014-02-28 10:11 GMT+01:00 Tommaso Teofili <to...@gmail.com>:

>
>
>
> 2014-02-27 22:55 GMT+01:00 Tommaso Teofili <to...@gmail.com>:
>
>
>>
>>
>> 2014-02-27 12:16 GMT+01:00 Jörn Kottmann <ko...@gmail.com>:
>>
>> On 02/23/2014 06:35 PM, Tommaso Teofili wrote:
>>>
>>>> I have implemented a very simple set of nlp tools at [1], with
>>>> implementations for ngrams [2] and language modeling [3] tasks too.
>>>> I'd be happy to donate it to Apache OpenNLP if the community is
>>>> interested.
>>>>
>>>
>>> Yes, that sounds very interesting. We already have ngram support, maybe
>>> we can merge your implementation
>>> with the current one in case there are any missing features.
>>>
>>
>> sure
>>
>>
>>>
>>> It would be nice if you could create an issue to contribute the code.
>>>
>>
>> yes, I'll do that
>>
>
> done, here it is: https://issues.apache.org/jira/browse/OPENNLP-657
>
> Regards,
> Tommaso
>
>
>>
>>
>>>
>>> Do you think we should directly include in opennlp-tools or first ship
>>> it as an addon or make it part of the sandbox?
>>
>>
>> maybe I'd put it in the sandbox to start, where to move things after that
>> would also depend a bit on where the different features best fit: ngram /
>> language modeling would fit well in opennlp-tools and maybe CFGs too, maybe
>> gradient descent / regression in opennlp-ml, not sure about naive bayes and
>> anomaly detection but I guess we can decide that also later on.
>>
>> Thanks,
>> Tommaso
>>
>>
>>>
>>>
>>> Jörn
>>>
>>
>>
>

Re: Stemming, Stoplists and Language Models?

Posted by Tommaso Teofili <to...@gmail.com>.

2014-02-27 22:55 GMT+01:00 Tommaso Teofili <to...@gmail.com>:

>
>
>
> 2014-02-27 12:16 GMT+01:00 Jörn Kottmann <ko...@gmail.com>:
>
> On 02/23/2014 06:35 PM, Tommaso Teofili wrote:
>>
>>> I have implemented a very simple set of nlp tools at [1], with
>>> implementations for ngrams [2] and language modeling [3] tasks too.
>>> I'd be happy to donate it to Apache OpenNLP if the community is
>>> interested.
>>>
>>
>> Yes, that sounds very interesting. We already have ngram support, maybe
>> we can merge your implementation
>> with the current one in case there are any missing features.
>>
>
> sure
>
>
>>
>> It would be nice if you could create an issue to contribute the code.
>>
>
> yes, I'll do that
>

done, here it is: https://issues.apache.org/jira/browse/OPENNLP-657

Regards,
Tommaso


>
>
>>
>> Do you think we should directly include in opennlp-tools or first ship it
>> as an addon or make it part of the sandbox?
>
>
> maybe I'd put it in the sandbox to start, where to move things after that
> would also depend a bit on where the different features best fit: ngram /
> language modeling would fit well in opennlp-tools and maybe CFGs too, maybe
> gradient descent / regression in opennlp-ml, not sure about naive bayes and
> anomaly detection but I guess we can decide that also later on.
>
> Thanks,
> Tommaso
>
>
>>
>>
>> Jörn
>>
>
>

Re: Stemming, Stoplists and Language Models?

Posted by Tommaso Teofili <to...@gmail.com>.

2014-02-27 12:16 GMT+01:00 Jörn Kottmann <ko...@gmail.com>:

> On 02/23/2014 06:35 PM, Tommaso Teofili wrote:
>
>> I have implemented a very simple set of nlp tools at [1], with
>> implementations for ngrams [2] and language modeling [3] tasks too.
>> I'd be happy to donate it to Apache OpenNLP if the community is
>> interested.
>>
>
> Yes, that sounds very interesting. We already have ngram support, maybe we
> can merge your implementation
> with the current one in case there are any missing features.
>

sure

>
> It would be nice if you could create an issue to contribute the code.
>

yes, I'll do that

>
> Do you think we should directly include in opennlp-tools or first ship it
> as an addon or make it part of the sandbox?

maybe I'd put it in the sandbox to start, where to move things after that
would also depend a bit on where the different features best fit: ngram /
language modeling would fit well in opennlp-tools and maybe CFGs too, maybe
gradient descent / regression in opennlp-ml, not sure about naive bayes and
anomaly detection but I guess we can decide that also later on.

Thanks,
Tommaso

>
>
> Jörn
>

Re: Stemming, Stoplists and Language Models?

Posted by Jörn Kottmann <ko...@gmail.com>.

On 02/23/2014 06:35 PM, Tommaso Teofili wrote:
> I have implemented a very simple set of nlp tools at [1], with
> implementations for ngrams [2] and language modeling [3] tasks too.
> I'd be happy to donate it to Apache OpenNLP if the community is interested.

Yes, that sounds very interesting. We already have ngram support, maybe 
we can merge your implementation
with the current one in case there are any missing features.

It would be nice if you could create an issue to contribute the code.

Do you think we should directly include in opennlp-tools or first ship 
it as an addon or make it part of the sandbox?

Jörn

Re: Stemming, Stoplists and Language Models?

Posted by Tommaso Teofili <to...@gmail.com>.

2014-02-23 15:24 GMT+01:00 Jörn Kottmann <ko...@gmail.com>:

> Hello,
>
> the current trunk version includes the Porter and Snowball stemmers. We
> didn't develop the ourself
> but redistribute them as part of OpenNLP.
> It would be nice to add more stemmers, in case you need a certain one it
> would be nice if you could
> point it out, and we might be able to redistribute it as well. Or maybe
> just implement it.
>
> We don't have stoplists, but I think it will be easy to change that. We
> could probably use the ones from snowball.
>
> There is no language modeling, it would be nice to get a contribution
> there.


I have implemented a very simple set of nlp tools at [1], with
implementations for ngrams [2] and language modeling [3] tasks too.
I'd be happy to donate it to Apache OpenNLP if the community is interested.


> Maybe you are interested in implementing it?
>
> Anyway, it would be nice if you could open two ira issues to request
> stopword lists and the language model.


Regards,
Tommaso

[1] : https://github.com/tteofili/nlp-utils
[2] :
https://github.com/tteofili/nlp-utils/blob/master/src/main/java/com/github/tteofili/nlputils/ngram/NGramUtils.java
[3] :
https://github.com/tteofili/nlp-utils/tree/master/src/main/java/com/github/tteofili/nlputils/languagemodel


>
>
> Jörn
>
>
> On 02/23/2014 02:35 PM, Martin Wunderlich wrote:
>
>> Hi all,
>>
>> I recently started working with OpenNLP for a project in the area of text
>> classification with neural networks. So far, OpenNLP is a great library and
>> very useful.
>> There are just three things that I haven't been able to find, but maybe
>> they do exist:
>> - language models: e.g. to create a bigram language model with relative
>> and absolute frequencies from several texts
>> - stemming: to reduce different word forms in inflected languages to a
>> canonical root form
>> - stoplist: to remove certain words (e.g. from the language model) that
>> are deemed irrelevant
>>
>> Do these functions exist in OpenNLP? If not, can you recommend another
>> library to complement these functions?
>>
>> Kind regards,
>>
>> Martin
>>
>>
>>
>

Re: Stemming, Stoplists and Language Models?

Posted by Martin Wunderlich <ma...@gmx.net>.

Hi Jörg, 

here are the two Jira-Tickets, as promised (one for stop lists and one for language models): 

https://issues.apache.org/jira/browse/OPENNLP-659 (for this one, I wasn't sure which component it should be assigned to)
https://issues.apache.org/jira/browse/OPENNLP-660

HTH.

Cheers, 

Martin


Am 23.02.2014 um 15:24 schrieb Jörn Kottmann <ko...@gmail.com>:

> Hello,
> 
> the current trunk version includes the Porter and Snowball stemmers. We didn't develop the ourself
> but redistribute them as part of OpenNLP.
> It would be nice to add more stemmers, in case you need a certain one it would be nice if you could
> point it out, and we might be able to redistribute it as well. Or maybe just implement it.
> 
> We don't have stoplists, but I think it will be easy to change that. We could probably use the ones from snowball.
> 
> There is no language modeling, it would be nice to get a contribution there. Maybe you are interested in implementing it?
> 
> Anyway, it would be nice if you could open two ira issues to request stopword lists and the language model.
> 
> Jörn
> 
> On 02/23/2014 02:35 PM, Martin Wunderlich wrote:
>> Hi all,
>> 
>> I recently started working with OpenNLP for a project in the area of text classification with neural networks. So far, OpenNLP is a great library and very useful.
>> There are just three things that I haven't been able to find, but maybe they do exist:
>> - language models: e.g. to create a bigram language model with relative and absolute frequencies from several texts
>> - stemming: to reduce different word forms in inflected languages to a canonical root form
>> - stoplist: to remove certain words (e.g. from the language model) that are deemed irrelevant
>> 
>> Do these functions exist in OpenNLP? If not, can you recommend another library to complement these functions?
>> 
>> Kind regards,
>> 
>> Martin
>> 
>> 
>

Re: Stemming, Stoplists and Language Models?

Posted by Jörn Kottmann <ko...@gmail.com>.

Hello,

the current trunk version includes the Porter and Snowball stemmers. We 
didn't develop the ourself
but redistribute them as part of OpenNLP.
It would be nice to add more stemmers, in case you need a certain one it 
would be nice if you could
point it out, and we might be able to redistribute it as well. Or maybe 
just implement it.

We don't have stoplists, but I think it will be easy to change that. We 
could probably use the ones from snowball.

There is no language modeling, it would be nice to get a contribution 
there. Maybe you are interested in implementing it?

Anyway, it would be nice if you could open two ira issues to request 
stopword lists and the language model.

Jörn

On 02/23/2014 02:35 PM, Martin Wunderlich wrote:
> Hi all,
>
> I recently started working with OpenNLP for a project in the area of 
> text classification with neural networks. So far, OpenNLP is a great 
> library and very useful.
> There are just three things that I haven't been able to find, but 
> maybe they do exist:
> - language models: e.g. to create a bigram language model with 
> relative and absolute frequencies from several texts
> - stemming: to reduce different word forms in inflected languages to a 
> canonical root form
> - stoplist: to remove certain words (e.g. from the language model) 
> that are deemed irrelevant
>
> Do these functions exist in OpenNLP? If not, can you recommend another 
> library to complement these functions?
>
> Kind regards,
>
> Martin
>
>