You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Ling <ma...@gmail.com> on 2017/07/03 16:54:32 UTC

Stemming in openNLP

Hi, I noticed that some words are stemmed like the following:

iphone ->  iphon
tmobile -> T-mobil

Is there some parameter to control this behavior? In such cases, those
stems are actually harmful, making them become unknown words in text. Since
these are quite common, I am just curious whether there is a way to change
the default behavior.

Thanks.
Ling

Re: Stemming in openNLP

Posted by Rakesh P <ra...@gmail.com>.


Regards,
Rakesh P

> On 03-Jul-2017, at 10:24 PM, Ling <ma...@gmail.com> wrote:
> 
> Hi, I noticed that some words are stemmed like the following:
> 
> iphone ->  iphon
> tmobile -> T-mobil
> 
> Is there some parameter to control this behavior? In such cases, those
> stems are actually harmful, making them become unknown words in text. Since
> these are quite common, I am just curious whether there is a way to change
> the default behavior.
> 
> Thanks.
> Ling

Re: Stemming in openNLP

Posted by Joern Kottmann <ko...@gmail.com>.

And also snowball:
http://snowball.tartarus.org/

Jörn

On Fri, Jul 7, 2017 at 9:10 AM, Rodrigo Agerri <ro...@ehu.eus> wrote:
> Hello,
>
> The stemmer algorithm implemented in OpenNLP is this one:
>
> https://tartarus.org/martin/PorterStemmer/
>
> Regarding the "null" lemma, are you using OpenNLP to lemmatize?
>
> Rodrigo
>
> On Fri, Jul 7, 2017 at 5:47 AM, Ling <li...@gmail.com> wrote:
>> I use it indirectly through another library, there is a function
>> token.getLemma().
>>
>> On Jul 6, 2017 7:24 PM, "John Stewart" <ca...@gmail.com> wrote:
>>
>>> I'm asking because I thought there are no pre-trained models for the
>>> lemmatizer. How are you using it exactly?  There's also an option to use a
>>> dictionary, e.g.
>>> https://stackoverflow.com/questions/38982423/opennlp-lemmatization-example
>>>
>>> AFAIK the models in 1.8.1 are the same as 1.5.3
>>>
>>> jds
>>>
>>> On Thu, Jul 6, 2017 at 6:26 PM, Ling <li...@gmail.com> wrote:
>>>
>>> > The openNLP1.5.3. I will update to 1.8.1 version after this week, if it's
>>> > an issue due to old models.
>>> >
>>> > Thanks.
>>> >
>>> > On Thu, Jul 6, 2017 at 3:19 PM, John Stewart <ca...@gmail.com>
>>> wrote:
>>> >
>>> > > What model or dictionary are you using with the lemmatizer?
>>> > >
>>> > > jds
>>> > >
>>> > > On Thu, Jul 6, 2017 at 6:05 PM, Ling <li...@gmail.com> wrote:
>>> > >
>>> > > > Hi, the problem with lemma is that, for "tmoble", the lemma returned
>>> by
>>> > > > openNLP is "null", not "tmoble".
>>> > > >
>>> > > > Why is it?
>>> > > >
>>> > > > On Mon, Jul 3, 2017 at 6:54 PM, Rakesh P <ra...@gmail.com>
>>> > wrote:
>>> > > >
>>> > > > > Hi,
>>> > > > > Stemmer works based on some predefined rules. Examples for rules
>>> are
>>> > > > "word
>>> > > > > that ends with 'e'". So, if you want to get a meaning word after
>>> > > > > preprocessing, then better use lemmatization.
>>> > > > >
>>> > > > > Regards,
>>> > > > > Rakesh P
>>> > > > >
>>> > > > > > On 03-Jul-2017, at 10:24 PM, Ling <ma...@gmail.com> wrote:
>>> > > > > >
>>> > > > > > Hi, I noticed that some words are stemmed like the following:
>>> > > > > >
>>> > > > > > iphone ->  iphon
>>> > > > > > tmobile -> T-mobil
>>> > > > > >
>>> > > > > > Is there some parameter to control this behavior? In such cases,
>>> > > those
>>> > > > > > stems are actually harmful, making them become unknown words in
>>> > text.
>>> > > > > Since
>>> > > > > > these are quite common, I am just curious whether there is a way
>>> to
>>> > > > > change
>>> > > > > > the default behavior.
>>> > > > > >
>>> > > > > > Thanks.
>>> > > > > > Ling
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>

Re: Stemming in openNLP

Posted by Rodrigo Agerri <ro...@ehu.eus>.

Hello,

The stemmer algorithm implemented in OpenNLP is this one:

https://tartarus.org/martin/PorterStemmer/

Regarding the "null" lemma, are you using OpenNLP to lemmatize?

Rodrigo

On Fri, Jul 7, 2017 at 5:47 AM, Ling <li...@gmail.com> wrote:
> I use it indirectly through another library, there is a function
> token.getLemma().
>
> On Jul 6, 2017 7:24 PM, "John Stewart" <ca...@gmail.com> wrote:
>
>> I'm asking because I thought there are no pre-trained models for the
>> lemmatizer. How are you using it exactly?  There's also an option to use a
>> dictionary, e.g.
>> https://stackoverflow.com/questions/38982423/opennlp-lemmatization-example
>>
>> AFAIK the models in 1.8.1 are the same as 1.5.3
>>
>> jds
>>
>> On Thu, Jul 6, 2017 at 6:26 PM, Ling <li...@gmail.com> wrote:
>>
>> > The openNLP1.5.3. I will update to 1.8.1 version after this week, if it's
>> > an issue due to old models.
>> >
>> > Thanks.
>> >
>> > On Thu, Jul 6, 2017 at 3:19 PM, John Stewart <ca...@gmail.com>
>> wrote:
>> >
>> > > What model or dictionary are you using with the lemmatizer?
>> > >
>> > > jds
>> > >
>> > > On Thu, Jul 6, 2017 at 6:05 PM, Ling <li...@gmail.com> wrote:
>> > >
>> > > > Hi, the problem with lemma is that, for "tmoble", the lemma returned
>> by
>> > > > openNLP is "null", not "tmoble".
>> > > >
>> > > > Why is it?
>> > > >
>> > > > On Mon, Jul 3, 2017 at 6:54 PM, Rakesh P <ra...@gmail.com>
>> > wrote:
>> > > >
>> > > > > Hi,
>> > > > > Stemmer works based on some predefined rules. Examples for rules
>> are
>> > > > "word
>> > > > > that ends with 'e'". So, if you want to get a meaning word after
>> > > > > preprocessing, then better use lemmatization.
>> > > > >
>> > > > > Regards,
>> > > > > Rakesh P
>> > > > >
>> > > > > > On 03-Jul-2017, at 10:24 PM, Ling <ma...@gmail.com> wrote:
>> > > > > >
>> > > > > > Hi, I noticed that some words are stemmed like the following:
>> > > > > >
>> > > > > > iphone ->  iphon
>> > > > > > tmobile -> T-mobil
>> > > > > >
>> > > > > > Is there some parameter to control this behavior? In such cases,
>> > > those
>> > > > > > stems are actually harmful, making them become unknown words in
>> > text.
>> > > > > Since
>> > > > > > these are quite common, I am just curious whether there is a way
>> to
>> > > > > change
>> > > > > > the default behavior.
>> > > > > >
>> > > > > > Thanks.
>> > > > > > Ling
>> > > > >
>> > > >
>> > >
>> >
>>

Re: Stemming in openNLP

Posted by Ling <li...@gmail.com>.

This is the function to getLemma() from "org.cleartk.token.type" package:

 //*--------------*
  //* Feature: stem

  /** getter for stem - gets
   * @generated
   * @return value of the feature
   */
  public String getStem() {
    if (Token_Type.featOkTst && ((Token_Type)jcasType).casFeat_stem == null)
      jcasType.jcas.throwFeatMissing("stem",
"org.cleartk.token.type.Token");
    return jcasType.ll_cas.ll_getStringValue(addr,
((Token_Type)jcasType).casFeatCode_stem);}

Anyway, I will directly use openNLP's new release without involving other
libraries. For openNLP, a lemmatization or stemming algorithm something
similar to WordNet seems  working better than Porter stemmer.

On Fri, Jul 7, 2017 at 9:24 AM, John Stewart <ca...@gmail.com> wrote:

> Which library?  They may be providing a trained model or a dictionary,
> distinct from the data files released by the OpenNLP project.
>
> jds
>
> On Thu, Jul 6, 2017 at 11:47 PM, Ling <li...@gmail.com> wrote:
>
> > I use it indirectly through another library, there is a function
> > token.getLemma().
> >
> > On Jul 6, 2017 7:24 PM, "John Stewart" <ca...@gmail.com> wrote:
> >
> > > I'm asking because I thought there are no pre-trained models for the
> > > lemmatizer. How are you using it exactly?  There's also an option to
> use
> > a
> > > dictionary, e.g.
> > > https://stackoverflow.com/questions/38982423/opennlp-
> > lemmatization-example
> > >
> > > AFAIK the models in 1.8.1 are the same as 1.5.3
> > >
> > > jds
> > >
> > > On Thu, Jul 6, 2017 at 6:26 PM, Ling <li...@gmail.com> wrote:
> > >
> > > > The openNLP1.5.3. I will update to 1.8.1 version after this week, if
> > it's
> > > > an issue due to old models.
> > > >
> > > > Thanks.
> > > >
> > > > On Thu, Jul 6, 2017 at 3:19 PM, John Stewart <ca...@gmail.com>
> > > wrote:
> > > >
> > > > > What model or dictionary are you using with the lemmatizer?
> > > > >
> > > > > jds
> > > > >
> > > > > On Thu, Jul 6, 2017 at 6:05 PM, Ling <li...@gmail.com> wrote:
> > > > >
> > > > > > Hi, the problem with lemma is that, for "tmoble", the lemma
> > returned
> > > by
> > > > > > openNLP is "null", not "tmoble".
> > > > > >
> > > > > > Why is it?
> > > > > >
> > > > > > On Mon, Jul 3, 2017 at 6:54 PM, Rakesh P <ra...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > > Stemmer works based on some predefined rules. Examples for
> rules
> > > are
> > > > > > "word
> > > > > > > that ends with 'e'". So, if you want to get a meaning word
> after
> > > > > > > preprocessing, then better use lemmatization.
> > > > > > >
> > > > > > > Regards,
> > > > > > > Rakesh P
> > > > > > >
> > > > > > > > On 03-Jul-2017, at 10:24 PM, Ling <ma...@gmail.com>
> wrote:
> > > > > > > >
> > > > > > > > Hi, I noticed that some words are stemmed like the following:
> > > > > > > >
> > > > > > > > iphone ->  iphon
> > > > > > > > tmobile -> T-mobil
> > > > > > > >
> > > > > > > > Is there some parameter to control this behavior? In such
> > cases,
> > > > > those
> > > > > > > > stems are actually harmful, making them become unknown words
> in
> > > > text.
> > > > > > > Since
> > > > > > > > these are quite common, I am just curious whether there is a
> > way
> > > to
> > > > > > > change
> > > > > > > > the default behavior.
> > > > > > > >
> > > > > > > > Thanks.
> > > > > > > > Ling
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Stemming in openNLP

Posted by John Stewart <ca...@gmail.com>.

Which library?  They may be providing a trained model or a dictionary,
distinct from the data files released by the OpenNLP project.

jds

On Thu, Jul 6, 2017 at 11:47 PM, Ling <li...@gmail.com> wrote:

> I use it indirectly through another library, there is a function
> token.getLemma().
>
> On Jul 6, 2017 7:24 PM, "John Stewart" <ca...@gmail.com> wrote:
>
> > I'm asking because I thought there are no pre-trained models for the
> > lemmatizer. How are you using it exactly?  There's also an option to use
> a
> > dictionary, e.g.
> > https://stackoverflow.com/questions/38982423/opennlp-
> lemmatization-example
> >
> > AFAIK the models in 1.8.1 are the same as 1.5.3
> >
> > jds
> >
> > On Thu, Jul 6, 2017 at 6:26 PM, Ling <li...@gmail.com> wrote:
> >
> > > The openNLP1.5.3. I will update to 1.8.1 version after this week, if
> it's
> > > an issue due to old models.
> > >
> > > Thanks.
> > >
> > > On Thu, Jul 6, 2017 at 3:19 PM, John Stewart <ca...@gmail.com>
> > wrote:
> > >
> > > > What model or dictionary are you using with the lemmatizer?
> > > >
> > > > jds
> > > >
> > > > On Thu, Jul 6, 2017 at 6:05 PM, Ling <li...@gmail.com> wrote:
> > > >
> > > > > Hi, the problem with lemma is that, for "tmoble", the lemma
> returned
> > by
> > > > > openNLP is "null", not "tmoble".
> > > > >
> > > > > Why is it?
> > > > >
> > > > > On Mon, Jul 3, 2017 at 6:54 PM, Rakesh P <ra...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hi,
> > > > > > Stemmer works based on some predefined rules. Examples for rules
> > are
> > > > > "word
> > > > > > that ends with 'e'". So, if you want to get a meaning word after
> > > > > > preprocessing, then better use lemmatization.
> > > > > >
> > > > > > Regards,
> > > > > > Rakesh P
> > > > > >
> > > > > > > On 03-Jul-2017, at 10:24 PM, Ling <ma...@gmail.com> wrote:
> > > > > > >
> > > > > > > Hi, I noticed that some words are stemmed like the following:
> > > > > > >
> > > > > > > iphone ->  iphon
> > > > > > > tmobile -> T-mobil
> > > > > > >
> > > > > > > Is there some parameter to control this behavior? In such
> cases,
> > > > those
> > > > > > > stems are actually harmful, making them become unknown words in
> > > text.
> > > > > > Since
> > > > > > > these are quite common, I am just curious whether there is a
> way
> > to
> > > > > > change
> > > > > > > the default behavior.
> > > > > > >
> > > > > > > Thanks.
> > > > > > > Ling
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Stemming in openNLP

Posted by Ling <li...@gmail.com>.

I use it indirectly through another library, there is a function
token.getLemma().

On Jul 6, 2017 7:24 PM, "John Stewart" <ca...@gmail.com> wrote:

> I'm asking because I thought there are no pre-trained models for the
> lemmatizer. How are you using it exactly?  There's also an option to use a
> dictionary, e.g.
> https://stackoverflow.com/questions/38982423/opennlp-lemmatization-example
>
> AFAIK the models in 1.8.1 are the same as 1.5.3
>
> jds
>
> On Thu, Jul 6, 2017 at 6:26 PM, Ling <li...@gmail.com> wrote:
>
> > The openNLP1.5.3. I will update to 1.8.1 version after this week, if it's
> > an issue due to old models.
> >
> > Thanks.
> >
> > On Thu, Jul 6, 2017 at 3:19 PM, John Stewart <ca...@gmail.com>
> wrote:
> >
> > > What model or dictionary are you using with the lemmatizer?
> > >
> > > jds
> > >
> > > On Thu, Jul 6, 2017 at 6:05 PM, Ling <li...@gmail.com> wrote:
> > >
> > > > Hi, the problem with lemma is that, for "tmoble", the lemma returned
> by
> > > > openNLP is "null", not "tmoble".
> > > >
> > > > Why is it?
> > > >
> > > > On Mon, Jul 3, 2017 at 6:54 PM, Rakesh P <ra...@gmail.com>
> > wrote:
> > > >
> > > > > Hi,
> > > > > Stemmer works based on some predefined rules. Examples for rules
> are
> > > > "word
> > > > > that ends with 'e'". So, if you want to get a meaning word after
> > > > > preprocessing, then better use lemmatization.
> > > > >
> > > > > Regards,
> > > > > Rakesh P
> > > > >
> > > > > > On 03-Jul-2017, at 10:24 PM, Ling <ma...@gmail.com> wrote:
> > > > > >
> > > > > > Hi, I noticed that some words are stemmed like the following:
> > > > > >
> > > > > > iphone ->  iphon
> > > > > > tmobile -> T-mobil
> > > > > >
> > > > > > Is there some parameter to control this behavior? In such cases,
> > > those
> > > > > > stems are actually harmful, making them become unknown words in
> > text.
> > > > > Since
> > > > > > these are quite common, I am just curious whether there is a way
> to
> > > > > change
> > > > > > the default behavior.
> > > > > >
> > > > > > Thanks.
> > > > > > Ling
> > > > >
> > > >
> > >
> >
>

Re: Stemming in openNLP

Posted by John Stewart <ca...@gmail.com>.

I'm asking because I thought there are no pre-trained models for the
lemmatizer. How are you using it exactly?  There's also an option to use a
dictionary, e.g.
https://stackoverflow.com/questions/38982423/opennlp-lemmatization-example

AFAIK the models in 1.8.1 are the same as 1.5.3

jds

On Thu, Jul 6, 2017 at 6:26 PM, Ling <li...@gmail.com> wrote:

> The openNLP1.5.3. I will update to 1.8.1 version after this week, if it's
> an issue due to old models.
>
> Thanks.
>
> On Thu, Jul 6, 2017 at 3:19 PM, John Stewart <ca...@gmail.com> wrote:
>
> > What model or dictionary are you using with the lemmatizer?
> >
> > jds
> >
> > On Thu, Jul 6, 2017 at 6:05 PM, Ling <li...@gmail.com> wrote:
> >
> > > Hi, the problem with lemma is that, for "tmoble", the lemma returned by
> > > openNLP is "null", not "tmoble".
> > >
> > > Why is it?
> > >
> > > On Mon, Jul 3, 2017 at 6:54 PM, Rakesh P <ra...@gmail.com>
> wrote:
> > >
> > > > Hi,
> > > > Stemmer works based on some predefined rules. Examples for rules are
> > > "word
> > > > that ends with 'e'". So, if you want to get a meaning word after
> > > > preprocessing, then better use lemmatization.
> > > >
> > > > Regards,
> > > > Rakesh P
> > > >
> > > > > On 03-Jul-2017, at 10:24 PM, Ling <ma...@gmail.com> wrote:
> > > > >
> > > > > Hi, I noticed that some words are stemmed like the following:
> > > > >
> > > > > iphone ->  iphon
> > > > > tmobile -> T-mobil
> > > > >
> > > > > Is there some parameter to control this behavior? In such cases,
> > those
> > > > > stems are actually harmful, making them become unknown words in
> text.
> > > > Since
> > > > > these are quite common, I am just curious whether there is a way to
> > > > change
> > > > > the default behavior.
> > > > >
> > > > > Thanks.
> > > > > Ling
> > > >
> > >
> >
>

Re: Stemming in openNLP

Posted by Ling <li...@gmail.com>.

The openNLP1.5.3. I will update to 1.8.1 version after this week, if it's
an issue due to old models.

Thanks.

On Thu, Jul 6, 2017 at 3:19 PM, John Stewart <ca...@gmail.com> wrote:

> What model or dictionary are you using with the lemmatizer?
>
> jds
>
> On Thu, Jul 6, 2017 at 6:05 PM, Ling <li...@gmail.com> wrote:
>
> > Hi, the problem with lemma is that, for "tmoble", the lemma returned by
> > openNLP is "null", not "tmoble".
> >
> > Why is it?
> >
> > On Mon, Jul 3, 2017 at 6:54 PM, Rakesh P <ra...@gmail.com> wrote:
> >
> > > Hi,
> > > Stemmer works based on some predefined rules. Examples for rules are
> > "word
> > > that ends with 'e'". So, if you want to get a meaning word after
> > > preprocessing, then better use lemmatization.
> > >
> > > Regards,
> > > Rakesh P
> > >
> > > > On 03-Jul-2017, at 10:24 PM, Ling <ma...@gmail.com> wrote:
> > > >
> > > > Hi, I noticed that some words are stemmed like the following:
> > > >
> > > > iphone ->  iphon
> > > > tmobile -> T-mobil
> > > >
> > > > Is there some parameter to control this behavior? In such cases,
> those
> > > > stems are actually harmful, making them become unknown words in text.
> > > Since
> > > > these are quite common, I am just curious whether there is a way to
> > > change
> > > > the default behavior.
> > > >
> > > > Thanks.
> > > > Ling
> > >
> >
>

Re: Stemming in openNLP

Posted by John Stewart <ca...@gmail.com>.

What model or dictionary are you using with the lemmatizer?

jds

On Thu, Jul 6, 2017 at 6:05 PM, Ling <li...@gmail.com> wrote:

> Hi, the problem with lemma is that, for "tmoble", the lemma returned by
> openNLP is "null", not "tmoble".
>
> Why is it?
>
> On Mon, Jul 3, 2017 at 6:54 PM, Rakesh P <ra...@gmail.com> wrote:
>
> > Hi,
> > Stemmer works based on some predefined rules. Examples for rules are
> "word
> > that ends with 'e'". So, if you want to get a meaning word after
> > preprocessing, then better use lemmatization.
> >
> > Regards,
> > Rakesh P
> >
> > > On 03-Jul-2017, at 10:24 PM, Ling <ma...@gmail.com> wrote:
> > >
> > > Hi, I noticed that some words are stemmed like the following:
> > >
> > > iphone ->  iphon
> > > tmobile -> T-mobil
> > >
> > > Is there some parameter to control this behavior? In such cases, those
> > > stems are actually harmful, making them become unknown words in text.
> > Since
> > > these are quite common, I am just curious whether there is a way to
> > change
> > > the default behavior.
> > >
> > > Thanks.
> > > Ling
> >
>

Re: Stemming in openNLP

Posted by Ling <li...@gmail.com>.

Hi, the problem with lemma is that, for "tmoble", the lemma returned by
openNLP is "null", not "tmoble".

Why is it?

On Mon, Jul 3, 2017 at 6:54 PM, Rakesh P <ra...@gmail.com> wrote:

> Hi,
> Stemmer works based on some predefined rules. Examples for rules are "word
> that ends with 'e'". So, if you want to get a meaning word after
> preprocessing, then better use lemmatization.
>
> Regards,
> Rakesh P
>
> > On 03-Jul-2017, at 10:24 PM, Ling <ma...@gmail.com> wrote:
> >
> > Hi, I noticed that some words are stemmed like the following:
> >
> > iphone ->  iphon
> > tmobile -> T-mobil
> >
> > Is there some parameter to control this behavior? In such cases, those
> > stems are actually harmful, making them become unknown words in text.
> Since
> > these are quite common, I am just curious whether there is a way to
> change
> > the default behavior.
> >
> > Thanks.
> > Ling
>

Re: Stemming in openNLP

Posted by Rakesh P <ra...@gmail.com>.

Hi,
Stemmer works based on some predefined rules. Examples for rules are "word that ends with 'e'". So, if you want to get a meaning word after preprocessing, then better use lemmatization. 

Regards,
Rakesh P

> On 03-Jul-2017, at 10:24 PM, Ling <ma...@gmail.com> wrote:
> 
> Hi, I noticed that some words are stemmed like the following:
> 
> iphone ->  iphon
> tmobile -> T-mobil
> 
> Is there some parameter to control this behavior? In such cases, those
> stems are actually harmful, making them become unknown words in text. Since
> these are quite common, I am just curious whether there is a way to change
> the default behavior.
> 
> Thanks.
> Ling