You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by "T. Kuro Kurosaka" <ku...@bhlab.com> on 2023/01/10 01:23:22 UTC

Portuguese lemmatization model?

Is there a pre-trained lemmatization model for Portuguese and other popular 
languages?

-- 
T. "Kuro" Kurosaka, Orinda, California, USA

Re: Portuguese lemmatization model?

Posted by le...@interia.eu.

Hi

Maybe I share my experience. In 'babzel' project, models for 19 languages are computed for now.
I also tried to compute Arabic language model. Computation of sentence-detector, tokenizer, pos-tagger was successful.
Lemmatizer training lasted incredibly long (a few hours) and failed with exception like "Serialization error, string too long".
I don't remember exact exception message, but the computed model could not be written to file.
So in the end it was not possible for me to publish this model.

Regards
Leszek

Od: "T. Kuro Kurosaka" <ku...@bhlab.com>
Do: users@opennlp.apache.org; 
Wysłane: 21:35 Niedziela 2023-02-12
Temat: Re: Portuguese lemmatization model?

> Thank you for responses for my earlier question.
> So far I'm using the models published in babzel project but it doesn't
have one 
> for Arabic.
> Are there any pre-trained lemmatization model of a reasonable accuracy
(95+% ?) 
> available?
> 
> 
> On 1/9/23 5:23 PM, T. Kuro Kurosaka wrote:
> > Is there a pre-trained lemmatization model for Portuguese and other
popular 
> > languages?
> >
> Kuro
> 
>

Re: Portuguese lemmatization model?

Posted by "T. Kuro Kurosaka" <ku...@bhlab.com>.

Thank you for responses for my earlier question.
So far I'm using the models published in babzel project but it doesn't have one 
for Arabic.
Are there any pre-trained lemmatization model of a reasonable accuracy (95+% ?) 
available?

On 1/9/23 5:23 PM, T. Kuro Kurosaka wrote:
> Is there a pre-trained lemmatization model for Portuguese and other popular 
> languages?
>
Kuro

Re: Portuguese lemmatization model?

Posted by le...@interia.eu.

Hi

You have to do these 4 phases in order, because lemmatizations needs tokens + their part-of-speech to do the process

sentence-detection
tokenization
pos-tagging
lemmatization

Theoretically it is possible to do lemmatization using opennlp model and other phases in a different way (some hard-coded algorithm?), but I think the simplest way is to use 4 opennlp models if they are already precomputed.

Regards
Leszek 

Od: "T. Kuro Kurosaka" <ku...@bhlab.com>
Do: users@opennlp.apache.org; leszekpi@interia.eu; 
Wysłane: 20:53 Wtorek 2023-01-10
Temat: Re: Portuguese lemmatization model?

> Thank you, Leszek!
> It looks promising. It did lemmatize "azuis" -> "azul".
> Are these 4 char filters absolutely required to run the lemmatizers
correctly ?
> 
> Kuro
> 
> On 1/10/23 12:45 AM, leszekpi@interia.eu wrote:
> > Hi
> >
> > As far as I know there is no portugese lemmatizer on official
opennlp site.
> > In general such models are not easily available, at least for less
popular languages.
> >
> > I developed an application to automatically compute
sentence-detector, tokenizer, pos-tagger and lemmatizer from Universal
Dependencies language files.
> > For now models are generated for 19 languages (including portugese).
> >
> > Main app: https://github.com/abzif/babzel
> > Pre-trained models: https://abzif.github.io/babzel/models.html
> >
> > Enjoy!
> > Leszek Piotrowicz
> >
> > Od: "T. Kuro Kurosaka" 
> > Do: users@opennlp.apache.org;
> > Wysłane: 2:29 Wtorek 2023-01-10
> > Temat: Portuguese lemmatization model?
> >
> >> Is there a pre-trained lemmatization model for Portuguese
> > and other popular
> >> languages?
> >>
> >> -- 
> >> T. "Kuro" Kurosaka, Orinda, California, USA
> >>
> >>
> >
> >
> 
> -- 
> T. "Kuro" Kurosaka, Orinda, California, USA
> 
>

Re: Portuguese lemmatization model?

Posted by Rodrigo Agerri <ro...@ehu.eus>.

Hello,

What I said is not only about Portuguese and not only for one dataset,
it is about most languages for which we have evaluation data (e.g.
sigmorphon 2019 data for which around 50 languages are evaluated). The
best results nowadays for lemmatization in NLP are obtained by
supervised approaches as it happens for the large majority of NLP
tasks (except when there is a very specific application with a very
ad-hoc fine-tuned model). If you check the state-of-the-art of NLP
tasks, that is the current trend of the field, not my personal claim.

Best regards,

Rodrigo

On Fri, 13 Jan 2023 at 12:41, Alexandre Rademaker <ar...@gmail.com> wrote:
>
>
> OpenNLP is mainly machine learning based, but we have the DictionaryLemmatizer with the ability to pass a dictionary of word forms. See https://opennlp.apache.org/docs/2.1.0/manual/opennlp.html#tools.lemmatizer.tagging.api. So you can use the http://github.com/LR-POR/MorphoBr that I mentioned before to prepare the input file for the DictionaryLemmatizer.
>
> The statistical lemmatizer is also available, and that would require a model to run. You can train yourself or use one already available from the link provided by Leszek.
>
> Rodrigo Agerri made a strong claim saying that supervised lemmatizer works better. I don’t want to go into that discussion, but I believe the decision about an ML-based (supervised or not) and rule-based approach should be based on many more criteria than the performance in a single dataset.
>
> Best,
> Alexandre
>
>
> > On 13 Jan 2023, at 01:48, T. Kuro Kurosaka <ku...@bhlab.com> wrote:
> >
> > I wrote "model" just because I did not know openNLP support a rule based approach.
> > Are there rule file sthat I can try for Portuguese and other major languages?
> >
> > Kuro
> >
>

Re: Portuguese lemmatization model?

Posted by Rodrigo Agerri <ro...@ehu.eus>.

Hello again,

On Fri, 13 Jan 2023 at 12:41, Alexandre Rademaker <ar...@gmail.com> wrote:
>
>
> OpenNLP is mainly machine learning based, but we have the DictionaryLemmatizer with the ability to pass a dictionary of word forms. See https://opennlp.apache.org/docs/2.1.0/manual/opennlp.html#tools.lemmatizer.tagging.api. So you can use the http://github.com/LR-POR/MorphoBr that I mentioned before to prepare the input file for the DictionaryLemmatizer.
>

The main motivation to provide a DictionaryLemmatizer was to sort of
be able to post-process (correct) the errors of the statistical model.
Note that dictionaries suffer from low coverage, even the large
dictionaries from Freeling etc., so the dictionary-based lemmatizer is
going to be limited to the entries contained in the dictionary.

> The statistical lemmatizer is also available, and that would require a model to run. You can train yourself or use one already available from the link provided by Leszek.

Our experiments at the time showed that in terms of performance
Perceptron was a better choice for lemmatization. It is quite fast and
cheap to train a lemmatizer with UD data.

Best regards,

Rodrigo

Re: Portuguese lemmatization model?

Posted by Alexandre Rademaker <ar...@gmail.com>.

OpenNLP is mainly machine learning based, but we have the DictionaryLemmatizer with the ability to pass a dictionary of word forms. See https://opennlp.apache.org/docs/2.1.0/manual/opennlp.html#tools.lemmatizer.tagging.api. So you can use the http://github.com/LR-POR/MorphoBr that I mentioned before to prepare the input file for the DictionaryLemmatizer.

The statistical lemmatizer is also available, and that would require a model to run. You can train yourself or use one already available from the link provided by Leszek.

Rodrigo Agerri made a strong claim saying that supervised lemmatizer works better. I don’t want to go into that discussion, but I believe the decision about an ML-based (supervised or not) and rule-based approach should be based on many more criteria than the performance in a single dataset.

Best,
Alexandre 

> On 13 Jan 2023, at 01:48, T. Kuro Kurosaka <ku...@bhlab.com> wrote:
> 
> I wrote "model" just because I did not know openNLP support a rule based approach.
> Are there rule file sthat I can try for Portuguese and other major languages?
> 
> Kuro
>

Re: Portuguese lemmatization model?

Posted by "T. Kuro Kurosaka" <ku...@bhlab.com>.

I wrote "model" just because I did not know openNLP support a rule based approach.
Are there rule file sthat I can try for Portuguese and other major languages?

Kuro

On 1/11/23 3:03 AM, Alexandre Rademaker wrote:
> Your first message ask for a lemmatization ‘model’ for Portuguese. I don’t have right now numbers to support my claim, but I fell like lemmatization (morphosyntactic analysis) is best done with a rule-based approach, finite-state in particular, with the possible support of a lexical resource.

-- 
T. "Kuro" Kurosaka, Orinda, California, USA

Re: Portuguese lemmatization model?

Posted by Rodrigo Agerri <ro...@ehu.eus>.

Hello,

Lemmatization works best in a supervised approach, currently neural
models, unless you have an extremely sophisticated approach for a
specific language.

https://aclanthology.org/W19-4226/

In OpenNLP we implemented Chrupala's approach (it performs better than
Freeling for Spanish and Catalan):

https://doras.dcu.ie/15272/

https://doras.dcu.ie/550/

as interpreted here:

https://github.com/ixa-ehu/ixa-pipe-pos

Best regards,

Rodrigo


On Wed, 11 Jan 2023 at 12:04, Alexandre Rademaker <ar...@gmail.com> wrote:
>
>
> Hi Kuro,
>
> Your first message ask for a lemmatization ‘model’ for Portuguese. I don’t have right now numbers to support my claim, but I fell like lemmatization (morphosyntactic analysis) is best done with a rule-based approach, finite-state in particular, with the possible support of a lexical resource. For Portuguese, I maintain the MorphoBr
>
> https://github.com/LR-POR/MorphoBr
>
> We are also expanding the rules implemented in http://fomafst.github.io <http://fomafst.github.io/> to compact the full-form dictionary and better integrate it with the HPSG grammar we are developing.
>
> A similar approach was adopted by Freeling, https://github.com/TALP-UPC/FreeLing. I have collaborated with Lluís Padró (Freeling's author) to expand the Portuguese support of it.
>
> Comments are welcome.
>
> Best,
> Alexandre
>
>
> > On 10 Jan 2023, at 19:43, T. Kuro Kurosaka <ku...@bhlab.com> wrote:
> >
> > Thank you, Leszek!
> > It looks promising. It did lemmatize "azuis" -> "azul".
> > Are these 4 char filters absolutely required to run the lemmatizers correctly ?
> >
> > Kuro
> >
> > On 1/10/23 12:45 AM, leszekpi@interia.eu wrote:
> >> Hi
> >>
> >> As far as I know there is no portugese lemmatizer on official opennlp site.
> >> In general such models are not easily available, at least for less popular languages.
> >>
> >> I developed an application to automatically compute sentence-detector, tokenizer, pos-tagger and lemmatizer from Universal Dependencies language files.
> >> For now models are generated for 19 languages (including portugese).
> >>
> >> Main app: https://github.com/abzif/babzel
> >> Pre-trained models: https://abzif.github.io/babzel/models.html
> >>
> >> Enjoy!
> >> Leszek Piotrowicz
> >>
> >> Od: "T. Kuro Kurosaka" <ku...@bhlab.com>
> >> Do: users@opennlp.apache.org;
> >> Wysłane: 2:29 Wtorek 2023-01-10
> >> Temat: Portuguese lemmatization model?
> >>
> >>> Is there a pre-trained lemmatization model for Portuguese and other popular
> >>> languages?
> >>>
> >>> --
> >>> T. "Kuro" Kurosaka, Orinda, California, USA
> >>>
>

Re: Portuguese lemmatization model?

Posted by Alexandre Rademaker <ar...@gmail.com>.

Hi Kuro, 

Your first message ask for a lemmatization ‘model’ for Portuguese. I don’t have right now numbers to support my claim, but I fell like lemmatization (morphosyntactic analysis) is best done with a rule-based approach, finite-state in particular, with the possible support of a lexical resource. For Portuguese, I maintain the MorphoBr

https://github.com/LR-POR/MorphoBr

We are also expanding the rules implemented in http://fomafst.github.io <http://fomafst.github.io/> to compact the full-form dictionary and better integrate it with the HPSG grammar we are developing. 

A similar approach was adopted by Freeling, https://github.com/TALP-UPC/FreeLing. I have collaborated with Lluís Padró (Freeling's author) to expand the Portuguese support of it. 

Comments are welcome.

Best,
Alexandre

> On 10 Jan 2023, at 19:43, T. Kuro Kurosaka <ku...@bhlab.com> wrote:
> 
> Thank you, Leszek!
> It looks promising. It did lemmatize "azuis" -> "azul".
> Are these 4 char filters absolutely required to run the lemmatizers correctly ?
> 
> Kuro
> 
> On 1/10/23 12:45 AM, leszekpi@interia.eu wrote:
>> Hi
>> 
>> As far as I know there is no portugese lemmatizer on official opennlp site.
>> In general such models are not easily available, at least for less popular languages.
>> 
>> I developed an application to automatically compute sentence-detector, tokenizer, pos-tagger and lemmatizer from Universal Dependencies language files.
>> For now models are generated for 19 languages (including portugese).
>> 
>> Main app: https://github.com/abzif/babzel
>> Pre-trained models: https://abzif.github.io/babzel/models.html
>> 
>> Enjoy!
>> Leszek Piotrowicz
>> 
>> Od: "T. Kuro Kurosaka" <ku...@bhlab.com>
>> Do: users@opennlp.apache.org;
>> Wysłane: 2:29 Wtorek 2023-01-10
>> Temat: Portuguese lemmatization model?
>> 
>>> Is there a pre-trained lemmatization model for Portuguese and other popular
>>> languages?
>>> 
>>> -- 
>>> T. "Kuro" Kurosaka, Orinda, California, USA
>>>

Re: Portuguese lemmatization model?

Posted by "T. Kuro Kurosaka" <ku...@bhlab.com>.

Thank you, Leszek!
It looks promising. It did lemmatize "azuis" -> "azul".
Are these 4 char filters absolutely required to run the lemmatizers correctly ?

Kuro

On 1/10/23 12:45 AM, leszekpi@interia.eu wrote:
> Hi
>
> As far as I know there is no portugese lemmatizer on official opennlp site.
> In general such models are not easily available, at least for less popular languages.
>
> I developed an application to automatically compute sentence-detector, tokenizer, pos-tagger and lemmatizer from Universal Dependencies language files.
> For now models are generated for 19 languages (including portugese).
>
> Main app: https://github.com/abzif/babzel
> Pre-trained models: https://abzif.github.io/babzel/models.html
>
> Enjoy!
> Leszek Piotrowicz
>
> Od: "T. Kuro Kurosaka" <ku...@bhlab.com>
> Do: users@opennlp.apache.org;
> Wysłane: 2:29 Wtorek 2023-01-10
> Temat: Portuguese lemmatization model?
>
>> Is there a pre-trained lemmatization model for Portuguese
> and other popular
>> languages?
>>
>> -- 
>> T. "Kuro" Kurosaka, Orinda, California, USA
>>
>>
>
>

-- 
T. "Kuro" Kurosaka, Orinda, California, USA

Re: Portuguese lemmatization model?

Posted by le...@interia.eu.

Hi

As far as I know there is no portugese lemmatizer on official opennlp site.
In general such models are not easily available, at least for less popular languages.

I developed an application to automatically compute sentence-detector, tokenizer, pos-tagger and lemmatizer from Universal Dependencies language files.
For now models are generated for 19 languages (including portugese).

Main app: https://github.com/abzif/babzel
Pre-trained models: https://abzif.github.io/babzel/models.html

Enjoy!
Leszek Piotrowicz

Od: "T. Kuro Kurosaka" <ku...@bhlab.com>
Do: users@opennlp.apache.org; 
Wysłane: 2:29 Wtorek 2023-01-10
Temat: Portuguese lemmatization model?

> Is there a pre-trained lemmatization model for Portuguese
and other popular 
> languages?
> 
> -- 
> T. "Kuro" Kurosaka, Orinda, California, USA
> 
>