You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by "T. Kuro Kurosaka" <ku...@bhlab.com> on 2019/07/09 01:04:00 UTC

Where can I find Spanish Lemmatizer training data ?

I downloaded OpenNLP hoping that I could use it to lemmatize Spanish text.

But https://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.lemmatizer
seems to be saying I have to train a model first to use the lemmatizer.

Although it says "The Universal Dependencies Treebank and the CoNLL 2009 
datasets distribute training data for many languages.", I am having difficulty 
finding one.

I thought
https://github.com/UniversalDependencies/UD_Spanish-GSD
may be it, but the files there are in an XML format.

Can someone point me to an open-source lemmatizer training data in the format 
openNLP UIMA Lemmatizer can use ?

Thank you in advance.

-- 
T. "Kuro" Kurosaka, Berkeley, California, USA


Re: Where can I find Spanish Lemmatizer training data ?

Posted by John Stewart <ca...@gmail.com>.
Freeling will do everything you need in Spanish and more:
http://nlp.lsi.upc.edu/freeling/index.php/node/4

jds

On Wed, Jul 10, 2019 at 7:17 PM T. Kuro Kurosaka <ku...@bhlab.com> wrote:

> Thank you Rodrigo. I've tried the lemmatizer data but I found it's not as
> simple
> as I hoped. It seems to require extra classes from IXA-PIPEs which is a
> 500 MB
> download.
>
> $ echo 'Todo es amor.' | ~/opt/apache-opennlp-1.9.1/bin/opennlp
> LemmatizerME
> openNLP/data/es-lemma-perceptron-ancora-2.0.bin
> Loading Lemmatizer model ... Exception in thread "main"
> java.lang.IllegalArgumentException:
> opennlp.tools.util.InvalidFormatException:
> Could not instantiate the eus.ixa.ixa.pipe.lemma.LemmatizerFactory. The
> initialization throw an exception.
>      at
> opennlp.tools.util.model.BaseModel.initializeFactory(BaseModel.java:259)
>      at opennlp.tools.util.model.BaseModel.loadModel(BaseModel.java:234)
>      at opennlp.tools.util.model.BaseModel.<init>(BaseModel.java:176)
>      at
> opennlp.tools.lemmatizer.LemmatizerModel.<init>(LemmatizerModel.java:74)
>      at
>
> opennlp.tools.cmdline.lemmatizer.LemmatizerModelLoader.loadModel(LemmatizerModelLoader.java:39)
>      at
>
> opennlp.tools.cmdline.lemmatizer.LemmatizerModelLoader.loadModel(LemmatizerModelLoader.java:31)
>      at opennlp.tools.cmdline.ModelLoader.load(ModelLoader.java:56)
>      at
>
> opennlp.tools.cmdline.lemmatizer.LemmatizerMETool.run(LemmatizerMETool.java:51)
>      at opennlp.tools.cmdline.CLI.main(CLI.java:259)
> Caused by: opennlp.tools.util.InvalidFormatException: Could not
> instantiate the
> eus.ixa.ixa.pipe.lemma.LemmatizerFactory. The initialization throw an
> exception.
>      at opennlp.tools.util.BaseToolFactory.create(BaseToolFactory.java:116)
>      at
> opennlp.tools.util.model.BaseModel.initializeFactory(BaseModel.java:257)
>      ... 8 more
> Caused by: opennlp.tools.util.ext.ExtensionNotLoadedException: Unable to
> find
> implementation for opennlp.tools.util.BaseToolFactory, the class or
> service
> *eus.ixa.ixa.pipe.lemma.LemmatizerFactory could not be located!*
>      at
>
> opennlp.tools.util.ext.ExtensionLoader.instantiateExtension(ExtensionLoader.java:119)
>      at opennlp.tools.util.BaseToolFactory.create(BaseToolFactory.java:108)
>      ... 9 more
>
>
> On 7/10/19 7:15 AM, Rodrigo Agerri wrote:
> > You can also find an already trained lemmatizer (trained with general
> > news text) for Spanish here:
> >
> > http://ixa2.si.ehu.es/ixa-pipes/
>
> --
> T. "Kuro" Kurosaka, Berkeley, California, USA
>
>

Re: Where can I find Spanish Lemmatizer training data ?

Posted by "T. Kuro Kurosaka" <ku...@bhlab.com>.
Thank you Rodrigo. I've tried the lemmatizer data but I found it's not as simple 
as I hoped. It seems to require extra classes from IXA-PIPEs which is a 500 MB 
download.

$ echo 'Todo es amor.' | ~/opt/apache-opennlp-1.9.1/bin/opennlp LemmatizerME 
openNLP/data/es-lemma-perceptron-ancora-2.0.bin
Loading Lemmatizer model ... Exception in thread "main" 
java.lang.IllegalArgumentException: opennlp.tools.util.InvalidFormatException: 
Could not instantiate the eus.ixa.ixa.pipe.lemma.LemmatizerFactory. The 
initialization throw an exception.
     at opennlp.tools.util.model.BaseModel.initializeFactory(BaseModel.java:259)
     at opennlp.tools.util.model.BaseModel.loadModel(BaseModel.java:234)
     at opennlp.tools.util.model.BaseModel.<init>(BaseModel.java:176)
     at opennlp.tools.lemmatizer.LemmatizerModel.<init>(LemmatizerModel.java:74)
     at 
opennlp.tools.cmdline.lemmatizer.LemmatizerModelLoader.loadModel(LemmatizerModelLoader.java:39)
     at 
opennlp.tools.cmdline.lemmatizer.LemmatizerModelLoader.loadModel(LemmatizerModelLoader.java:31)
     at opennlp.tools.cmdline.ModelLoader.load(ModelLoader.java:56)
     at 
opennlp.tools.cmdline.lemmatizer.LemmatizerMETool.run(LemmatizerMETool.java:51)
     at opennlp.tools.cmdline.CLI.main(CLI.java:259)
Caused by: opennlp.tools.util.InvalidFormatException: Could not instantiate the 
eus.ixa.ixa.pipe.lemma.LemmatizerFactory. The initialization throw an exception.
     at opennlp.tools.util.BaseToolFactory.create(BaseToolFactory.java:116)
     at opennlp.tools.util.model.BaseModel.initializeFactory(BaseModel.java:257)
     ... 8 more
Caused by: opennlp.tools.util.ext.ExtensionNotLoadedException: Unable to find 
implementation for opennlp.tools.util.BaseToolFactory, the class or service 
*eus.ixa.ixa.pipe.lemma.LemmatizerFactory could not be located!*
     at 
opennlp.tools.util.ext.ExtensionLoader.instantiateExtension(ExtensionLoader.java:119)
     at opennlp.tools.util.BaseToolFactory.create(BaseToolFactory.java:108)
     ... 9 more


On 7/10/19 7:15 AM, Rodrigo Agerri wrote:
> You can also find an already trained lemmatizer (trained with general
> news text) for Spanish here:
>
> http://ixa2.si.ehu.es/ixa-pipes/

-- 
T. "Kuro" Kurosaka, Berkeley, California, USA


Re: Where can I find Spanish Lemmatizer training data ?

Posted by Rodrigo Agerri <ro...@ehu.eus>.
There are also examples of training via API:

https://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.lemmatizer.training.api

The example PerceptronTrainer can be found here:

https://github.com/apache/opennlp/blob/master/opennlp-tools/lang/ml/PerceptronTrainerParams.txt

You can also find an already trained lemmatizer (trained with general
news text) for Spanish here:

http://ixa2.si.ehu.es/ixa-pipes/

R

On Tue, 9 Jul 2019 at 21:49, T. Kuro Kurosaka <ku...@bhlab.com> wrote:
>
> Thank you.
>
> Where can I find a sample parameter files, or a syntax of the parameter file?
>
> The training instructions only tells us a command line sample
>
> $ opennlp LemmatizerTrainerME -model en-lemmatizer.bin -params PerceptronTrainerParams.txt -lang en -data en-lemmatizer.train -encoding UTF-8
>
>
> Kuro
>
>

Re: Where can I find Spanish Lemmatizer training data ?

Posted by "T. Kuro Kurosaka" <ku...@bhlab.com>.
Thank you.

Where can I find a sample parameter files, or a syntax of the parameter file?

The training instructions only tells us a command line sample

$ opennlp LemmatizerTrainerME -model en-lemmatizer.bin -params PerceptronTrainerParams.txt -lang en -data en-lemmatizer.train -encoding UTF-8


Kuro



Re: Where can I find Spanish Lemmatizer training data ?

Posted by Martin Krallinger <kr...@gmail.com>.
Dear Daniel,

you can find a Spanish Lemmatizer corpus (medical domain) at:

https://github.com/PlanTL-SANIDAD/SPACCC_TOKEN

Under the corpus folder (validation)



Regards,

Martin

El mar., 9 jul. 2019 a las 15:11, Dan Russ (<da...@gmail.com>) escribió:

> Hello,
>    It looks like the GitHub repo has files in conllu format, which is
> readable by opennlp.
>
>    es_gsd-ud-dev.conllu
> Daniel
>
> > On Jul 8, 2019, at 9:04 PM, T. Kuro Kurosaka <ku...@bhlab.com> wrote:
> >
> > I downloaded OpenNLP hoping that I could use it to lemmatize Spanish
> text.
> >
> > But
> https://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.lemmatizer
> > seems to be saying I have to train a model first to use the lemmatizer.
> >
> > Although it says "The Universal Dependencies Treebank and the CoNLL 2009
> datasets distribute training data for many languages.", I am having
> difficulty finding one.
> >
> > I thought
> > https://github.com/UniversalDependencies/UD_Spanish-GSD
> > may be it, but the files there are in an XML format.
> >
> > Can someone point me to an open-source lemmatizer training data in the
> format openNLP UIMA Lemmatizer can use ?
> >
> > Thank you in advance.
> >
> > --
> > T. "Kuro" Kurosaka, Berkeley, California, USA
> >
>
>

-- 
=======================================
Martin Krallinger, Dr.
--------------------------------------------------------------------
Head of Biological Text Mining Unit
Barcelona Supercomputing Center (BSC-CNS)
--------------------------------------------------------------------
Oficina Técnica General (OTG) del Plan TL en el
área de Biomedicina de la
*Secretaría de Estado *
*para* el Avance Digital

=======================================

Re: Where can I find Spanish Lemmatizer training data ?

Posted by Dan Russ <da...@gmail.com>.
Hello,
   It looks like the GitHub repo has files in conllu format, which is readable by opennlp.

   es_gsd-ud-dev.conllu
Daniel

> On Jul 8, 2019, at 9:04 PM, T. Kuro Kurosaka <ku...@bhlab.com> wrote:
> 
> I downloaded OpenNLP hoping that I could use it to lemmatize Spanish text.
> 
> But https://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.lemmatizer
> seems to be saying I have to train a model first to use the lemmatizer.
> 
> Although it says "The Universal Dependencies Treebank and the CoNLL 2009 datasets distribute training data for many languages.", I am having difficulty finding one.
> 
> I thought
> https://github.com/UniversalDependencies/UD_Spanish-GSD
> may be it, but the files there are in an XML format.
> 
> Can someone point me to an open-source lemmatizer training data in the format openNLP UIMA Lemmatizer can use ?
> 
> Thank you in advance.
> 
> -- 
> T. "Kuro" Kurosaka, Berkeley, California, USA
>