You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by "T. Kuro Kurosaka" <ku...@bhlab.com> on 2019/07/09 01:04:00 UTC
Where can I find Spanish Lemmatizer training data ?
I downloaded OpenNLP hoping that I could use it to lemmatize Spanish text.
But https://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.lemmatizer
seems to be saying I have to train a model first to use the lemmatizer.
Although it says "The Universal Dependencies Treebank and the CoNLL 2009
datasets distribute training data for many languages.", I am having difficulty
finding one.
I thought
https://github.com/UniversalDependencies/UD_Spanish-GSD
may be it, but the files there are in an XML format.
Can someone point me to an open-source lemmatizer training data in the format
openNLP UIMA Lemmatizer can use ?
Thank you in advance.
--
T. "Kuro" Kurosaka, Berkeley, California, USA
Re: Where can I find Spanish Lemmatizer training data ?
Posted by John Stewart <ca...@gmail.com>.
Freeling will do everything you need in Spanish and more:
http://nlp.lsi.upc.edu/freeling/index.php/node/4
jds
On Wed, Jul 10, 2019 at 7:17 PM T. Kuro Kurosaka <ku...@bhlab.com> wrote:
> Thank you Rodrigo. I've tried the lemmatizer data but I found it's not as
> simple
> as I hoped. It seems to require extra classes from IXA-PIPEs which is a
> 500 MB
> download.
>
> $ echo 'Todo es amor.' | ~/opt/apache-opennlp-1.9.1/bin/opennlp
> LemmatizerME
> openNLP/data/es-lemma-perceptron-ancora-2.0.bin
> Loading Lemmatizer model ... Exception in thread "main"
> java.lang.IllegalArgumentException:
> opennlp.tools.util.InvalidFormatException:
> Could not instantiate the eus.ixa.ixa.pipe.lemma.LemmatizerFactory. The
> initialization throw an exception.
> at
> opennlp.tools.util.model.BaseModel.initializeFactory(BaseModel.java:259)
> at opennlp.tools.util.model.BaseModel.loadModel(BaseModel.java:234)
> at opennlp.tools.util.model.BaseModel.<init>(BaseModel.java:176)
> at
> opennlp.tools.lemmatizer.LemmatizerModel.<init>(LemmatizerModel.java:74)
> at
>
> opennlp.tools.cmdline.lemmatizer.LemmatizerModelLoader.loadModel(LemmatizerModelLoader.java:39)
> at
>
> opennlp.tools.cmdline.lemmatizer.LemmatizerModelLoader.loadModel(LemmatizerModelLoader.java:31)
> at opennlp.tools.cmdline.ModelLoader.load(ModelLoader.java:56)
> at
>
> opennlp.tools.cmdline.lemmatizer.LemmatizerMETool.run(LemmatizerMETool.java:51)
> at opennlp.tools.cmdline.CLI.main(CLI.java:259)
> Caused by: opennlp.tools.util.InvalidFormatException: Could not
> instantiate the
> eus.ixa.ixa.pipe.lemma.LemmatizerFactory. The initialization throw an
> exception.
> at opennlp.tools.util.BaseToolFactory.create(BaseToolFactory.java:116)
> at
> opennlp.tools.util.model.BaseModel.initializeFactory(BaseModel.java:257)
> ... 8 more
> Caused by: opennlp.tools.util.ext.ExtensionNotLoadedException: Unable to
> find
> implementation for opennlp.tools.util.BaseToolFactory, the class or
> service
> *eus.ixa.ixa.pipe.lemma.LemmatizerFactory could not be located!*
> at
>
> opennlp.tools.util.ext.ExtensionLoader.instantiateExtension(ExtensionLoader.java:119)
> at opennlp.tools.util.BaseToolFactory.create(BaseToolFactory.java:108)
> ... 9 more
>
>
> On 7/10/19 7:15 AM, Rodrigo Agerri wrote:
> > You can also find an already trained lemmatizer (trained with general
> > news text) for Spanish here:
> >
> > http://ixa2.si.ehu.es/ixa-pipes/
>
> --
> T. "Kuro" Kurosaka, Berkeley, California, USA
>
>
Re: Where can I find Spanish Lemmatizer training data ?
Posted by "T. Kuro Kurosaka" <ku...@bhlab.com>.
Thank you Rodrigo. I've tried the lemmatizer data but I found it's not as simple
as I hoped. It seems to require extra classes from IXA-PIPEs which is a 500 MB
download.
$ echo 'Todo es amor.' | ~/opt/apache-opennlp-1.9.1/bin/opennlp LemmatizerME
openNLP/data/es-lemma-perceptron-ancora-2.0.bin
Loading Lemmatizer model ... Exception in thread "main"
java.lang.IllegalArgumentException: opennlp.tools.util.InvalidFormatException:
Could not instantiate the eus.ixa.ixa.pipe.lemma.LemmatizerFactory. The
initialization throw an exception.
at opennlp.tools.util.model.BaseModel.initializeFactory(BaseModel.java:259)
at opennlp.tools.util.model.BaseModel.loadModel(BaseModel.java:234)
at opennlp.tools.util.model.BaseModel.<init>(BaseModel.java:176)
at opennlp.tools.lemmatizer.LemmatizerModel.<init>(LemmatizerModel.java:74)
at
opennlp.tools.cmdline.lemmatizer.LemmatizerModelLoader.loadModel(LemmatizerModelLoader.java:39)
at
opennlp.tools.cmdline.lemmatizer.LemmatizerModelLoader.loadModel(LemmatizerModelLoader.java:31)
at opennlp.tools.cmdline.ModelLoader.load(ModelLoader.java:56)
at
opennlp.tools.cmdline.lemmatizer.LemmatizerMETool.run(LemmatizerMETool.java:51)
at opennlp.tools.cmdline.CLI.main(CLI.java:259)
Caused by: opennlp.tools.util.InvalidFormatException: Could not instantiate the
eus.ixa.ixa.pipe.lemma.LemmatizerFactory. The initialization throw an exception.
at opennlp.tools.util.BaseToolFactory.create(BaseToolFactory.java:116)
at opennlp.tools.util.model.BaseModel.initializeFactory(BaseModel.java:257)
... 8 more
Caused by: opennlp.tools.util.ext.ExtensionNotLoadedException: Unable to find
implementation for opennlp.tools.util.BaseToolFactory, the class or service
*eus.ixa.ixa.pipe.lemma.LemmatizerFactory could not be located!*
at
opennlp.tools.util.ext.ExtensionLoader.instantiateExtension(ExtensionLoader.java:119)
at opennlp.tools.util.BaseToolFactory.create(BaseToolFactory.java:108)
... 9 more
On 7/10/19 7:15 AM, Rodrigo Agerri wrote:
> You can also find an already trained lemmatizer (trained with general
> news text) for Spanish here:
>
> http://ixa2.si.ehu.es/ixa-pipes/
--
T. "Kuro" Kurosaka, Berkeley, California, USA
Re: Where can I find Spanish Lemmatizer training data ?
Posted by Rodrigo Agerri <ro...@ehu.eus>.
There are also examples of training via API:
https://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.lemmatizer.training.api
The example PerceptronTrainer can be found here:
https://github.com/apache/opennlp/blob/master/opennlp-tools/lang/ml/PerceptronTrainerParams.txt
You can also find an already trained lemmatizer (trained with general
news text) for Spanish here:
http://ixa2.si.ehu.es/ixa-pipes/
R
On Tue, 9 Jul 2019 at 21:49, T. Kuro Kurosaka <ku...@bhlab.com> wrote:
>
> Thank you.
>
> Where can I find a sample parameter files, or a syntax of the parameter file?
>
> The training instructions only tells us a command line sample
>
> $ opennlp LemmatizerTrainerME -model en-lemmatizer.bin -params PerceptronTrainerParams.txt -lang en -data en-lemmatizer.train -encoding UTF-8
>
>
> Kuro
>
>
Re: Where can I find Spanish Lemmatizer training data ?
Posted by "T. Kuro Kurosaka" <ku...@bhlab.com>.
Thank you.
Where can I find a sample parameter files, or a syntax of the parameter file?
The training instructions only tells us a command line sample
$ opennlp LemmatizerTrainerME -model en-lemmatizer.bin -params PerceptronTrainerParams.txt -lang en -data en-lemmatizer.train -encoding UTF-8
Kuro
Re: Where can I find Spanish Lemmatizer training data ?
Posted by Martin Krallinger <kr...@gmail.com>.
Dear Daniel,
you can find a Spanish Lemmatizer corpus (medical domain) at:
https://github.com/PlanTL-SANIDAD/SPACCC_TOKEN
Under the corpus folder (validation)
Regards,
Martin
El mar., 9 jul. 2019 a las 15:11, Dan Russ (<da...@gmail.com>) escribió:
> Hello,
> It looks like the GitHub repo has files in conllu format, which is
> readable by opennlp.
>
> es_gsd-ud-dev.conllu
> Daniel
>
> > On Jul 8, 2019, at 9:04 PM, T. Kuro Kurosaka <ku...@bhlab.com> wrote:
> >
> > I downloaded OpenNLP hoping that I could use it to lemmatize Spanish
> text.
> >
> > But
> https://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.lemmatizer
> > seems to be saying I have to train a model first to use the lemmatizer.
> >
> > Although it says "The Universal Dependencies Treebank and the CoNLL 2009
> datasets distribute training data for many languages.", I am having
> difficulty finding one.
> >
> > I thought
> > https://github.com/UniversalDependencies/UD_Spanish-GSD
> > may be it, but the files there are in an XML format.
> >
> > Can someone point me to an open-source lemmatizer training data in the
> format openNLP UIMA Lemmatizer can use ?
> >
> > Thank you in advance.
> >
> > --
> > T. "Kuro" Kurosaka, Berkeley, California, USA
> >
>
>
--
=======================================
Martin Krallinger, Dr.
--------------------------------------------------------------------
Head of Biological Text Mining Unit
Barcelona Supercomputing Center (BSC-CNS)
--------------------------------------------------------------------
Oficina Técnica General (OTG) del Plan TL en el
área de Biomedicina de la
*Secretaría de Estado *
*para* el Avance Digital
=======================================
Re: Where can I find Spanish Lemmatizer training data ?
Posted by Dan Russ <da...@gmail.com>.
Hello,
It looks like the GitHub repo has files in conllu format, which is readable by opennlp.
es_gsd-ud-dev.conllu
Daniel
> On Jul 8, 2019, at 9:04 PM, T. Kuro Kurosaka <ku...@bhlab.com> wrote:
>
> I downloaded OpenNLP hoping that I could use it to lemmatize Spanish text.
>
> But https://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.lemmatizer
> seems to be saying I have to train a model first to use the lemmatizer.
>
> Although it says "The Universal Dependencies Treebank and the CoNLL 2009 datasets distribute training data for many languages.", I am having difficulty finding one.
>
> I thought
> https://github.com/UniversalDependencies/UD_Spanish-GSD
> may be it, but the files there are in an XML format.
>
> Can someone point me to an open-source lemmatizer training data in the format openNLP UIMA Lemmatizer can use ?
>
> Thank you in advance.
>
> --
> T. "Kuro" Kurosaka, Berkeley, California, USA
>