You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Leonel de Alencar <le...@yahoo.com.br> on 2013/01/10 21:31:20 UTC

problem in training a new portuguese model


 Hi!

I'm trying to train a new Brazilian Portuguese tagger model from the MAC-Morpho corpus.
I used the command below, but no model was created. Instead, I just got a usage message.

opennlp POSTaggerTrainer -lang pt -model-type maxent -encoding utf-8 -data pt-pos.train -model pt-pos-maxent.bin
Usage: opennlp POSTaggerTrainer [-type maxent|perceptron|perceptron_sequence] [-dict dictionaryPath] [-ngram cutoff] [-paramsparamsFile] -lang language [-cutoff num] [-iterations num] [-encoding charsetName] -data trainData -model modelFile

Arguments description:
        -type maxent|perceptron|perceptron_sequence
                The type of the token name finder model. One of maxent|perceptron|perceptron_sequence.
        -dict dictionaryPath
                The XML tag dictionary file
        -ngram cutoff
                NGram cutoff. If not specified will not create ngram dictionary.
        -paramsparamsFile
                Training parameters file.
        -lang language
                specifies the language which is being processed.
        -cutoff num
                specifies the min number of times a feature must be seen. It is ignored if a parameters file is passed.
        -iterations num
                specifies the number of training iterations. It is ignored if a parameters file is passed.
        -encoding charsetName
                specifies the encoding which should be used for reading and writing text. If not specified the system default will be used.
        -data trainData
                the data to be used during training
        -model modelFile
                the output model file

Here is an excerpt from the pt-pos.train file:

Jersei_N atinge_V média_N de_PREP Cr$_CUR 1,4_NUMmilhão_N em_PREP|+ a_ART venda_N de_PREP|+ a_ART Pinhal_NPROP em_PREP São_NPROP Paulo_NPROP ._.

Programe_V sua_PROADJviagem_N a_PREP|+ a_ART Exposição_NPROPNacional_NPROP do_NPROP Zebu_NPROP ,_, que_PRO-KS-REL começa_V dia_N 25_N|AP ._.

Safra_N recorde_ADJ e_KC disponibilidade_N de_PREP crédito_N ativam_V vendas_N de_PREP máquinas_N agrícolas_ADJ ._.

A_ART degradação_N de_PREP|+ as_ART terras_N por_PREP|+ o_ART mau_ADJ uso_N de_PREP|+ os_ART solos_N avança_V em_PREP|+ o_ART ._.

A_ART desertificação_N tornou_V crítica_ADJ a_ART produtividade_N de_PREP 52_NUM mil_NUM km²_N em_PREP|+ a_ART região_N ._.



I would appreciate if someone could help me!

Best,
Leonel

Re: problem in training a new portuguese model

Posted by James Kosin <ja...@gmail.com>.

On 1/10/2013 3:31 PM, Leonel de Alencar wrote:
>
>   Hi!
>
> I'm trying to train a new Brazilian Portuguese tagger model from the MAC-Morpho corpus.
> I used the command below, but no model was created. Instead, I just got a usage message.
>
> opennlp POSTaggerTrainer -lang pt -model-type maxent -encoding utf-8 -data pt-pos.train -model pt-pos-maxent.bin
Here, you have '-model-type' this really just should be '-type' instead.

I'm creating a JIRA issue to cover the ability to detect and report any 
bad-parameters as a good additional feature.

James

Re: problem in training a new portuguese model

Posted by James Kosin <ja...@gmail.com>.

JIRA issue has been created here:
     https://issues.apache.org/jira/browse/OPENNLP-556

Re: problem in training a new portuguese model

Posted by "Jim - FooBar();" <ji...@gmail.com>.

Your data doesn't look right...It should be like this: (from the docs)

About_IN 10_CD Euro_NNP ,_, I_PRP reckon_VBP ._.
That_DT sounds_VBZ good_JJ ._.

1 sentence per line of course...

apart from this the message you're getting suggests that you're not 
using the command correctly and you're being redirected to the help 
message. However the command you posted seems correct to me! strange...

I'm not sure if this helps you...did you try using the API?Do you know 
Java or some other jvm-hosted language?

Jim

On 10/01/13 20:31, Leonel de Alencar wrote:
>
>   Hi!
>
> I'm trying to train a new Brazilian Portuguese tagger model from the MAC-Morpho corpus.
> I used the command below, but no model was created. Instead, I just got a usage message.
>
> opennlp POSTaggerTrainer -lang pt -model-type maxent -encoding utf-8 -data pt-pos.train -model pt-pos-maxent.bin
> Usage: opennlp POSTaggerTrainer [-type maxent|perceptron|perceptron_sequence] [-dict dictionaryPath] [-ngram cutoff] [-paramsparamsFile] -lang language [-cutoff num] [-iterations num] [-encoding charsetName] -data trainData -model modelFile
>
> Arguments description:
>          -type maxent|perceptron|perceptron_sequence
>                  The type of the token name finder model. One of maxent|perceptron|perceptron_sequence.
>          -dict dictionaryPath
>                  The XML tag dictionary file
>          -ngram cutoff
>                  NGram cutoff. If not specified will not create ngram dictionary.
>          -paramsparamsFile
>                  Training parameters file.
>          -lang language
>                  specifies the language which is being processed.
>          -cutoff num
>                  specifies the min number of times a feature must be seen. It is ignored if a parameters file is passed.
>          -iterations num
>                  specifies the number of training iterations. It is ignored if a parameters file is passed.
>          -encoding charsetName
>                  specifies the encoding which should be used for reading and writing text. If not specified the system default will be used.
>          -data trainData
>                  the data to be used during training
>          -model modelFile
>                  the output model file
>
> Here is an excerpt from the pt-pos.train file:
>
> Jersei_N atinge_V média_N de_PREP Cr$_CUR 1,4_NUMmilhão_N em_PREP|+ a_ART venda_N de_PREP|+ a_ART Pinhal_NPROP em_PREP São_NPROP Paulo_NPROP ._.
>
> Programe_V sua_PROADJviagem_N a_PREP|+ a_ART Exposição_NPROPNacional_NPROP do_NPROP Zebu_NPROP ,_, que_PRO-KS-REL começa_V dia_N 25_N|AP ._.
>
> Safra_N recorde_ADJ e_KC disponibilidade_N de_PREP crédito_N ativam_V vendas_N de_PREP máquinas_N agrícolas_ADJ ._.
>
> A_ART degradação_N de_PREP|+ as_ART terras_N por_PREP|+ o_ART mau_ADJ uso_N de_PREP|+ os_ART solos_N avança_V em_PREP|+ o_ART ._.
>
> A_ART desertificação_N tornou_V crítica_ADJ a_ART produtividade_N de_PREP 52_NUM mil_NUM km²_N em_PREP|+ a_ART região_N ._.
>
>
>
> I would appreciate if someone could help me!
>
> Best,
> Leonel