You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by Nicolas Hernandez <ni...@gmail.com> on 2013/05/13 15:44:36 UTC

Error writing model file due to a java writeUTF method problem

Hi All

I ve tried to use the postagger command to learn models of various
morphological features. Even if I know it is not adapted to, I also
try to build a model for lemma tagging....

Below you will see the error I ve got [1]. The problem is due to fact
that java.io.DataOutputStream is not able to serialize strings larger
than 64KB.
[2] presents the problem and gives some workarounds.

What do you think about ?

/Nicolas

[1] Writing pos tagger model ... failed
Error during writing model file '/tmp/train-lemma.model'
encoded string too long: 153687 bytes
java.io.UTFDataFormatException: encoded string too long: 153687 bytes
at java.io.DataOutputStream.writeUTF(DataOutputStream.java:364)
at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323)
at opennlp.maxent.io.BinaryGISModelWriter.writeUTF(BinaryGISModelWriter.java:73)

[2] http://www.drillio.com/en/software-development/java/encoded-string-too-long-64kb-limit/

Re: Error writing model file due to a java writeUTF method problem

Posted by Nicolas Hernandez <ni...@gmail.com>.
This was not intentional. As I said I wanted to use the
POSTaggerTrainer with a tagset whose values would be word lemma.
Consequently instead of having thirty tag values, I had thirty
thousand distinct tag values... I did it by curiosity, the approach
works fine for predicting gender, number, person...

Here is an excerpt of my corpus
Il_il est_être vrai_vrai ,_, si_si l'on_on en_en croit_croire le_le
rapport_rapport Delors_Delors que_que c'_ce est_être un_un
organisme_organisme du_de#:#le même_même genre_genre que_que l'on_on
veut_vouloir créer_créer au_à#:#le bénéfice_bénéfice de_de l'_le
Europe_Europe tout_entière_tout#:#entière ._.

I open the following issue
https://issues.apache.org/jira/browse/OPENNLP-578

/Nicolas

On Mon, May 13, 2013 at 6:13 PM, Jörn Kottmann <ko...@gmail.com> wrote:
> On 05/13/2013 03:44 PM, Nicolas Hernandez wrote:
>>
>> I ve tried to use the postagger command to learn models of various
>> morphological features. Even if I know it is not adapted to, I also
>> try to build a model for lemma tagging....
>
>
> Looks like we do not support strings for features larger than 64KB, as
> pointed out
> this seems to be a bug in our serializer code. Anyway, why do you use such
> large
> strings for features? Is this intentional?
>
> Would you mind to open a jira issue for this?
>
> Thanks,
> Jörn



-- 
Dr. Nicolas Hernandez
Associate Professor (Maître de Conférences)
Université de Nantes - LINA CNRS UMR 6241
http://enicolashernandez.blogspot.com
http://www.univ-nantes.fr/hernandez-n
+33 (0)2 51 12 53 94
+33 (0)2 40 30 60 67

Re: Error writing model file due to a java writeUTF method problem

Posted by Jörn Kottmann <ko...@gmail.com>.
On 05/13/2013 03:44 PM, Nicolas Hernandez wrote:
> I ve tried to use the postagger command to learn models of various
> morphological features. Even if I know it is not adapted to, I also
> try to build a model for lemma tagging....

Looks like we do not support strings for features larger than 64KB, as 
pointed out
this seems to be a bug in our serializer code. Anyway, why do you use 
such large
strings for features? Is this intentional?

Would you mind to open a jira issue for this?

Thanks,
Jörn