You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by Nicolas Hernandez <ni...@gmail.com> on 2013/05/13 15:44:36 UTC
Error writing model file due to a java writeUTF method problem
Hi All
I ve tried to use the postagger command to learn models of various
morphological features. Even if I know it is not adapted to, I also
try to build a model for lemma tagging....
Below you will see the error I ve got [1]. The problem is due to fact
that java.io.DataOutputStream is not able to serialize strings larger
than 64KB.
[2] presents the problem and gives some workarounds.
What do you think about ?
/Nicolas
[1] Writing pos tagger model ... failed
Error during writing model file '/tmp/train-lemma.model'
encoded string too long: 153687 bytes
java.io.UTFDataFormatException: encoded string too long: 153687 bytes
at java.io.DataOutputStream.writeUTF(DataOutputStream.java:364)
at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323)
at opennlp.maxent.io.BinaryGISModelWriter.writeUTF(BinaryGISModelWriter.java:73)
[2] http://www.drillio.com/en/software-development/java/encoded-string-too-long-64kb-limit/
Re: Error writing model file due to a java writeUTF method problem
Posted by Nicolas Hernandez <ni...@gmail.com>.
This was not intentional. As I said I wanted to use the
POSTaggerTrainer with a tagset whose values would be word lemma.
Consequently instead of having thirty tag values, I had thirty
thousand distinct tag values... I did it by curiosity, the approach
works fine for predicting gender, number, person...
Here is an excerpt of my corpus
Il_il est_être vrai_vrai ,_, si_si l'on_on en_en croit_croire le_le
rapport_rapport Delors_Delors que_que c'_ce est_être un_un
organisme_organisme du_de#:#le même_même genre_genre que_que l'on_on
veut_vouloir créer_créer au_à#:#le bénéfice_bénéfice de_de l'_le
Europe_Europe tout_entière_tout#:#entière ._.
I open the following issue
https://issues.apache.org/jira/browse/OPENNLP-578
/Nicolas
On Mon, May 13, 2013 at 6:13 PM, Jörn Kottmann <ko...@gmail.com> wrote:
> On 05/13/2013 03:44 PM, Nicolas Hernandez wrote:
>>
>> I ve tried to use the postagger command to learn models of various
>> morphological features. Even if I know it is not adapted to, I also
>> try to build a model for lemma tagging....
>
>
> Looks like we do not support strings for features larger than 64KB, as
> pointed out
> this seems to be a bug in our serializer code. Anyway, why do you use such
> large
> strings for features? Is this intentional?
>
> Would you mind to open a jira issue for this?
>
> Thanks,
> Jörn
--
Dr. Nicolas Hernandez
Associate Professor (Maître de Conférences)
Université de Nantes - LINA CNRS UMR 6241
http://enicolashernandez.blogspot.com
http://www.univ-nantes.fr/hernandez-n
+33 (0)2 51 12 53 94
+33 (0)2 40 30 60 67
Re: Error writing model file due to a java writeUTF method problem
Posted by Jörn Kottmann <ko...@gmail.com>.
On 05/13/2013 03:44 PM, Nicolas Hernandez wrote:
> I ve tried to use the postagger command to learn models of various
> morphological features. Even if I know it is not adapted to, I also
> try to build a model for lemma tagging....
Looks like we do not support strings for features larger than 64KB, as
pointed out
this seems to be a bug in our serializer code. Anyway, why do you use
such large
strings for features? Is this intentional?
Would you mind to open a jira issue for this?
Thanks,
Jörn