You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Nicolas Hernandez <ni...@gmail.com> on 2011/06/15 16:46:51 UTC

UIMA TokenizerTrainer component : the model file is not created

Hello

Does someone have already used the UIMA TokenizerTrainer component ? I
am a bit confused since it does not create any model file.

In my stdout I got this :
Indexing events using cutoff of 5
	Computing event counts...

done. 69669 events
	Indexing...  done.
Sorting and merging events... done. Reduced 69669 events to 16467.
Done indexing.
Incorporating indexed data for training...
done.
	Number of Event Tokens: 16467
	    Number of Outcomes: 1
	  Number of Predicates: 5624
...done.
Computing model parameters...
Performing 100 iterations.
  1:  .. loglikelihood=0.0	1.0
  2:  .. loglikelihood=0.0	1.0

This look like a problem I got when I trained the model in command
line without using the '<SPLIT>' tag. In command line, It differs
since in command line I also got the following exception
Exception in thread "main" java.lang.IllegalArgumentException: The
maxent model is not compatible!

I solved this problem by adding the tag as it is mentioned in the post
of maxent model is not compatible with Tokenizer training	Fri, 13 May,
09:33
 http://mail-archives.apache.org/mod_mbox/incubator-opennlp-users/201105.mbox/browser

Does anyone know if it is the same problem ? In that case, how to
specify the '<SPLIT>' tag in the UIMA version? As much as I understand
its role, it is important to let the user the possibility of setting
it.

More globaly I am interested by any return on experience of people who
successfully managed to build models with the UIMA OpenNLP * Trainer
components. For now, I also got some trouble with the SentenceTrainer
and I do not have test the others.

/Nicolas


-- 
nicolas.hernandez@univ-nantes.fr
#
http://enicolashernandez.blogspot.com
http://www.univ-nantes.fr/hernandez-n
#
Laboratoire LINA-TALN CNRS UMR 6241
tel. +33 (0)2 51 12 58 55
#
Université de Nantes - Institut Universitaire de Technologie -
Département Informatique
tel. +33 (0)2 40 30 60 67

Re: UIMA TokenizerTrainer component : the model file is not created

Posted by Tommaso Teofili <to...@gmail.com>.

Nicolas,
After re-training the sentence detector with OpenNLP UIMA I noticed the
problem while using the command line tools, I didn't notice that.
Regards,
Tommaso

2011/6/22 Nicolas Hernandez <ni...@gmail.com>

> Tommaso,
>
> Concerning the sentence boundaries detection problem: After asking
> Jörn, I opened the following jira [1]
>
> Regards
>
> /Nicolas
>
> [1] https://issues.apache.org/jira/browse/OPENNLP-203
>
>
> On Mon, Jun 20, 2011 at 11:14 AM, Tommaso Teofili
> <to...@gmail.com> wrote:
> > Hello Nicolas,
> >
> > 2011/6/17 Nicolas Hernandez <ni...@gmail.com>
> >
> >> Tommaso you said you successfully used the OpenNLP UIMA trainers.
> >>
> >> I am currently attempting to build French models for the various tasks
> >> OpenNLP can deal with. But since I am also involved in UIMA stuff, I
> >> wanted to test the OpenNLP UIMA components for doing that.
> >> My goal is to donate the models to the OpenNLP community (i.e. in
> >> http://opennlp.sourceforge.net/models-1.5/)
> >>
> >> Before testing the tokenizerTrainer, I tested the SentenceDetector. I
> >> found at least two problems with the UIMA component
> >> https://issues.apache.org/jira/browse/OPENNLP-197
> >> One of them is not yet referenced in the jira. But I am currious to
> >> know whether you encountered it.
> >>
> >> I noted that models trained with the UIMA component give wrong
> >> begin/end offset despite the fact they manage to split text in
> >> sentences. I observed that the begin of a current sentence starts
> >> including as a first token the punctuation character of the previous
> >> one while the
> >> previous one does not include it as its last one.
> >>
> >> Have you noticed the problem ?
> >>
> >
> > I didn't noticed that but I will rerun my tests to check it out, I may
> have
> > missed that.
> > I'll let you know how it goes.
> > Regards,
> > Tommaso
> >
> >
> >>
> >> I think that, most of all, my problems are due to the lack of
> >> documentation for the uima integration. I plan to blog post about my
> >> experience. Since I see there is an open issue for that
> >> https://issues.apache.org/jira/browse/OPENNLP-49, if I manage to find
> >> the time to blog spot, I can try to write it in some way it can also
> >> be used to contribute to the documentation too (if you are interested
> >> in).
> >>
> >>
> >>
> >> On Thu, Jun 16, 2011 at 3:52 PM, Nicolas Hernandez
> >> <ni...@gmail.com> wrote:
> >> > Hello Tommaso,
> >> >
> >> > after some more tests... I think I have found how to reproduce my
> >> problem.
> >> >
> >> > Tommaso, you re right it works fine with the pipeline you described
> >> > (i.e. with the WhitespaceTokenizer followed by the token trainer
> >> > (wst-tokenTrainer-AAE)) but only if the input texts are formatted as
> >> > 'normal' texts...
> >> > I tested the pipeline with texts already formatted in a 'wst' way (a
> >> > sentence per line and tokens separated by a whitespace character) and
> >> > like that it does not work any longer (despite the presence of the
> >> > sentence and token annotations).
> >> >
> >> > So my guess is that in command line the tokenTrainer needs to input a
> >> > wst format (with '<SPLIT>' tags) but the opennlp uima tokenTrainer
> >> > needs (in some way a 'detokenized' text).
> >> >
> >> > If needed, I can open a 'question' issue and attach the texts I used
> >> > to produce the problem.
> >> >
> >> > /Nicolas
> >> >
> >> > ---------- Forwarded message ----------
> >> > From: Tommaso Teofili <to...@gmail.com>
> >> > Date: Wed, Jun 15, 2011 at 5:30 PM
> >> > Subject: Re: UIMA TokenizerTrainer component : the model file is not
> >> created
> >> > To: opennlp-users@incubator.apache.org,
> nicolas.hernandez@univ-nantes.fr
> >> >
> >> >
> >> > Hello Nicolas,
> >> > I successfully used the OpenNLP UIMA TokenizerTrainer and also the
> >> > other trainers, for a simple proof I created an aggregate analysis
> >> > engine descriptor with the UIMA WhitespaceTokenizer and the OpenNLP
> >> > TokenizerTrainer in a fixed flow, then used a
> >> > FileSystemCollectionReader to to feed the pipeline.
> >> > In the TokenizerTrainer I set:
> >> >         <nameValuePair>
> >> >   <name>opennlp.uima.TokenType</name>
> >> >   <value>
> >> >      <string>org.apache.uima.TokenAnnotation</string>
> >> >   </value>
> >> > </nameValuePair>
> >> >         <nameValuePair>
> >> >   <name>opennlp.uima.language</name>
> >> >   <value>
> >> >      <string>en-US</string>
> >> >   </value>
> >> > </nameValuePair>
> >> >         <nameValuePair>
> >> >   <name>opennlp.uima.ModelName</name>
> >> >   <value>
> >> >      <string>target/Tokens.bin</string>
> >> >   </value>
> >> > </nameValuePair>
> >> >
> >> > which then created the Tokens.bin model that I was able to test from
> >> > command line and via APIs.
> >> > Are you using it in a different way?
> >> > Regards,
> >> > Tommaso
> >> >
> >> > 2011/6/15 Nicolas Hernandez <ni...@gmail.com>
> >> >>
> >> >> Hello
> >> >>
> >> >> Does someone have already used the UIMA TokenizerTrainer component ?
> I
> >> >> am a bit confused since it does not create any model file.
> >> >>
> >> >> In my stdout I got this :
> >> >> Indexing events using cutoff of 5
> >> >>        Computing event counts...
> >> >>
> >> >> done. 69669 events
> >> >>        Indexing...  done.
> >> >> Sorting and merging events... done. Reduced 69669 events to 16467.
> >> >> Done indexing.
> >> >> Incorporating indexed data for training...
> >> >> done.
> >> >>        Number of Event Tokens: 16467
> >> >>            Number of Outcomes: 1
> >> >>          Number of Predicates: 5624
> >> >> ...done.
> >> >> Computing model parameters...
> >> >> Performing 100 iterations.
> >> >>  1:  .. loglikelihood=0.0      1.0
> >> >>  2:  .. loglikelihood=0.0      1.0
> >> >>
> >> >> This look like a problem I got when I trained the model in command
> >> >> line without using the '<SPLIT>' tag. In command line, It differs
> >> >> since in command line I also got the following exception
> >> >> Exception in thread "main" java.lang.IllegalArgumentException: The
> >> >> maxent model is not compatible!
> >> >>
> >> >> I solved this problem by adding the tag as it is mentioned in the
> post
> >> >> of maxent model is not compatible with Tokenizer training       Fri,
> 13
> >> May,
> >> >> 09:33
> >> >>
> >>
> http://mail-archives.apache.org/mod_mbox/incubator-opennlp-users/201105.mbox/browser
> >> >>
> >> >> Does anyone know if it is the same problem ? In that case, how to
> >> >> specify the '<SPLIT>' tag in the UIMA version? As much as I
> understand
> >> >> its role, it is important to let the user the possibility of setting
> >> >> it.
> >> >>
> >> >> More globaly I am interested by any return on experience of people
> who
> >> >> successfully managed to build models with the UIMA OpenNLP * Trainer
> >> >> components. For now, I also got some trouble with the SentenceTrainer
> >> >> and I do not have test the others.
> >> >>
> >> >> /Nicolas
> >> >>
> >> >>
> >> >> --
> >> >> nicolas.hernandez@univ-nantes.fr
> >> >> #
> >> >> http://enicolashernandez.blogspot.com
> >> >> http://www.univ-nantes.fr/hernandez-n
> >> >> #
> >> >> Laboratoire LINA-TALN CNRS UMR 6241
> >> >> tel. +33 (0)2 51 12 58 55
> >> >> #
> >> >> Université de Nantes - Institut Universitaire de Technologie -
> >> >> Département Informatique
> >> >> tel. +33 (0)2 40 30 60 67
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > nicolas.hernandez@univ-nantes.fr
> >> > #
> >> > http://enicolashernandez.blogspot.com
> >> > http://www.univ-nantes.fr/hernandez-n
> >> > #
> >> > Laboratoire LINA-TALN CNRS UMR 6241
> >> > tel. +33 (0)2 51 12 58 55
> >> > #
> >> > Université de Nantes - Institut Universitaire de Technologie -
> >> > Département Informatique
> >> > tel. +33 (0)2 40 30 60 67
> >> >
> >>
> >>
> >>
> >> --
> >> nicolas.hernandez@univ-nantes.fr
> >> #
> >> http://enicolashernandez.blogspot.com
> >> http://www.univ-nantes.fr/hernandez-n
> >> #
> >> Laboratoire LINA-TALN CNRS UMR 6241
> >> tel. +33 (0)2 51 12 58 55
> >> #
> >> Université de Nantes - Institut Universitaire de Technologie -
> >> Département Informatique
> >> tel. +33 (0)2 40 30 60 67
> >>
> >
>
>
>
> --
> nicolas.hernandez@univ-nantes.fr
> #
> http://enicolashernandez.blogspot.com
> http://www.univ-nantes.fr/hernandez-n
> #
> Laboratoire Informatique de Nantes Atlantique CNRS UMR 6241
> tel. +33 (0)2 51 12 58 55
> #
> Université de Nantes - Institut Universitaire de Technologie -
> Département Informatique
> tel. +33 (0)2 40 30 60 67
>

Re: UIMA TokenizerTrainer component : the model file is not created

Posted by Nicolas Hernandez <ni...@gmail.com>.

Tommaso,

Concerning the sentence boundaries detection problem: After asking
Jörn, I opened the following jira [1]

Regards

/Nicolas

[1] https://issues.apache.org/jira/browse/OPENNLP-203


On Mon, Jun 20, 2011 at 11:14 AM, Tommaso Teofili
<to...@gmail.com> wrote:
> Hello Nicolas,
>
> 2011/6/17 Nicolas Hernandez <ni...@gmail.com>
>
>> Tommaso you said you successfully used the OpenNLP UIMA trainers.
>>
>> I am currently attempting to build French models for the various tasks
>> OpenNLP can deal with. But since I am also involved in UIMA stuff, I
>> wanted to test the OpenNLP UIMA components for doing that.
>> My goal is to donate the models to the OpenNLP community (i.e. in
>> http://opennlp.sourceforge.net/models-1.5/)
>>
>> Before testing the tokenizerTrainer, I tested the SentenceDetector. I
>> found at least two problems with the UIMA component
>> https://issues.apache.org/jira/browse/OPENNLP-197
>> One of them is not yet referenced in the jira. But I am currious to
>> know whether you encountered it.
>>
>> I noted that models trained with the UIMA component give wrong
>> begin/end offset despite the fact they manage to split text in
>> sentences. I observed that the begin of a current sentence starts
>> including as a first token the punctuation character of the previous
>> one while the
>> previous one does not include it as its last one.
>>
>> Have you noticed the problem ?
>>
>
> I didn't noticed that but I will rerun my tests to check it out, I may have
> missed that.
> I'll let you know how it goes.
> Regards,
> Tommaso
>
>
>>
>> I think that, most of all, my problems are due to the lack of
>> documentation for the uima integration. I plan to blog post about my
>> experience. Since I see there is an open issue for that
>> https://issues.apache.org/jira/browse/OPENNLP-49, if I manage to find
>> the time to blog spot, I can try to write it in some way it can also
>> be used to contribute to the documentation too (if you are interested
>> in).
>>
>>
>>
>> On Thu, Jun 16, 2011 at 3:52 PM, Nicolas Hernandez
>> <ni...@gmail.com> wrote:
>> > Hello Tommaso,
>> >
>> > after some more tests... I think I have found how to reproduce my
>> problem.
>> >
>> > Tommaso, you re right it works fine with the pipeline you described
>> > (i.e. with the WhitespaceTokenizer followed by the token trainer
>> > (wst-tokenTrainer-AAE)) but only if the input texts are formatted as
>> > 'normal' texts...
>> > I tested the pipeline with texts already formatted in a 'wst' way (a
>> > sentence per line and tokens separated by a whitespace character) and
>> > like that it does not work any longer (despite the presence of the
>> > sentence and token annotations).
>> >
>> > So my guess is that in command line the tokenTrainer needs to input a
>> > wst format (with '<SPLIT>' tags) but the opennlp uima tokenTrainer
>> > needs (in some way a 'detokenized' text).
>> >
>> > If needed, I can open a 'question' issue and attach the texts I used
>> > to produce the problem.
>> >
>> > /Nicolas
>> >
>> > ---------- Forwarded message ----------
>> > From: Tommaso Teofili <to...@gmail.com>
>> > Date: Wed, Jun 15, 2011 at 5:30 PM
>> > Subject: Re: UIMA TokenizerTrainer component : the model file is not
>> created
>> > To: opennlp-users@incubator.apache.org, nicolas.hernandez@univ-nantes.fr
>> >
>> >
>> > Hello Nicolas,
>> > I successfully used the OpenNLP UIMA TokenizerTrainer and also the
>> > other trainers, for a simple proof I created an aggregate analysis
>> > engine descriptor with the UIMA WhitespaceTokenizer and the OpenNLP
>> > TokenizerTrainer in a fixed flow, then used a
>> > FileSystemCollectionReader to to feed the pipeline.
>> > In the TokenizerTrainer I set:
>> >         <nameValuePair>
>> >   <name>opennlp.uima.TokenType</name>
>> >   <value>
>> >      <string>org.apache.uima.TokenAnnotation</string>
>> >   </value>
>> > </nameValuePair>
>> >         <nameValuePair>
>> >   <name>opennlp.uima.language</name>
>> >   <value>
>> >      <string>en-US</string>
>> >   </value>
>> > </nameValuePair>
>> >         <nameValuePair>
>> >   <name>opennlp.uima.ModelName</name>
>> >   <value>
>> >      <string>target/Tokens.bin</string>
>> >   </value>
>> > </nameValuePair>
>> >
>> > which then created the Tokens.bin model that I was able to test from
>> > command line and via APIs.
>> > Are you using it in a different way?
>> > Regards,
>> > Tommaso
>> >
>> > 2011/6/15 Nicolas Hernandez <ni...@gmail.com>
>> >>
>> >> Hello
>> >>
>> >> Does someone have already used the UIMA TokenizerTrainer component ? I
>> >> am a bit confused since it does not create any model file.
>> >>
>> >> In my stdout I got this :
>> >> Indexing events using cutoff of 5
>> >>        Computing event counts...
>> >>
>> >> done. 69669 events
>> >>        Indexing...  done.
>> >> Sorting and merging events... done. Reduced 69669 events to 16467.
>> >> Done indexing.
>> >> Incorporating indexed data for training...
>> >> done.
>> >>        Number of Event Tokens: 16467
>> >>            Number of Outcomes: 1
>> >>          Number of Predicates: 5624
>> >> ...done.
>> >> Computing model parameters...
>> >> Performing 100 iterations.
>> >>  1:  .. loglikelihood=0.0      1.0
>> >>  2:  .. loglikelihood=0.0      1.0
>> >>
>> >> This look like a problem I got when I trained the model in command
>> >> line without using the '<SPLIT>' tag. In command line, It differs
>> >> since in command line I also got the following exception
>> >> Exception in thread "main" java.lang.IllegalArgumentException: The
>> >> maxent model is not compatible!
>> >>
>> >> I solved this problem by adding the tag as it is mentioned in the post
>> >> of maxent model is not compatible with Tokenizer training       Fri, 13
>> May,
>> >> 09:33
>> >>
>> http://mail-archives.apache.org/mod_mbox/incubator-opennlp-users/201105.mbox/browser
>> >>
>> >> Does anyone know if it is the same problem ? In that case, how to
>> >> specify the '<SPLIT>' tag in the UIMA version? As much as I understand
>> >> its role, it is important to let the user the possibility of setting
>> >> it.
>> >>
>> >> More globaly I am interested by any return on experience of people who
>> >> successfully managed to build models with the UIMA OpenNLP * Trainer
>> >> components. For now, I also got some trouble with the SentenceTrainer
>> >> and I do not have test the others.
>> >>
>> >> /Nicolas
>> >>
>> >>
>> >> --
>> >> nicolas.hernandez@univ-nantes.fr
>> >> #
>> >> http://enicolashernandez.blogspot.com
>> >> http://www.univ-nantes.fr/hernandez-n
>> >> #
>> >> Laboratoire LINA-TALN CNRS UMR 6241
>> >> tel. +33 (0)2 51 12 58 55
>> >> #
>> >> Université de Nantes - Institut Universitaire de Technologie -
>> >> Département Informatique
>> >> tel. +33 (0)2 40 30 60 67
>> >
>> >
>> >
>> >
>> > --
>> > nicolas.hernandez@univ-nantes.fr
>> > #
>> > http://enicolashernandez.blogspot.com
>> > http://www.univ-nantes.fr/hernandez-n
>> > #
>> > Laboratoire LINA-TALN CNRS UMR 6241
>> > tel. +33 (0)2 51 12 58 55
>> > #
>> > Université de Nantes - Institut Universitaire de Technologie -
>> > Département Informatique
>> > tel. +33 (0)2 40 30 60 67
>> >
>>
>>
>>
>> --
>> nicolas.hernandez@univ-nantes.fr
>> #
>> http://enicolashernandez.blogspot.com
>> http://www.univ-nantes.fr/hernandez-n
>> #
>> Laboratoire LINA-TALN CNRS UMR 6241
>> tel. +33 (0)2 51 12 58 55
>> #
>> Université de Nantes - Institut Universitaire de Technologie -
>> Département Informatique
>> tel. +33 (0)2 40 30 60 67
>>
>



-- 
nicolas.hernandez@univ-nantes.fr
#
http://enicolashernandez.blogspot.com
http://www.univ-nantes.fr/hernandez-n
#
Laboratoire Informatique de Nantes Atlantique CNRS UMR 6241
tel. +33 (0)2 51 12 58 55
#
Université de Nantes - Institut Universitaire de Technologie -
Département Informatique
tel. +33 (0)2 40 30 60 67

Re: UIMA TokenizerTrainer component : the model file is not created

Posted by Tommaso Teofili <to...@gmail.com>.

Hello Nicolas,

2011/6/17 Nicolas Hernandez <ni...@gmail.com>

> Tommaso you said you successfully used the OpenNLP UIMA trainers.
>
> I am currently attempting to build French models for the various tasks
> OpenNLP can deal with. But since I am also involved in UIMA stuff, I
> wanted to test the OpenNLP UIMA components for doing that.
> My goal is to donate the models to the OpenNLP community (i.e. in
> http://opennlp.sourceforge.net/models-1.5/)
>
> Before testing the tokenizerTrainer, I tested the SentenceDetector. I
> found at least two problems with the UIMA component
> https://issues.apache.org/jira/browse/OPENNLP-197
> One of them is not yet referenced in the jira. But I am currious to
> know whether you encountered it.
>
> I noted that models trained with the UIMA component give wrong
> begin/end offset despite the fact they manage to split text in
> sentences. I observed that the begin of a current sentence starts
> including as a first token the punctuation character of the previous
> one while the
> previous one does not include it as its last one.
>
> Have you noticed the problem ?
>

I didn't noticed that but I will rerun my tests to check it out, I may have
missed that.
I'll let you know how it goes.
Regards,
Tommaso


>
> I think that, most of all, my problems are due to the lack of
> documentation for the uima integration. I plan to blog post about my
> experience. Since I see there is an open issue for that
> https://issues.apache.org/jira/browse/OPENNLP-49, if I manage to find
> the time to blog spot, I can try to write it in some way it can also
> be used to contribute to the documentation too (if you are interested
> in).
>
>
>
> On Thu, Jun 16, 2011 at 3:52 PM, Nicolas Hernandez
> <ni...@gmail.com> wrote:
> > Hello Tommaso,
> >
> > after some more tests... I think I have found how to reproduce my
> problem.
> >
> > Tommaso, you re right it works fine with the pipeline you described
> > (i.e. with the WhitespaceTokenizer followed by the token trainer
> > (wst-tokenTrainer-AAE)) but only if the input texts are formatted as
> > 'normal' texts...
> > I tested the pipeline with texts already formatted in a 'wst' way (a
> > sentence per line and tokens separated by a whitespace character) and
> > like that it does not work any longer (despite the presence of the
> > sentence and token annotations).
> >
> > So my guess is that in command line the tokenTrainer needs to input a
> > wst format (with '<SPLIT>' tags) but the opennlp uima tokenTrainer
> > needs (in some way a 'detokenized' text).
> >
> > If needed, I can open a 'question' issue and attach the texts I used
> > to produce the problem.
> >
> > /Nicolas
> >
> > ---------- Forwarded message ----------
> > From: Tommaso Teofili <to...@gmail.com>
> > Date: Wed, Jun 15, 2011 at 5:30 PM
> > Subject: Re: UIMA TokenizerTrainer component : the model file is not
> created
> > To: opennlp-users@incubator.apache.org, nicolas.hernandez@univ-nantes.fr
> >
> >
> > Hello Nicolas,
> > I successfully used the OpenNLP UIMA TokenizerTrainer and also the
> > other trainers, for a simple proof I created an aggregate analysis
> > engine descriptor with the UIMA WhitespaceTokenizer and the OpenNLP
> > TokenizerTrainer in a fixed flow, then used a
> > FileSystemCollectionReader to to feed the pipeline.
> > In the TokenizerTrainer I set:
> >         <nameValuePair>
> >   <name>opennlp.uima.TokenType</name>
> >   <value>
> >      <string>org.apache.uima.TokenAnnotation</string>
> >   </value>
> > </nameValuePair>
> >         <nameValuePair>
> >   <name>opennlp.uima.language</name>
> >   <value>
> >      <string>en-US</string>
> >   </value>
> > </nameValuePair>
> >         <nameValuePair>
> >   <name>opennlp.uima.ModelName</name>
> >   <value>
> >      <string>target/Tokens.bin</string>
> >   </value>
> > </nameValuePair>
> >
> > which then created the Tokens.bin model that I was able to test from
> > command line and via APIs.
> > Are you using it in a different way?
> > Regards,
> > Tommaso
> >
> > 2011/6/15 Nicolas Hernandez <ni...@gmail.com>
> >>
> >> Hello
> >>
> >> Does someone have already used the UIMA TokenizerTrainer component ? I
> >> am a bit confused since it does not create any model file.
> >>
> >> In my stdout I got this :
> >> Indexing events using cutoff of 5
> >>        Computing event counts...
> >>
> >> done. 69669 events
> >>        Indexing...  done.
> >> Sorting and merging events... done. Reduced 69669 events to 16467.
> >> Done indexing.
> >> Incorporating indexed data for training...
> >> done.
> >>        Number of Event Tokens: 16467
> >>            Number of Outcomes: 1
> >>          Number of Predicates: 5624
> >> ...done.
> >> Computing model parameters...
> >> Performing 100 iterations.
> >>  1:  .. loglikelihood=0.0      1.0
> >>  2:  .. loglikelihood=0.0      1.0
> >>
> >> This look like a problem I got when I trained the model in command
> >> line without using the '<SPLIT>' tag. In command line, It differs
> >> since in command line I also got the following exception
> >> Exception in thread "main" java.lang.IllegalArgumentException: The
> >> maxent model is not compatible!
> >>
> >> I solved this problem by adding the tag as it is mentioned in the post
> >> of maxent model is not compatible with Tokenizer training       Fri, 13
> May,
> >> 09:33
> >>
> http://mail-archives.apache.org/mod_mbox/incubator-opennlp-users/201105.mbox/browser
> >>
> >> Does anyone know if it is the same problem ? In that case, how to
> >> specify the '<SPLIT>' tag in the UIMA version? As much as I understand
> >> its role, it is important to let the user the possibility of setting
> >> it.
> >>
> >> More globaly I am interested by any return on experience of people who
> >> successfully managed to build models with the UIMA OpenNLP * Trainer
> >> components. For now, I also got some trouble with the SentenceTrainer
> >> and I do not have test the others.
> >>
> >> /Nicolas
> >>
> >>
> >> --
> >> nicolas.hernandez@univ-nantes.fr
> >> #
> >> http://enicolashernandez.blogspot.com
> >> http://www.univ-nantes.fr/hernandez-n
> >> #
> >> Laboratoire LINA-TALN CNRS UMR 6241
> >> tel. +33 (0)2 51 12 58 55
> >> #
> >> Université de Nantes - Institut Universitaire de Technologie -
> >> Département Informatique
> >> tel. +33 (0)2 40 30 60 67
> >
> >
> >
> >
> > --
> > nicolas.hernandez@univ-nantes.fr
> > #
> > http://enicolashernandez.blogspot.com
> > http://www.univ-nantes.fr/hernandez-n
> > #
> > Laboratoire LINA-TALN CNRS UMR 6241
> > tel. +33 (0)2 51 12 58 55
> > #
> > Université de Nantes - Institut Universitaire de Technologie -
> > Département Informatique
> > tel. +33 (0)2 40 30 60 67
> >
>
>
>
> --
> nicolas.hernandez@univ-nantes.fr
> #
> http://enicolashernandez.blogspot.com
> http://www.univ-nantes.fr/hernandez-n
> #
> Laboratoire LINA-TALN CNRS UMR 6241
> tel. +33 (0)2 51 12 58 55
> #
> Université de Nantes - Institut Universitaire de Technologie -
> Département Informatique
> tel. +33 (0)2 40 30 60 67
>

Re: UIMA TokenizerTrainer component : the model file is not created

Posted by Nicolas Hernandez <ni...@gmail.com>.

Tommaso you said you successfully used the OpenNLP UIMA trainers.

I am currently attempting to build French models for the various tasks
OpenNLP can deal with. But since I am also involved in UIMA stuff, I
wanted to test the OpenNLP UIMA components for doing that.
My goal is to donate the models to the OpenNLP community (i.e. in
http://opennlp.sourceforge.net/models-1.5/)

Before testing the tokenizerTrainer, I tested the SentenceDetector. I
found at least two problems with the UIMA component
https://issues.apache.org/jira/browse/OPENNLP-197
One of them is not yet referenced in the jira. But I am currious to
know whether you encountered it.

I noted that models trained with the UIMA component give wrong
begin/end offset despite the fact they manage to split text in
sentences. I observed that the begin of a current sentence starts
including as a first token the punctuation character of the previous
one while the
previous one does not include it as its last one.

Have you noticed the problem ?

I think that, most of all, my problems are due to the lack of
documentation for the uima integration. I plan to blog post about my
experience. Since I see there is an open issue for that
https://issues.apache.org/jira/browse/OPENNLP-49, if I manage to find
the time to blog spot, I can try to write it in some way it can also
be used to contribute to the documentation too (if you are interested
in).



On Thu, Jun 16, 2011 at 3:52 PM, Nicolas Hernandez
<ni...@gmail.com> wrote:
> Hello Tommaso,
>
> after some more tests... I think I have found how to reproduce my problem.
>
> Tommaso, you re right it works fine with the pipeline you described
> (i.e. with the WhitespaceTokenizer followed by the token trainer
> (wst-tokenTrainer-AAE)) but only if the input texts are formatted as
> 'normal' texts...
> I tested the pipeline with texts already formatted in a 'wst' way (a
> sentence per line and tokens separated by a whitespace character) and
> like that it does not work any longer (despite the presence of the
> sentence and token annotations).
>
> So my guess is that in command line the tokenTrainer needs to input a
> wst format (with '<SPLIT>' tags) but the opennlp uima tokenTrainer
> needs (in some way a 'detokenized' text).
>
> If needed, I can open a 'question' issue and attach the texts I used
> to produce the problem.
>
> /Nicolas
>
> ---------- Forwarded message ----------
> From: Tommaso Teofili <to...@gmail.com>
> Date: Wed, Jun 15, 2011 at 5:30 PM
> Subject: Re: UIMA TokenizerTrainer component : the model file is not created
> To: opennlp-users@incubator.apache.org, nicolas.hernandez@univ-nantes.fr
>
>
> Hello Nicolas,
> I successfully used the OpenNLP UIMA TokenizerTrainer and also the
> other trainers, for a simple proof I created an aggregate analysis
> engine descriptor with the UIMA WhitespaceTokenizer and the OpenNLP
> TokenizerTrainer in a fixed flow, then used a
> FileSystemCollectionReader to to feed the pipeline.
> In the TokenizerTrainer I set:
>         <nameValuePair>
>   <name>opennlp.uima.TokenType</name>
>   <value>
>      <string>org.apache.uima.TokenAnnotation</string>
>   </value>
> </nameValuePair>
>         <nameValuePair>
>   <name>opennlp.uima.language</name>
>   <value>
>      <string>en-US</string>
>   </value>
> </nameValuePair>
>         <nameValuePair>
>   <name>opennlp.uima.ModelName</name>
>   <value>
>      <string>target/Tokens.bin</string>
>   </value>
> </nameValuePair>
>
> which then created the Tokens.bin model that I was able to test from
> command line and via APIs.
> Are you using it in a different way?
> Regards,
> Tommaso
>
> 2011/6/15 Nicolas Hernandez <ni...@gmail.com>
>>
>> Hello
>>
>> Does someone have already used the UIMA TokenizerTrainer component ? I
>> am a bit confused since it does not create any model file.
>>
>> In my stdout I got this :
>> Indexing events using cutoff of 5
>>        Computing event counts...
>>
>> done. 69669 events
>>        Indexing...  done.
>> Sorting and merging events... done. Reduced 69669 events to 16467.
>> Done indexing.
>> Incorporating indexed data for training...
>> done.
>>        Number of Event Tokens: 16467
>>            Number of Outcomes: 1
>>          Number of Predicates: 5624
>> ...done.
>> Computing model parameters...
>> Performing 100 iterations.
>>  1:  .. loglikelihood=0.0      1.0
>>  2:  .. loglikelihood=0.0      1.0
>>
>> This look like a problem I got when I trained the model in command
>> line without using the '<SPLIT>' tag. In command line, It differs
>> since in command line I also got the following exception
>> Exception in thread "main" java.lang.IllegalArgumentException: The
>> maxent model is not compatible!
>>
>> I solved this problem by adding the tag as it is mentioned in the post
>> of maxent model is not compatible with Tokenizer training       Fri, 13 May,
>> 09:33
>>  http://mail-archives.apache.org/mod_mbox/incubator-opennlp-users/201105.mbox/browser
>>
>> Does anyone know if it is the same problem ? In that case, how to
>> specify the '<SPLIT>' tag in the UIMA version? As much as I understand
>> its role, it is important to let the user the possibility of setting
>> it.
>>
>> More globaly I am interested by any return on experience of people who
>> successfully managed to build models with the UIMA OpenNLP * Trainer
>> components. For now, I also got some trouble with the SentenceTrainer
>> and I do not have test the others.
>>
>> /Nicolas
>>
>>
>> --
>> nicolas.hernandez@univ-nantes.fr
>> #
>> http://enicolashernandez.blogspot.com
>> http://www.univ-nantes.fr/hernandez-n
>> #
>> Laboratoire LINA-TALN CNRS UMR 6241
>> tel. +33 (0)2 51 12 58 55
>> #
>> Université de Nantes - Institut Universitaire de Technologie -
>> Département Informatique
>> tel. +33 (0)2 40 30 60 67
>
>
>
>
> --
> nicolas.hernandez@univ-nantes.fr
> #
> http://enicolashernandez.blogspot.com
> http://www.univ-nantes.fr/hernandez-n
> #
> Laboratoire LINA-TALN CNRS UMR 6241
> tel. +33 (0)2 51 12 58 55
> #
> Université de Nantes - Institut Universitaire de Technologie -
> Département Informatique
> tel. +33 (0)2 40 30 60 67
>



-- 
nicolas.hernandez@univ-nantes.fr
#
http://enicolashernandez.blogspot.com
http://www.univ-nantes.fr/hernandez-n
#
Laboratoire LINA-TALN CNRS UMR 6241
tel. +33 (0)2 51 12 58 55
#
Université de Nantes - Institut Universitaire de Technologie -
Département Informatique
tel. +33 (0)2 40 30 60 67

Re: UIMA TokenizerTrainer component : the model file is not created

Posted by Nicolas Hernandez <ni...@gmail.com>.

Hello Tommaso,

after some more tests... I think I have found how to reproduce my problem.

Tommaso, you re right it works fine with the pipeline you described
(i.e. with the WhitespaceTokenizer followed by the token trainer
(wst-tokenTrainer-AAE)) but only if the input texts are formatted as
'normal' texts...
I tested the pipeline with texts already formatted in a 'wst' way (a
sentence per line and tokens separated by a whitespace character) and
like that it does not work any longer (despite the presence of the
sentence and token annotations).

So my guess is that in command line the tokenTrainer needs to input a
wst format (with '<SPLIT>' tags) but the opennlp uima tokenTrainer
needs (in some way a 'detokenized' text).

If needed, I can open a 'question' issue and attach the texts I used
to produce the problem.

/Nicolas

---------- Forwarded message ----------
From: Tommaso Teofili <to...@gmail.com>
Date: Wed, Jun 15, 2011 at 5:30 PM
Subject: Re: UIMA TokenizerTrainer component : the model file is not created
To: opennlp-users@incubator.apache.org, nicolas.hernandez@univ-nantes.fr


Hello Nicolas,
I successfully used the OpenNLP UIMA TokenizerTrainer and also the
other trainers, for a simple proof I created an aggregate analysis
engine descriptor with the UIMA WhitespaceTokenizer and the OpenNLP
TokenizerTrainer in a fixed flow, then used a
FileSystemCollectionReader to to feed the pipeline.
In the TokenizerTrainer I set:
        <nameValuePair>
  <name>opennlp.uima.TokenType</name>
  <value>
     <string>org.apache.uima.TokenAnnotation</string>
  </value>
</nameValuePair>
        <nameValuePair>
  <name>opennlp.uima.language</name>
  <value>
     <string>en-US</string>
  </value>
</nameValuePair>
        <nameValuePair>
  <name>opennlp.uima.ModelName</name>
  <value>
     <string>target/Tokens.bin</string>
  </value>
</nameValuePair>

which then created the Tokens.bin model that I was able to test from
command line and via APIs.
Are you using it in a different way?
Regards,
Tommaso

2011/6/15 Nicolas Hernandez <ni...@gmail.com>
>
> Hello
>
> Does someone have already used the UIMA TokenizerTrainer component ? I
> am a bit confused since it does not create any model file.
>
> In my stdout I got this :
> Indexing events using cutoff of 5
>        Computing event counts...
>
> done. 69669 events
>        Indexing...  done.
> Sorting and merging events... done. Reduced 69669 events to 16467.
> Done indexing.
> Incorporating indexed data for training...
> done.
>        Number of Event Tokens: 16467
>            Number of Outcomes: 1
>          Number of Predicates: 5624
> ...done.
> Computing model parameters...
> Performing 100 iterations.
>  1:  .. loglikelihood=0.0      1.0
>  2:  .. loglikelihood=0.0      1.0
>
> This look like a problem I got when I trained the model in command
> line without using the '<SPLIT>' tag. In command line, It differs
> since in command line I also got the following exception
> Exception in thread "main" java.lang.IllegalArgumentException: The
> maxent model is not compatible!
>
> I solved this problem by adding the tag as it is mentioned in the post
> of maxent model is not compatible with Tokenizer training       Fri, 13 May,
> 09:33
>  http://mail-archives.apache.org/mod_mbox/incubator-opennlp-users/201105.mbox/browser
>
> Does anyone know if it is the same problem ? In that case, how to
> specify the '<SPLIT>' tag in the UIMA version? As much as I understand
> its role, it is important to let the user the possibility of setting
> it.
>
> More globaly I am interested by any return on experience of people who
> successfully managed to build models with the UIMA OpenNLP * Trainer
> components. For now, I also got some trouble with the SentenceTrainer
> and I do not have test the others.
>
> /Nicolas
>
>
> --
> nicolas.hernandez@univ-nantes.fr
> #
> http://enicolashernandez.blogspot.com
> http://www.univ-nantes.fr/hernandez-n
> #
> Laboratoire LINA-TALN CNRS UMR 6241
> tel. +33 (0)2 51 12 58 55
> #
> Université de Nantes - Institut Universitaire de Technologie -
> Département Informatique
> tel. +33 (0)2 40 30 60 67




-- 
nicolas.hernandez@univ-nantes.fr
#
http://enicolashernandez.blogspot.com
http://www.univ-nantes.fr/hernandez-n
#
Laboratoire LINA-TALN CNRS UMR 6241
tel. +33 (0)2 51 12 58 55
#
Université de Nantes - Institut Universitaire de Technologie -
Département Informatique
tel. +33 (0)2 40 30 60 67

Re: UIMA TokenizerTrainer component : the model file is not created

Posted by Tommaso Teofili <to...@gmail.com>.

Hello Nicolas,
I successfully used the OpenNLP UIMA TokenizerTrainer and also the other
trainers, for a simple proof I created an aggregate analysis engine
descriptor with the UIMA WhitespaceTokenizer and the OpenNLP
TokenizerTrainer in a fixed flow, then used a FileSystemCollectionReader to
to feed the pipeline.
In the TokenizerTrainer I set:

        <nameValuePair>
  <name>opennlp.uima.TokenType</name>
   <value>
     <string>org.apache.uima.TokenAnnotation</string>
   </value>
</nameValuePair>
        <nameValuePair>
   <name>opennlp.uima.language</name>
  <value>
     <string>en-US</string>
   </value>
</nameValuePair>
        <nameValuePair>
   <name>opennlp.uima.ModelName</name>
  <value>
      <string>target/Tokens.bin</string>
  </value>
 </nameValuePair>

which then created the Tokens.bin model that I was able to test from command
line and via APIs.
Are you using it in a different way?
Regards,
Tommaso


2011/6/15 Nicolas Hernandez <ni...@gmail.com>

> Hello
>
> Does someone have already used the UIMA TokenizerTrainer component ? I
> am a bit confused since it does not create any model file.
>
> In my stdout I got this :
> Indexing events using cutoff of 5
>        Computing event counts...
>
> done. 69669 events
>        Indexing...  done.
> Sorting and merging events... done. Reduced 69669 events to 16467.
> Done indexing.
> Incorporating indexed data for training...
> done.
>        Number of Event Tokens: 16467
>            Number of Outcomes: 1
>          Number of Predicates: 5624
> ...done.
> Computing model parameters...
> Performing 100 iterations.
>  1:  .. loglikelihood=0.0      1.0
>  2:  .. loglikelihood=0.0      1.0
>
> This look like a problem I got when I trained the model in command
> line without using the '<SPLIT>' tag. In command line, It differs
> since in command line I also got the following exception
> Exception in thread "main" java.lang.IllegalArgumentException: The
> maxent model is not compatible!
>
> I solved this problem by adding the tag as it is mentioned in the post
> of maxent model is not compatible with Tokenizer training       Fri, 13
> May,
> 09:33
>
> http://mail-archives.apache.org/mod_mbox/incubator-opennlp-users/201105.mbox/browser
>
> Does anyone know if it is the same problem ? In that case, how to
> specify the '<SPLIT>' tag in the UIMA version? As much as I understand
> its role, it is important to let the user the possibility of setting
> it.
>
> More globaly I am interested by any return on experience of people who
> successfully managed to build models with the UIMA OpenNLP * Trainer
> components. For now, I also got some trouble with the SentenceTrainer
> and I do not have test the others.
>
> /Nicolas
>
>
> --
> nicolas.hernandez@univ-nantes.fr
> #
> http://enicolashernandez.blogspot.com
> http://www.univ-nantes.fr/hernandez-n
> #
> Laboratoire LINA-TALN CNRS UMR 6241
> tel. +33 (0)2 51 12 58 55
> #
> Université de Nantes - Institut Universitaire de Technologie -
> Département Informatique
> tel. +33 (0)2 40 30 60 67
>

Re: UIMA TokenizerTrainer component : the model file is not created

Posted by Jörn Kottmann <ko...@gmail.com>.

On 6/15/11 4:46 PM, Nicolas Hernandez wrote:
> Hello
>
> Does someone have already used the UIMA TokenizerTrainer component ? I
> am a bit confused since it does not create any model file.
>
> In my stdout I got this :
> Indexing events using cutoff of 5
> 	Computing event counts...
>
> done. 69669 events
> 	Indexing...  done.
> Sorting and merging events... done. Reduced 69669 events to 16467.
> Done indexing.
> Incorporating indexed data for training...
> done.
> 	Number of Event Tokens: 16467
> 	    Number of Outcomes: 1
> 	  Number of Predicates: 5624
> ...done.
> Computing model parameters...
> Performing 100 iterations.
>    1:  .. loglikelihood=0.0	1.0
>    2:  .. loglikelihood=0.0	1.0
>
> This look like a problem I got when I trained the model in command
> line without using the '<SPLIT>' tag. In command line, It differs
> since in command line I also got the following exception
> Exception in thread "main" java.lang.IllegalArgumentException: The
> maxent model is not compatible!
>
> I solved this problem by adding the tag as it is mentioned in the post
> of maxent model is not compatible with Tokenizer training	Fri, 13 May,
> 09:33
>   http://mail-archives.apache.org/mod_mbox/incubator-opennlp-users/201105.mbox/browser
>
> Does anyone know if it is the same problem ? In that case, how to
> specify the '<SPLIT>' tag in the UIMA version? As much as I understand
> its role, it is important to let the user the possibility of setting
> it.
The <SPLIT> tag is not supported by the UIMA trainer version, there you 
simply
annotate your tokens with an UIMA annotation. The training code does not 
work
when you annotate white space tokenized text, since then the training 
code cannot
figure out which tokens haven been written together and which not.

In UIMA you usually always want to work with the original text, which is 
usually
not white space tokenized. To track the tokens, token annotations can be 
added to
the CAS.

I guess in your test the serialization code failed because the model 
only had one
outcome, that can be considered as a bug and should be fixed in some way.

Jörn