You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by vishvAs vAsuki <vi...@gmail.com> on 2011/07/26 01:26:55 UTC
An observation about the MAXENT tagger and CAPS
Here is an observation about the MAXENT tagger which may be of interest to
others.
I recently tried to replicate the tagging results described in the
wiki<http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Conll06#Train_a_tokenizer_model>(
http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Conll06#Train_a_tokenizer_model),
while calling the tagging
API<http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Postagger>from
my Scala code. As in the case of the command line tool, I was using
the
parameters numIterations = 100 and event-threshold = 5. The only difference
was in how the sample-stream passed to the tagging API was created: I was
using my own scala code to create the sample stream (which looked fine to
the naked eye). But, my code was reading the words in all CAPS. This
resulted in a slight but noticeable decline in performance: eg: 0.96 vs 0.95.
(More detailed output appended.)
Note that the sample stream for both the test and training data were in CAPS
- so maybe the model treats “Port” and “port” differently.
=== Command line case===
Sorting and merging events... done. Reduced 206678 events to 193001.
...
Number of Event Tokens: 193001
Number of Outcomes: 22
Number of Predicates: 29155
...done.
Computing model parameters...
Performing 100 iterations.
1: .. loglikelihood=-638850.4721742678 0.13807468622688432
..
100: .. loglikelihood=-13827.506953520902 0.9901537657612325
Accuracy: 0.9659110277825124
=== My code===
Sorting and merging events... done. Reduced 206678 events to 193059.
Done indexing.
Incorporating indexed data for training...
done.
Number of Event Tokens: 193059
Number of Outcomes: 16
Number of Predicates: 27709
...done.
Computing model parameters...
Performing 100 iterations.
1: .. loglikelihood=-573033.0919349034 0.13807468622688432
..
100: .. loglikelihood=-18019.22974368408 0.9831041523529355
Evaluating ... Accuracy: 0.9500596557013806
--
Cheers,
vishvAs
Re: An observation about the MAXENT tagger and CAPS
Posted by Jörn Kottmann <ko...@gmail.com>.
You can find our current documentation here:
http://incubator.apache.org/opennlp/documentation/manual/opennlp.html
Why do you have more events and less outcomes in your second run?
In 1.5.1 we now have built-in converters for conll06, you can see how
to use it with this command:
bin/opennlp POSTaggerConverter conllx
It is still not described in our documentation,
but any help is welcome.
Jörn
On 7/26/11 1:26 AM, vishvAs vAsuki wrote:
> Here is an observation about the MAXENT tagger which may be of interest to
> others.
>
> I recently tried to replicate the tagging results described in the
> wiki<http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Conll06#Train_a_tokenizer_model>(
> http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Conll06#Train_a_tokenizer_model),
> while calling the tagging
> API<http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Postagger>from
> my Scala code. As in the case of the command line tool, I was using
> the
> parameters numIterations = 100 and event-threshold = 5. The only difference
> was in how the sample-stream passed to the tagging API was created: I was
> using my own scala code to create the sample stream (which looked fine to
> the naked eye). But, my code was reading the words in all CAPS. This
> resulted in a slight but noticeable decline in performance: eg: 0.96 vs 0.95.
> (More detailed output appended.)
>
> Note that the sample stream for both the test and training data were in CAPS
> - so maybe the model treats “Port” and “port” differently.
>
>
> === Command line case===
> Sorting and merging events... done. Reduced 206678 events to 193001.
> ...
> Number of Event Tokens: 193001
> Number of Outcomes: 22
> Number of Predicates: 29155
> ...done.
> Computing model parameters...
> Performing 100 iterations.
> 1: .. loglikelihood=-638850.4721742678 0.13807468622688432
> ..
> 100: .. loglikelihood=-13827.506953520902 0.9901537657612325
> Accuracy: 0.9659110277825124
>
> === My code===
> Sorting and merging events... done. Reduced 206678 events to 193059.
> Done indexing.
> Incorporating indexed data for training...
> done.
> Number of Event Tokens: 193059
> Number of Outcomes: 16
> Number of Predicates: 27709
> ...done.
> Computing model parameters...
> Performing 100 iterations.
> 1: .. loglikelihood=-573033.0919349034 0.13807468622688432
> ..
> 100: .. loglikelihood=-18019.22974368408 0.9831041523529355
> Evaluating ... Accuracy: 0.9500596557013806
>
> --
> Cheers,
> vishvAs
>