You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by vishvAs vAsuki <vi...@gmail.com> on 2011/07/26 01:26:55 UTC

An observation about the MAXENT tagger and CAPS

Here is an observation about the MAXENT tagger which may be of interest to
others.

I recently tried to replicate the tagging results described in the
wiki<http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Conll06#Train_a_tokenizer_model>(
http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Conll06#Train_a_tokenizer_model),
while calling the tagging
API<http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Postagger>from
my Scala code. As in the case of the command line tool, I was using
the
parameters numIterations = 100 and event-threshold = 5. The only difference
was in how the sample-stream  passed to the tagging API was created: I was
using my own scala code to create the sample stream (which looked fine to
the naked eye). But, my code was reading the words in all CAPS. This
resulted in a slight but noticeable decline in performance: eg: 0.96 vs 0.95.
(More detailed output appended.)

Note that the sample stream for both the test and training data were in CAPS
- so maybe the model treats “Port” and “port” differently.


=== Command line case===
Sorting and merging events... done. Reduced 206678 events to 193001.
...
        Number of Event Tokens: 193001
            Number of Outcomes: 22
          Number of Predicates: 29155
...done.
Computing model parameters...
Performing 100 iterations.
  1:  .. loglikelihood=-638850.4721742678       0.13807468622688432
..
100:  .. loglikelihood=-13827.506953520902      0.9901537657612325
Accuracy: 0.9659110277825124

=== My code===
Sorting and merging events... done. Reduced 206678 events to 193059.
Done indexing.
Incorporating indexed data for training...
done.
        Number of Event Tokens: 193059
            Number of Outcomes: 16
          Number of Predicates: 27709
...done.
Computing model parameters...
Performing 100 iterations.
  1:  .. loglikelihood=-573033.0919349034        0.13807468622688432
..
100:  .. loglikelihood=-18019.22974368408        0.9831041523529355
Evaluating ... Accuracy: 0.9500596557013806

--
Cheers,
vishvAs

Re: An observation about the MAXENT tagger and CAPS

Posted by Jörn Kottmann <ko...@gmail.com>.
You can find our current documentation here:
http://incubator.apache.org/opennlp/documentation/manual/opennlp.html

Why do you have more events and less outcomes in your second run?

In 1.5.1 we now have built-in converters for conll06, you can see how
to use it with this command:
bin/opennlp POSTaggerConverter conllx

It is still not described in our documentation,
but any help is welcome.

Jörn

On 7/26/11 1:26 AM, vishvAs vAsuki wrote:
> Here is an observation about the MAXENT tagger which may be of interest to
> others.
>
> I recently tried to replicate the tagging results described in the
> wiki<http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Conll06#Train_a_tokenizer_model>(
> http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Conll06#Train_a_tokenizer_model),
> while calling the tagging
> API<http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Postagger>from
> my Scala code. As in the case of the command line tool, I was using
> the
> parameters numIterations = 100 and event-threshold = 5. The only difference
> was in how the sample-stream  passed to the tagging API was created: I was
> using my own scala code to create the sample stream (which looked fine to
> the naked eye). But, my code was reading the words in all CAPS. This
> resulted in a slight but noticeable decline in performance: eg: 0.96 vs 0.95.
> (More detailed output appended.)
>
> Note that the sample stream for both the test and training data were in CAPS
> - so maybe the model treats “Port” and “port” differently.
>
>
> === Command line case===
> Sorting and merging events... done. Reduced 206678 events to 193001.
> ...
>          Number of Event Tokens: 193001
>              Number of Outcomes: 22
>            Number of Predicates: 29155
> ...done.
> Computing model parameters...
> Performing 100 iterations.
>    1:  .. loglikelihood=-638850.4721742678       0.13807468622688432
> ..
> 100:  .. loglikelihood=-13827.506953520902      0.9901537657612325
> Accuracy: 0.9659110277825124
>
> === My code===
> Sorting and merging events... done. Reduced 206678 events to 193059.
> Done indexing.
> Incorporating indexed data for training...
> done.
>          Number of Event Tokens: 193059
>              Number of Outcomes: 16
>            Number of Predicates: 27709
> ...done.
> Computing model parameters...
> Performing 100 iterations.
>    1:  .. loglikelihood=-573033.0919349034        0.13807468622688432
> ..
> 100:  .. loglikelihood=-18019.22974368408        0.9831041523529355
> Evaluating ... Accuracy: 0.9500596557013806
>
> --
> Cheers,
> vishvAs
>