You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Michael Schmitz <sc...@cs.washington.edu> on 2011/06/10 19:27:51 UTC

OpenNLP English maxent pos tagger models for case-insensitive sentences

Hi, I was wondering if the training data for the OpenNLP maxent POS tagger
models is public and available somewhere.  I would like to train models for
the pos tagger and the chunker that work on sentences without case (i.e. all
capitalized).  If I had the training data used for en-pos-maxent.bin, a
first pass would simply mean capitalizing the tokens and running the
trainer.  It appears that the chunker training data somes from CONLL2000 (
http://www.cnts.ua.ac.be/conll2000/chunking/).

I would be happy to share the models with OpenNLP if anyone thought they
would be of use to others.

Peace.  Michael

Re: OpenNLP English maxent pos tagger models for case-insensitive sentences

Posted by "william.colen@gmail.com" <wi...@gmail.com>.

Hi Michael,

Maybe you could use the CONLL2000 data. What do you think? It includes POS
tags.
To use it you will need to create a new converter:

   1. Create a new POSSample stream for the CONLL2000, it is similar to
   ConllXPOSSampleStream<http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/formats/ConllXPOSSampleStream.java?view=markup>
   ;
   2. Create a factory for your new class, similar to
   ConllXPOSSampleStreamFactory<http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/formats/ConllXPOSSampleStreamFactory.java?view=markup>,
   this class is required to launch the formatter from command line;
   3. Finally add the factory to
POSTaggerConverter<http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/postag/POSTaggerConverter.java?view=markup>.

With the converted sample you you will be able to train your model like
explained in the documentation.
It would be nice if you could contribute back with a patch adding your new
converter.

Regards,
William

On Fri, Jun 10, 2011 at 11:21 PM, Jason Baldridge
<ja...@gmail.com>wrote:

> Michael,
>
> The inability to redistribute training data is a current problem with
> retraining and improving models:
>
> https://cwiki.apache.org/OPENNLP/opennlp-annotations.html
>
> Also, see this discussion about "OpenNLP Annotations Proposal" on the
> opennlp-dev list:
>
>
> http://mail-archives.apache.org/mod_mbox/incubator-opennlp-dev/201106.mbox/thread
>
> It might take a little while to get this going, but we're all very keen to
> make progress on it!
>
> Jason
>
> On Fri, Jun 10, 2011 at 12:27 PM, Michael Schmitz
> <sc...@cs.washington.edu>wrote:
>
> > Hi, I was wondering if the training data for the OpenNLP maxent POS
> tagger
> > models is public and available somewhere.  I would like to train models
> for
> > the pos tagger and the chunker that work on sentences without case (i.e.
> > all
> > capitalized).  If I had the training data used for en-pos-maxent.bin, a
> > first pass would simply mean capitalizing the tokens and running the
> > trainer.  It appears that the chunker training data somes from CONLL2000
> (
> > http://www.cnts.ua.ac.be/conll2000/chunking/).
> >
> > I would be happy to share the models with OpenNLP if anyone thought they
> > would be of use to others.
> >
> > Peace.  Michael
> >
>
>
>
> --
> Jason Baldridge
> Assistant Professor, Department of Linguistics
> The University of Texas at Austin
> http://www.jasonbaldridge.com
> http://twitter.com/jasonbaldridge
>

Re: OpenNLP English maxent pos tagger models for case-insensitive sentences

Posted by Jason Baldridge <ja...@gmail.com>.

Michael,

The inability to redistribute training data is a current problem with
retraining and improving models:

https://cwiki.apache.org/OPENNLP/opennlp-annotations.html

Also, see this discussion about "OpenNLP Annotations Proposal" on the
opennlp-dev list:

http://mail-archives.apache.org/mod_mbox/incubator-opennlp-dev/201106.mbox/thread

It might take a little while to get this going, but we're all very keen to
make progress on it!

Jason

On Fri, Jun 10, 2011 at 12:27 PM, Michael Schmitz
<sc...@cs.washington.edu>wrote:

> Hi, I was wondering if the training data for the OpenNLP maxent POS tagger
> models is public and available somewhere.  I would like to train models for
> the pos tagger and the chunker that work on sentences without case (i.e.
> all
> capitalized).  If I had the training data used for en-pos-maxent.bin, a
> first pass would simply mean capitalizing the tokens and running the
> trainer.  It appears that the chunker training data somes from CONLL2000 (
> http://www.cnts.ua.ac.be/conll2000/chunking/).
>
> I would be happy to share the models with OpenNLP if anyone thought they
> would be of use to others.
>
> Peace.  Michael
>



-- 
Jason Baldridge
Assistant Professor, Department of Linguistics
The University of Texas at Austin
http://www.jasonbaldridge.com
http://twitter.com/jasonbaldridge