You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by Joern Kottmann <ko...@gmail.com> on 2015/02/02 10:12:00 UTC

Re: Training data sets used in OpenNLP

Hello,

I suggest not to use the old models anymore, - especially the name
finders - don't perform well on recent news articles.

I am not aware which data was used to train the sentence detector,
tokenizer and pos tagger. The latter I guess could be based on brown and
penn treebank data.

There is support to train with OntoNotes. I think most components can
now be trained on the data, but I only did that for the name finders
which turned out to work quite well. OntoNotes can be acquired very
cheaply.

Jörn

On Fri, 2015-01-30 at 15:12 +0000, demaidim@cs.man.ac.uk wrote:
> I am using Opennlp in my research to extract
> terms from educational corpus and I would like to ask you about the
> opennlp models (chunker, Sentence Detector, Tokenizer, maxentropy POS
> tagger). What is the training data set used. It is mentioned clearly that
> CONLL 2000 is used to train the chunker. however, no information is
> provided about the training data used in Sentence Detector, Tokenizer,
> maxentropy POS tagger.