You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Aditya Kulkarni <ad...@gmail.com> on 2014/04/02 11:52:49 UTC

Re: obtaining data used to train OpenNLP models

+1
This question is not answered for me too.
Should be great help to get it answered.

-aditya
 On Apr 1, 2014 12:38 AM, "Stuart Robinson" <st...@gmail.com>
wrote:

> I've tried using the tokenizer model for English provided by OpenNLP:
>
> http://opennlp.sourceforge.net/models-1.5/en-token.bin
>
> It's listed here, where it's described as "Trained on opennnlp training
> data":
>
> http://opennlp.sourceforge.net/models-1.5/
>
> It works pretty well but I'm working on some social media text that has
> some non-standard punctuation. For example, it's not uncommon for words to
> be separated by a series of punctuation characters, like so:
>
> oooh,,,,go away fever and flu
>
> I want to train up a new model using text like this but don't want to start
> entirely from scratch. Is the training data for this model available from
> OpenNLP? If so, I could experiment with supplementing its training data. It
> seems like sharing training data, and not just trained models, could be a
> great service.
>
> Thanks,
> Stuart Robinson
>