You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Stuart Robinson <st...@gmail.com> on 2014/03/31 21:08:18 UTC

obtaining data used to train OpenNLP models

I've tried using the tokenizer model for English provided by OpenNLP:

http://opennlp.sourceforge.net/models-1.5/en-token.bin

It's listed here, where it's described as "Trained on opennnlp training
data":

http://opennlp.sourceforge.net/models-1.5/

It works pretty well but I'm working on some social media text that has
some non-standard punctuation. For example, it's not uncommon for words to
be separated by a series of punctuation characters, like so:

oooh,,,,go away fever and flu

I want to train up a new model using text like this but don't want to start
entirely from scratch. Is the training data for this model available from
OpenNLP? If so, I could experiment with supplementing its training data. It
seems like sharing training data, and not just trained models, could be a
great service.

Thanks,
Stuart Robinson

Re: obtaining data used to train OpenNLP models

Posted by Aditya Kulkarni <ad...@gmail.com>.

+1
This question is not answered for me too.
Should be great help to get it answered.

-aditya
 On Apr 1, 2014 12:38 AM, "Stuart Robinson" <st...@gmail.com>
wrote:

> I've tried using the tokenizer model for English provided by OpenNLP:
>
> http://opennlp.sourceforge.net/models-1.5/en-token.bin
>
> It's listed here, where it's described as "Trained on opennnlp training
> data":
>
> http://opennlp.sourceforge.net/models-1.5/
>
> It works pretty well but I'm working on some social media text that has
> some non-standard punctuation. For example, it's not uncommon for words to
> be separated by a series of punctuation characters, like so:
>
> oooh,,,,go away fever and flu
>
> I want to train up a new model using text like this but don't want to start
> entirely from scratch. Is the training data for this model available from
> OpenNLP? If so, I could experiment with supplementing its training data. It
> seems like sharing training data, and not just trained models, could be a
> great service.
>
> Thanks,
> Stuart Robinson
>

Re: obtaining data used to train OpenNLP models

Posted by Aditya Kulkarni <ad...@gmail.com>.

Well, thanks Jorn. This settles it for me.
Let me see how both model together can be used in tandem. If any non
trivial observed, then shall share it.
-a


On Wed, Apr 2, 2014 at 4:25 PM, Jörn Kottmann <ko...@gmail.com> wrote:

> Hello,
>
> the training data for the tokenizer is not Open Source and can't be
> released due
> to copyright restrictions.
>
> For best performance you should create your own training data based on
> social media texts.
>
> Jörn
>
>
> On 03/31/2014 09:08 PM, Stuart Robinson wrote:
>
>> I've tried using the tokenizer model for English provided by OpenNLP:
>>
>> http://opennlp.sourceforge.net/models-1.5/en-token.bin
>>
>> It's listed here, where it's described as "Trained on opennnlp training
>> data":
>>
>> http://opennlp.sourceforge.net/models-1.5/
>>
>> It works pretty well but I'm working on some social media text that has
>> some non-standard punctuation. For example, it's not uncommon for words to
>> be separated by a series of punctuation characters, like so:
>>
>> oooh,,,,go away fever and flu
>>
>> I want to train up a new model using text like this but don't want to
>> start
>> entirely from scratch. Is the training data for this model available from
>> OpenNLP? If so, I could experiment with supplementing its training data.
>> It
>> seems like sharing training data, and not just trained models, could be a
>> great service.
>>
>> Thanks,
>> Stuart Robinson
>>
>>
>

Re: obtaining data used to train OpenNLP models

Posted by Jörn Kottmann <ko...@gmail.com>.

Hello,

the training data for the tokenizer is not Open Source and can't be 
released due
to copyright restrictions.

For best performance you should create your own training data based on 
social media texts.

Jörn

On 03/31/2014 09:08 PM, Stuart Robinson wrote:
> I've tried using the tokenizer model for English provided by OpenNLP:
>
> http://opennlp.sourceforge.net/models-1.5/en-token.bin
>
> It's listed here, where it's described as "Trained on opennnlp training
> data":
>
> http://opennlp.sourceforge.net/models-1.5/
>
> It works pretty well but I'm working on some social media text that has
> some non-standard punctuation. For example, it's not uncommon for words to
> be separated by a series of punctuation characters, like so:
>
> oooh,,,,go away fever and flu
>
> I want to train up a new model using text like this but don't want to start
> entirely from scratch. Is the training data for this model available from
> OpenNLP? If so, I could experiment with supplementing its training data. It
> seems like sharing training data, and not just trained models, could be a
> great service.
>
> Thanks,
> Stuart Robinson
>