You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Chris Spencer <ch...@gmail.com> on 2011/02/14 05:03:38 UTC

Updating Pre-Trained Models

Where would we download the source data and tools used to generate the
pretrained models available at
http://opennlp.sourceforge.net/models-1.5/, specifically for the
English Treebank Parser?

I have a large corpus of hand-corrected sentence/parse-tree pairs, as
well as an extended lexicon, and I'd like to incorporate these into
the training data and retrain a new parser better fitted for my
domain.

Regards,
Chris

Re: Updating Pre-Trained Models

Posted by James Kosin <ja...@gmail.com>.

Chris,

The tools are part of the source code.  The heart is the MAXENT network
that is trained with the data.  Most of the trainers are now in CLI
interfaces and usually work on the raw training data.  Where the raw
training data is inappropriate, converters have been built, also part of
the source.

The group is currently trying to start a push to find freely available
corpus (or training data).  Most of the training data we currently have
is copyrighted and can not be released in the raw data format.  The
models are fine, because they don't contain any of the original text. 
Unfortunately, this means any additional training is not possible
without having the entire training set of data.  Even if you did, most
of the training takes hours..... since they contain many many samples. 
Another unfortunate thing is most are news articles and are not taken
from other sources.

James

On 2/15/2011 10:37 AM, Chris Spencer wrote:
> I suspected this might be the case. What about the tools used to
> generate the model? Are those freely available or part of OpenNLP?
>
> I tried searching through OpenNLP's codebase, but I'm still new to it,
> so I'm not really sure what I'm looking for.
>
> Regards,
> Chris
>
> On Mon, Feb 14, 2011 at 5:58 PM, James Kosin <ja...@gmail.com> wrote:
>> Chris,
>>
>> Unfortunately, most... if not all, of the training data is not FREE or
>> openly available due to copyright.  If you would like to start a group
>> to engage in collecting non-copyrighted text and parse the data by hand
>> you are more than welcome and encouraged to do so.
>> Jorn or Jason may have a more complete set of training data and could
>> help if you pass on your samples.
>>
>> James
>>
>> On 2/13/2011 11:03 PM, Chris Spencer wrote:
>>> Where would we download the source data and tools used to generate the
>>> pretrained models available at
>>> http://opennlp.sourceforge.net/models-1.5/, specifically for the
>>> English Treebank Parser?
>>>
>>> I have a large corpus of hand-corrected sentence/parse-tree pairs, as
>>> well as an extended lexicon, and I'd like to incorporate these into
>>> the training data and retrain a new parser better fitted for my
>>> domain.
>>>
>>> Regards,
>>> Chris
>>

Re: Updating Pre-Trained Models

Posted by Chris Spencer <ch...@gmail.com>.

I suspected this might be the case. What about the tools used to
generate the model? Are those freely available or part of OpenNLP?

I tried searching through OpenNLP's codebase, but I'm still new to it,
so I'm not really sure what I'm looking for.

Regards,
Chris

On Mon, Feb 14, 2011 at 5:58 PM, James Kosin <ja...@gmail.com> wrote:
> Chris,
>
> Unfortunately, most... if not all, of the training data is not FREE or
> openly available due to copyright.  If you would like to start a group
> to engage in collecting non-copyrighted text and parse the data by hand
> you are more than welcome and encouraged to do so.
> Jorn or Jason may have a more complete set of training data and could
> help if you pass on your samples.
>
> James
>
> On 2/13/2011 11:03 PM, Chris Spencer wrote:
>> Where would we download the source data and tools used to generate the
>> pretrained models available at
>> http://opennlp.sourceforge.net/models-1.5/, specifically for the
>> English Treebank Parser?
>>
>> I have a large corpus of hand-corrected sentence/parse-tree pairs, as
>> well as an extended lexicon, and I'd like to incorporate these into
>> the training data and retrain a new parser better fitted for my
>> domain.
>>
>> Regards,
>> Chris
>
>

Re: Updating Pre-Trained Models

Posted by James Kosin <ja...@gmail.com>.

Chris,

Unfortunately, most... if not all, of the training data is not FREE or
openly available due to copyright.  If you would like to start a group
to engage in collecting non-copyrighted text and parse the data by hand
you are more than welcome and encouraged to do so.
Jorn or Jason may have a more complete set of training data and could
help if you pass on your samples.

James

On 2/13/2011 11:03 PM, Chris Spencer wrote:
> Where would we download the source data and tools used to generate the
> pretrained models available at
> http://opennlp.sourceforge.net/models-1.5/, specifically for the
> English Treebank Parser?
>
> I have a large corpus of hand-corrected sentence/parse-tree pairs, as
> well as an extended lexicon, and I'd like to incorporate these into
> the training data and retrain a new parser better fitted for my
> domain.
>
> Regards,
> Chris