You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by "Jim - FooBar();" <ji...@gmail.com> on 2012/04/21 18:40:39 UTC

Re: New to opennlp

On 13/02/12 23:07, Michael Collins wrote:
> Does opennlp provide a way to create the *.train file based on a body of text which I provide, or is the *.train file created another way.
Apart from the sentence detector there is no way to automatically create 
training data for other tasks (POS,NER etc)...these are often language 
and domain dependant. For the sentence detector however it is easy to 
create your own private training data (as Jorn said) targeted especially 
for your problem domain. assuming of course that the pre-trained model 
is not good enough for you...i find it's pretty good! :)

Jim

Re: New to opennlp

Posted by James Kosin <ja...@gmail.com>.
On 4/21/2012 12:40 PM, Jim - FooBar(); wrote:
> On 13/02/12 23:07, Michael Collins wrote:
>> Does opennlp provide a way to create the *.train file based on a body
>> of text which I provide, or is the *.train file created another way.
> Apart from the sentence detector there is no way to automatically
> create training data for other tasks (POS,NER etc)...these are often
> language and domain dependant. For the sentence detector however it is
> easy to create your own private training data (as Jorn said) targeted
> especially for your problem domain. assuming of course that the
> pre-trained model is not good enough for you...i find it's pretty
> good! :)
>
> Jim
The training data is based on corpus of text already parsed for POS,
Name or other reasons.  Usually, they are hand done ... or generated and
rechecked by humans to verify accuracy.
Unfortunately for most, the corpus' are usually copyrighted text meaning
they can not be freely distributed.  Most provide some data either only
the data needed to be merged with the original text... ie: you have to
run scripts to take multiple files and merge them with the data to get
the final corpus or they only provide small samples of some corpus'. 
Either way, the copyright usually prohibits commercial usage or usage
for any reason other than research.

We do have projects we want to start to start our own corpus based on
freely available text that we can distribute freely for any purpose
based on OpenNLP.

This is also why our models are currently on sourceforge only... due to
distributing licenses that are not Apache friendly.

James

Re: New to opennlp

Posted by James Kosin <ja...@gmail.com>.
On 4/22/2012 6:18 AM, Jim - FooBar(); wrote:
> On 21/04/12 23:30, James Kosin wrote:
>> On 4/21/2012 12:40 PM, Jim - FooBar(); wrote:
>>> On 13/02/12 23:07, Michael Collins wrote:
>>>> Does opennlp provide a way to create the *.train file based on a body
>>>> of text which I provide, or is the *.train file created another way.
>>> Apart from the sentence detector there is no way to automatically
>>> create training data for other tasks (POS,NER etc)...these are often
>>> language and domain dependant. For the sentence detector however it is
>>> easy to create your own private training data (as Jorn said) targeted
>>> especially for your problem domain. assuming of course that the
>>> pre-trained model is not good enough for you...i find it's pretty
>>> good! :)
>>>
>>> Jim
>> Also, unlike a lot of the other models, the sentence detector can
>> actually be trained and works quite well with just a few sentences to
>> train on.  ~20-30 does really well.
>>
>> James
>>
>
> Wow!!! did not know that!!! I thought the sentence detector needs
> thousands of sentences just like the other models! Thanks James...
>
> Jim
Jim,

The sentence detector is probably the simplest model next would be the
tokenizer.

The sentence detector only requires to be trained on knowing the
end-of-sentence.  Most cases this is a '.' or other terminating punctuation.
I even trained with a few sentences with abbreviations that had a '.' in
them as well.  Of course in my case and with so few sentence samples, I
have to use the parameter to change the cutoff to 1 instead of the
default 5.

The tokenizer though is training for more than just splitting
punctuation .... so, it will require a bit more.

The harder ones like POS, NameFinder, etc ... require large volumes of
data to be trained reliably.

James

Re: New to opennlp

Posted by "Jim - FooBar();" <ji...@gmail.com>.
On 21/04/12 23:30, James Kosin wrote:
> On 4/21/2012 12:40 PM, Jim - FooBar(); wrote:
>> On 13/02/12 23:07, Michael Collins wrote:
>>> Does opennlp provide a way to create the *.train file based on a body
>>> of text which I provide, or is the *.train file created another way.
>> Apart from the sentence detector there is no way to automatically
>> create training data for other tasks (POS,NER etc)...these are often
>> language and domain dependant. For the sentence detector however it is
>> easy to create your own private training data (as Jorn said) targeted
>> especially for your problem domain. assuming of course that the
>> pre-trained model is not good enough for you...i find it's pretty
>> good! :)
>>
>> Jim
> Also, unlike a lot of the other models, the sentence detector can
> actually be trained and works quite well with just a few sentences to
> train on.  ~20-30 does really well.
>
> James
>

Wow!!! did not know that!!! I thought the sentence detector needs 
thousands of sentences just like the other models! Thanks James...

Jim

Re: New to opennlp

Posted by James Kosin <ja...@gmail.com>.
On 4/21/2012 12:40 PM, Jim - FooBar(); wrote:
> On 13/02/12 23:07, Michael Collins wrote:
>> Does opennlp provide a way to create the *.train file based on a body
>> of text which I provide, or is the *.train file created another way.
> Apart from the sentence detector there is no way to automatically
> create training data for other tasks (POS,NER etc)...these are often
> language and domain dependant. For the sentence detector however it is
> easy to create your own private training data (as Jorn said) targeted
> especially for your problem domain. assuming of course that the
> pre-trained model is not good enough for you...i find it's pretty
> good! :)
>
> Jim
Also, unlike a lot of the other models, the sentence detector can
actually be trained and works quite well with just a few sentences to
train on.  ~20-30 does really well.

James