You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Sanjeev Sharma <sa...@evanschambers.com> on 2014/03/27 23:35:29 UTC

Training new models

Hi,



I am new to OpenNLP.  I've been playing with chunking, tokenizing, POS
tagging, and Name recognition for a few days.  I've been following the
example code and using preexisting models from
http://opennlp.sourceforge.net/models-1.5/.  I've been having some trouble
with name recognition and organization recognition in that using the above
mentioned models I can only identify common names or organizations like
"Mike Smith" and "IBM".  In addition I need to be able to find date ranges
and technical language like "Java", "C++", and "HTML" (I should mention
that my input is going to be resumes).



I figured I need to train my own models, especially since my training data
should look more like my input to give a better context (i.e. resumes).
I've been trying to find some information on how to do this in the
documentation and also doing google searches.  I found a few simple
examples, but not much more.  I did see the example in the documentation
with the "<START:person> <END>" tags and the command line to process the
training data into a .bin file, but nothing with organization names.  I
tried to look at one or two of the annotation guides and that created more
questions than answers (for example, the annotation guides not consistent
with each other or the example in the documentation.  Are there pros and
cons between the different formats?  Are the examples in the documentation
in a native format?  Is there a conversion utility?  If so and I'm creating
data from scratch, would it not be better to just put it in the native
format?)



I just lack understanding of OpenNLP and NLP in general and the OpenNLP
Manual just hasn't worked for me.  Maybe I'm just misinterpreting the
documentation or just not looking in the right place.  I would appreciate
it greatly if someone could point me in the right direction in the way of
real world examples of training a model, recommending a book I can read
through, or maybe just some good examples of training data.  Beyond the
specific task I'm trying to accomplish, I would like to get a deeper
understanding of how OpenNLP works.



Thanks for any help.

RE: Training new models

Posted by Sanjeev Sharma <sa...@evanschambers.com>.

Thank you Jorn.

-----Original Message-----
From: Joern Kottmann [mailto:kottmann@gmail.com]
Sent: Sunday, March 30, 2014 12:54 PM
To: users@opennlp.apache.org
Subject: Re: Training new models

You should use a few hundred maybe up to a bit over a thousand to get good
performance.

The model training command looks good. To get anything detecetd you will
need more data. And I would use the perceptron with a cutoff of zero instead
the default maxent with cutoff of five.

HTH,
Jörn


On Sun, Mar 30, 2014 at 7:01 AM, Stuart Robinson <stuartprobinson1@gmail.com
> wrote:

> Thanks, Sanjeev. I was actually asking about the data used to train
> the tokenizers provided by OpenNLP. I'll start a new thread to prevent
> confusion. Sorry about that.
>
>
> On Sat, Mar 29, 2014 at 7:23 PM, Sanjeev Sharma <
> sanjeev.sharma@evanschambers.com> wrote:
>
> > Sorry, can't share the data due to privacy concerns.  The way I got
> > this data was to extract text from word doc resumes, cat them into a
> > single
> text
> > file, and tagged only the names using <START:person> and <END> tags.
> > I'm using 20 or so resumes for initial experimentation, but the
> > actual
> training
> > data will have several hundred resumes.
> >
> > -----Original Message-----
> > From: Stuart Robinson [mailto:stuartprobinson1@gmail.com]
> > Sent: Saturday, March 29, 2014 8:01 PM
> > To: users@opennlp.apache.org
> > Subject: Re: Training new models
> >
> > Is the training data used to train the tokenizer models available?
> > Specifically, I'm interested in the data used to train the English
> > tokenizer:
> >
> > http://opennlp.sourceforge.net/models-1.5/en-token.bin
> >
> > Thanks,
> > Stuart Robinson
> >
> > > On Mar 29, 2014, at 10:12 AM, Sanjeev Sharma
> > > <sa...@evanschambers.com> wrote:
> > >
> > > Jorn,
> > >
> > > Thanks you for your reply.  Here is what I tried as a simple test:
> > >
> > > - tagged the names on about 20 resumes using "<START:person><END>"
> > > notation
> > > - concatenated them into a single text file.
> > > - created a new .bin file using the following command
> > >
> > >    >opennlp TokenNameFinderTrainer -model persons.bin -lang en
> > > -data train.txt -encoding UTF-8
> > > - using this model file and TokenNameFinderModel tried to identify
> > > a name in one of the resumes I used for training.  (I can post the
> > > code if you
> > > need.)
> > >
> > > Should this work?  If not, what am I doing wrong?
> > >
> > > Thanks,
> > > Sanjeev.
> > >
> > > -----Original Message-----
> > > From: Jörn Kottmann [mailto:kottmann@gmail.com]
> > > Sent: Friday, March 28, 2014 5:04 AM
> > > To: users@opennlp.apache.org
> > > Subject: Re: Training new models
> > >
> > >> On 03/27/2014 11:35 PM, Sanjeev Sharma wrote:
> > >> Hi,
> > >>
> > >>
> > >>
> > >> I am new to OpenNLP.  I've been playing with chunking,
> > >> tokenizing, POS tagging, and Name recognition for a few days.
> > >> I've been following the example code and using preexisting models
> > >> from http://opennlp.sourceforge.net/models-1.5/.  I've been
> > >> having some trouble with name recognition and organization
> > >> recognition in that using the above mentioned models I can only
> > >> identify common names or organizations like "Mike Smith" and
> > >> "IBM".  In addition I need to be able to find date ranges and
> > >> technical language like "Java", "C++", and "HTML" (I should mention
> > >> that my input is going to be resumes).
> > >>
> > >>
> > >>
> > >> I figured I need to train my own models, especially since my
> > >> training data should look more like my input to give a better context
> > >> (i.e.
> > > resumes).
> > >> I've been trying to find some information on how to do this in
> > >> the documentation and also doing google searches.  I found a few
> > >> simple examples, but not much more.  I did see the example in the
> > >> documentation with the "<START:person> <END>" tags and the
> > >> command line to process the training data into a .bin file, but
> > >> nothing with organization names.  I tried to look at one or two
> > >> of the annotation guides and that created more questions than
> > >> answers (for example, the annotation guides not consistent with
> > >> each other or the example in the documentation.  Are there pros
> > >> and cons between the different formats?
> > >> Are the examples in the documentation in a native format?  Is
> > >> there a conversion utility?  If so and I'm creating data from
> > >> scratch, would it not be better to just put it in the native
> > >> format?)
> > >>
> > >>
> > >>
> > >> I just lack understanding of OpenNLP and NLP in general and the
> > >> OpenNLP Manual just hasn't worked for me.  Maybe I'm just
> > >> misinterpreting the documentation or just not looking in the
> > >> right place.  I would appreciate it greatly if someone could
> > >> point me in the right direction in the way of real world examples
> > >> of training a model, recommending a book I can read through, or
> > >> maybe just some good examples of training data.  Beyond the
> > >> specific task I'm trying to accomplish, I would like to get a
> > >> deeper understanding of how OpenNLP
> > > works.
> > >
> > > Hello,
> > >
> > > the OpenNLP Name Finder training format is rather simple, as you
> > > already figured out, you need to use the <START:entity_name> and
> > > <END> tags to mark the name in tokenized plain text documents.
> > >
> > > In the example above you could replace <START:person> with
> > > <START:organization> to markup an organization name in your text.
> > >
> > > To create a model which performs on your documents you will have
> > > to label quite a few of them and using a text editor to insert the
> > > tags is an approach which does not scale for more than a few
> > > documents.
> > >
> > > I suggest to have a look at brat:
> > > http://brat.nlplab.org/
> > >
> > > Brat has a few issues in the 1.3 release version, but they are now
> > > resolved in the trunk, I recommend to use it instead of 1.3.
> > >
> > > The OpenNLP Name Finder in the trunk version can be directly
> > > trained on the brat format.
> > > If you want to use OpenNLP 1.5.3 instead you can still use 1.6.0
> > > to convert the data into the above discussed OpenNLP format.
> > >
> > > I know a few people who have done this successfully. Let us know
> > > if you have an issues, and a contribution about this process to
> > > our documentation would be very welcome!
> > >
> > > HTH,
> > > Jörn
> >
>

Re: Training new models

Posted by Joern Kottmann <ko...@gmail.com>.

You should use a few hundred maybe up to a bit over a thousand to get good
performance.

The model training command looks good. To get anything detecetd you will
need more data. And I would use the perceptron with a cutoff of zero
instead the default maxent with cutoff of five.

HTH,
Jörn


On Sun, Mar 30, 2014 at 7:01 AM, Stuart Robinson <stuartprobinson1@gmail.com
> wrote:

> Thanks, Sanjeev. I was actually asking about the data used to train the
> tokenizers provided by OpenNLP. I'll start a new thread to prevent
> confusion. Sorry about that.
>
>
> On Sat, Mar 29, 2014 at 7:23 PM, Sanjeev Sharma <
> sanjeev.sharma@evanschambers.com> wrote:
>
> > Sorry, can't share the data due to privacy concerns.  The way I got this
> > data was to extract text from word doc resumes, cat them into a single
> text
> > file, and tagged only the names using <START:person> and <END> tags.  I'm
> > using 20 or so resumes for initial experimentation, but the actual
> training
> > data will have several hundred resumes.
> >
> > -----Original Message-----
> > From: Stuart Robinson [mailto:stuartprobinson1@gmail.com]
> > Sent: Saturday, March 29, 2014 8:01 PM
> > To: users@opennlp.apache.org
> > Subject: Re: Training new models
> >
> > Is the training data used to train the tokenizer models available?
> > Specifically, I'm interested in the data used to train the English
> > tokenizer:
> >
> > http://opennlp.sourceforge.net/models-1.5/en-token.bin
> >
> > Thanks,
> > Stuart Robinson
> >
> > > On Mar 29, 2014, at 10:12 AM, Sanjeev Sharma
> > > <sa...@evanschambers.com> wrote:
> > >
> > > Jorn,
> > >
> > > Thanks you for your reply.  Here is what I tried as a simple test:
> > >
> > > - tagged the names on about 20 resumes using "<START:person><END>"
> > > notation
> > > - concatenated them into a single text file.
> > > - created a new .bin file using the following command
> > >
> > >    >opennlp TokenNameFinderTrainer -model persons.bin -lang en -data
> > > train.txt -encoding UTF-8
> > > - using this model file and TokenNameFinderModel tried to identify a
> > > name in one of the resumes I used for training.  (I can post the code
> > > if you
> > > need.)
> > >
> > > Should this work?  If not, what am I doing wrong?
> > >
> > > Thanks,
> > > Sanjeev.
> > >
> > > -----Original Message-----
> > > From: Jörn Kottmann [mailto:kottmann@gmail.com]
> > > Sent: Friday, March 28, 2014 5:04 AM
> > > To: users@opennlp.apache.org
> > > Subject: Re: Training new models
> > >
> > >> On 03/27/2014 11:35 PM, Sanjeev Sharma wrote:
> > >> Hi,
> > >>
> > >>
> > >>
> > >> I am new to OpenNLP.  I've been playing with chunking, tokenizing,
> > >> POS tagging, and Name recognition for a few days.  I've been
> > >> following the example code and using preexisting models from
> > >> http://opennlp.sourceforge.net/models-1.5/.  I've been having some
> > >> trouble with name recognition and organization recognition in that
> > >> using the above mentioned models I can only identify common names or
> > >> organizations like "Mike Smith" and "IBM".  In addition I need to be
> > >> able to find date ranges and technical language like "Java", "C++",
> > >> and "HTML" (I should mention that my input is going to be resumes).
> > >>
> > >>
> > >>
> > >> I figured I need to train my own models, especially since my training
> > >> data should look more like my input to give a better context (i.e.
> > > resumes).
> > >> I've been trying to find some information on how to do this in the
> > >> documentation and also doing google searches.  I found a few simple
> > >> examples, but not much more.  I did see the example in the
> > >> documentation with the "<START:person> <END>" tags and the command
> > >> line to process the training data into a .bin file, but nothing with
> > >> organization names.  I tried to look at one or two of the annotation
> > >> guides and that created more questions than answers (for example, the
> > >> annotation guides not consistent with each other or the example in
> > >> the documentation.  Are there pros and cons between the different
> > >> formats?
> > >> Are the examples in the documentation in a native format?  Is there a
> > >> conversion utility?  If so and I'm creating data from scratch, would
> > >> it not be better to just put it in the native
> > >> format?)
> > >>
> > >>
> > >>
> > >> I just lack understanding of OpenNLP and NLP in general and the
> > >> OpenNLP Manual just hasn't worked for me.  Maybe I'm just
> > >> misinterpreting the documentation or just not looking in the right
> > >> place.  I would appreciate it greatly if someone could point me in
> > >> the right direction in the way of real world examples of training a
> > >> model, recommending a book I can read through, or maybe just some
> > >> good examples of training data.  Beyond the specific task I'm trying
> > >> to accomplish, I would like to get a deeper understanding of how
> > >> OpenNLP
> > > works.
> > >
> > > Hello,
> > >
> > > the OpenNLP Name Finder training format is rather simple, as you
> > > already figured out, you need to use the <START:entity_name> and <END>
> > > tags to mark the name in tokenized plain text documents.
> > >
> > > In the example above you could replace <START:person> with
> > > <START:organization> to markup an organization name in your text.
> > >
> > > To create a model which performs on your documents you will have to
> > > label quite a few of them and using a text editor to insert the tags
> > > is an approach which does not scale for more than a few documents.
> > >
> > > I suggest to have a look at brat:
> > > http://brat.nlplab.org/
> > >
> > > Brat has a few issues in the 1.3 release version, but they are now
> > > resolved in the trunk, I recommend to use it instead of 1.3.
> > >
> > > The OpenNLP Name Finder in the trunk version can be directly trained
> > > on the brat format.
> > > If you want to use OpenNLP 1.5.3 instead you can still use 1.6.0 to
> > > convert the data into the above discussed OpenNLP format.
> > >
> > > I know a few people who have done this successfully. Let us know if
> > > you have an issues, and a contribution about this process to our
> > > documentation would be very welcome!
> > >
> > > HTH,
> > > Jörn
> >
>

Re: Training new models

Posted by Stuart Robinson <st...@gmail.com>.

Thanks, Sanjeev. I was actually asking about the data used to train the
tokenizers provided by OpenNLP. I'll start a new thread to prevent
confusion. Sorry about that.


On Sat, Mar 29, 2014 at 7:23 PM, Sanjeev Sharma <
sanjeev.sharma@evanschambers.com> wrote:

> Sorry, can't share the data due to privacy concerns.  The way I got this
> data was to extract text from word doc resumes, cat them into a single text
> file, and tagged only the names using <START:person> and <END> tags.  I'm
> using 20 or so resumes for initial experimentation, but the actual training
> data will have several hundred resumes.
>
> -----Original Message-----
> From: Stuart Robinson [mailto:stuartprobinson1@gmail.com]
> Sent: Saturday, March 29, 2014 8:01 PM
> To: users@opennlp.apache.org
> Subject: Re: Training new models
>
> Is the training data used to train the tokenizer models available?
> Specifically, I'm interested in the data used to train the English
> tokenizer:
>
> http://opennlp.sourceforge.net/models-1.5/en-token.bin
>
> Thanks,
> Stuart Robinson
>
> > On Mar 29, 2014, at 10:12 AM, Sanjeev Sharma
> > <sa...@evanschambers.com> wrote:
> >
> > Jorn,
> >
> > Thanks you for your reply.  Here is what I tried as a simple test:
> >
> > - tagged the names on about 20 resumes using "<START:person><END>"
> > notation
> > - concatenated them into a single text file.
> > - created a new .bin file using the following command
> >
> >    >opennlp TokenNameFinderTrainer -model persons.bin -lang en -data
> > train.txt -encoding UTF-8
> > - using this model file and TokenNameFinderModel tried to identify a
> > name in one of the resumes I used for training.  (I can post the code
> > if you
> > need.)
> >
> > Should this work?  If not, what am I doing wrong?
> >
> > Thanks,
> > Sanjeev.
> >
> > -----Original Message-----
> > From: Jörn Kottmann [mailto:kottmann@gmail.com]
> > Sent: Friday, March 28, 2014 5:04 AM
> > To: users@opennlp.apache.org
> > Subject: Re: Training new models
> >
> >> On 03/27/2014 11:35 PM, Sanjeev Sharma wrote:
> >> Hi,
> >>
> >>
> >>
> >> I am new to OpenNLP.  I've been playing with chunking, tokenizing,
> >> POS tagging, and Name recognition for a few days.  I've been
> >> following the example code and using preexisting models from
> >> http://opennlp.sourceforge.net/models-1.5/.  I've been having some
> >> trouble with name recognition and organization recognition in that
> >> using the above mentioned models I can only identify common names or
> >> organizations like "Mike Smith" and "IBM".  In addition I need to be
> >> able to find date ranges and technical language like "Java", "C++",
> >> and "HTML" (I should mention that my input is going to be resumes).
> >>
> >>
> >>
> >> I figured I need to train my own models, especially since my training
> >> data should look more like my input to give a better context (i.e.
> > resumes).
> >> I've been trying to find some information on how to do this in the
> >> documentation and also doing google searches.  I found a few simple
> >> examples, but not much more.  I did see the example in the
> >> documentation with the "<START:person> <END>" tags and the command
> >> line to process the training data into a .bin file, but nothing with
> >> organization names.  I tried to look at one or two of the annotation
> >> guides and that created more questions than answers (for example, the
> >> annotation guides not consistent with each other or the example in
> >> the documentation.  Are there pros and cons between the different
> >> formats?
> >> Are the examples in the documentation in a native format?  Is there a
> >> conversion utility?  If so and I'm creating data from scratch, would
> >> it not be better to just put it in the native
> >> format?)
> >>
> >>
> >>
> >> I just lack understanding of OpenNLP and NLP in general and the
> >> OpenNLP Manual just hasn't worked for me.  Maybe I'm just
> >> misinterpreting the documentation or just not looking in the right
> >> place.  I would appreciate it greatly if someone could point me in
> >> the right direction in the way of real world examples of training a
> >> model, recommending a book I can read through, or maybe just some
> >> good examples of training data.  Beyond the specific task I'm trying
> >> to accomplish, I would like to get a deeper understanding of how
> >> OpenNLP
> > works.
> >
> > Hello,
> >
> > the OpenNLP Name Finder training format is rather simple, as you
> > already figured out, you need to use the <START:entity_name> and <END>
> > tags to mark the name in tokenized plain text documents.
> >
> > In the example above you could replace <START:person> with
> > <START:organization> to markup an organization name in your text.
> >
> > To create a model which performs on your documents you will have to
> > label quite a few of them and using a text editor to insert the tags
> > is an approach which does not scale for more than a few documents.
> >
> > I suggest to have a look at brat:
> > http://brat.nlplab.org/
> >
> > Brat has a few issues in the 1.3 release version, but they are now
> > resolved in the trunk, I recommend to use it instead of 1.3.
> >
> > The OpenNLP Name Finder in the trunk version can be directly trained
> > on the brat format.
> > If you want to use OpenNLP 1.5.3 instead you can still use 1.6.0 to
> > convert the data into the above discussed OpenNLP format.
> >
> > I know a few people who have done this successfully. Let us know if
> > you have an issues, and a contribution about this process to our
> > documentation would be very welcome!
> >
> > HTH,
> > Jörn
>

RE: Training new models

Posted by Sanjeev Sharma <sa...@evanschambers.com>.

Sorry, can't share the data due to privacy concerns.  The way I got this
data was to extract text from word doc resumes, cat them into a single text
file, and tagged only the names using <START:person> and <END> tags.  I'm
using 20 or so resumes for initial experimentation, but the actual training
data will have several hundred resumes.

-----Original Message-----
From: Stuart Robinson [mailto:stuartprobinson1@gmail.com]
Sent: Saturday, March 29, 2014 8:01 PM
To: users@opennlp.apache.org
Subject: Re: Training new models

Is the training data used to train the tokenizer models available?
Specifically, I'm interested in the data used to train the English
tokenizer:

http://opennlp.sourceforge.net/models-1.5/en-token.bin

Thanks,
Stuart Robinson

> On Mar 29, 2014, at 10:12 AM, Sanjeev Sharma
> <sa...@evanschambers.com> wrote:
>
> Jorn,
>
> Thanks you for your reply.  Here is what I tried as a simple test:
>
> - tagged the names on about 20 resumes using "<START:person><END>"
> notation
> - concatenated them into a single text file.
> - created a new .bin file using the following command
>
>    >opennlp TokenNameFinderTrainer -model persons.bin -lang en -data
> train.txt -encoding UTF-8
> - using this model file and TokenNameFinderModel tried to identify a
> name in one of the resumes I used for training.  (I can post the code
> if you
> need.)
>
> Should this work?  If not, what am I doing wrong?
>
> Thanks,
> Sanjeev.
>
> -----Original Message-----
> From: Jörn Kottmann [mailto:kottmann@gmail.com]
> Sent: Friday, March 28, 2014 5:04 AM
> To: users@opennlp.apache.org
> Subject: Re: Training new models
>
>> On 03/27/2014 11:35 PM, Sanjeev Sharma wrote:
>> Hi,
>>
>>
>>
>> I am new to OpenNLP.  I've been playing with chunking, tokenizing,
>> POS tagging, and Name recognition for a few days.  I've been
>> following the example code and using preexisting models from
>> http://opennlp.sourceforge.net/models-1.5/.  I've been having some
>> trouble with name recognition and organization recognition in that
>> using the above mentioned models I can only identify common names or
>> organizations like "Mike Smith" and "IBM".  In addition I need to be
>> able to find date ranges and technical language like "Java", "C++",
>> and "HTML" (I should mention that my input is going to be resumes).
>>
>>
>>
>> I figured I need to train my own models, especially since my training
>> data should look more like my input to give a better context (i.e.
> resumes).
>> I've been trying to find some information on how to do this in the
>> documentation and also doing google searches.  I found a few simple
>> examples, but not much more.  I did see the example in the
>> documentation with the "<START:person> <END>" tags and the command
>> line to process the training data into a .bin file, but nothing with
>> organization names.  I tried to look at one or two of the annotation
>> guides and that created more questions than answers (for example, the
>> annotation guides not consistent with each other or the example in
>> the documentation.  Are there pros and cons between the different
>> formats?
>> Are the examples in the documentation in a native format?  Is there a
>> conversion utility?  If so and I'm creating data from scratch, would
>> it not be better to just put it in the native
>> format?)
>>
>>
>>
>> I just lack understanding of OpenNLP and NLP in general and the
>> OpenNLP Manual just hasn't worked for me.  Maybe I'm just
>> misinterpreting the documentation or just not looking in the right
>> place.  I would appreciate it greatly if someone could point me in
>> the right direction in the way of real world examples of training a
>> model, recommending a book I can read through, or maybe just some
>> good examples of training data.  Beyond the specific task I'm trying
>> to accomplish, I would like to get a deeper understanding of how
>> OpenNLP
> works.
>
> Hello,
>
> the OpenNLP Name Finder training format is rather simple, as you
> already figured out, you need to use the <START:entity_name> and <END>
> tags to mark the name in tokenized plain text documents.
>
> In the example above you could replace <START:person> with
> <START:organization> to markup an organization name in your text.
>
> To create a model which performs on your documents you will have to
> label quite a few of them and using a text editor to insert the tags
> is an approach which does not scale for more than a few documents.
>
> I suggest to have a look at brat:
> http://brat.nlplab.org/
>
> Brat has a few issues in the 1.3 release version, but they are now
> resolved in the trunk, I recommend to use it instead of 1.3.
>
> The OpenNLP Name Finder in the trunk version can be directly trained
> on the brat format.
> If you want to use OpenNLP 1.5.3 instead you can still use 1.6.0 to
> convert the data into the above discussed OpenNLP format.
>
> I know a few people who have done this successfully. Let us know if
> you have an issues, and a contribution about this process to our
> documentation would be very welcome!
>
> HTH,
> Jörn

Re: Training new models

Posted by Stuart Robinson <st...@gmail.com>.

Is the training data used to train the tokenizer models available? Specifically, I'm interested in the data used to train the English tokenizer:

http://opennlp.sourceforge.net/models-1.5/en-token.bin

Thanks,
Stuart Robinson

> On Mar 29, 2014, at 10:12 AM, Sanjeev Sharma <sa...@evanschambers.com> wrote:
> 
> Jorn,
> 
> Thanks you for your reply.  Here is what I tried as a simple test:
> 
> - tagged the names on about 20 resumes using "<START:person><END>"
> notation
> - concatenated them into a single text file.
> - created a new .bin file using the following command
> 
>    >opennlp TokenNameFinderTrainer -model persons.bin -lang en -data
> train.txt -encoding UTF-8
> - using this model file and TokenNameFinderModel tried to identify a name
> in one of the resumes I used for training.  (I can post the code if you
> need.)
> 
> Should this work?  If not, what am I doing wrong?
> 
> Thanks,
> Sanjeev.
> 
> -----Original Message-----
> From: Jörn Kottmann [mailto:kottmann@gmail.com]
> Sent: Friday, March 28, 2014 5:04 AM
> To: users@opennlp.apache.org
> Subject: Re: Training new models
> 
>> On 03/27/2014 11:35 PM, Sanjeev Sharma wrote:
>> Hi,
>> 
>> 
>> 
>> I am new to OpenNLP.  I've been playing with chunking, tokenizing, POS
>> tagging, and Name recognition for a few days.  I've been following the
>> example code and using preexisting models from
>> http://opennlp.sourceforge.net/models-1.5/.  I've been having some
>> trouble with name recognition and organization recognition in that
>> using the above mentioned models I can only identify common names or
>> organizations like "Mike Smith" and "IBM".  In addition I need to be
>> able to find date ranges and technical language like "Java", "C++",
>> and "HTML" (I should mention that my input is going to be resumes).
>> 
>> 
>> 
>> I figured I need to train my own models, especially since my training
>> data should look more like my input to give a better context (i.e.
> resumes).
>> I've been trying to find some information on how to do this in the
>> documentation and also doing google searches.  I found a few simple
>> examples, but not much more.  I did see the example in the
>> documentation with the "<START:person> <END>" tags and the command
>> line to process the training data into a .bin file, but nothing with
>> organization names.  I tried to look at one or two of the annotation
>> guides and that created more questions than answers (for example, the
>> annotation guides not consistent with each other or the example in the
>> documentation.  Are there pros and cons between the different formats?
>> Are the examples in the documentation in a native format?  Is there a
>> conversion utility?  If so and I'm creating data from scratch, would
>> it not be better to just put it in the native
>> format?)
>> 
>> 
>> 
>> I just lack understanding of OpenNLP and NLP in general and the
>> OpenNLP Manual just hasn't worked for me.  Maybe I'm just
>> misinterpreting the documentation or just not looking in the right
>> place.  I would appreciate it greatly if someone could point me in the
>> right direction in the way of real world examples of training a model,
>> recommending a book I can read through, or maybe just some good
>> examples of training data.  Beyond the specific task I'm trying to
>> accomplish, I would like to get a deeper understanding of how OpenNLP
> works.
> 
> Hello,
> 
> the OpenNLP Name Finder training format is rather simple, as you already
> figured out, you need to use the <START:entity_name> and <END> tags to
> mark the name in tokenized plain text documents.
> 
> In the example above you could replace <START:person> with
> <START:organization> to markup an organization name in your text.
> 
> To create a model which performs on your documents you will have to label
> quite a few of them and using a text editor to insert the tags is an
> approach which does not scale for more than a few documents.
> 
> I suggest to have a look at brat:
> http://brat.nlplab.org/
> 
> Brat has a few issues in the 1.3 release version, but they are now
> resolved in the trunk, I recommend to use it instead of 1.3.
> 
> The OpenNLP Name Finder in the trunk version can be directly trained on
> the brat format.
> If you want to use OpenNLP 1.5.3 instead you can still use 1.6.0 to
> convert the data into the above discussed OpenNLP format.
> 
> I know a few people who have done this successfully. Let us know if you
> have an issues, and a contribution about this process to our documentation
> would be very welcome!
> 
> HTH,
> Jörn

RE: Training new models

Posted by Sanjeev Sharma <sa...@evanschambers.com>.

Jorn,

Thanks you for your reply.  Here is what I tried as a simple test:

- tagged the names on about 20 resumes using "<START:person><END>"
notation
- concatenated them into a single text file.
- created a new .bin file using the following command

	>opennlp TokenNameFinderTrainer -model persons.bin -lang en -data
train.txt -encoding UTF-8
- using this model file and TokenNameFinderModel tried to identify a name
in one of the resumes I used for training.  (I can post the code if you
need.)

Should this work?  If not, what am I doing wrong?

Thanks,
Sanjeev.

-----Original Message-----
From: Jörn Kottmann [mailto:kottmann@gmail.com]
Sent: Friday, March 28, 2014 5:04 AM
To: users@opennlp.apache.org
Subject: Re: Training new models

On 03/27/2014 11:35 PM, Sanjeev Sharma wrote:
> Hi,
>
>
>
> I am new to OpenNLP.  I've been playing with chunking, tokenizing, POS
> tagging, and Name recognition for a few days.  I've been following the
> example code and using preexisting models from
> http://opennlp.sourceforge.net/models-1.5/.  I've been having some
> trouble with name recognition and organization recognition in that
> using the above mentioned models I can only identify common names or
> organizations like "Mike Smith" and "IBM".  In addition I need to be
> able to find date ranges and technical language like "Java", "C++",
> and "HTML" (I should mention that my input is going to be resumes).
>
>
>
> I figured I need to train my own models, especially since my training
> data should look more like my input to give a better context (i.e.
resumes).
> I've been trying to find some information on how to do this in the
> documentation and also doing google searches.  I found a few simple
> examples, but not much more.  I did see the example in the
> documentation with the "<START:person> <END>" tags and the command
> line to process the training data into a .bin file, but nothing with
> organization names.  I tried to look at one or two of the annotation
> guides and that created more questions than answers (for example, the
> annotation guides not consistent with each other or the example in the
> documentation.  Are there pros and cons between the different formats?
> Are the examples in the documentation in a native format?  Is there a
> conversion utility?  If so and I'm creating data from scratch, would
> it not be better to just put it in the native
> format?)
>
>
>
> I just lack understanding of OpenNLP and NLP in general and the
> OpenNLP Manual just hasn't worked for me.  Maybe I'm just
> misinterpreting the documentation or just not looking in the right
> place.  I would appreciate it greatly if someone could point me in the
> right direction in the way of real world examples of training a model,
> recommending a book I can read through, or maybe just some good
> examples of training data.  Beyond the specific task I'm trying to
> accomplish, I would like to get a deeper understanding of how OpenNLP
works.

Hello,

the OpenNLP Name Finder training format is rather simple, as you already
figured out, you need to use the <START:entity_name> and <END> tags to
mark the name in tokenized plain text documents.

In the example above you could replace <START:person> with
<START:organization> to markup an organization name in your text.

To create a model which performs on your documents you will have to label
quite a few of them and using a text editor to insert the tags is an
approach which does not scale for more than a few documents.

I suggest to have a look at brat:
http://brat.nlplab.org/

Brat has a few issues in the 1.3 release version, but they are now
resolved in the trunk, I recommend to use it instead of 1.3.

The OpenNLP Name Finder in the trunk version can be directly trained on
the brat format.
If you want to use OpenNLP 1.5.3 instead you can still use 1.6.0 to
convert the data into the above discussed OpenNLP format.

I know a few people who have done this successfully. Let us know if you
have an issues, and a contribution about this process to our documentation
would be very welcome!

HTH,
Jörn

Re: Training new models

Posted by Jörn Kottmann <ko...@gmail.com>.

On 03/27/2014 11:35 PM, Sanjeev Sharma wrote:
> Hi,
>
>
>
> I am new to OpenNLP.  I've been playing with chunking, tokenizing, POS
> tagging, and Name recognition for a few days.  I've been following the
> example code and using preexisting models from
> http://opennlp.sourceforge.net/models-1.5/.  I've been having some trouble
> with name recognition and organization recognition in that using the above
> mentioned models I can only identify common names or organizations like
> "Mike Smith" and "IBM".  In addition I need to be able to find date ranges
> and technical language like "Java", "C++", and "HTML" (I should mention
> that my input is going to be resumes).
>
>
>
> I figured I need to train my own models, especially since my training data
> should look more like my input to give a better context (i.e. resumes).
> I've been trying to find some information on how to do this in the
> documentation and also doing google searches.  I found a few simple
> examples, but not much more.  I did see the example in the documentation
> with the "<START:person> <END>" tags and the command line to process the
> training data into a .bin file, but nothing with organization names.  I
> tried to look at one or two of the annotation guides and that created more
> questions than answers (for example, the annotation guides not consistent
> with each other or the example in the documentation.  Are there pros and
> cons between the different formats?  Are the examples in the documentation
> in a native format?  Is there a conversion utility?  If so and I'm creating
> data from scratch, would it not be better to just put it in the native
> format?)
>
>
>
> I just lack understanding of OpenNLP and NLP in general and the OpenNLP
> Manual just hasn't worked for me.  Maybe I'm just misinterpreting the
> documentation or just not looking in the right place.  I would appreciate
> it greatly if someone could point me in the right direction in the way of
> real world examples of training a model, recommending a book I can read
> through, or maybe just some good examples of training data.  Beyond the
> specific task I'm trying to accomplish, I would like to get a deeper
> understanding of how OpenNLP works.

Hello,

the OpenNLP Name Finder training format is rather simple, as you already 
figured out, you
need to use the <START:entity_name> and <END> tags to mark the name in 
tokenized
plain text documents.

In the example above you could replace <START:person> with 
<START:organization> to markup
an organization name in your text.

To create a model which performs on your documents you will have to 
label quite a few of them
and using a text editor to insert the tags is an approach which does not 
scale for more than
a few documents.

I suggest to have a look at brat:
http://brat.nlplab.org/

Brat has a few issues in the 1.3 release version, but they are now 
resolved in the trunk,
I recommend to use it instead of 1.3.

The OpenNLP Name Finder in the trunk version can be directly trained on 
the brat format.
If you want to use OpenNLP 1.5.3 instead you can still use 1.6.0 to 
convert the data into the
above discussed OpenNLP format.

I know a few people who have done this successfully. Let us know if you 
have an issues, and a contribution
about this process to our documentation would be very welcome!

HTH,
Jörn

Re: Training new models

Posted by swapnil marathe <sp...@gmail.com>.

If you wish to make your own models you can makde them using
"Language modelling toolkits"  eg:
SRILM<http://www.speech.sri.com/projects/srilm/>
,MLITM <http://projects.csail.mit.edu/cgi-bin/wiki/view/SLS/MITLMTutorial>

Thanks,
Swapnil


On Fri, Mar 28, 2014 at 4:05 AM, Sanjeev Sharma <
sanjeev.sharma@evanschambers.com> wrote:

> Hi,
>
>
>
> I am new to OpenNLP.  I've been playing with chunking, tokenizing, POS
> tagging, and Name recognition for a few days.  I've been following the
> example code and using preexisting models from
> http://opennlp.sourceforge.net/models-1.5/.  I've been having some trouble
> with name recognition and organization recognition in that using the above
> mentioned models I can only identify common names or organizations like
> "Mike Smith" and "IBM".  In addition I need to be able to find date ranges
> and technical language like "Java", "C++", and "HTML" (I should mention
> that my input is going to be resumes).
>
>
>
> I figured I need to train my own models, especially since my training data
> should look more like my input to give a better context (i.e. resumes).
> I've been trying to find some information on how to do this in the
> documentation and also doing google searches.  I found a few simple
> examples, but not much more.  I did see the example in the documentation
> with the "<START:person> <END>" tags and the command line to process the
> training data into a .bin file, but nothing with organization names.  I
> tried to look at one or two of the annotation guides and that created more
> questions than answers (for example, the annotation guides not consistent
> with each other or the example in the documentation.  Are there pros and
> cons between the different formats?  Are the examples in the documentation
> in a native format?  Is there a conversion utility?  If so and I'm creating
> data from scratch, would it not be better to just put it in the native
> format?)
>
>
>
> I just lack understanding of OpenNLP and NLP in general and the OpenNLP
> Manual just hasn't worked for me.  Maybe I'm just misinterpreting the
> documentation or just not looking in the right place.  I would appreciate
> it greatly if someone could point me in the right direction in the way of
> real world examples of training a model, recommending a book I can read
> through, or maybe just some good examples of training data.  Beyond the
> specific task I'm trying to accomplish, I would like to get a deeper
> understanding of how OpenNLP works.
>
>
>
> Thanks for any help.
>