You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by Madhvi Gupta <mg...@gmail.com> on 2017/02/18 10:28:16 UTC

How to train a Named entity detection model

Hi All,

I have got reuters data from NIST. Now I want to generate the training data
from that to create a model for detecting named entities. Can anyone tell
me how the models can be generated from that.

-- 
With Regards
Madhvi Gupta
*(Senior Software Engineer)*

Re: How to train a Named entity detection model

Posted by Madhav Sharan <ms...@usc.edu>.
As per my knowledge you have two options -

   1. Write custom code which takes input data in your format and convert
   it to OpenNLP format.
   1. I haven't seen your data so I can't tell how entity is tagged is
      there. As you must know your data format I am sure you can write your own
      converter.
      2. You might want do this if training model is one time thing. (Which
      it is in most of the cases)
   2. Look at [0] code in OpenNLP and override
   `opennlp.tools.namefind.NameSample.parse(String, String, boolean)` and
   parse your format.
      1. You need to return tokenized sentences and span of entity. I think
      code is very straightforward to understand.
      2. You can see Junit [1] and test data [2] to understand how it's
      done for current format

I think any of them should solve your problem. If I had to make a choice
I'll go through current format and write a converter. :)

HTH

[0] Function which parses sentences
https://github.com/apache/opennlp/blob/master/opennlp-tools/src/main/java/opennlp/tools/namefind/NameSample.java#L215-L268
[1] Test case which uses this function -
https://github.com/apache/opennlp/blob/master/opennlp-tools/src/test/java/opennlp/tools/namefind/NameFinderMETest.java#L57
[2] Sample test data -
https://github.com/apache/opennlp/blob/master/opennlp-tools/src/test/resources/opennlp/tools/namefind/AnnotatedSentences.txt

--
Madhav Sharan


On Mon, Feb 27, 2017 at 2:28 AM, Madhvi Gupta <mg...@gmail.com> wrote:

> Hi Madhav,
>
> My training data is not in format mentioned in [0] wiki.
>
> It is in format generated through the following link:
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.
> clips.uantwerpen.be_conll2003_ner_000README&d=DwIFaQ&c=
> clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=DhBa2eLkbd4gAFB01lkNgg&m=
> PKyYEZBiSDHlt1bFw95J42ET2RmG9tMLdTXW-JqCirk&s=kAkuG-
> nMJUjbCa1WdOeaf46WaxbH8iq1y9crqViIHcw&e=
>
> Its format is mentioned in the trailing mail.
> I just want to know how opennlp models can be trained using that model. If
> not then the how the required format can be generated?
>
> With Regards
> Madhvi Gupta
> *(Senior Software Engineer)*
>
> On Mon, Feb 27, 2017 at 12:47 PM, Madhav Sharan <ms...@usc.edu> wrote:
>
> > Hi - Can you ensure that your training data is in format like mentioned
> in
> > wiki ? [0]
> >
> > Like mentioned in wiki training should be something like this-
> >
> > <START:person> Pierre Vinken <END> 61 years old , will join the board as
> a
> > nonexecutive director Nov. 29
> >
> > Here Type of Entity is "person" and "Pierre Vinken" is one of the person
> in
> > training data.
> >
> > I was looking at links you shared and your data looks in different
> format.
> > Can you ensure your eng.train is in above format?
> >
> > I think you can write your own code to read training file and convert it
> > into OpenNLP format. Also look at [1] in case you can make use of some
> pre
> > trained model available for OpenNLP
> >
> > HTH
> >
> >
> >
> > [0] https://urldefense.proofpoint.com/v2/url?u=https-3A__
> opennlp.apache.org_documentation_1.7.2_manual_opennl&d=DwIFaQ&c=
> clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=DhBa2eLkbd4gAFB01lkNgg&m=
> PKyYEZBiSDHlt1bFw95J42ET2RmG9tMLdTXW-JqCirk&s=gUV1ZD7Vwy_
> VB0wC09LDxkzLPsTlx4laAM5wJvqXVhg&e=
> > p.html#tools.namefind.training
> > [1] https://urldefense.proofpoint.com/v2/url?u=http-3A__opennlp.
> sourceforge.net_models-2D1.5_&d=DwIFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN
> 0H8p7CSfnc_gI&r=DhBa2eLkbd4gAFB01lkNgg&m=PKyYEZBiSDHlt1bFw95J42ET2RmG9t
> MLdTXW-JqCirk&s=WXqTe9O0PUx067VJieEoe6tnxlfxzbJygeJSUnNd2f4&e=
> >
> >
> > --
> > Madhav Sharan
> >
> >
> > On Sun, Feb 26, 2017 at 9:42 PM, Madhvi Gupta <mg...@gmail.com>
> > wrote:
> >
> > > Please let me know if anyone have any idea about this
> > >
> > > With Regards
> > > Madhvi Gupta
> > > *(Senior Software Engineer)*
> > >
> > > On Tue, Feb 21, 2017 at 10:51 AM, Madhvi Gupta <mg...@gmail.com>
> > > wrote:
> > >
> > > > Hi Joern,
> > > >
> > > > Training data generated from reuters dataset is in the following
> > format.
> > > > It has generated three files eng.train, eng.testa, eng.testb.
> > > >
> > > > A DT I-NP O
> > > > rare JJ I-NP O
> > > > early JJ I-NP O
> > > > handwritten JJ I-NP O
> > > > draft NN I-NP O
> > > > of IN I-PP O
> > > > a DT I-NP O
> > > > song NN I-NP O
> > > > by IN I-PP O
> > > > U.S. NNP I-NP I-LOC
> > > > guitar NN I-NP O
> > > > legend NN I-NP O
> > > > Jimi NNP I-NP I-PER
> > > >
> > > > Using this training data file when I ran the command:
> > > > ./opennlp TokenNameFinderTrainer -model en-ner-person.bin -lang en
> > -data
> > > > /home/centos/ner/eng.train -encoding UTF-8
> > > >
> > > > It is giving me the following error:
> > > > ERROR: Not enough training data
> > > > The provided training data is not sufficient to create enough events
> to
> > > > train a model.
> > > > To resolve this error use more training data, if this doesn't help
> > there
> > > > might
> > > > be some fundamental problem with the training data itself.
> > > >
> > > > The format required for training opennlp models is in the form of
> > > > sentences but training data prepared from reuters dataset is in the
> > baove
> > > > said format. So please tell me how training data can be generated in
> > the
> > > > required format or how the existing training data format can be used
> > for
> > > > generating models.
> > > >
> > > > With Regards
> > > > Madhvi Gupta
> > > > *(Senior Software Engineer)*
> > > >
> > > > On Mon, Feb 20, 2017 at 5:52 PM, Joern Kottmann <ko...@gmail.com>
> > > > wrote:
> > > >
> > > >> Please explain to us what is not working. Any error messages or
> > > >> exceptions?
> > > >>
> > > >> The name finder by default trains on the default format which you
> can
> > > see
> > > >> in the documentation link i shared.
> > > >>
> > > >> Jörn
> > > >>
> > > >> On Mon, Feb 20, 2017 at 6:04 AM, Madhvi Gupta <mgmahi.007@gmail.com
> >
> > > >> wrote:
> > > >>
> > > >> > Hi Joern,
> > > >> >
> > > >> > I have got the data from the following link which consist of
> corpus
> > of
> > > >> new
> > > >> > articles.
> > > >> > https://urldefense.proofpoint.com/v2/url?u=http-3A__trec.nis
> > > t.gov_data_reuters_reuters.html&d=DwIFaQ&c=clK7kQUTWtAVEOVIg
> > > vi0NU5BOUHhpN0H8p7CSfnc_gI&r=DhBa2eLkbd4gAFB01lkNgg&m=lMnAkl
> > > nfFkmS3IfHhJy5PgR6CHe7-61J_5MAe3U8CJI&s=0sEQ0deDkUi3w600Svja
> > > aKSVhtlEHEGzDh-l202X76o&e=
> > > >> >
> > > >> > Following the steps given in the below link I have created
> training
> > > and
> > > >> > test data but it is not working with the NameFinder of opennlp
> api.
> > > >> > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.clip
> > > s.uantwerpen.be_conll2003_ner_000README&d=DwIFaQ&c=clK7kQUTW
> > > tAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=DhBa2eLkbd4gAFB01lkNgg&
> > > m=lMnAklnfFkmS3IfHhJy5PgR6CHe7-61J_5MAe3U8CJI&s=ijG9-HM4_WRl
> > > wIUM6VyvE0YB3arX5Z2BVN5SFKlmzN4&e=
> > > >> >
> > > >> > So can you please help me how to create training data out of that
> > > corpus
> > > >> > and use it to create name entity detection models?
> > > >> >
> > > >> > With Regards
> > > >> > Madhvi Gupta
> > > >> > *(Senior Software Engineer)*
> > > >> >
> > > >> > On Mon, Feb 20, 2017 at 1:00 AM, Joern Kottmann <
> kottmann@gmail.com
> > >
> > > >> > wrote:
> > > >> >
> > > >> > > Hello,
> > > >> > >
> > > >> > > to train the name finder you need training data that contains
> the
> > > >> > entities
> > > >> > > you would like to decect.
> > > >> > > Is that the case with the data you have?
> > > >> > >
> > > >> > > Take a look at our documentation:
> > > >> > > https://urldefense.proofpoint.com/v2/url?u=https-3A__opennlp
> > > .apache.org_documentation_1.7.2_manual_&d=DwIFaQ&c=clK7kQUTW
> > > tAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=DhBa2eLkbd4gAFB01lkNgg&
> > > m=lMnAklnfFkmS3IfHhJy5PgR6CHe7-61J_5MAe3U8CJI&s=aLn09MB1cLHy
> > > ZI9a0NT3gLdj5ZNFrR_eg_PhHHQHYC4&e=
> > > >> > > opennlp.html#tools.namefind.training
> > > >> > >
> > > >> > > At the beginning of that section you can see how the data has to
> > be
> > > >> > marked
> > > >> > > up.
> > > >> > >
> > > >> > > Please note you that you need many sentences to train the name
> > > finder.
> > > >> > >
> > > >> > > HTH,
> > > >> > > Jörn
> > > >> > >
> > > >> > >
> > > >> > > On Sat, Feb 18, 2017 at 11:28 AM, Madhvi Gupta <
> > > mgmahi.007@gmail.com>
> > > >> > > wrote:
> > > >> > >
> > > >> > > > Hi All,
> > > >> > > >
> > > >> > > > I have got reuters data from NIST. Now I want to generate the
> > > >> training
> > > >> > > data
> > > >> > > > from that to create a model for detecting named entities. Can
> > > anyone
> > > >> > tell
> > > >> > > > me how the models can be generated from that.
> > > >> > > >
> > > >> > > > --
> > > >> > > > With Regards
> > > >> > > > Madhvi Gupta
> > > >> > > > *(Senior Software Engineer)*
> > > >> > > >
> > > >> > >
> > > >> >
> > > >> >
> > > >> >
> > > >> > --
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Re: How to train a Named entity detection model

Posted by Madhvi Gupta <mg...@gmail.com>.
Hi Madhav,

My training data is not in format mentioned in [0] wiki.

It is in format generated through the following link:
http://www.clips.uantwerpen.be/conll2003/ner/000README

Its format is mentioned in the trailing mail.
I just want to know how opennlp models can be trained using that model. If
not then the how the required format can be generated?

With Regards
Madhvi Gupta
*(Senior Software Engineer)*

On Mon, Feb 27, 2017 at 12:47 PM, Madhav Sharan <ms...@usc.edu> wrote:

> Hi - Can you ensure that your training data is in format like mentioned in
> wiki ? [0]
>
> Like mentioned in wiki training should be something like this-
>
> <START:person> Pierre Vinken <END> 61 years old , will join the board as a
> nonexecutive director Nov. 29
>
> Here Type of Entity is "person" and "Pierre Vinken" is one of the person in
> training data.
>
> I was looking at links you shared and your data looks in different format.
> Can you ensure your eng.train is in above format?
>
> I think you can write your own code to read training file and convert it
> into OpenNLP format. Also look at [1] in case you can make use of some pre
> trained model available for OpenNLP
>
> HTH
>
>
>
> [0] https://opennlp.apache.org/documentation/1.7.2/manual/opennl
> p.html#tools.namefind.training
> [1] http://opennlp.sourceforge.net/models-1.5/
>
>
> --
> Madhav Sharan
>
>
> On Sun, Feb 26, 2017 at 9:42 PM, Madhvi Gupta <mg...@gmail.com>
> wrote:
>
> > Please let me know if anyone have any idea about this
> >
> > With Regards
> > Madhvi Gupta
> > *(Senior Software Engineer)*
> >
> > On Tue, Feb 21, 2017 at 10:51 AM, Madhvi Gupta <mg...@gmail.com>
> > wrote:
> >
> > > Hi Joern,
> > >
> > > Training data generated from reuters dataset is in the following
> format.
> > > It has generated three files eng.train, eng.testa, eng.testb.
> > >
> > > A DT I-NP O
> > > rare JJ I-NP O
> > > early JJ I-NP O
> > > handwritten JJ I-NP O
> > > draft NN I-NP O
> > > of IN I-PP O
> > > a DT I-NP O
> > > song NN I-NP O
> > > by IN I-PP O
> > > U.S. NNP I-NP I-LOC
> > > guitar NN I-NP O
> > > legend NN I-NP O
> > > Jimi NNP I-NP I-PER
> > >
> > > Using this training data file when I ran the command:
> > > ./opennlp TokenNameFinderTrainer -model en-ner-person.bin -lang en
> -data
> > > /home/centos/ner/eng.train -encoding UTF-8
> > >
> > > It is giving me the following error:
> > > ERROR: Not enough training data
> > > The provided training data is not sufficient to create enough events to
> > > train a model.
> > > To resolve this error use more training data, if this doesn't help
> there
> > > might
> > > be some fundamental problem with the training data itself.
> > >
> > > The format required for training opennlp models is in the form of
> > > sentences but training data prepared from reuters dataset is in the
> baove
> > > said format. So please tell me how training data can be generated in
> the
> > > required format or how the existing training data format can be used
> for
> > > generating models.
> > >
> > > With Regards
> > > Madhvi Gupta
> > > *(Senior Software Engineer)*
> > >
> > > On Mon, Feb 20, 2017 at 5:52 PM, Joern Kottmann <ko...@gmail.com>
> > > wrote:
> > >
> > >> Please explain to us what is not working. Any error messages or
> > >> exceptions?
> > >>
> > >> The name finder by default trains on the default format which you can
> > see
> > >> in the documentation link i shared.
> > >>
> > >> Jörn
> > >>
> > >> On Mon, Feb 20, 2017 at 6:04 AM, Madhvi Gupta <mg...@gmail.com>
> > >> wrote:
> > >>
> > >> > Hi Joern,
> > >> >
> > >> > I have got the data from the following link which consist of corpus
> of
> > >> new
> > >> > articles.
> > >> > https://urldefense.proofpoint.com/v2/url?u=http-3A__trec.nis
> > t.gov_data_reuters_reuters.html&d=DwIFaQ&c=clK7kQUTWtAVEOVIg
> > vi0NU5BOUHhpN0H8p7CSfnc_gI&r=DhBa2eLkbd4gAFB01lkNgg&m=lMnAkl
> > nfFkmS3IfHhJy5PgR6CHe7-61J_5MAe3U8CJI&s=0sEQ0deDkUi3w600Svja
> > aKSVhtlEHEGzDh-l202X76o&e=
> > >> >
> > >> > Following the steps given in the below link I have created training
> > and
> > >> > test data but it is not working with the NameFinder of opennlp api.
> > >> > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.clip
> > s.uantwerpen.be_conll2003_ner_000README&d=DwIFaQ&c=clK7kQUTW
> > tAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=DhBa2eLkbd4gAFB01lkNgg&
> > m=lMnAklnfFkmS3IfHhJy5PgR6CHe7-61J_5MAe3U8CJI&s=ijG9-HM4_WRl
> > wIUM6VyvE0YB3arX5Z2BVN5SFKlmzN4&e=
> > >> >
> > >> > So can you please help me how to create training data out of that
> > corpus
> > >> > and use it to create name entity detection models?
> > >> >
> > >> > With Regards
> > >> > Madhvi Gupta
> > >> > *(Senior Software Engineer)*
> > >> >
> > >> > On Mon, Feb 20, 2017 at 1:00 AM, Joern Kottmann <kottmann@gmail.com
> >
> > >> > wrote:
> > >> >
> > >> > > Hello,
> > >> > >
> > >> > > to train the name finder you need training data that contains the
> > >> > entities
> > >> > > you would like to decect.
> > >> > > Is that the case with the data you have?
> > >> > >
> > >> > > Take a look at our documentation:
> > >> > > https://urldefense.proofpoint.com/v2/url?u=https-3A__opennlp
> > .apache.org_documentation_1.7.2_manual_&d=DwIFaQ&c=clK7kQUTW
> > tAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=DhBa2eLkbd4gAFB01lkNgg&
> > m=lMnAklnfFkmS3IfHhJy5PgR6CHe7-61J_5MAe3U8CJI&s=aLn09MB1cLHy
> > ZI9a0NT3gLdj5ZNFrR_eg_PhHHQHYC4&e=
> > >> > > opennlp.html#tools.namefind.training
> > >> > >
> > >> > > At the beginning of that section you can see how the data has to
> be
> > >> > marked
> > >> > > up.
> > >> > >
> > >> > > Please note you that you need many sentences to train the name
> > finder.
> > >> > >
> > >> > > HTH,
> > >> > > Jörn
> > >> > >
> > >> > >
> > >> > > On Sat, Feb 18, 2017 at 11:28 AM, Madhvi Gupta <
> > mgmahi.007@gmail.com>
> > >> > > wrote:
> > >> > >
> > >> > > > Hi All,
> > >> > > >
> > >> > > > I have got reuters data from NIST. Now I want to generate the
> > >> training
> > >> > > data
> > >> > > > from that to create a model for detecting named entities. Can
> > anyone
> > >> > tell
> > >> > > > me how the models can be generated from that.
> > >> > > >
> > >> > > > --
> > >> > > > With Regards
> > >> > > > Madhvi Gupta
> > >> > > > *(Senior Software Engineer)*
> > >> > > >
> > >> > >
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> >
> > >>
> > >
> > >
> >
>

Re: How to train a Named entity detection model

Posted by Madhav Sharan <ms...@usc.edu>.
Hi - Can you ensure that your training data is in format like mentioned in
wiki ? [0]

Like mentioned in wiki training should be something like this-

<START:person> Pierre Vinken <END> 61 years old , will join the board as a
nonexecutive director Nov. 29

Here Type of Entity is "person" and "Pierre Vinken" is one of the person in
training data.

I was looking at links you shared and your data looks in different format.
Can you ensure your eng.train is in above format?

I think you can write your own code to read training file and convert it
into OpenNLP format. Also look at [1] in case you can make use of some pre
trained model available for OpenNLP

HTH



[0] https://opennlp.apache.org/documentation/1.7.2/manual/opennl
p.html#tools.namefind.training
[1] http://opennlp.sourceforge.net/models-1.5/


--
Madhav Sharan


On Sun, Feb 26, 2017 at 9:42 PM, Madhvi Gupta <mg...@gmail.com> wrote:

> Please let me know if anyone have any idea about this
>
> With Regards
> Madhvi Gupta
> *(Senior Software Engineer)*
>
> On Tue, Feb 21, 2017 at 10:51 AM, Madhvi Gupta <mg...@gmail.com>
> wrote:
>
> > Hi Joern,
> >
> > Training data generated from reuters dataset is in the following format.
> > It has generated three files eng.train, eng.testa, eng.testb.
> >
> > A DT I-NP O
> > rare JJ I-NP O
> > early JJ I-NP O
> > handwritten JJ I-NP O
> > draft NN I-NP O
> > of IN I-PP O
> > a DT I-NP O
> > song NN I-NP O
> > by IN I-PP O
> > U.S. NNP I-NP I-LOC
> > guitar NN I-NP O
> > legend NN I-NP O
> > Jimi NNP I-NP I-PER
> >
> > Using this training data file when I ran the command:
> > ./opennlp TokenNameFinderTrainer -model en-ner-person.bin -lang en -data
> > /home/centos/ner/eng.train -encoding UTF-8
> >
> > It is giving me the following error:
> > ERROR: Not enough training data
> > The provided training data is not sufficient to create enough events to
> > train a model.
> > To resolve this error use more training data, if this doesn't help there
> > might
> > be some fundamental problem with the training data itself.
> >
> > The format required for training opennlp models is in the form of
> > sentences but training data prepared from reuters dataset is in the baove
> > said format. So please tell me how training data can be generated in the
> > required format or how the existing training data format can be used for
> > generating models.
> >
> > With Regards
> > Madhvi Gupta
> > *(Senior Software Engineer)*
> >
> > On Mon, Feb 20, 2017 at 5:52 PM, Joern Kottmann <ko...@gmail.com>
> > wrote:
> >
> >> Please explain to us what is not working. Any error messages or
> >> exceptions?
> >>
> >> The name finder by default trains on the default format which you can
> see
> >> in the documentation link i shared.
> >>
> >> Jörn
> >>
> >> On Mon, Feb 20, 2017 at 6:04 AM, Madhvi Gupta <mg...@gmail.com>
> >> wrote:
> >>
> >> > Hi Joern,
> >> >
> >> > I have got the data from the following link which consist of corpus of
> >> new
> >> > articles.
> >> > https://urldefense.proofpoint.com/v2/url?u=http-3A__trec.nis
> t.gov_data_reuters_reuters.html&d=DwIFaQ&c=clK7kQUTWtAVEOVIg
> vi0NU5BOUHhpN0H8p7CSfnc_gI&r=DhBa2eLkbd4gAFB01lkNgg&m=lMnAkl
> nfFkmS3IfHhJy5PgR6CHe7-61J_5MAe3U8CJI&s=0sEQ0deDkUi3w600Svja
> aKSVhtlEHEGzDh-l202X76o&e=
> >> >
> >> > Following the steps given in the below link I have created training
> and
> >> > test data but it is not working with the NameFinder of opennlp api.
> >> > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.clip
> s.uantwerpen.be_conll2003_ner_000README&d=DwIFaQ&c=clK7kQUTW
> tAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=DhBa2eLkbd4gAFB01lkNgg&
> m=lMnAklnfFkmS3IfHhJy5PgR6CHe7-61J_5MAe3U8CJI&s=ijG9-HM4_WRl
> wIUM6VyvE0YB3arX5Z2BVN5SFKlmzN4&e=
> >> >
> >> > So can you please help me how to create training data out of that
> corpus
> >> > and use it to create name entity detection models?
> >> >
> >> > With Regards
> >> > Madhvi Gupta
> >> > *(Senior Software Engineer)*
> >> >
> >> > On Mon, Feb 20, 2017 at 1:00 AM, Joern Kottmann <ko...@gmail.com>
> >> > wrote:
> >> >
> >> > > Hello,
> >> > >
> >> > > to train the name finder you need training data that contains the
> >> > entities
> >> > > you would like to decect.
> >> > > Is that the case with the data you have?
> >> > >
> >> > > Take a look at our documentation:
> >> > > https://urldefense.proofpoint.com/v2/url?u=https-3A__opennlp
> .apache.org_documentation_1.7.2_manual_&d=DwIFaQ&c=clK7kQUTW
> tAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=DhBa2eLkbd4gAFB01lkNgg&
> m=lMnAklnfFkmS3IfHhJy5PgR6CHe7-61J_5MAe3U8CJI&s=aLn09MB1cLHy
> ZI9a0NT3gLdj5ZNFrR_eg_PhHHQHYC4&e=
> >> > > opennlp.html#tools.namefind.training
> >> > >
> >> > > At the beginning of that section you can see how the data has to be
> >> > marked
> >> > > up.
> >> > >
> >> > > Please note you that you need many sentences to train the name
> finder.
> >> > >
> >> > > HTH,
> >> > > Jörn
> >> > >
> >> > >
> >> > > On Sat, Feb 18, 2017 at 11:28 AM, Madhvi Gupta <
> mgmahi.007@gmail.com>
> >> > > wrote:
> >> > >
> >> > > > Hi All,
> >> > > >
> >> > > > I have got reuters data from NIST. Now I want to generate the
> >> training
> >> > > data
> >> > > > from that to create a model for detecting named entities. Can
> anyone
> >> > tell
> >> > > > me how the models can be generated from that.
> >> > > >
> >> > > > --
> >> > > > With Regards
> >> > > > Madhvi Gupta
> >> > > > *(Senior Software Engineer)*
> >> > > >
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> >
> >>
> >
> >
>

Re: How to train a Named entity detection model

Posted by Madhvi Gupta <mg...@gmail.com>.
Please let me know if anyone have any idea about this

With Regards
Madhvi Gupta
*(Senior Software Engineer)*

On Tue, Feb 21, 2017 at 10:51 AM, Madhvi Gupta <mg...@gmail.com> wrote:

> Hi Joern,
>
> Training data generated from reuters dataset is in the following format.
> It has generated three files eng.train, eng.testa, eng.testb.
>
> A DT I-NP O
> rare JJ I-NP O
> early JJ I-NP O
> handwritten JJ I-NP O
> draft NN I-NP O
> of IN I-PP O
> a DT I-NP O
> song NN I-NP O
> by IN I-PP O
> U.S. NNP I-NP I-LOC
> guitar NN I-NP O
> legend NN I-NP O
> Jimi NNP I-NP I-PER
>
> Using this training data file when I ran the command:
> ./opennlp TokenNameFinderTrainer -model en-ner-person.bin -lang en -data
> /home/centos/ner/eng.train -encoding UTF-8
>
> It is giving me the following error:
> ERROR: Not enough training data
> The provided training data is not sufficient to create enough events to
> train a model.
> To resolve this error use more training data, if this doesn't help there
> might
> be some fundamental problem with the training data itself.
>
> The format required for training opennlp models is in the form of
> sentences but training data prepared from reuters dataset is in the baove
> said format. So please tell me how training data can be generated in the
> required format or how the existing training data format can be used for
> generating models.
>
> With Regards
> Madhvi Gupta
> *(Senior Software Engineer)*
>
> On Mon, Feb 20, 2017 at 5:52 PM, Joern Kottmann <ko...@gmail.com>
> wrote:
>
>> Please explain to us what is not working. Any error messages or
>> exceptions?
>>
>> The name finder by default trains on the default format which you can see
>> in the documentation link i shared.
>>
>> Jörn
>>
>> On Mon, Feb 20, 2017 at 6:04 AM, Madhvi Gupta <mg...@gmail.com>
>> wrote:
>>
>> > Hi Joern,
>> >
>> > I have got the data from the following link which consist of corpus of
>> new
>> > articles.
>> > http://trec.nist.gov/data/reuters/reuters.html
>> >
>> > Following the steps given in the below link I have created training and
>> > test data but it is not working with the NameFinder of opennlp api.
>> > http://www.clips.uantwerpen.be/conll2003/ner/000README
>> >
>> > So can you please help me how to create training data out of that corpus
>> > and use it to create name entity detection models?
>> >
>> > With Regards
>> > Madhvi Gupta
>> > *(Senior Software Engineer)*
>> >
>> > On Mon, Feb 20, 2017 at 1:00 AM, Joern Kottmann <ko...@gmail.com>
>> > wrote:
>> >
>> > > Hello,
>> > >
>> > > to train the name finder you need training data that contains the
>> > entities
>> > > you would like to decect.
>> > > Is that the case with the data you have?
>> > >
>> > > Take a look at our documentation:
>> > > https://opennlp.apache.org/documentation/1.7.2/manual/
>> > > opennlp.html#tools.namefind.training
>> > >
>> > > At the beginning of that section you can see how the data has to be
>> > marked
>> > > up.
>> > >
>> > > Please note you that you need many sentences to train the name finder.
>> > >
>> > > HTH,
>> > > Jörn
>> > >
>> > >
>> > > On Sat, Feb 18, 2017 at 11:28 AM, Madhvi Gupta <mg...@gmail.com>
>> > > wrote:
>> > >
>> > > > Hi All,
>> > > >
>> > > > I have got reuters data from NIST. Now I want to generate the
>> training
>> > > data
>> > > > from that to create a model for detecting named entities. Can anyone
>> > tell
>> > > > me how the models can be generated from that.
>> > > >
>> > > > --
>> > > > With Regards
>> > > > Madhvi Gupta
>> > > > *(Senior Software Engineer)*
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> >
>>
>
>

Re: How to train a Named entity detection model

Posted by Madhvi Gupta <mg...@gmail.com>.
Hi Joern,

Training data generated from reuters dataset is in the following format.
It has generated three files eng.train, eng.testa, eng.testb.

A DT I-NP O
rare JJ I-NP O
early JJ I-NP O
handwritten JJ I-NP O
draft NN I-NP O
of IN I-PP O
a DT I-NP O
song NN I-NP O
by IN I-PP O
U.S. NNP I-NP I-LOC
guitar NN I-NP O
legend NN I-NP O
Jimi NNP I-NP I-PER

Using this training data file when I ran the command:
./opennlp TokenNameFinderTrainer -model en-ner-person.bin -lang en -data
/home/centos/ner/eng.train -encoding UTF-8

It is giving me the following error:
ERROR: Not enough training data
The provided training data is not sufficient to create enough events to
train a model.
To resolve this error use more training data, if this doesn't help there
might
be some fundamental problem with the training data itself.

The format required for training opennlp models is in the form of sentences
but training data prepared from reuters dataset is in the baove said
format. So please tell me how training data can be generated in the
required format or how the existing training data format can be used for
generating models.

With Regards
Madhvi Gupta
*(Senior Software Engineer)*

On Mon, Feb 20, 2017 at 5:52 PM, Joern Kottmann <ko...@gmail.com> wrote:

> Please explain to us what is not working. Any error messages or exceptions?
>
> The name finder by default trains on the default format which you can see
> in the documentation link i shared.
>
> Jörn
>
> On Mon, Feb 20, 2017 at 6:04 AM, Madhvi Gupta <mg...@gmail.com>
> wrote:
>
> > Hi Joern,
> >
> > I have got the data from the following link which consist of corpus of
> new
> > articles.
> > http://trec.nist.gov/data/reuters/reuters.html
> >
> > Following the steps given in the below link I have created training and
> > test data but it is not working with the NameFinder of opennlp api.
> > http://www.clips.uantwerpen.be/conll2003/ner/000README
> >
> > So can you please help me how to create training data out of that corpus
> > and use it to create name entity detection models?
> >
> > With Regards
> > Madhvi Gupta
> > *(Senior Software Engineer)*
> >
> > On Mon, Feb 20, 2017 at 1:00 AM, Joern Kottmann <ko...@gmail.com>
> > wrote:
> >
> > > Hello,
> > >
> > > to train the name finder you need training data that contains the
> > entities
> > > you would like to decect.
> > > Is that the case with the data you have?
> > >
> > > Take a look at our documentation:
> > > https://opennlp.apache.org/documentation/1.7.2/manual/
> > > opennlp.html#tools.namefind.training
> > >
> > > At the beginning of that section you can see how the data has to be
> > marked
> > > up.
> > >
> > > Please note you that you need many sentences to train the name finder.
> > >
> > > HTH,
> > > Jörn
> > >
> > >
> > > On Sat, Feb 18, 2017 at 11:28 AM, Madhvi Gupta <mg...@gmail.com>
> > > wrote:
> > >
> > > > Hi All,
> > > >
> > > > I have got reuters data from NIST. Now I want to generate the
> training
> > > data
> > > > from that to create a model for detecting named entities. Can anyone
> > tell
> > > > me how the models can be generated from that.
> > > >
> > > > --
> > > > With Regards
> > > > Madhvi Gupta
> > > > *(Senior Software Engineer)*
> > > >
> > >
> >
> >
> >
> > --
> >
>

Re: How to train a Named entity detection model

Posted by Joern Kottmann <ko...@gmail.com>.
Please explain to us what is not working. Any error messages or exceptions?

The name finder by default trains on the default format which you can see
in the documentation link i shared.

Jörn

On Mon, Feb 20, 2017 at 6:04 AM, Madhvi Gupta <mg...@gmail.com> wrote:

> Hi Joern,
>
> I have got the data from the following link which consist of corpus of new
> articles.
> http://trec.nist.gov/data/reuters/reuters.html
>
> Following the steps given in the below link I have created training and
> test data but it is not working with the NameFinder of opennlp api.
> http://www.clips.uantwerpen.be/conll2003/ner/000README
>
> So can you please help me how to create training data out of that corpus
> and use it to create name entity detection models?
>
> With Regards
> Madhvi Gupta
> *(Senior Software Engineer)*
>
> On Mon, Feb 20, 2017 at 1:00 AM, Joern Kottmann <ko...@gmail.com>
> wrote:
>
> > Hello,
> >
> > to train the name finder you need training data that contains the
> entities
> > you would like to decect.
> > Is that the case with the data you have?
> >
> > Take a look at our documentation:
> > https://opennlp.apache.org/documentation/1.7.2/manual/
> > opennlp.html#tools.namefind.training
> >
> > At the beginning of that section you can see how the data has to be
> marked
> > up.
> >
> > Please note you that you need many sentences to train the name finder.
> >
> > HTH,
> > Jörn
> >
> >
> > On Sat, Feb 18, 2017 at 11:28 AM, Madhvi Gupta <mg...@gmail.com>
> > wrote:
> >
> > > Hi All,
> > >
> > > I have got reuters data from NIST. Now I want to generate the training
> > data
> > > from that to create a model for detecting named entities. Can anyone
> tell
> > > me how the models can be generated from that.
> > >
> > > --
> > > With Regards
> > > Madhvi Gupta
> > > *(Senior Software Engineer)*
> > >
> >
>
>
>
> --
>

Re: How to train a Named entity detection model

Posted by Madhvi Gupta <mg...@gmail.com>.
Hi Joern,

I have got the data from the following link which consist of corpus of new
articles.
http://trec.nist.gov/data/reuters/reuters.html

Following the steps given in the below link I have created training and
test data but it is not working with the NameFinder of opennlp api.
http://www.clips.uantwerpen.be/conll2003/ner/000README

So can you please help me how to create training data out of that corpus
and use it to create name entity detection models?

With Regards
Madhvi Gupta
*(Senior Software Engineer)*

On Mon, Feb 20, 2017 at 1:00 AM, Joern Kottmann <ko...@gmail.com> wrote:

> Hello,
>
> to train the name finder you need training data that contains the entities
> you would like to decect.
> Is that the case with the data you have?
>
> Take a look at our documentation:
> https://opennlp.apache.org/documentation/1.7.2/manual/
> opennlp.html#tools.namefind.training
>
> At the beginning of that section you can see how the data has to be marked
> up.
>
> Please note you that you need many sentences to train the name finder.
>
> HTH,
> Jörn
>
>
> On Sat, Feb 18, 2017 at 11:28 AM, Madhvi Gupta <mg...@gmail.com>
> wrote:
>
> > Hi All,
> >
> > I have got reuters data from NIST. Now I want to generate the training
> data
> > from that to create a model for detecting named entities. Can anyone tell
> > me how the models can be generated from that.
> >
> > --
> > With Regards
> > Madhvi Gupta
> > *(Senior Software Engineer)*
> >
>



--

Re: How to train a Named entity detection model

Posted by Joern Kottmann <ko...@gmail.com>.
Hello,

to train the name finder you need training data that contains the entities
you would like to decect.
Is that the case with the data you have?

Take a look at our documentation:
https://opennlp.apache.org/documentation/1.7.2/manual/opennlp.html#tools.namefind.training

At the beginning of that section you can see how the data has to be marked
up.

Please note you that you need many sentences to train the name finder.

HTH,
Jörn


On Sat, Feb 18, 2017 at 11:28 AM, Madhvi Gupta <mg...@gmail.com> wrote:

> Hi All,
>
> I have got reuters data from NIST. Now I want to generate the training data
> from that to create a model for detecting named entities. Can anyone tell
> me how the models can be generated from that.
>
> --
> With Regards
> Madhvi Gupta
> *(Senior Software Engineer)*
>