You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by Robert Logue <rp...@hotmail.co.uk> on 2016/04/15 10:25:21 UTC

Name finder questions

Hello,

I have just started using OpenNLP in the java application. I am just getting my used with the software and have a couple of newbie questions.

I see for the name finder there is different model data for people and organizations (en-ner-organization.bin and en-ner-person.bin). Is there any way to combine these into one file so I can do 1 search that will give me back person names and organization names. Or is this not possible and is it best to do two searches?This question isn't related to the name finder and I don't think it is possible but thought I would ask anyway. If I had two sentences say 'Jack climbed the hill. He was very tired.' Is there any way to know that the pronoun, he, at the start of the second sentence is actually about Jack the subject of the first sentence? I know in this simple case it is obvious but I am wondering if there is anything in the OpenNLP software that will help with this?

Thanks,


Robert

 		 	   		  

Re: Name finder questions

Posted by Jeffrey Zemerick <jz...@apache.org>.
Robert,

The docs used to say that training on multiple entity types is supported
but experimental but I can't find it in the documentation now. The
documentation is quoted in a Stack Overflow answer [1]. Perhaps if this has
changed someone will let us know.

I will have to leave your second question for the others . :)

Jeff

[1] http://stackoverflow.com/a/31496897



On Fri, Apr 15, 2016 at 4:25 AM, Robert Logue <rp...@hotmail.co.uk> wrote:

> Hello,
>
> I have just started using OpenNLP in the java application. I am just
> getting my used with the software and have a couple of newbie questions.
>
> I see for the name finder there is different model data for people and
> organizations (en-ner-organization.bin and en-ner-person.bin). Is there any
> way to combine these into one file so I can do 1 search that will give me
> back person names and organization names. Or is this not possible and is it
> best to do two searches?This question isn't related to the name finder and
> I don't think it is possible but thought I would ask anyway. If I had two
> sentences say 'Jack climbed the hill. He was very tired.' Is there any way
> to know that the pronoun, he, at the start of the second sentence is
> actually about Jack the subject of the first sentence? I know in this
> simple case it is obvious but I am wondering if there is anything in the
> OpenNLP software that will help with this?
>
> Thanks,
>
>
> Robert
>
>

RE: Name finder questions

Posted by Robert Logue <rp...@hotmail.co.uk>.
Ok thanks. I guess it was my inexperience that was making me think it wasn't named entity

So in one of the files I see

Daniel|NNP|I-PER Guerin|NNP|I-PER

So I would need to parse and remove the POS tags and replace the |I-PER to <START:Person><END> and that would do the job.

Thanks, that helps me a lot.

Robert

> From: ragerri@apache.org
> Date: Mon, 25 Apr 2016 16:53:28 +0200
> Subject: Re: Name finder questions
> To: users@opennlp.apache.org
> 
> Hi,
> 
> It is much easier to try with a corpus that is already available. The
> links I sent are about Named Entities, and they all contain persons,
> locations and organizations. The idea is obtain (one of) those corpora
> and format it to OpenNLP format to train a new model. If that does not
> work for you (e.g., the output is very bad) then maybe you could
> consider annotating your own data. But that takes time.
> 
> HTH,
> 
> R
> 
> On Mon, Apr 25, 2016 at 4:32 PM, Robert Logue <rp...@hotmail.co.uk> wrote:
> > I sure did, thanks. I was more unsure if these would work as well for sports specifically or would it be best to make my own?
> >
> > I may have missed something but they are also unclear what the files are for ie is it a model file for. The ones I downloaded and looked at seemed to be POS tagging rather than named entity tagging. May my inexperience is making me miss something?
> >
> > Thanks,
> > Robert
> >
> >
> >
> >> From: ragerri@apache.org
> >> Date: Mon, 25 Apr 2016 15:43:23 +0200
> >> Subject: Re: Name finder questions
> >> To: users@opennlp.apache.org
> >>
> >> Did you look at the links I sent in a previous email?
> >>
> >> R
> >>
> >> On Mon, Apr 25, 2016 at 3:10 PM, Robert Logue <rp...@hotmail.co.uk> wrote:
> >> > The area I would be looking in would be sports and the only things I would be interested in would be the 3 things I mentioned ie
> >> >
> >> > People, Organizations and Location
> >> >
> >> > Do you think there is existing corpora that would cover this? Or would there be benefit in creating my own?
> >> >
> >> > Thanks,
> >> > Robert
> >> >
> >> >> From: ragerri@apache.org
> >> >> Date: Mon, 25 Apr 2016 09:39:48 +0200
> >> >> Subject: Re: Name finder questions
> >> >> To: users@opennlp.apache.org
> >> >>
> >> >> Hi Robert,
> >> >>
> >> >> Performance varies a lot, and that is still the subject of research.
> >> >> Basically, more data always helps, but depending on the type of data,
> >> >> number of entity types, etc., the quantity required differs. If you
> >> >> need to tag persons, locations and organizations on news or similar
> >> >> text genre I recommend you to use one of the already existing corpora
> >> >> and avoid tagging your own data.
> >> >>
> >> >> Which genre are you interested in?
> >> >>
> >> >> R
> >> >>
> >> >> On Fri, Apr 22, 2016 at 10:31 AM, Robert Logue <rp...@hotmail.co.uk> wrote:
> >> >> > Very useful, thank you.
> >> >> >
> >> >> > Only question I have left now, for the moment, is on performance. The minimum recommend number of sentences is 15,000 does anyone know how much this would need to be increased to before it would, maybe it never would, become a performance issue? So if I created training data with 100,000 sentences would this be an issue? Is there any number I could go to where it would be an issue?
> >> >> >
> >> >> > Thanks,
> >> >> >
> >> >> > Robert
> >> >> >
> >> >> >> Subject: Re: Name finder questions
> >> >> >> To: users@opennlp.apache.org
> >> >> >> From: post@thomas-zastrow.de
> >> >> >> Date: Fri, 22 Apr 2016 10:22:50 +0200
> >> >> >>
> >> >> >> Here you can find raw data I used to create a German model, maybe its
> >> >> >> useful for you:
> >> >> >>
> >> >> >> http://www.thomas-zastrow.de/nlp/
> >> >> >>
> >> >> >> ("Raw trainingdata in OpenNLP format")
> >> >> >>
> >> >> >>
> >> >> >> Am 22.04.2016 um 10:17 schrieb Robert Logue:
> >> >> >> > Can anyone help here? I don't want to start creating a large training file and find out I have gone about it in the wrong way.
> >> >> >> >
> >> >> >> > The resources I have been looking at are
> >> >> >> >
> >> >> >> > https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.training
> >> >> >> > http://blog.thedigitalgroup.com/sagarg/2015/10/30/open-nlp-name-finder-model-training/
> >> >> >> > http://nishutayaltech.blogspot.co.uk/2015/07/writing-custom-namefinder-model-in.html
> >> >> >> >
> >> >> >> > None of which gives the answers I am looking for.
> >> >> >> >
> >> >> >> > Thanks,
> >> >> >> >
> >> >> >> > Robert
> >> >> >> >
> >> >> >> >> From: rplogue@hotmail.co.uk
> >> >> >> >> To: users@opennlp.apache.org
> >> >> >> >> Subject: RE: Name finder questions
> >> >> >> >> Date: Wed, 20 Apr 2016 09:51:25 +0100
> >> >> >> >>
> >> >> >> >> I have a few questions regarding creating my own training data for the name finder. I would like to distinguish between people, organizations and locations. The example in the documentation shows the tags to use for people ie
> >> >> >> >>
> >> >> >> >> <START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .So would I used <START:organization><END> and <START:location><END> for organizations and locations respectively? The name entity guidelines in the documentation ie
> >> >> >> >>
> >> >> >> >> https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.annotation_guides
> >> >> >> >>
> >> >> >> >> seem to show different tags getting used which has confused me slightly as to which tags I should actually use?
> >> >> >> >>
> >> >> >> >> Also I see the 15,000 line recommendation is there any performance hit if you use many more lines?
> >> >> >> >>
> >> >> >> >> If I create my plain text training file as I outlined above is there any other params that are recommended to use beyond the basic ie
> >> >> >> >>
> >> >> >> >> opennlp TokenNameFinderTrainer -model OUTPUT_FILE.bin -lang en -data TRAINING_FILE.train -encoding UTF-8
> >> >> >> >>
> >> >> >> >> For instance what is the -params training parameters file used for? Is this necessary should this list the named entities I am looking for ie person, organization and location if so what format should it be in?
> >> >> >> >>
> >> >> >> >> Sorry for the basic questions here but kind find the answers in the documentation or from a quick google.
> >> >> >> >>
> >> >> >> >> Thanks,
> >> >> >> >>
> >> >> >> >> Robert
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>> From: rodrigo.agerri@ehu.eus
> >> >> >> >>> Date: Mon, 18 Apr 2016 09:36:24 +0200
> >> >> >> >>> Subject: Re: Name finder questions
> >> >> >> >>> To: users@opennlp.apache.org
> >> >> >> >>>
> >> >> >> >>> Hello,
> >> >> >> >>>
> >> >> >> >>> Yes, that is the idea.
> >> >> >> >>>
> >> >> >> >>> R
> >> >> >> >>>
> >> >> >> >>> On Sun, Apr 17, 2016 at 9:10 PM, Robert Logue <rp...@hotmail.co.uk> wrote:
> >> >> >> >>>> I am slightly confused what I can use the data in those links for? So can I use this data with the training tool like the following
> >> >> >> >>>>
> >> >> >> >>>> opennlp TokenNameFinderTrainer -model OUTPUT_FILE_NAME -lang en
> >> >> >> >>>> -data DOWNLOADED_FILE_NAME -encoding UTF-8
> >> >> >> >>>> And that should give me a better model file for when I use the name finder?
> >> >> >> >>>>
> >> >> >> >>>> Thanks,
> >> >> >> >>>>
> >> >> >> >>>> Robert
> >> >> >> >>>>
> >> >> >> >>>>> From: rodrigo.agerri@ehu.eus
> >> >> >> >>>>> Date: Fri, 15 Apr 2016 17:12:20 +0200
> >> >> >> >>>>> Subject: Re: Name finder questions
> >> >> >> >>>>> To: users@opennlp.apache.org
> >> >> >> >>>>>
> >> >> >> >>>>> Hi Robert,
> >> >> >> >>>>>
> >> >> >> >>>>> On Fri, Apr 15, 2016 at 10:25 AM, Robert Logue <rp...@hotmail.co.uk> wrote:
> >> >> >> >>>>>> Hello,
> >> >> >> >>>>>>
> >> >> >> >>>>>> I have just started using OpenNLP in the java application. I am just getting my used with the software and have a couple of newbie questions.
> >> >> >> >>>>>>
> >> >> >> >>>>>> I see for the name finder there is different model data for people and organizations (en-ner-organization.bin and en-ner-person.bin). Is there any way to combine these into one file so I can do 1 search that will give me back person names and organization names. Or is this not possible and is it best to do two searches?
> >> >> >> >>>>> This used to be experimental. It is not anymore, namely, you can train
> >> >> >> >>>>> a name finder model for more than one entity type. The models
> >> >> >> >>>>> available were trained with rather old newswire data so I would
> >> >> >> >>>>> recommend you to obtain train new models using OpenNLP:
> >> >> >> >>>>>
> >> >> >> >>>>> http://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.tool
> >> >> >> >>>>>
> >> >> >> >>>>> I suppose you do not have manually annotated training data so I could
> >> >> >> >>>>> recommend to get the Ontonotes corpus.
> >> >> >> >>>>>
> >> >> >> >>>>> https://catalog.ldc.upenn.edu/LDC2013T19
> >> >> >> >>>>>
> >> >> >> >>>>> https://github.com/ontonotes/conll-formatted-ontonotes-5.0
> >> >> >> >>>>>
> >> >> >> >>>>> Another option is to get a silver standard corpus obtained
> >> >> >> >>>>> automatically from the Wikipedia:
> >> >> >> >>>>>
> >> >> >> >>>>> http://schwa.org/projects/resources/wiki/Wikiner#Automatic-training-data-from-Wikipedia
> >> >> >> >>>>>
> >> >> >> >>>>> For Dutch, Spanish, German and Italian (that I know of) there are free
> >> >> >> >>>>> resources. Search for Ancora, SONAR-1, GermEval 2014 and Evalita 2009.
> >> >> >> >>>>>
> >> >> >> >>>>>> This question isn't related to the name finder and I don't think it is possible but thought I would ask anyway. If I had two sentences say 'Jack climbed the hill. He was very tired.' Is there any way to know that the pronoun, he, at the start of the second sentence is actually about Jack the subject of the first sentence? I know in this simple case it is obvious but I am wondering if there is anything in the OpenNLP software that will help with this?
> >> >> >> >>>>> The example you mentioned is called "pronominal anaphora" and it
> >> >> >> >>>>> generalizes in the coreference resolution problem. There used to be a
> >> >> >> >>>>> coreference tool in OpenNLP but got moved to the Sandbox because many
> >> >> >> >>>>> things need to be updated to be able to distribute it.
> >> >> >> >>>>>
> >> >> >> >>>>> See http://conll.cemantix.org/2012/introduction.html for more details.
> >> >> >> >>>>>
> >> >> >> >>>>> HTH,
> >> >> >> >>>>>
> >> >> >> >>>>> R
> >> >> >> >>
> >> >> >> >
> >> >> >>
> >> >> >> --
> >> >> >> Dr. Thomas Zastrow
> >> >> >> Rechenzentrum Garching (RZG) der Max-Planck-Gesellschaft (MPG)
> >> >> >> Gießenbachstr. 2, D-85748 Garching bei München, Germany
> >> >> >> Tel +49-89-3299-1457
> >> >> >> http://www.rzg.mpg.de
> >> >> >>
> >> >> >
> >> >
> >
 		 	   		  

Re: Name finder questions

Posted by Rodrigo Agerri <ra...@apache.org>.
Hi,

It is much easier to try with a corpus that is already available. The
links I sent are about Named Entities, and they all contain persons,
locations and organizations. The idea is obtain (one of) those corpora
and format it to OpenNLP format to train a new model. If that does not
work for you (e.g., the output is very bad) then maybe you could
consider annotating your own data. But that takes time.

HTH,

R

On Mon, Apr 25, 2016 at 4:32 PM, Robert Logue <rp...@hotmail.co.uk> wrote:
> I sure did, thanks. I was more unsure if these would work as well for sports specifically or would it be best to make my own?
>
> I may have missed something but they are also unclear what the files are for ie is it a model file for. The ones I downloaded and looked at seemed to be POS tagging rather than named entity tagging. May my inexperience is making me miss something?
>
> Thanks,
> Robert
>
>
>
>> From: ragerri@apache.org
>> Date: Mon, 25 Apr 2016 15:43:23 +0200
>> Subject: Re: Name finder questions
>> To: users@opennlp.apache.org
>>
>> Did you look at the links I sent in a previous email?
>>
>> R
>>
>> On Mon, Apr 25, 2016 at 3:10 PM, Robert Logue <rp...@hotmail.co.uk> wrote:
>> > The area I would be looking in would be sports and the only things I would be interested in would be the 3 things I mentioned ie
>> >
>> > People, Organizations and Location
>> >
>> > Do you think there is existing corpora that would cover this? Or would there be benefit in creating my own?
>> >
>> > Thanks,
>> > Robert
>> >
>> >> From: ragerri@apache.org
>> >> Date: Mon, 25 Apr 2016 09:39:48 +0200
>> >> Subject: Re: Name finder questions
>> >> To: users@opennlp.apache.org
>> >>
>> >> Hi Robert,
>> >>
>> >> Performance varies a lot, and that is still the subject of research.
>> >> Basically, more data always helps, but depending on the type of data,
>> >> number of entity types, etc., the quantity required differs. If you
>> >> need to tag persons, locations and organizations on news or similar
>> >> text genre I recommend you to use one of the already existing corpora
>> >> and avoid tagging your own data.
>> >>
>> >> Which genre are you interested in?
>> >>
>> >> R
>> >>
>> >> On Fri, Apr 22, 2016 at 10:31 AM, Robert Logue <rp...@hotmail.co.uk> wrote:
>> >> > Very useful, thank you.
>> >> >
>> >> > Only question I have left now, for the moment, is on performance. The minimum recommend number of sentences is 15,000 does anyone know how much this would need to be increased to before it would, maybe it never would, become a performance issue? So if I created training data with 100,000 sentences would this be an issue? Is there any number I could go to where it would be an issue?
>> >> >
>> >> > Thanks,
>> >> >
>> >> > Robert
>> >> >
>> >> >> Subject: Re: Name finder questions
>> >> >> To: users@opennlp.apache.org
>> >> >> From: post@thomas-zastrow.de
>> >> >> Date: Fri, 22 Apr 2016 10:22:50 +0200
>> >> >>
>> >> >> Here you can find raw data I used to create a German model, maybe its
>> >> >> useful for you:
>> >> >>
>> >> >> http://www.thomas-zastrow.de/nlp/
>> >> >>
>> >> >> ("Raw trainingdata in OpenNLP format")
>> >> >>
>> >> >>
>> >> >> Am 22.04.2016 um 10:17 schrieb Robert Logue:
>> >> >> > Can anyone help here? I don't want to start creating a large training file and find out I have gone about it in the wrong way.
>> >> >> >
>> >> >> > The resources I have been looking at are
>> >> >> >
>> >> >> > https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.training
>> >> >> > http://blog.thedigitalgroup.com/sagarg/2015/10/30/open-nlp-name-finder-model-training/
>> >> >> > http://nishutayaltech.blogspot.co.uk/2015/07/writing-custom-namefinder-model-in.html
>> >> >> >
>> >> >> > None of which gives the answers I am looking for.
>> >> >> >
>> >> >> > Thanks,
>> >> >> >
>> >> >> > Robert
>> >> >> >
>> >> >> >> From: rplogue@hotmail.co.uk
>> >> >> >> To: users@opennlp.apache.org
>> >> >> >> Subject: RE: Name finder questions
>> >> >> >> Date: Wed, 20 Apr 2016 09:51:25 +0100
>> >> >> >>
>> >> >> >> I have a few questions regarding creating my own training data for the name finder. I would like to distinguish between people, organizations and locations. The example in the documentation shows the tags to use for people ie
>> >> >> >>
>> >> >> >> <START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .So would I used <START:organization><END> and <START:location><END> for organizations and locations respectively? The name entity guidelines in the documentation ie
>> >> >> >>
>> >> >> >> https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.annotation_guides
>> >> >> >>
>> >> >> >> seem to show different tags getting used which has confused me slightly as to which tags I should actually use?
>> >> >> >>
>> >> >> >> Also I see the 15,000 line recommendation is there any performance hit if you use many more lines?
>> >> >> >>
>> >> >> >> If I create my plain text training file as I outlined above is there any other params that are recommended to use beyond the basic ie
>> >> >> >>
>> >> >> >> opennlp TokenNameFinderTrainer -model OUTPUT_FILE.bin -lang en -data TRAINING_FILE.train -encoding UTF-8
>> >> >> >>
>> >> >> >> For instance what is the -params training parameters file used for? Is this necessary should this list the named entities I am looking for ie person, organization and location if so what format should it be in?
>> >> >> >>
>> >> >> >> Sorry for the basic questions here but kind find the answers in the documentation or from a quick google.
>> >> >> >>
>> >> >> >> Thanks,
>> >> >> >>
>> >> >> >> Robert
>> >> >> >>
>> >> >> >>
>> >> >> >>> From: rodrigo.agerri@ehu.eus
>> >> >> >>> Date: Mon, 18 Apr 2016 09:36:24 +0200
>> >> >> >>> Subject: Re: Name finder questions
>> >> >> >>> To: users@opennlp.apache.org
>> >> >> >>>
>> >> >> >>> Hello,
>> >> >> >>>
>> >> >> >>> Yes, that is the idea.
>> >> >> >>>
>> >> >> >>> R
>> >> >> >>>
>> >> >> >>> On Sun, Apr 17, 2016 at 9:10 PM, Robert Logue <rp...@hotmail.co.uk> wrote:
>> >> >> >>>> I am slightly confused what I can use the data in those links for? So can I use this data with the training tool like the following
>> >> >> >>>>
>> >> >> >>>> opennlp TokenNameFinderTrainer -model OUTPUT_FILE_NAME -lang en
>> >> >> >>>> -data DOWNLOADED_FILE_NAME -encoding UTF-8
>> >> >> >>>> And that should give me a better model file for when I use the name finder?
>> >> >> >>>>
>> >> >> >>>> Thanks,
>> >> >> >>>>
>> >> >> >>>> Robert
>> >> >> >>>>
>> >> >> >>>>> From: rodrigo.agerri@ehu.eus
>> >> >> >>>>> Date: Fri, 15 Apr 2016 17:12:20 +0200
>> >> >> >>>>> Subject: Re: Name finder questions
>> >> >> >>>>> To: users@opennlp.apache.org
>> >> >> >>>>>
>> >> >> >>>>> Hi Robert,
>> >> >> >>>>>
>> >> >> >>>>> On Fri, Apr 15, 2016 at 10:25 AM, Robert Logue <rp...@hotmail.co.uk> wrote:
>> >> >> >>>>>> Hello,
>> >> >> >>>>>>
>> >> >> >>>>>> I have just started using OpenNLP in the java application. I am just getting my used with the software and have a couple of newbie questions.
>> >> >> >>>>>>
>> >> >> >>>>>> I see for the name finder there is different model data for people and organizations (en-ner-organization.bin and en-ner-person.bin). Is there any way to combine these into one file so I can do 1 search that will give me back person names and organization names. Or is this not possible and is it best to do two searches?
>> >> >> >>>>> This used to be experimental. It is not anymore, namely, you can train
>> >> >> >>>>> a name finder model for more than one entity type. The models
>> >> >> >>>>> available were trained with rather old newswire data so I would
>> >> >> >>>>> recommend you to obtain train new models using OpenNLP:
>> >> >> >>>>>
>> >> >> >>>>> http://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.tool
>> >> >> >>>>>
>> >> >> >>>>> I suppose you do not have manually annotated training data so I could
>> >> >> >>>>> recommend to get the Ontonotes corpus.
>> >> >> >>>>>
>> >> >> >>>>> https://catalog.ldc.upenn.edu/LDC2013T19
>> >> >> >>>>>
>> >> >> >>>>> https://github.com/ontonotes/conll-formatted-ontonotes-5.0
>> >> >> >>>>>
>> >> >> >>>>> Another option is to get a silver standard corpus obtained
>> >> >> >>>>> automatically from the Wikipedia:
>> >> >> >>>>>
>> >> >> >>>>> http://schwa.org/projects/resources/wiki/Wikiner#Automatic-training-data-from-Wikipedia
>> >> >> >>>>>
>> >> >> >>>>> For Dutch, Spanish, German and Italian (that I know of) there are free
>> >> >> >>>>> resources. Search for Ancora, SONAR-1, GermEval 2014 and Evalita 2009.
>> >> >> >>>>>
>> >> >> >>>>>> This question isn't related to the name finder and I don't think it is possible but thought I would ask anyway. If I had two sentences say 'Jack climbed the hill. He was very tired.' Is there any way to know that the pronoun, he, at the start of the second sentence is actually about Jack the subject of the first sentence? I know in this simple case it is obvious but I am wondering if there is anything in the OpenNLP software that will help with this?
>> >> >> >>>>> The example you mentioned is called "pronominal anaphora" and it
>> >> >> >>>>> generalizes in the coreference resolution problem. There used to be a
>> >> >> >>>>> coreference tool in OpenNLP but got moved to the Sandbox because many
>> >> >> >>>>> things need to be updated to be able to distribute it.
>> >> >> >>>>>
>> >> >> >>>>> See http://conll.cemantix.org/2012/introduction.html for more details.
>> >> >> >>>>>
>> >> >> >>>>> HTH,
>> >> >> >>>>>
>> >> >> >>>>> R
>> >> >> >>
>> >> >> >
>> >> >>
>> >> >> --
>> >> >> Dr. Thomas Zastrow
>> >> >> Rechenzentrum Garching (RZG) der Max-Planck-Gesellschaft (MPG)
>> >> >> Gießenbachstr. 2, D-85748 Garching bei München, Germany
>> >> >> Tel +49-89-3299-1457
>> >> >> http://www.rzg.mpg.de
>> >> >>
>> >> >
>> >
>

RE: Name finder questions

Posted by Robert Logue <rp...@hotmail.co.uk>.
I sure did, thanks. I was more unsure if these would work as well for sports specifically or would it be best to make my own? 

I may have missed something but they are also unclear what the files are for ie is it a model file for. The ones I downloaded and looked at seemed to be POS tagging rather than named entity tagging. May my inexperience is making me miss something?

Thanks,
Robert



> From: ragerri@apache.org
> Date: Mon, 25 Apr 2016 15:43:23 +0200
> Subject: Re: Name finder questions
> To: users@opennlp.apache.org
> 
> Did you look at the links I sent in a previous email?
> 
> R
> 
> On Mon, Apr 25, 2016 at 3:10 PM, Robert Logue <rp...@hotmail.co.uk> wrote:
> > The area I would be looking in would be sports and the only things I would be interested in would be the 3 things I mentioned ie
> >
> > People, Organizations and Location
> >
> > Do you think there is existing corpora that would cover this? Or would there be benefit in creating my own?
> >
> > Thanks,
> > Robert
> >
> >> From: ragerri@apache.org
> >> Date: Mon, 25 Apr 2016 09:39:48 +0200
> >> Subject: Re: Name finder questions
> >> To: users@opennlp.apache.org
> >>
> >> Hi Robert,
> >>
> >> Performance varies a lot, and that is still the subject of research.
> >> Basically, more data always helps, but depending on the type of data,
> >> number of entity types, etc., the quantity required differs. If you
> >> need to tag persons, locations and organizations on news or similar
> >> text genre I recommend you to use one of the already existing corpora
> >> and avoid tagging your own data.
> >>
> >> Which genre are you interested in?
> >>
> >> R
> >>
> >> On Fri, Apr 22, 2016 at 10:31 AM, Robert Logue <rp...@hotmail.co.uk> wrote:
> >> > Very useful, thank you.
> >> >
> >> > Only question I have left now, for the moment, is on performance. The minimum recommend number of sentences is 15,000 does anyone know how much this would need to be increased to before it would, maybe it never would, become a performance issue? So if I created training data with 100,000 sentences would this be an issue? Is there any number I could go to where it would be an issue?
> >> >
> >> > Thanks,
> >> >
> >> > Robert
> >> >
> >> >> Subject: Re: Name finder questions
> >> >> To: users@opennlp.apache.org
> >> >> From: post@thomas-zastrow.de
> >> >> Date: Fri, 22 Apr 2016 10:22:50 +0200
> >> >>
> >> >> Here you can find raw data I used to create a German model, maybe its
> >> >> useful for you:
> >> >>
> >> >> http://www.thomas-zastrow.de/nlp/
> >> >>
> >> >> ("Raw trainingdata in OpenNLP format")
> >> >>
> >> >>
> >> >> Am 22.04.2016 um 10:17 schrieb Robert Logue:
> >> >> > Can anyone help here? I don't want to start creating a large training file and find out I have gone about it in the wrong way.
> >> >> >
> >> >> > The resources I have been looking at are
> >> >> >
> >> >> > https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.training
> >> >> > http://blog.thedigitalgroup.com/sagarg/2015/10/30/open-nlp-name-finder-model-training/
> >> >> > http://nishutayaltech.blogspot.co.uk/2015/07/writing-custom-namefinder-model-in.html
> >> >> >
> >> >> > None of which gives the answers I am looking for.
> >> >> >
> >> >> > Thanks,
> >> >> >
> >> >> > Robert
> >> >> >
> >> >> >> From: rplogue@hotmail.co.uk
> >> >> >> To: users@opennlp.apache.org
> >> >> >> Subject: RE: Name finder questions
> >> >> >> Date: Wed, 20 Apr 2016 09:51:25 +0100
> >> >> >>
> >> >> >> I have a few questions regarding creating my own training data for the name finder. I would like to distinguish between people, organizations and locations. The example in the documentation shows the tags to use for people ie
> >> >> >>
> >> >> >> <START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .So would I used <START:organization><END> and <START:location><END> for organizations and locations respectively? The name entity guidelines in the documentation ie
> >> >> >>
> >> >> >> https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.annotation_guides
> >> >> >>
> >> >> >> seem to show different tags getting used which has confused me slightly as to which tags I should actually use?
> >> >> >>
> >> >> >> Also I see the 15,000 line recommendation is there any performance hit if you use many more lines?
> >> >> >>
> >> >> >> If I create my plain text training file as I outlined above is there any other params that are recommended to use beyond the basic ie
> >> >> >>
> >> >> >> opennlp TokenNameFinderTrainer -model OUTPUT_FILE.bin -lang en -data TRAINING_FILE.train -encoding UTF-8
> >> >> >>
> >> >> >> For instance what is the -params training parameters file used for? Is this necessary should this list the named entities I am looking for ie person, organization and location if so what format should it be in?
> >> >> >>
> >> >> >> Sorry for the basic questions here but kind find the answers in the documentation or from a quick google.
> >> >> >>
> >> >> >> Thanks,
> >> >> >>
> >> >> >> Robert
> >> >> >>
> >> >> >>
> >> >> >>> From: rodrigo.agerri@ehu.eus
> >> >> >>> Date: Mon, 18 Apr 2016 09:36:24 +0200
> >> >> >>> Subject: Re: Name finder questions
> >> >> >>> To: users@opennlp.apache.org
> >> >> >>>
> >> >> >>> Hello,
> >> >> >>>
> >> >> >>> Yes, that is the idea.
> >> >> >>>
> >> >> >>> R
> >> >> >>>
> >> >> >>> On Sun, Apr 17, 2016 at 9:10 PM, Robert Logue <rp...@hotmail.co.uk> wrote:
> >> >> >>>> I am slightly confused what I can use the data in those links for? So can I use this data with the training tool like the following
> >> >> >>>>
> >> >> >>>> opennlp TokenNameFinderTrainer -model OUTPUT_FILE_NAME -lang en
> >> >> >>>> -data DOWNLOADED_FILE_NAME -encoding UTF-8
> >> >> >>>> And that should give me a better model file for when I use the name finder?
> >> >> >>>>
> >> >> >>>> Thanks,
> >> >> >>>>
> >> >> >>>> Robert
> >> >> >>>>
> >> >> >>>>> From: rodrigo.agerri@ehu.eus
> >> >> >>>>> Date: Fri, 15 Apr 2016 17:12:20 +0200
> >> >> >>>>> Subject: Re: Name finder questions
> >> >> >>>>> To: users@opennlp.apache.org
> >> >> >>>>>
> >> >> >>>>> Hi Robert,
> >> >> >>>>>
> >> >> >>>>> On Fri, Apr 15, 2016 at 10:25 AM, Robert Logue <rp...@hotmail.co.uk> wrote:
> >> >> >>>>>> Hello,
> >> >> >>>>>>
> >> >> >>>>>> I have just started using OpenNLP in the java application. I am just getting my used with the software and have a couple of newbie questions.
> >> >> >>>>>>
> >> >> >>>>>> I see for the name finder there is different model data for people and organizations (en-ner-organization.bin and en-ner-person.bin). Is there any way to combine these into one file so I can do 1 search that will give me back person names and organization names. Or is this not possible and is it best to do two searches?
> >> >> >>>>> This used to be experimental. It is not anymore, namely, you can train
> >> >> >>>>> a name finder model for more than one entity type. The models
> >> >> >>>>> available were trained with rather old newswire data so I would
> >> >> >>>>> recommend you to obtain train new models using OpenNLP:
> >> >> >>>>>
> >> >> >>>>> http://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.tool
> >> >> >>>>>
> >> >> >>>>> I suppose you do not have manually annotated training data so I could
> >> >> >>>>> recommend to get the Ontonotes corpus.
> >> >> >>>>>
> >> >> >>>>> https://catalog.ldc.upenn.edu/LDC2013T19
> >> >> >>>>>
> >> >> >>>>> https://github.com/ontonotes/conll-formatted-ontonotes-5.0
> >> >> >>>>>
> >> >> >>>>> Another option is to get a silver standard corpus obtained
> >> >> >>>>> automatically from the Wikipedia:
> >> >> >>>>>
> >> >> >>>>> http://schwa.org/projects/resources/wiki/Wikiner#Automatic-training-data-from-Wikipedia
> >> >> >>>>>
> >> >> >>>>> For Dutch, Spanish, German and Italian (that I know of) there are free
> >> >> >>>>> resources. Search for Ancora, SONAR-1, GermEval 2014 and Evalita 2009.
> >> >> >>>>>
> >> >> >>>>>> This question isn't related to the name finder and I don't think it is possible but thought I would ask anyway. If I had two sentences say 'Jack climbed the hill. He was very tired.' Is there any way to know that the pronoun, he, at the start of the second sentence is actually about Jack the subject of the first sentence? I know in this simple case it is obvious but I am wondering if there is anything in the OpenNLP software that will help with this?
> >> >> >>>>> The example you mentioned is called "pronominal anaphora" and it
> >> >> >>>>> generalizes in the coreference resolution problem. There used to be a
> >> >> >>>>> coreference tool in OpenNLP but got moved to the Sandbox because many
> >> >> >>>>> things need to be updated to be able to distribute it.
> >> >> >>>>>
> >> >> >>>>> See http://conll.cemantix.org/2012/introduction.html for more details.
> >> >> >>>>>
> >> >> >>>>> HTH,
> >> >> >>>>>
> >> >> >>>>> R
> >> >> >>
> >> >> >
> >> >>
> >> >> --
> >> >> Dr. Thomas Zastrow
> >> >> Rechenzentrum Garching (RZG) der Max-Planck-Gesellschaft (MPG)
> >> >> Gießenbachstr. 2, D-85748 Garching bei München, Germany
> >> >> Tel +49-89-3299-1457
> >> >> http://www.rzg.mpg.de
> >> >>
> >> >
> >
 		 	   		  

Re: Name finder questions

Posted by Rodrigo Agerri <ra...@apache.org>.
Did you look at the links I sent in a previous email?

R

On Mon, Apr 25, 2016 at 3:10 PM, Robert Logue <rp...@hotmail.co.uk> wrote:
> The area I would be looking in would be sports and the only things I would be interested in would be the 3 things I mentioned ie
>
> People, Organizations and Location
>
> Do you think there is existing corpora that would cover this? Or would there be benefit in creating my own?
>
> Thanks,
> Robert
>
>> From: ragerri@apache.org
>> Date: Mon, 25 Apr 2016 09:39:48 +0200
>> Subject: Re: Name finder questions
>> To: users@opennlp.apache.org
>>
>> Hi Robert,
>>
>> Performance varies a lot, and that is still the subject of research.
>> Basically, more data always helps, but depending on the type of data,
>> number of entity types, etc., the quantity required differs. If you
>> need to tag persons, locations and organizations on news or similar
>> text genre I recommend you to use one of the already existing corpora
>> and avoid tagging your own data.
>>
>> Which genre are you interested in?
>>
>> R
>>
>> On Fri, Apr 22, 2016 at 10:31 AM, Robert Logue <rp...@hotmail.co.uk> wrote:
>> > Very useful, thank you.
>> >
>> > Only question I have left now, for the moment, is on performance. The minimum recommend number of sentences is 15,000 does anyone know how much this would need to be increased to before it would, maybe it never would, become a performance issue? So if I created training data with 100,000 sentences would this be an issue? Is there any number I could go to where it would be an issue?
>> >
>> > Thanks,
>> >
>> > Robert
>> >
>> >> Subject: Re: Name finder questions
>> >> To: users@opennlp.apache.org
>> >> From: post@thomas-zastrow.de
>> >> Date: Fri, 22 Apr 2016 10:22:50 +0200
>> >>
>> >> Here you can find raw data I used to create a German model, maybe its
>> >> useful for you:
>> >>
>> >> http://www.thomas-zastrow.de/nlp/
>> >>
>> >> ("Raw trainingdata in OpenNLP format")
>> >>
>> >>
>> >> Am 22.04.2016 um 10:17 schrieb Robert Logue:
>> >> > Can anyone help here? I don't want to start creating a large training file and find out I have gone about it in the wrong way.
>> >> >
>> >> > The resources I have been looking at are
>> >> >
>> >> > https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.training
>> >> > http://blog.thedigitalgroup.com/sagarg/2015/10/30/open-nlp-name-finder-model-training/
>> >> > http://nishutayaltech.blogspot.co.uk/2015/07/writing-custom-namefinder-model-in.html
>> >> >
>> >> > None of which gives the answers I am looking for.
>> >> >
>> >> > Thanks,
>> >> >
>> >> > Robert
>> >> >
>> >> >> From: rplogue@hotmail.co.uk
>> >> >> To: users@opennlp.apache.org
>> >> >> Subject: RE: Name finder questions
>> >> >> Date: Wed, 20 Apr 2016 09:51:25 +0100
>> >> >>
>> >> >> I have a few questions regarding creating my own training data for the name finder. I would like to distinguish between people, organizations and locations. The example in the documentation shows the tags to use for people ie
>> >> >>
>> >> >> <START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .So would I used <START:organization><END> and <START:location><END> for organizations and locations respectively? The name entity guidelines in the documentation ie
>> >> >>
>> >> >> https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.annotation_guides
>> >> >>
>> >> >> seem to show different tags getting used which has confused me slightly as to which tags I should actually use?
>> >> >>
>> >> >> Also I see the 15,000 line recommendation is there any performance hit if you use many more lines?
>> >> >>
>> >> >> If I create my plain text training file as I outlined above is there any other params that are recommended to use beyond the basic ie
>> >> >>
>> >> >> opennlp TokenNameFinderTrainer -model OUTPUT_FILE.bin -lang en -data TRAINING_FILE.train -encoding UTF-8
>> >> >>
>> >> >> For instance what is the -params training parameters file used for? Is this necessary should this list the named entities I am looking for ie person, organization and location if so what format should it be in?
>> >> >>
>> >> >> Sorry for the basic questions here but kind find the answers in the documentation or from a quick google.
>> >> >>
>> >> >> Thanks,
>> >> >>
>> >> >> Robert
>> >> >>
>> >> >>
>> >> >>> From: rodrigo.agerri@ehu.eus
>> >> >>> Date: Mon, 18 Apr 2016 09:36:24 +0200
>> >> >>> Subject: Re: Name finder questions
>> >> >>> To: users@opennlp.apache.org
>> >> >>>
>> >> >>> Hello,
>> >> >>>
>> >> >>> Yes, that is the idea.
>> >> >>>
>> >> >>> R
>> >> >>>
>> >> >>> On Sun, Apr 17, 2016 at 9:10 PM, Robert Logue <rp...@hotmail.co.uk> wrote:
>> >> >>>> I am slightly confused what I can use the data in those links for? So can I use this data with the training tool like the following
>> >> >>>>
>> >> >>>> opennlp TokenNameFinderTrainer -model OUTPUT_FILE_NAME -lang en
>> >> >>>> -data DOWNLOADED_FILE_NAME -encoding UTF-8
>> >> >>>> And that should give me a better model file for when I use the name finder?
>> >> >>>>
>> >> >>>> Thanks,
>> >> >>>>
>> >> >>>> Robert
>> >> >>>>
>> >> >>>>> From: rodrigo.agerri@ehu.eus
>> >> >>>>> Date: Fri, 15 Apr 2016 17:12:20 +0200
>> >> >>>>> Subject: Re: Name finder questions
>> >> >>>>> To: users@opennlp.apache.org
>> >> >>>>>
>> >> >>>>> Hi Robert,
>> >> >>>>>
>> >> >>>>> On Fri, Apr 15, 2016 at 10:25 AM, Robert Logue <rp...@hotmail.co.uk> wrote:
>> >> >>>>>> Hello,
>> >> >>>>>>
>> >> >>>>>> I have just started using OpenNLP in the java application. I am just getting my used with the software and have a couple of newbie questions.
>> >> >>>>>>
>> >> >>>>>> I see for the name finder there is different model data for people and organizations (en-ner-organization.bin and en-ner-person.bin). Is there any way to combine these into one file so I can do 1 search that will give me back person names and organization names. Or is this not possible and is it best to do two searches?
>> >> >>>>> This used to be experimental. It is not anymore, namely, you can train
>> >> >>>>> a name finder model for more than one entity type. The models
>> >> >>>>> available were trained with rather old newswire data so I would
>> >> >>>>> recommend you to obtain train new models using OpenNLP:
>> >> >>>>>
>> >> >>>>> http://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.tool
>> >> >>>>>
>> >> >>>>> I suppose you do not have manually annotated training data so I could
>> >> >>>>> recommend to get the Ontonotes corpus.
>> >> >>>>>
>> >> >>>>> https://catalog.ldc.upenn.edu/LDC2013T19
>> >> >>>>>
>> >> >>>>> https://github.com/ontonotes/conll-formatted-ontonotes-5.0
>> >> >>>>>
>> >> >>>>> Another option is to get a silver standard corpus obtained
>> >> >>>>> automatically from the Wikipedia:
>> >> >>>>>
>> >> >>>>> http://schwa.org/projects/resources/wiki/Wikiner#Automatic-training-data-from-Wikipedia
>> >> >>>>>
>> >> >>>>> For Dutch, Spanish, German and Italian (that I know of) there are free
>> >> >>>>> resources. Search for Ancora, SONAR-1, GermEval 2014 and Evalita 2009.
>> >> >>>>>
>> >> >>>>>> This question isn't related to the name finder and I don't think it is possible but thought I would ask anyway. If I had two sentences say 'Jack climbed the hill. He was very tired.' Is there any way to know that the pronoun, he, at the start of the second sentence is actually about Jack the subject of the first sentence? I know in this simple case it is obvious but I am wondering if there is anything in the OpenNLP software that will help with this?
>> >> >>>>> The example you mentioned is called "pronominal anaphora" and it
>> >> >>>>> generalizes in the coreference resolution problem. There used to be a
>> >> >>>>> coreference tool in OpenNLP but got moved to the Sandbox because many
>> >> >>>>> things need to be updated to be able to distribute it.
>> >> >>>>>
>> >> >>>>> See http://conll.cemantix.org/2012/introduction.html for more details.
>> >> >>>>>
>> >> >>>>> HTH,
>> >> >>>>>
>> >> >>>>> R
>> >> >>
>> >> >
>> >>
>> >> --
>> >> Dr. Thomas Zastrow
>> >> Rechenzentrum Garching (RZG) der Max-Planck-Gesellschaft (MPG)
>> >> Gießenbachstr. 2, D-85748 Garching bei München, Germany
>> >> Tel +49-89-3299-1457
>> >> http://www.rzg.mpg.de
>> >>
>> >
>

RE: Name finder questions

Posted by Robert Logue <rp...@hotmail.co.uk>.
The area I would be looking in would be sports and the only things I would be interested in would be the 3 things I mentioned ie

People, Organizations and Location

Do you think there is existing corpora that would cover this? Or would there be benefit in creating my own?

Thanks,
Robert

> From: ragerri@apache.org
> Date: Mon, 25 Apr 2016 09:39:48 +0200
> Subject: Re: Name finder questions
> To: users@opennlp.apache.org
> 
> Hi Robert,
> 
> Performance varies a lot, and that is still the subject of research.
> Basically, more data always helps, but depending on the type of data,
> number of entity types, etc., the quantity required differs. If you
> need to tag persons, locations and organizations on news or similar
> text genre I recommend you to use one of the already existing corpora
> and avoid tagging your own data.
> 
> Which genre are you interested in?
> 
> R
> 
> On Fri, Apr 22, 2016 at 10:31 AM, Robert Logue <rp...@hotmail.co.uk> wrote:
> > Very useful, thank you.
> >
> > Only question I have left now, for the moment, is on performance. The minimum recommend number of sentences is 15,000 does anyone know how much this would need to be increased to before it would, maybe it never would, become a performance issue? So if I created training data with 100,000 sentences would this be an issue? Is there any number I could go to where it would be an issue?
> >
> > Thanks,
> >
> > Robert
> >
> >> Subject: Re: Name finder questions
> >> To: users@opennlp.apache.org
> >> From: post@thomas-zastrow.de
> >> Date: Fri, 22 Apr 2016 10:22:50 +0200
> >>
> >> Here you can find raw data I used to create a German model, maybe its
> >> useful for you:
> >>
> >> http://www.thomas-zastrow.de/nlp/
> >>
> >> ("Raw trainingdata in OpenNLP format")
> >>
> >>
> >> Am 22.04.2016 um 10:17 schrieb Robert Logue:
> >> > Can anyone help here? I don't want to start creating a large training file and find out I have gone about it in the wrong way.
> >> >
> >> > The resources I have been looking at are
> >> >
> >> > https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.training
> >> > http://blog.thedigitalgroup.com/sagarg/2015/10/30/open-nlp-name-finder-model-training/
> >> > http://nishutayaltech.blogspot.co.uk/2015/07/writing-custom-namefinder-model-in.html
> >> >
> >> > None of which gives the answers I am looking for.
> >> >
> >> > Thanks,
> >> >
> >> > Robert
> >> >
> >> >> From: rplogue@hotmail.co.uk
> >> >> To: users@opennlp.apache.org
> >> >> Subject: RE: Name finder questions
> >> >> Date: Wed, 20 Apr 2016 09:51:25 +0100
> >> >>
> >> >> I have a few questions regarding creating my own training data for the name finder. I would like to distinguish between people, organizations and locations. The example in the documentation shows the tags to use for people ie
> >> >>
> >> >> <START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .So would I used <START:organization><END> and <START:location><END> for organizations and locations respectively? The name entity guidelines in the documentation ie
> >> >>
> >> >> https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.annotation_guides
> >> >>
> >> >> seem to show different tags getting used which has confused me slightly as to which tags I should actually use?
> >> >>
> >> >> Also I see the 15,000 line recommendation is there any performance hit if you use many more lines?
> >> >>
> >> >> If I create my plain text training file as I outlined above is there any other params that are recommended to use beyond the basic ie
> >> >>
> >> >> opennlp TokenNameFinderTrainer -model OUTPUT_FILE.bin -lang en -data TRAINING_FILE.train -encoding UTF-8
> >> >>
> >> >> For instance what is the -params training parameters file used for? Is this necessary should this list the named entities I am looking for ie person, organization and location if so what format should it be in?
> >> >>
> >> >> Sorry for the basic questions here but kind find the answers in the documentation or from a quick google.
> >> >>
> >> >> Thanks,
> >> >>
> >> >> Robert
> >> >>
> >> >>
> >> >>> From: rodrigo.agerri@ehu.eus
> >> >>> Date: Mon, 18 Apr 2016 09:36:24 +0200
> >> >>> Subject: Re: Name finder questions
> >> >>> To: users@opennlp.apache.org
> >> >>>
> >> >>> Hello,
> >> >>>
> >> >>> Yes, that is the idea.
> >> >>>
> >> >>> R
> >> >>>
> >> >>> On Sun, Apr 17, 2016 at 9:10 PM, Robert Logue <rp...@hotmail.co.uk> wrote:
> >> >>>> I am slightly confused what I can use the data in those links for? So can I use this data with the training tool like the following
> >> >>>>
> >> >>>> opennlp TokenNameFinderTrainer -model OUTPUT_FILE_NAME -lang en
> >> >>>> -data DOWNLOADED_FILE_NAME -encoding UTF-8
> >> >>>> And that should give me a better model file for when I use the name finder?
> >> >>>>
> >> >>>> Thanks,
> >> >>>>
> >> >>>> Robert
> >> >>>>
> >> >>>>> From: rodrigo.agerri@ehu.eus
> >> >>>>> Date: Fri, 15 Apr 2016 17:12:20 +0200
> >> >>>>> Subject: Re: Name finder questions
> >> >>>>> To: users@opennlp.apache.org
> >> >>>>>
> >> >>>>> Hi Robert,
> >> >>>>>
> >> >>>>> On Fri, Apr 15, 2016 at 10:25 AM, Robert Logue <rp...@hotmail.co.uk> wrote:
> >> >>>>>> Hello,
> >> >>>>>>
> >> >>>>>> I have just started using OpenNLP in the java application. I am just getting my used with the software and have a couple of newbie questions.
> >> >>>>>>
> >> >>>>>> I see for the name finder there is different model data for people and organizations (en-ner-organization.bin and en-ner-person.bin). Is there any way to combine these into one file so I can do 1 search that will give me back person names and organization names. Or is this not possible and is it best to do two searches?
> >> >>>>> This used to be experimental. It is not anymore, namely, you can train
> >> >>>>> a name finder model for more than one entity type. The models
> >> >>>>> available were trained with rather old newswire data so I would
> >> >>>>> recommend you to obtain train new models using OpenNLP:
> >> >>>>>
> >> >>>>> http://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.tool
> >> >>>>>
> >> >>>>> I suppose you do not have manually annotated training data so I could
> >> >>>>> recommend to get the Ontonotes corpus.
> >> >>>>>
> >> >>>>> https://catalog.ldc.upenn.edu/LDC2013T19
> >> >>>>>
> >> >>>>> https://github.com/ontonotes/conll-formatted-ontonotes-5.0
> >> >>>>>
> >> >>>>> Another option is to get a silver standard corpus obtained
> >> >>>>> automatically from the Wikipedia:
> >> >>>>>
> >> >>>>> http://schwa.org/projects/resources/wiki/Wikiner#Automatic-training-data-from-Wikipedia
> >> >>>>>
> >> >>>>> For Dutch, Spanish, German and Italian (that I know of) there are free
> >> >>>>> resources. Search for Ancora, SONAR-1, GermEval 2014 and Evalita 2009.
> >> >>>>>
> >> >>>>>> This question isn't related to the name finder and I don't think it is possible but thought I would ask anyway. If I had two sentences say 'Jack climbed the hill. He was very tired.' Is there any way to know that the pronoun, he, at the start of the second sentence is actually about Jack the subject of the first sentence? I know in this simple case it is obvious but I am wondering if there is anything in the OpenNLP software that will help with this?
> >> >>>>> The example you mentioned is called "pronominal anaphora" and it
> >> >>>>> generalizes in the coreference resolution problem. There used to be a
> >> >>>>> coreference tool in OpenNLP but got moved to the Sandbox because many
> >> >>>>> things need to be updated to be able to distribute it.
> >> >>>>>
> >> >>>>> See http://conll.cemantix.org/2012/introduction.html for more details.
> >> >>>>>
> >> >>>>> HTH,
> >> >>>>>
> >> >>>>> R
> >> >>
> >> >
> >>
> >> --
> >> Dr. Thomas Zastrow
> >> Rechenzentrum Garching (RZG) der Max-Planck-Gesellschaft (MPG)
> >> Gießenbachstr. 2, D-85748 Garching bei München, Germany
> >> Tel +49-89-3299-1457
> >> http://www.rzg.mpg.de
> >>
> >
 		 	   		  

Re: Name finder questions

Posted by Rodrigo Agerri <ra...@apache.org>.
Hi Robert,

Performance varies a lot, and that is still the subject of research.
Basically, more data always helps, but depending on the type of data,
number of entity types, etc., the quantity required differs. If you
need to tag persons, locations and organizations on news or similar
text genre I recommend you to use one of the already existing corpora
and avoid tagging your own data.

Which genre are you interested in?

R

On Fri, Apr 22, 2016 at 10:31 AM, Robert Logue <rp...@hotmail.co.uk> wrote:
> Very useful, thank you.
>
> Only question I have left now, for the moment, is on performance. The minimum recommend number of sentences is 15,000 does anyone know how much this would need to be increased to before it would, maybe it never would, become a performance issue? So if I created training data with 100,000 sentences would this be an issue? Is there any number I could go to where it would be an issue?
>
> Thanks,
>
> Robert
>
>> Subject: Re: Name finder questions
>> To: users@opennlp.apache.org
>> From: post@thomas-zastrow.de
>> Date: Fri, 22 Apr 2016 10:22:50 +0200
>>
>> Here you can find raw data I used to create a German model, maybe its
>> useful for you:
>>
>> http://www.thomas-zastrow.de/nlp/
>>
>> ("Raw trainingdata in OpenNLP format")
>>
>>
>> Am 22.04.2016 um 10:17 schrieb Robert Logue:
>> > Can anyone help here? I don't want to start creating a large training file and find out I have gone about it in the wrong way.
>> >
>> > The resources I have been looking at are
>> >
>> > https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.training
>> > http://blog.thedigitalgroup.com/sagarg/2015/10/30/open-nlp-name-finder-model-training/
>> > http://nishutayaltech.blogspot.co.uk/2015/07/writing-custom-namefinder-model-in.html
>> >
>> > None of which gives the answers I am looking for.
>> >
>> > Thanks,
>> >
>> > Robert
>> >
>> >> From: rplogue@hotmail.co.uk
>> >> To: users@opennlp.apache.org
>> >> Subject: RE: Name finder questions
>> >> Date: Wed, 20 Apr 2016 09:51:25 +0100
>> >>
>> >> I have a few questions regarding creating my own training data for the name finder. I would like to distinguish between people, organizations and locations. The example in the documentation shows the tags to use for people ie
>> >>
>> >> <START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .So would I used <START:organization><END> and <START:location><END> for organizations and locations respectively? The name entity guidelines in the documentation ie
>> >>
>> >> https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.annotation_guides
>> >>
>> >> seem to show different tags getting used which has confused me slightly as to which tags I should actually use?
>> >>
>> >> Also I see the 15,000 line recommendation is there any performance hit if you use many more lines?
>> >>
>> >> If I create my plain text training file as I outlined above is there any other params that are recommended to use beyond the basic ie
>> >>
>> >> opennlp TokenNameFinderTrainer -model OUTPUT_FILE.bin -lang en -data TRAINING_FILE.train -encoding UTF-8
>> >>
>> >> For instance what is the -params training parameters file used for? Is this necessary should this list the named entities I am looking for ie person, organization and location if so what format should it be in?
>> >>
>> >> Sorry for the basic questions here but kind find the answers in the documentation or from a quick google.
>> >>
>> >> Thanks,
>> >>
>> >> Robert
>> >>
>> >>
>> >>> From: rodrigo.agerri@ehu.eus
>> >>> Date: Mon, 18 Apr 2016 09:36:24 +0200
>> >>> Subject: Re: Name finder questions
>> >>> To: users@opennlp.apache.org
>> >>>
>> >>> Hello,
>> >>>
>> >>> Yes, that is the idea.
>> >>>
>> >>> R
>> >>>
>> >>> On Sun, Apr 17, 2016 at 9:10 PM, Robert Logue <rp...@hotmail.co.uk> wrote:
>> >>>> I am slightly confused what I can use the data in those links for? So can I use this data with the training tool like the following
>> >>>>
>> >>>> opennlp TokenNameFinderTrainer -model OUTPUT_FILE_NAME -lang en
>> >>>> -data DOWNLOADED_FILE_NAME -encoding UTF-8
>> >>>> And that should give me a better model file for when I use the name finder?
>> >>>>
>> >>>> Thanks,
>> >>>>
>> >>>> Robert
>> >>>>
>> >>>>> From: rodrigo.agerri@ehu.eus
>> >>>>> Date: Fri, 15 Apr 2016 17:12:20 +0200
>> >>>>> Subject: Re: Name finder questions
>> >>>>> To: users@opennlp.apache.org
>> >>>>>
>> >>>>> Hi Robert,
>> >>>>>
>> >>>>> On Fri, Apr 15, 2016 at 10:25 AM, Robert Logue <rp...@hotmail.co.uk> wrote:
>> >>>>>> Hello,
>> >>>>>>
>> >>>>>> I have just started using OpenNLP in the java application. I am just getting my used with the software and have a couple of newbie questions.
>> >>>>>>
>> >>>>>> I see for the name finder there is different model data for people and organizations (en-ner-organization.bin and en-ner-person.bin). Is there any way to combine these into one file so I can do 1 search that will give me back person names and organization names. Or is this not possible and is it best to do two searches?
>> >>>>> This used to be experimental. It is not anymore, namely, you can train
>> >>>>> a name finder model for more than one entity type. The models
>> >>>>> available were trained with rather old newswire data so I would
>> >>>>> recommend you to obtain train new models using OpenNLP:
>> >>>>>
>> >>>>> http://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.tool
>> >>>>>
>> >>>>> I suppose you do not have manually annotated training data so I could
>> >>>>> recommend to get the Ontonotes corpus.
>> >>>>>
>> >>>>> https://catalog.ldc.upenn.edu/LDC2013T19
>> >>>>>
>> >>>>> https://github.com/ontonotes/conll-formatted-ontonotes-5.0
>> >>>>>
>> >>>>> Another option is to get a silver standard corpus obtained
>> >>>>> automatically from the Wikipedia:
>> >>>>>
>> >>>>> http://schwa.org/projects/resources/wiki/Wikiner#Automatic-training-data-from-Wikipedia
>> >>>>>
>> >>>>> For Dutch, Spanish, German and Italian (that I know of) there are free
>> >>>>> resources. Search for Ancora, SONAR-1, GermEval 2014 and Evalita 2009.
>> >>>>>
>> >>>>>> This question isn't related to the name finder and I don't think it is possible but thought I would ask anyway. If I had two sentences say 'Jack climbed the hill. He was very tired.' Is there any way to know that the pronoun, he, at the start of the second sentence is actually about Jack the subject of the first sentence? I know in this simple case it is obvious but I am wondering if there is anything in the OpenNLP software that will help with this?
>> >>>>> The example you mentioned is called "pronominal anaphora" and it
>> >>>>> generalizes in the coreference resolution problem. There used to be a
>> >>>>> coreference tool in OpenNLP but got moved to the Sandbox because many
>> >>>>> things need to be updated to be able to distribute it.
>> >>>>>
>> >>>>> See http://conll.cemantix.org/2012/introduction.html for more details.
>> >>>>>
>> >>>>> HTH,
>> >>>>>
>> >>>>> R
>> >>
>> >
>>
>> --
>> Dr. Thomas Zastrow
>> Rechenzentrum Garching (RZG) der Max-Planck-Gesellschaft (MPG)
>> Gießenbachstr. 2, D-85748 Garching bei München, Germany
>> Tel +49-89-3299-1457
>> http://www.rzg.mpg.de
>>
>

RE: Name finder questions

Posted by Robert Logue <rp...@hotmail.co.uk>.
Very useful, thank you.

Only question I have left now, for the moment, is on performance. The minimum recommend number of sentences is 15,000 does anyone know how much this would need to be increased to before it would, maybe it never would, become a performance issue? So if I created training data with 100,000 sentences would this be an issue? Is there any number I could go to where it would be an issue?

Thanks,

Robert 

> Subject: Re: Name finder questions
> To: users@opennlp.apache.org
> From: post@thomas-zastrow.de
> Date: Fri, 22 Apr 2016 10:22:50 +0200
> 
> Here you can find raw data I used to create a German model, maybe its 
> useful for you:
> 
> http://www.thomas-zastrow.de/nlp/
> 
> ("Raw trainingdata in OpenNLP format")
> 
> 
> Am 22.04.2016 um 10:17 schrieb Robert Logue:
> > Can anyone help here? I don't want to start creating a large training file and find out I have gone about it in the wrong way.
> >
> > The resources I have been looking at are
> >
> > https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.training
> > http://blog.thedigitalgroup.com/sagarg/2015/10/30/open-nlp-name-finder-model-training/
> > http://nishutayaltech.blogspot.co.uk/2015/07/writing-custom-namefinder-model-in.html
> >
> > None of which gives the answers I am looking for.
> >
> > Thanks,
> >
> > Robert
> >
> >> From: rplogue@hotmail.co.uk
> >> To: users@opennlp.apache.org
> >> Subject: RE: Name finder questions
> >> Date: Wed, 20 Apr 2016 09:51:25 +0100
> >>
> >> I have a few questions regarding creating my own training data for the name finder. I would like to distinguish between people, organizations and locations. The example in the documentation shows the tags to use for people ie
> >>
> >> <START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .So would I used <START:organization><END> and <START:location><END> for organizations and locations respectively? The name entity guidelines in the documentation ie
> >>
> >> https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.annotation_guides
> >>
> >> seem to show different tags getting used which has confused me slightly as to which tags I should actually use?
> >>
> >> Also I see the 15,000 line recommendation is there any performance hit if you use many more lines?
> >>
> >> If I create my plain text training file as I outlined above is there any other params that are recommended to use beyond the basic ie
> >>
> >> opennlp TokenNameFinderTrainer -model OUTPUT_FILE.bin -lang en -data TRAINING_FILE.train -encoding UTF-8
> >>
> >> For instance what is the -params training parameters file used for? Is this necessary should this list the named entities I am looking for ie person, organization and location if so what format should it be in?
> >>
> >> Sorry for the basic questions here but kind find the answers in the documentation or from a quick google.
> >>
> >> Thanks,
> >>
> >> Robert
> >>
> >>
> >>> From: rodrigo.agerri@ehu.eus
> >>> Date: Mon, 18 Apr 2016 09:36:24 +0200
> >>> Subject: Re: Name finder questions
> >>> To: users@opennlp.apache.org
> >>>
> >>> Hello,
> >>>
> >>> Yes, that is the idea.
> >>>
> >>> R
> >>>
> >>> On Sun, Apr 17, 2016 at 9:10 PM, Robert Logue <rp...@hotmail.co.uk> wrote:
> >>>> I am slightly confused what I can use the data in those links for? So can I use this data with the training tool like the following
> >>>>
> >>>> opennlp TokenNameFinderTrainer -model OUTPUT_FILE_NAME -lang en
> >>>> -data DOWNLOADED_FILE_NAME -encoding UTF-8
> >>>> And that should give me a better model file for when I use the name finder?
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Robert
> >>>>
> >>>>> From: rodrigo.agerri@ehu.eus
> >>>>> Date: Fri, 15 Apr 2016 17:12:20 +0200
> >>>>> Subject: Re: Name finder questions
> >>>>> To: users@opennlp.apache.org
> >>>>>
> >>>>> Hi Robert,
> >>>>>
> >>>>> On Fri, Apr 15, 2016 at 10:25 AM, Robert Logue <rp...@hotmail.co.uk> wrote:
> >>>>>> Hello,
> >>>>>>
> >>>>>> I have just started using OpenNLP in the java application. I am just getting my used with the software and have a couple of newbie questions.
> >>>>>>
> >>>>>> I see for the name finder there is different model data for people and organizations (en-ner-organization.bin and en-ner-person.bin). Is there any way to combine these into one file so I can do 1 search that will give me back person names and organization names. Or is this not possible and is it best to do two searches?
> >>>>> This used to be experimental. It is not anymore, namely, you can train
> >>>>> a name finder model for more than one entity type. The models
> >>>>> available were trained with rather old newswire data so I would
> >>>>> recommend you to obtain train new models using OpenNLP:
> >>>>>
> >>>>> http://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.tool
> >>>>>
> >>>>> I suppose you do not have manually annotated training data so I could
> >>>>> recommend to get the Ontonotes corpus.
> >>>>>
> >>>>> https://catalog.ldc.upenn.edu/LDC2013T19
> >>>>>
> >>>>> https://github.com/ontonotes/conll-formatted-ontonotes-5.0
> >>>>>
> >>>>> Another option is to get a silver standard corpus obtained
> >>>>> automatically from the Wikipedia:
> >>>>>
> >>>>> http://schwa.org/projects/resources/wiki/Wikiner#Automatic-training-data-from-Wikipedia
> >>>>>
> >>>>> For Dutch, Spanish, German and Italian (that I know of) there are free
> >>>>> resources. Search for Ancora, SONAR-1, GermEval 2014 and Evalita 2009.
> >>>>>
> >>>>>> This question isn't related to the name finder and I don't think it is possible but thought I would ask anyway. If I had two sentences say 'Jack climbed the hill. He was very tired.' Is there any way to know that the pronoun, he, at the start of the second sentence is actually about Jack the subject of the first sentence? I know in this simple case it is obvious but I am wondering if there is anything in the OpenNLP software that will help with this?
> >>>>> The example you mentioned is called "pronominal anaphora" and it
> >>>>> generalizes in the coreference resolution problem. There used to be a
> >>>>> coreference tool in OpenNLP but got moved to the Sandbox because many
> >>>>> things need to be updated to be able to distribute it.
> >>>>>
> >>>>> See http://conll.cemantix.org/2012/introduction.html for more details.
> >>>>>
> >>>>> HTH,
> >>>>>
> >>>>> R
> >>   		 	   		
> >   		 	   		
> 
> -- 
> Dr. Thomas Zastrow
> Rechenzentrum Garching (RZG) der Max-Planck-Gesellschaft (MPG)
> Gießenbachstr. 2, D-85748 Garching bei München, Germany
> Tel +49-89-3299-1457
> http://www.rzg.mpg.de
> 
 		 	   		  

Re: Name finder questions

Posted by Thomas Zastrow <po...@thomas-zastrow.de>.
Here you can find raw data I used to create a German model, maybe its 
useful for you:

http://www.thomas-zastrow.de/nlp/

("Raw trainingdata in OpenNLP format")


Am 22.04.2016 um 10:17 schrieb Robert Logue:
> Can anyone help here? I don't want to start creating a large training file and find out I have gone about it in the wrong way.
>
> The resources I have been looking at are
>
> https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.training
> http://blog.thedigitalgroup.com/sagarg/2015/10/30/open-nlp-name-finder-model-training/
> http://nishutayaltech.blogspot.co.uk/2015/07/writing-custom-namefinder-model-in.html
>
> None of which gives the answers I am looking for.
>
> Thanks,
>
> Robert
>
>> From: rplogue@hotmail.co.uk
>> To: users@opennlp.apache.org
>> Subject: RE: Name finder questions
>> Date: Wed, 20 Apr 2016 09:51:25 +0100
>>
>> I have a few questions regarding creating my own training data for the name finder. I would like to distinguish between people, organizations and locations. The example in the documentation shows the tags to use for people ie
>>
>> <START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .So would I used <START:organization><END> and <START:location><END> for organizations and locations respectively? The name entity guidelines in the documentation ie
>>
>> https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.annotation_guides
>>
>> seem to show different tags getting used which has confused me slightly as to which tags I should actually use?
>>
>> Also I see the 15,000 line recommendation is there any performance hit if you use many more lines?
>>
>> If I create my plain text training file as I outlined above is there any other params that are recommended to use beyond the basic ie
>>
>> opennlp TokenNameFinderTrainer -model OUTPUT_FILE.bin -lang en -data TRAINING_FILE.train -encoding UTF-8
>>
>> For instance what is the -params training parameters file used for? Is this necessary should this list the named entities I am looking for ie person, organization and location if so what format should it be in?
>>
>> Sorry for the basic questions here but kind find the answers in the documentation or from a quick google.
>>
>> Thanks,
>>
>> Robert
>>
>>
>>> From: rodrigo.agerri@ehu.eus
>>> Date: Mon, 18 Apr 2016 09:36:24 +0200
>>> Subject: Re: Name finder questions
>>> To: users@opennlp.apache.org
>>>
>>> Hello,
>>>
>>> Yes, that is the idea.
>>>
>>> R
>>>
>>> On Sun, Apr 17, 2016 at 9:10 PM, Robert Logue <rp...@hotmail.co.uk> wrote:
>>>> I am slightly confused what I can use the data in those links for? So can I use this data with the training tool like the following
>>>>
>>>> opennlp TokenNameFinderTrainer -model OUTPUT_FILE_NAME -lang en
>>>> -data DOWNLOADED_FILE_NAME -encoding UTF-8
>>>> And that should give me a better model file for when I use the name finder?
>>>>
>>>> Thanks,
>>>>
>>>> Robert
>>>>
>>>>> From: rodrigo.agerri@ehu.eus
>>>>> Date: Fri, 15 Apr 2016 17:12:20 +0200
>>>>> Subject: Re: Name finder questions
>>>>> To: users@opennlp.apache.org
>>>>>
>>>>> Hi Robert,
>>>>>
>>>>> On Fri, Apr 15, 2016 at 10:25 AM, Robert Logue <rp...@hotmail.co.uk> wrote:
>>>>>> Hello,
>>>>>>
>>>>>> I have just started using OpenNLP in the java application. I am just getting my used with the software and have a couple of newbie questions.
>>>>>>
>>>>>> I see for the name finder there is different model data for people and organizations (en-ner-organization.bin and en-ner-person.bin). Is there any way to combine these into one file so I can do 1 search that will give me back person names and organization names. Or is this not possible and is it best to do two searches?
>>>>> This used to be experimental. It is not anymore, namely, you can train
>>>>> a name finder model for more than one entity type. The models
>>>>> available were trained with rather old newswire data so I would
>>>>> recommend you to obtain train new models using OpenNLP:
>>>>>
>>>>> http://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.tool
>>>>>
>>>>> I suppose you do not have manually annotated training data so I could
>>>>> recommend to get the Ontonotes corpus.
>>>>>
>>>>> https://catalog.ldc.upenn.edu/LDC2013T19
>>>>>
>>>>> https://github.com/ontonotes/conll-formatted-ontonotes-5.0
>>>>>
>>>>> Another option is to get a silver standard corpus obtained
>>>>> automatically from the Wikipedia:
>>>>>
>>>>> http://schwa.org/projects/resources/wiki/Wikiner#Automatic-training-data-from-Wikipedia
>>>>>
>>>>> For Dutch, Spanish, German and Italian (that I know of) there are free
>>>>> resources. Search for Ancora, SONAR-1, GermEval 2014 and Evalita 2009.
>>>>>
>>>>>> This question isn't related to the name finder and I don't think it is possible but thought I would ask anyway. If I had two sentences say 'Jack climbed the hill. He was very tired.' Is there any way to know that the pronoun, he, at the start of the second sentence is actually about Jack the subject of the first sentence? I know in this simple case it is obvious but I am wondering if there is anything in the OpenNLP software that will help with this?
>>>>> The example you mentioned is called "pronominal anaphora" and it
>>>>> generalizes in the coreference resolution problem. There used to be a
>>>>> coreference tool in OpenNLP but got moved to the Sandbox because many
>>>>> things need to be updated to be able to distribute it.
>>>>>
>>>>> See http://conll.cemantix.org/2012/introduction.html for more details.
>>>>>
>>>>> HTH,
>>>>>
>>>>> R
>>   		 	   		
>   		 	   		

-- 
Dr. Thomas Zastrow
Rechenzentrum Garching (RZG) der Max-Planck-Gesellschaft (MPG)
Gießenbachstr. 2, D-85748 Garching bei München, Germany
Tel +49-89-3299-1457
http://www.rzg.mpg.de


RE: Name finder questions

Posted by Robert Logue <rp...@hotmail.co.uk>.
Can anyone help here? I don't want to start creating a large training file and find out I have gone about it in the wrong way.

The resources I have been looking at are

https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.training
http://blog.thedigitalgroup.com/sagarg/2015/10/30/open-nlp-name-finder-model-training/
http://nishutayaltech.blogspot.co.uk/2015/07/writing-custom-namefinder-model-in.html

None of which gives the answers I am looking for.

Thanks,

Robert

> From: rplogue@hotmail.co.uk
> To: users@opennlp.apache.org
> Subject: RE: Name finder questions
> Date: Wed, 20 Apr 2016 09:51:25 +0100
> 
> I have a few questions regarding creating my own training data for the name finder. I would like to distinguish between people, organizations and locations. The example in the documentation shows the tags to use for people ie
> 
> <START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .So would I used <START:organization><END> and <START:location><END> for organizations and locations respectively? The name entity guidelines in the documentation ie
> 
> https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.annotation_guides
> 
> seem to show different tags getting used which has confused me slightly as to which tags I should actually use?
> 
> Also I see the 15,000 line recommendation is there any performance hit if you use many more lines?
> 
> If I create my plain text training file as I outlined above is there any other params that are recommended to use beyond the basic ie
> 
> opennlp TokenNameFinderTrainer -model OUTPUT_FILE.bin -lang en -data TRAINING_FILE.train -encoding UTF-8
> 
> For instance what is the -params training parameters file used for? Is this necessary should this list the named entities I am looking for ie person, organization and location if so what format should it be in?
> 
> Sorry for the basic questions here but kind find the answers in the documentation or from a quick google.
> 
> Thanks,
> 
> Robert
> 
> 
> > From: rodrigo.agerri@ehu.eus
> > Date: Mon, 18 Apr 2016 09:36:24 +0200
> > Subject: Re: Name finder questions
> > To: users@opennlp.apache.org
> > 
> > Hello,
> > 
> > Yes, that is the idea.
> > 
> > R
> > 
> > On Sun, Apr 17, 2016 at 9:10 PM, Robert Logue <rp...@hotmail.co.uk> wrote:
> > > I am slightly confused what I can use the data in those links for? So can I use this data with the training tool like the following
> > >
> > > opennlp TokenNameFinderTrainer -model OUTPUT_FILE_NAME -lang en
> > > -data DOWNLOADED_FILE_NAME -encoding UTF-8
> > > And that should give me a better model file for when I use the name finder?
> > >
> > > Thanks,
> > >
> > > Robert
> > >
> > >> From: rodrigo.agerri@ehu.eus
> > >> Date: Fri, 15 Apr 2016 17:12:20 +0200
> > >> Subject: Re: Name finder questions
> > >> To: users@opennlp.apache.org
> > >>
> > >> Hi Robert,
> > >>
> > >> On Fri, Apr 15, 2016 at 10:25 AM, Robert Logue <rp...@hotmail.co.uk> wrote:
> > >> > Hello,
> > >> >
> > >> > I have just started using OpenNLP in the java application. I am just getting my used with the software and have a couple of newbie questions.
> > >> >
> > >> > I see for the name finder there is different model data for people and organizations (en-ner-organization.bin and en-ner-person.bin). Is there any way to combine these into one file so I can do 1 search that will give me back person names and organization names. Or is this not possible and is it best to do two searches?
> > >>
> > >> This used to be experimental. It is not anymore, namely, you can train
> > >> a name finder model for more than one entity type. The models
> > >> available were trained with rather old newswire data so I would
> > >> recommend you to obtain train new models using OpenNLP:
> > >>
> > >> http://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.tool
> > >>
> > >> I suppose you do not have manually annotated training data so I could
> > >> recommend to get the Ontonotes corpus.
> > >>
> > >> https://catalog.ldc.upenn.edu/LDC2013T19
> > >>
> > >> https://github.com/ontonotes/conll-formatted-ontonotes-5.0
> > >>
> > >> Another option is to get a silver standard corpus obtained
> > >> automatically from the Wikipedia:
> > >>
> > >> http://schwa.org/projects/resources/wiki/Wikiner#Automatic-training-data-from-Wikipedia
> > >>
> > >> For Dutch, Spanish, German and Italian (that I know of) there are free
> > >> resources. Search for Ancora, SONAR-1, GermEval 2014 and Evalita 2009.
> > >>
> > >> > This question isn't related to the name finder and I don't think it is possible but thought I would ask anyway. If I had two sentences say 'Jack climbed the hill. He was very tired.' Is there any way to know that the pronoun, he, at the start of the second sentence is actually about Jack the subject of the first sentence? I know in this simple case it is obvious but I am wondering if there is anything in the OpenNLP software that will help with this?
> > >>
> > >> The example you mentioned is called "pronominal anaphora" and it
> > >> generalizes in the coreference resolution problem. There used to be a
> > >> coreference tool in OpenNLP but got moved to the Sandbox because many
> > >> things need to be updated to be able to distribute it.
> > >>
> > >> See http://conll.cemantix.org/2012/introduction.html for more details.
> > >>
> > >> HTH,
> > >>
> > >> R
> > >
>  		 	   		  
 		 	   		  

RE: Name finder questions

Posted by Robert Logue <rp...@hotmail.co.uk>.
I have a few questions regarding creating my own training data for the name finder. I would like to distinguish between people, organizations and locations. The example in the documentation shows the tags to use for people ie

<START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .So would I used <START:organization><END> and <START:location><END> for organizations and locations respectively? The name entity guidelines in the documentation ie

https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.annotation_guides

seem to show different tags getting used which has confused me slightly as to which tags I should actually use?

Also I see the 15,000 line recommendation is there any performance hit if you use many more lines?

If I create my plain text training file as I outlined above is there any other params that are recommended to use beyond the basic ie

opennlp TokenNameFinderTrainer -model OUTPUT_FILE.bin -lang en -data TRAINING_FILE.train -encoding UTF-8

For instance what is the -params training parameters file used for? Is this necessary should this list the named entities I am looking for ie person, organization and location if so what format should it be in?

Sorry for the basic questions here but kind find the answers in the documentation or from a quick google.

Thanks,

Robert


> From: rodrigo.agerri@ehu.eus
> Date: Mon, 18 Apr 2016 09:36:24 +0200
> Subject: Re: Name finder questions
> To: users@opennlp.apache.org
> 
> Hello,
> 
> Yes, that is the idea.
> 
> R
> 
> On Sun, Apr 17, 2016 at 9:10 PM, Robert Logue <rp...@hotmail.co.uk> wrote:
> > I am slightly confused what I can use the data in those links for? So can I use this data with the training tool like the following
> >
> > opennlp TokenNameFinderTrainer -model OUTPUT_FILE_NAME -lang en
> > -data DOWNLOADED_FILE_NAME -encoding UTF-8
> > And that should give me a better model file for when I use the name finder?
> >
> > Thanks,
> >
> > Robert
> >
> >> From: rodrigo.agerri@ehu.eus
> >> Date: Fri, 15 Apr 2016 17:12:20 +0200
> >> Subject: Re: Name finder questions
> >> To: users@opennlp.apache.org
> >>
> >> Hi Robert,
> >>
> >> On Fri, Apr 15, 2016 at 10:25 AM, Robert Logue <rp...@hotmail.co.uk> wrote:
> >> > Hello,
> >> >
> >> > I have just started using OpenNLP in the java application. I am just getting my used with the software and have a couple of newbie questions.
> >> >
> >> > I see for the name finder there is different model data for people and organizations (en-ner-organization.bin and en-ner-person.bin). Is there any way to combine these into one file so I can do 1 search that will give me back person names and organization names. Or is this not possible and is it best to do two searches?
> >>
> >> This used to be experimental. It is not anymore, namely, you can train
> >> a name finder model for more than one entity type. The models
> >> available were trained with rather old newswire data so I would
> >> recommend you to obtain train new models using OpenNLP:
> >>
> >> http://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.tool
> >>
> >> I suppose you do not have manually annotated training data so I could
> >> recommend to get the Ontonotes corpus.
> >>
> >> https://catalog.ldc.upenn.edu/LDC2013T19
> >>
> >> https://github.com/ontonotes/conll-formatted-ontonotes-5.0
> >>
> >> Another option is to get a silver standard corpus obtained
> >> automatically from the Wikipedia:
> >>
> >> http://schwa.org/projects/resources/wiki/Wikiner#Automatic-training-data-from-Wikipedia
> >>
> >> For Dutch, Spanish, German and Italian (that I know of) there are free
> >> resources. Search for Ancora, SONAR-1, GermEval 2014 and Evalita 2009.
> >>
> >> > This question isn't related to the name finder and I don't think it is possible but thought I would ask anyway. If I had two sentences say 'Jack climbed the hill. He was very tired.' Is there any way to know that the pronoun, he, at the start of the second sentence is actually about Jack the subject of the first sentence? I know in this simple case it is obvious but I am wondering if there is anything in the OpenNLP software that will help with this?
> >>
> >> The example you mentioned is called "pronominal anaphora" and it
> >> generalizes in the coreference resolution problem. There used to be a
> >> coreference tool in OpenNLP but got moved to the Sandbox because many
> >> things need to be updated to be able to distribute it.
> >>
> >> See http://conll.cemantix.org/2012/introduction.html for more details.
> >>
> >> HTH,
> >>
> >> R
> >
 		 	   		  

Re: Name finder questions

Posted by Rodrigo Agerri <ro...@ehu.eus>.
Hello,

Yes, that is the idea.

R

On Sun, Apr 17, 2016 at 9:10 PM, Robert Logue <rp...@hotmail.co.uk> wrote:
> I am slightly confused what I can use the data in those links for? So can I use this data with the training tool like the following
>
> opennlp TokenNameFinderTrainer -model OUTPUT_FILE_NAME -lang en
> -data DOWNLOADED_FILE_NAME -encoding UTF-8
> And that should give me a better model file for when I use the name finder?
>
> Thanks,
>
> Robert
>
>> From: rodrigo.agerri@ehu.eus
>> Date: Fri, 15 Apr 2016 17:12:20 +0200
>> Subject: Re: Name finder questions
>> To: users@opennlp.apache.org
>>
>> Hi Robert,
>>
>> On Fri, Apr 15, 2016 at 10:25 AM, Robert Logue <rp...@hotmail.co.uk> wrote:
>> > Hello,
>> >
>> > I have just started using OpenNLP in the java application. I am just getting my used with the software and have a couple of newbie questions.
>> >
>> > I see for the name finder there is different model data for people and organizations (en-ner-organization.bin and en-ner-person.bin). Is there any way to combine these into one file so I can do 1 search that will give me back person names and organization names. Or is this not possible and is it best to do two searches?
>>
>> This used to be experimental. It is not anymore, namely, you can train
>> a name finder model for more than one entity type. The models
>> available were trained with rather old newswire data so I would
>> recommend you to obtain train new models using OpenNLP:
>>
>> http://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.tool
>>
>> I suppose you do not have manually annotated training data so I could
>> recommend to get the Ontonotes corpus.
>>
>> https://catalog.ldc.upenn.edu/LDC2013T19
>>
>> https://github.com/ontonotes/conll-formatted-ontonotes-5.0
>>
>> Another option is to get a silver standard corpus obtained
>> automatically from the Wikipedia:
>>
>> http://schwa.org/projects/resources/wiki/Wikiner#Automatic-training-data-from-Wikipedia
>>
>> For Dutch, Spanish, German and Italian (that I know of) there are free
>> resources. Search for Ancora, SONAR-1, GermEval 2014 and Evalita 2009.
>>
>> > This question isn't related to the name finder and I don't think it is possible but thought I would ask anyway. If I had two sentences say 'Jack climbed the hill. He was very tired.' Is there any way to know that the pronoun, he, at the start of the second sentence is actually about Jack the subject of the first sentence? I know in this simple case it is obvious but I am wondering if there is anything in the OpenNLP software that will help with this?
>>
>> The example you mentioned is called "pronominal anaphora" and it
>> generalizes in the coreference resolution problem. There used to be a
>> coreference tool in OpenNLP but got moved to the Sandbox because many
>> things need to be updated to be able to distribute it.
>>
>> See http://conll.cemantix.org/2012/introduction.html for more details.
>>
>> HTH,
>>
>> R
>

RE: Name finder questions

Posted by Robert Logue <rp...@hotmail.co.uk>.
I am slightly confused what I can use the data in those links for? So can I use this data with the training tool like the following

opennlp TokenNameFinderTrainer -model OUTPUT_FILE_NAME -lang en 
-data DOWNLOADED_FILE_NAME -encoding UTF-8
And that should give me a better model file for when I use the name finder?

Thanks,

Robert

> From: rodrigo.agerri@ehu.eus
> Date: Fri, 15 Apr 2016 17:12:20 +0200
> Subject: Re: Name finder questions
> To: users@opennlp.apache.org
> 
> Hi Robert,
> 
> On Fri, Apr 15, 2016 at 10:25 AM, Robert Logue <rp...@hotmail.co.uk> wrote:
> > Hello,
> >
> > I have just started using OpenNLP in the java application. I am just getting my used with the software and have a couple of newbie questions.
> >
> > I see for the name finder there is different model data for people and organizations (en-ner-organization.bin and en-ner-person.bin). Is there any way to combine these into one file so I can do 1 search that will give me back person names and organization names. Or is this not possible and is it best to do two searches?
> 
> This used to be experimental. It is not anymore, namely, you can train
> a name finder model for more than one entity type. The models
> available were trained with rather old newswire data so I would
> recommend you to obtain train new models using OpenNLP:
> 
> http://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.tool
> 
> I suppose you do not have manually annotated training data so I could
> recommend to get the Ontonotes corpus.
> 
> https://catalog.ldc.upenn.edu/LDC2013T19
> 
> https://github.com/ontonotes/conll-formatted-ontonotes-5.0
> 
> Another option is to get a silver standard corpus obtained
> automatically from the Wikipedia:
> 
> http://schwa.org/projects/resources/wiki/Wikiner#Automatic-training-data-from-Wikipedia
> 
> For Dutch, Spanish, German and Italian (that I know of) there are free
> resources. Search for Ancora, SONAR-1, GermEval 2014 and Evalita 2009.
> 
> > This question isn't related to the name finder and I don't think it is possible but thought I would ask anyway. If I had two sentences say 'Jack climbed the hill. He was very tired.' Is there any way to know that the pronoun, he, at the start of the second sentence is actually about Jack the subject of the first sentence? I know in this simple case it is obvious but I am wondering if there is anything in the OpenNLP software that will help with this?
> 
> The example you mentioned is called "pronominal anaphora" and it
> generalizes in the coreference resolution problem. There used to be a
> coreference tool in OpenNLP but got moved to the Sandbox because many
> things need to be updated to be able to distribute it.
> 
> See http://conll.cemantix.org/2012/introduction.html for more details.
> 
> HTH,
> 
> R
 		 	   		  

Re: Name finder questions

Posted by Rodrigo Agerri <ro...@ehu.eus>.
Hi Robert,

On Fri, Apr 15, 2016 at 10:25 AM, Robert Logue <rp...@hotmail.co.uk> wrote:
> Hello,
>
> I have just started using OpenNLP in the java application. I am just getting my used with the software and have a couple of newbie questions.
>
> I see for the name finder there is different model data for people and organizations (en-ner-organization.bin and en-ner-person.bin). Is there any way to combine these into one file so I can do 1 search that will give me back person names and organization names. Or is this not possible and is it best to do two searches?

This used to be experimental. It is not anymore, namely, you can train
a name finder model for more than one entity type. The models
available were trained with rather old newswire data so I would
recommend you to obtain train new models using OpenNLP:

http://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.tool

I suppose you do not have manually annotated training data so I could
recommend to get the Ontonotes corpus.

https://catalog.ldc.upenn.edu/LDC2013T19

https://github.com/ontonotes/conll-formatted-ontonotes-5.0

Another option is to get a silver standard corpus obtained
automatically from the Wikipedia:

http://schwa.org/projects/resources/wiki/Wikiner#Automatic-training-data-from-Wikipedia

For Dutch, Spanish, German and Italian (that I know of) there are free
resources. Search for Ancora, SONAR-1, GermEval 2014 and Evalita 2009.

> This question isn't related to the name finder and I don't think it is possible but thought I would ask anyway. If I had two sentences say 'Jack climbed the hill. He was very tired.' Is there any way to know that the pronoun, he, at the start of the second sentence is actually about Jack the subject of the first sentence? I know in this simple case it is obvious but I am wondering if there is anything in the OpenNLP software that will help with this?

The example you mentioned is called "pronominal anaphora" and it
generalizes in the coreference resolution problem. There used to be a
coreference tool in OpenNLP but got moved to the Sandbox because many
things need to be updated to be able to distribute it.

See http://conll.cemantix.org/2012/introduction.html for more details.

HTH,

R