You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Madhav Sharan <ms...@usc.edu> on 2015/11/01 06:15:24 UTC

How is En Location name finder model trained?

Hello opennlp users,

I am facing some issue while extracting locations from file contents. Using
en-ner-location.bin I am able to extract location if it's provided in
camelcase but not if otherwise.

*For example :*
  - I can extract "China" out of - "A geographically distributed network of
*China*"
  - But not from - "A geographically distributed network of *china*"

I already tried converting whole text to camel case but it makes matter
worse, so instead of trying more solution based on my intuitions would be
best for me if I can get help on below two questions:

Can someone suggest an enhancement?
Can someone help me know how en location name finder model is trained?
Location name finder model.en-ner-location.bin
<http://opennlp.sourceforge.net/models-1.5/en-ner-location.bin>
*What are we trying to do?*
We are building an opensource tool to extract location out of any file and
then visualize it on a map. These file will mostly coming from web content
but can be anything a user wish.

--
Thanks
Madhav Sharan

Re: How is En Location name finder model trained?

Posted by Madhav Sharan <ms...@usc.edu>.

Thanks for your reply Rodrigo. I will look into what you suggested.

--
Thanks
Madhav Sharan


On Mon, Nov 16, 2015 at 12:38 AM, Rodrigo Agerri <ra...@apache.org> wrote:

> Hello,
>
> I am not entirely sure but I think the English NER models were trained
> on MUC 7 data. Note that supervised learning approaches to NLP in
> general work suffer the "domain adaptation problem". Basically that
> means that you are deploying a model learned from some specific type
> of data to other type of data which is quite different. Performance
> degrades as a result.
>
> To improve your results the best is to train your own model (need
> annotated data for that). If you do not have annotated data from your
> own domain, you can use a newer dataset such as Ontonotes and train
> your model with that data.
>
> Optionally, if you have a type of locations which happen fairly
> regularly, you  can also try to use the DictionaryNameFinder to use
> lists of locations and the RegexNameFinder to create rules using
> regular expressions for location finding.
>
> HTH,
>
> Rodrigo
>
> On Sun, Nov 1, 2015 at 6:15 AM, Madhav Sharan <ms...@usc.edu> wrote:
> > Hello opennlp users,
> >
> > I am facing some issue while extracting locations from file contents.
> Using
> > en-ner-location.bin I am able to extract location if it's provided in
> > camelcase but not if otherwise.
> >
> > *For example :*
> >   - I can extract "China" out of - "A geographically distributed network
> of
> > *China*"
> >   - But not from - "A geographically distributed network of *china*"
> >
> > I already tried converting whole text to camel case but it makes matter
> > worse, so instead of trying more solution based on my intuitions would be
> > best for me if I can get help on below two questions:
> >
> > Can someone suggest an enhancement?
> > Can someone help me know how en location name finder model is trained?
> > Location name finder model.en-ner-location.bin
> > <http://opennlp.sourceforge.net/models-1.5/en-ner-location.bin>
> > *What are we trying to do?*
> > We are building an opensource tool to extract location out of any file
> and
> > then visualize it on a map. These file will mostly coming from web
> content
> > but can be anything a user wish.
> >
> > --
> > Thanks
> > Madhav Sharan
>

Re: How is En Location name finder model trained?

Posted by Rodrigo Agerri <ra...@apache.org>.

Hello,

I am not entirely sure but I think the English NER models were trained
on MUC 7 data. Note that supervised learning approaches to NLP in
general work suffer the "domain adaptation problem". Basically that
means that you are deploying a model learned from some specific type
of data to other type of data which is quite different. Performance
degrades as a result.

To improve your results the best is to train your own model (need
annotated data for that). If you do not have annotated data from your
own domain, you can use a newer dataset such as Ontonotes and train
your model with that data.

Optionally, if you have a type of locations which happen fairly
regularly, you  can also try to use the DictionaryNameFinder to use
lists of locations and the RegexNameFinder to create rules using
regular expressions for location finding.

HTH,

Rodrigo

On Sun, Nov 1, 2015 at 6:15 AM, Madhav Sharan <ms...@usc.edu> wrote:
> Hello opennlp users,
>
> I am facing some issue while extracting locations from file contents. Using
> en-ner-location.bin I am able to extract location if it's provided in
> camelcase but not if otherwise.
>
> *For example :*
>   - I can extract "China" out of - "A geographically distributed network of
> *China*"
>   - But not from - "A geographically distributed network of *china*"
>
> I already tried converting whole text to camel case but it makes matter
> worse, so instead of trying more solution based on my intuitions would be
> best for me if I can get help on below two questions:
>
> Can someone suggest an enhancement?
> Can someone help me know how en location name finder model is trained?
> Location name finder model.en-ner-location.bin
> <http://opennlp.sourceforge.net/models-1.5/en-ner-location.bin>
> *What are we trying to do?*
> We are building an opensource tool to extract location out of any file and
> then visualize it on a map. These file will mostly coming from web content
> but can be anything a user wish.
>
> --
> Thanks
> Madhav Sharan