You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Mark Giaconia (JIRA)" <ji...@apache.org> on 2013/11/12 13:23:17 UTC
[jira] [Commented] (OPENNLP-615) GeoEntityLinker should score toponyms based on surrounding context via a model

    [ https://issues.apache.org/jira/browse/OPENNLP-615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820054#comment-13820054 ] 

Mark Giaconia commented on OPENNLP-615:
---------------------------------------

Committed an initial capability to do this... see here:
http://svn.apache.org/viewvc/opennlp/sandbox/apache-opennlp-addons/src/main/java/org/apache/opennlp/addons/tools/entitylinker/geoentitylinker/ModelBasedScorer.java?revision=1541016&view=markup

any feedback is welcome, the basic use case is this:
1. User trains a doccat model on their data via the static methods in ModelBasedScorer. Each category is a country code, each sample is a bag of words N chars left and right of each country mention in all the documents passed in. The user only has to specify a list of doc text to build this model because the CountryContext object will find the country references, and the static methods do the rest.
2. User adds the file location of the output model to the entitylinker properties file with the appropriate key
3. User utilizes the GeoEntityLinker, and the ModelBasedScorer will populate the linkedSpan's baselink scoremap with a score called countrymodel.

here are some sample results of scored toponyms. modelscore is the doccat score for the entry. In these examples (as in most I have processed) the doccat model does a nice job of enriching the other more quantitative scoring.
"latitude"	"longitude"	"locname"	    "countryproxscore"	"geohashbinningscore"	"norm_lucene"	"doccatmodelscore"	"combscore"
"34.01347"	"71.56344"	"Pakistan"	"0.860"	"1"	"0.977"	"0.887"	"3.72509"
"34.516895"	"69.147014"	"Kabul"	"1"	"1"	"1"	"0.830"	"3.830"

I will work on documentation in the next few days, and I am consolidating all of the wikipedia entries from each country in the world to generate a prebuilt model to be used in the GeoEntityLinker. It would be great if someone could test this whole addon at this point.



> GeoEntityLinker should score toponyms based on surrounding context via a model
> ------------------------------------------------------------------------------
>
>                 Key: OPENNLP-615
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-615
>             Project: OpenNLP
>          Issue Type: Sub-task
>          Components: Entity Linker
>    Affects Versions: 1.6.0
>            Reporter: Mark Giaconia
>            Assignee: Mark Giaconia
>             Fix For: 1.6.0
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> As per the concept in this paper http://www.jasonbaldridge.com/papers/speriosu-baldridge-acl2013.pdf
> the GeoEntityLinker addon should allow a user to score toponyms based on a model. For instance, if the gazateer returns an ambiguous name associated to multiple countries, X and Y, then features should be generated from around the name, and those features should be used as a test set against a categorizer for the country returned and a score generated.
> This functionality also implies the need for a rapid way to generate the models based on user defined data, because countries and location mentions have content that is highly variant. Also, this method will be configurable in the GeoEntityLinker.
> The Sandbox contains a model-builder-prototype that I plan to use to generate the models based on user data and the countrycontext data that the GeoEntityLinker requires, which will make it easy to get started.



--
This message was sent by Atlassian JIRA
(v6.1#6144)