You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by "Cristian Petroaca (JIRA)" <ji...@apache.org> on 2014/03/29 17:03:15 UTC

[jira] [Updated] (STANBOL-1279) Named Entity co-reference resolution engine based on yago/dbpedia contextual information

     [ https://issues.apache.org/jira/browse/STANBOL-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cristian Petroaca updated STANBOL-1279:
---------------------------------------

    Description: 
Develop an enhancement engine that will perform co-reference resolution of Named Entities in a given text. The co-references will be noun phrases which refer to those Named Entities by having a minimal set of attributes which match contextual information (yago rdf:type and dbpedia spatial and object function giving info - more on this below) from dbpedia/yago for that Named Entity.

We have the following text as an example : "Microsoft has posted its 2013 earnings. The software company did better than expected. ... The Redmond-based company will hire 500 new developers this year."
The enhancement engine will link "Microsoft" with "The software company" and "The Redmond-based company".

Below there are the steps necessary in order to extract the co-references.

Named Entity extraction 
================== 
Extract all Named Entities from the given text. If there are no Named Entities then the process stops here.

Noun Phrases extraction 
===================
Select all noun phrases after the first Named Entity that have:
+ a determinate pos which implies reference to an entity local to the text, such as "the, this, these") but not "another, every", etc which implies a reference to an entity outside of the text.
+ at least another noun aside from the main required noun which further describes it. For example I will not count "The company" as being a legitimate candidate since this could create a lot of false positives by considering the double meaning of some words such as "in the company of good people".
	
This step should have different logic implemented for different languages.
This step ensures good recall.
	
Noun Phrases matching
===================
This step tries to match the previously selected noun phrases to the Named Entities from step 1 and establish the co-references.
For every noun phrase the following rules will be applied:

Yago:class matching
--------------------------
For each NER prior to the current noun phrase in the text match the yago:class label to the contents of the noun phrase. If there are no matches then drop the current noun phrase.

Group membership rules
-------------------------------

+ Spatial membership : the noun phrase is part of a LOCATION. 
If the noun phrase contains a LOCATION or a demonym then check any location properties of the matching NER. These properties will be part of a generic ontology. For clarity I will describe the dbpedia extracted properties which will be aligned to this generic ontology.

If matching NER is a :
    - person, match against :birthPlace, :region, :nationality
    - organisation, match against :foundationPlace, :locationCity, :location, :hometown
    - place, match against :country, :subdivisionName, :location.

Example: The Italian President, The Richmond-based company

+ Organisational membership : the NER is part of an ORGANISATION. 
If the noun phrase contains an ORGANISATION then check the following properties of the maching NER. These properties will be part of a generic ontology. For clarity I will describe the dbpedia extracted properties which will be aligned to this generic ontology.

If matching NER is :
    - person, match against :occupation, :associatedActs
    - organisation : no dbpedia properties to match
    - location : no dbpedia properties to match

Example: The Microsoft executive, The Pink Floyd singer

Functional description rules
-----------------------------------
The noun phrase describes what the NER does conceptually.
If there are no NERs in the noun phrase then match the following properties of the matching NER to the contents of the noun phrase (aside from the nouns which are part of the yago:class) :

   If NER is a:
   - person : no dbpedia properties to match
   - organisation : , match against :service, :industry, :genre
   - location : no dbpedia properties to match

Example:  The software company.


	This step is designed to filter out all bad co-references and ensure good precision.
	
As an additional note if there are multiple named entities which can match a certain noun phrase then link the noun phrase with the closest named entity prior to it in the text.

  was:
Develop an enhancement engine that will perform co-reference resolution of Named Entities in a given text. The co-references will be noun phrases which refer to those Named Entities by having a minimal set of attributes which match contextual information (yago rdf:type and dbpedia spatial and object function giving info - more on this below) from dbpedia/yago for that Named Entity.

We have the following text as an example : "Microsoft has posted its 2013 earnings. The software company did better than expected. ... The Redmond-based company will hire 500 new developers this year."
The enhancement engine will link "Microsoft" with "The software company" and "The Redmond-based company".

We will describe below the mechanism for perfoming the resolution :

If we have any Named Entities in the text then :

1. Select all noun phrases after the first Named Entity that have:
	a. a determinate pos which implies reference to an entity local to the text, such as "the, this, these") but not "another, every", etc which implies a reference to an entity outside of the text.
	b. at least another noun aside from the main required noun which further describes it. For example I will not count "The company" as being a legitimate candidate since this could create a lot of false positives by considering the double meaning of some words such as "in the company of good people".
	
	This step ensures good recall.
	
2. Match any noun phrase selected above with all Named Entities prior to it in the text.
   
   The core matching mechanism gets all nouns in the noun phrase and compares them with the yago rdf:type of the Named Entity. For example we will compare "software company" in the example above with any yagp rdf type for "Microsoft" which in our case will contain the category "Software_companies_of_the_United_States" . Based on the result with the most matches we can create a confidence level and link the noun phrase with the best matched named entity.
   
   Before the matching is done we need to have the yago rdf type values lemmatized and pos tagged so that any plural form mismatches can be avoided (as can be seen from the example above) and the non-noun words such as prepositions to be ignored. - at the moment it is unclear to me how to best make this happen.
   
   Besides from the core matching mechanism we will also have the following types of matches :
        a. Spatial - if a noun phrase contains a Location entity then we can also match any spatial dbpedia attributes in the Named Entity such as dbpedia-owl:locationCity for Organizations or dbpedia-owl:birthPlace, dbpedia-owl:region for Persons and dbpedia-owl:country for Locations.
	    b. Based on what function they have - check the given nouns against the function describing properties in dbpedia such as : dbpedia-owl:profession, dbpedia-owl:occupation for Persons or dbpedia-owl:industry, dbpprop:services for Organizations.
		
		For both of these types of matches we first need to have the main noun of the noun phrase be matched with the rdf:type from yago.
		
		
	This step is designed to filter out all bad co-references and ensure good precision.
	
As an additional note if there are multiple named entities which can match a certain noun phrase then link the noun phrase with the closest named entity prior to it in the text.


> Named Entity co-reference resolution engine based on yago/dbpedia contextual information
> ----------------------------------------------------------------------------------------
>
>                 Key: STANBOL-1279
>                 URL: https://issues.apache.org/jira/browse/STANBOL-1279
>             Project: Stanbol
>          Issue Type: New Feature
>          Components: Enhancement Engines
>            Reporter: Cristian Petroaca
>              Labels: co-reference, dbpedia, entity, named, yago
>
> Develop an enhancement engine that will perform co-reference resolution of Named Entities in a given text. The co-references will be noun phrases which refer to those Named Entities by having a minimal set of attributes which match contextual information (yago rdf:type and dbpedia spatial and object function giving info - more on this below) from dbpedia/yago for that Named Entity.
> We have the following text as an example : "Microsoft has posted its 2013 earnings. The software company did better than expected. ... The Redmond-based company will hire 500 new developers this year."
> The enhancement engine will link "Microsoft" with "The software company" and "The Redmond-based company".
> Below there are the steps necessary in order to extract the co-references.
> Named Entity extraction 
> ================== 
> Extract all Named Entities from the given text. If there are no Named Entities then the process stops here.
> Noun Phrases extraction 
> ===================
> Select all noun phrases after the first Named Entity that have:
> + a determinate pos which implies reference to an entity local to the text, such as "the, this, these") but not "another, every", etc which implies a reference to an entity outside of the text.
> + at least another noun aside from the main required noun which further describes it. For example I will not count "The company" as being a legitimate candidate since this could create a lot of false positives by considering the double meaning of some words such as "in the company of good people".
> 	
> This step should have different logic implemented for different languages.
> This step ensures good recall.
> 	
> Noun Phrases matching
> ===================
> This step tries to match the previously selected noun phrases to the Named Entities from step 1 and establish the co-references.
> For every noun phrase the following rules will be applied:
> Yago:class matching
> --------------------------
> For each NER prior to the current noun phrase in the text match the yago:class label to the contents of the noun phrase. If there are no matches then drop the current noun phrase.
> Group membership rules
> -------------------------------
> + Spatial membership : the noun phrase is part of a LOCATION. 
> If the noun phrase contains a LOCATION or a demonym then check any location properties of the matching NER. These properties will be part of a generic ontology. For clarity I will describe the dbpedia extracted properties which will be aligned to this generic ontology.
> If matching NER is a :
>     - person, match against :birthPlace, :region, :nationality
>     - organisation, match against :foundationPlace, :locationCity, :location, :hometown
>     - place, match against :country, :subdivisionName, :location.
> Example: The Italian President, The Richmond-based company
> + Organisational membership : the NER is part of an ORGANISATION. 
> If the noun phrase contains an ORGANISATION then check the following properties of the maching NER. These properties will be part of a generic ontology. For clarity I will describe the dbpedia extracted properties which will be aligned to this generic ontology.
> If matching NER is :
>     - person, match against :occupation, :associatedActs
>     - organisation : no dbpedia properties to match
>     - location : no dbpedia properties to match
> Example: The Microsoft executive, The Pink Floyd singer
> Functional description rules
> -----------------------------------
> The noun phrase describes what the NER does conceptually.
> If there are no NERs in the noun phrase then match the following properties of the matching NER to the contents of the noun phrase (aside from the nouns which are part of the yago:class) :
>    If NER is a:
>    - person : no dbpedia properties to match
>    - organisation : , match against :service, :industry, :genre
>    - location : no dbpedia properties to match
> Example:  The software company.
> 	This step is designed to filter out all bad co-references and ensure good precision.
> 	
> As an additional note if there are multiple named entities which can match a certain noun phrase then link the noun phrase with the closest named entity prior to it in the text.



--
This message was sent by Atlassian JIRA
(v6.2#6252)