You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@stanbol.apache.org by "Rupert Westenthaler (JIRA)" <ji...@apache.org> on 2011/06/10 08:35:58 UTC

[jira] [Created] (STANBOL-223) Entity Disambiguation based on Solr MLT

Entity Disambiguation based on Solr MLT
---------------------------------------

Key: STANBOL-223
URL: https://issues.apache.org/jira/browse/STANBOL-223
Project: Stanbol
Issue Type: New Feature
Components: Enhancer
Reporter: Rupert Westenthaler
Assignee: Rupert Westenthaler

In short:

The Idea is to use sentences with links to an Entity in a dataset (e.g. wikipedia) as context and compare this with the surrounding text of an Entity extracted from the analyzed content. Solr More Like This (MLT) queries will be used for the ranking.

More details:

Sentences with occurrences of the Entity can be extracted by using https://github.com/ogrisel/pignlproc. Functionality will be added to output the results (entity -[0..*]-> "sentence") as N-TRIPLEs (a RDF serialization). This will allow it to indexed this information (together with all the other information of Entities) by using the Indexing Tools porvided by the Stanbol Entityhub (e.g. entityhub/indexing/dbpedia).

The following Information will be used for EntityDisambiguation:

(1) TextAnnotations providing the label, the type as detected by the NLP framework, the context of the extraction
(1b) In addition links to other Text Annotations about the same Entity could be used to extend the context
(2) A Solr Index (ReferencedSite of the Stanbol Entityhub) providing at least the labels, types and the occurrences of the Entities

EntityDisambiguation will filter based on the label and the type (filter query) and rank selected Entities based on a "More Like This" query with the context over the occurrences.

A first prototype of this engine was implemented during the bbuzz - Semantic Hackatron (http://berlinbuzzwords.de/wiki/semantic-nlp-hackathon) as an own EnhancementEngine that uses an separate Solr Index for the MLT queries.

The plan is to implement this as an optional (configureable) feature to the existing ReferencedSiteEntityTaggingEnhancementEngine. Users will be able to activate/deactivate Entity disambiguation via the OSGI Console if the required data are available for a ReferencedSite.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (STANBOL-223) Entity Disambiguation based on Solr MLT

Posted by "Olivier Grisel (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/STANBOL-223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13247196#comment-13247196 ] 

Olivier Grisel commented on STANBOL-223:
----------------------------------------

As an alternative to MLT which will compute an unnormalized similarity score as an approximate of the cosine similarity, one could use a Jaccard coefficient index of the overlapping words (either restricted to co-occurring names or any other words, not restricted to names) of the potential entities descriptions + past mentions found by the existing name lookup and the document context to re-rank the link candidates.

For instance papers such as the following might be interesting to study:

  http://aclweb.org/anthology/P/P11/P11-1138.pdf
  http://liuchuan.org/pub/CS475.pdf

Also before using complex disambiguation logics such as Jaccard coef and MLT one should implement simpler approaches such as:

- Add a configuration option to the entity linking engine to perform exact search name only, both the on the canonical labels from the entity hub + redirect names (for DBpedia only, could be stored as alternative names) or the mention expressions that carry a link as found in the wikipedia dump (need a dedicated extraction as explained above).

- Ad-hoc rules could also be interesting: if the named detected by OpenNLP is a firstname (as indexed in the entity hub for instance), one could mark the name as ambiguous and skip its linking.
                
> Entity Disambiguation based on Solr MLT
> ---------------------------------------
>
>                 Key: STANBOL-223
>                 URL: https://issues.apache.org/jira/browse/STANBOL-223
>             Project: Stanbol
>          Issue Type: New Feature
>          Components: Enhancer
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>
> In short:
> The Idea is to use sentences with links to an Entity in a dataset (e.g. wikipedia) as context and compare this with the surrounding text of an Entity extracted from the analyzed content. Solr More Like This (MLT) queries will be used for the ranking. 
>  
> More details:
> Sentences with occurrences of the Entity can be extracted by using https://github.com/ogrisel/pignlproc. Functionality will be added to output the results (entity -[0..*]-> "sentence") as N-TRIPLEs (a RDF serialization). This will allow it to indexed this information (together with all the other information of Entities) by using the Indexing Tools porvided by the Stanbol Entityhub (e.g. entityhub/indexing/dbpedia).
> The following Information will be used for EntityDisambiguation:
> (1) TextAnnotations providing the label, the type as detected by the NLP framework, the context of the extraction
> (1b) In addition links to other Text Annotations about the same Entity could be used to extend the context
> (2) A Solr Index (ReferencedSite of the Stanbol Entityhub) providing at least the labels, types and the occurrences of the Entities
> EntityDisambiguation will filter based on the label and the type (filter query) and rank selected Entities based on a "More Like This" query with the context over the occurrences.
> A first prototype of this engine was implemented during the bbuzz - Semantic Hackatron (http://berlinbuzzwords.de/wiki/semantic-nlp-hackathon) as an own EnhancementEngine that uses an separate Solr Index for the MLT queries.
> The plan is to implement this as an optional (configureable) feature to the existing ReferencedSiteEntityTaggingEnhancementEngine. Users will be able to activate/deactivate Entity disambiguation via the OSGI Console if the required data are available for a ReferencedSite.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (STANBOL-223) Entity Disambiguation

Posted by "Rupert Westenthaler (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/STANBOL-223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rupert Westenthaler updated STANBOL-223:
----------------------------------------

    Description: 
Adding Disambiguation support to the Stanbol Enhancer includes the following points

1. Dataset: For Disambiguation you need not only a set of Entities but also additional data used for the disambiguation
  * This might need some preprocessing of the data (e.g. using mentions of the entity in sentences; Using data from linked Entities to create a context)
  * This data need to accessible for the Stanbol Enhancer (e.g. by using the Entityhub, an own SolrIndex or even other means)

2. Deciding on possible algorithms
  * This Issue already two possible algorithms (see below and comments)

3. Workflow:
  a) Disambiguate while linking (basically you have the String "Paris" and the Sentence/Document as context and want to know if you
should link to Paris, France or Paris, Texas)
  b) Disambiguate already linked Entities (you have 5 suggested Entities by two different Engines and you want to disambiguate (rank)
them)

4. Validation of the Disambiguation: We need to compare enhancement quality with/without disambiguation
  * The Benchmarking (enhancer/benchmark) tool could be used for that
  * Question: How much time would be needed to create Benchmarking Examples

5. What are the expected results?
  * implementation of a (maybe more) disambiguation algorithm(s)
  * integration to the Stanbol Enhancer as one or more EnhancementEngines
  * management of the data needed for disambiguation (e.g. as part of the Entityhub)
  * support (tools) for creating/extracting data needed for disambiguation
  * Validation results using the enhancer/benchmarking tool
  * Documentation on the Stanbol Webpage
  * Simple Web interface showing the improved enhancement results (I am thinking of a single text box to put the text and two enhancement results one with and one without entity disambiguation.

Optional
  * integration of user feedback to enhance learning/validation set


Disambiguation based on Solr MLT
===========================

The Idea is to use sentences with links to an Entity in a dataset (e.g. wikipedia) as context and compare this with the surrounding text of an Entity extracted from the analyzed content. Solr More Like This (MLT) queries will be used for the ranking. 
 
More details:

Sentences with occurrences of the Entity can be extracted by using https://github.com/ogrisel/pignlproc. Functionality will be added to output the results (entity -[0..*]-> "sentence") as N-TRIPLEs (a RDF serialization). This will allow it to indexed this information (together with all the other information of Entities) by using the Indexing Tools porvided by the Stanbol Entityhub (e.g. entityhub/indexing/dbpedia).

The following Information will be used for EntityDisambiguation:

(1) TextAnnotations providing the label, the type as detected by the NLP framework, the context of the extraction
(1b) In addition links to other Text Annotations about the same Entity could be used to extend the context
(2) A Solr Index (ReferencedSite of the Stanbol Entityhub) providing at least the labels, types and the occurrences of the Entities

EntityDisambiguation will filter based on the label and the type (filter query) and rank selected Entities based on a "More Like This" query with the context over the occurrences.

A first prototype of this engine was implemented during the bbuzz - Semantic Hackatron (http://berlinbuzzwords.de/wiki/semantic-nlp-hackathon) as an own EnhancementEngine that uses an separate Solr Index for the MLT queries.

The plan is to implement this as an optional (configureable) feature to the existing ReferencedSiteEntityTaggingEnhancementEngine. Users will be able to activate/deactivate Entity disambiguation via the OSGI Console if the required data are available for a ReferencedSite.


  was:
In short:

The Idea is to use sentences with links to an Entity in a dataset (e.g. wikipedia) as context and compare this with the surrounding text of an Entity extracted from the analyzed content. Solr More Like This (MLT) queries will be used for the ranking. 
 
More details:

Sentences with occurrences of the Entity can be extracted by using https://github.com/ogrisel/pignlproc. Functionality will be added to output the results (entity -[0..*]-> "sentence") as N-TRIPLEs (a RDF serialization). This will allow it to indexed this information (together with all the other information of Entities) by using the Indexing Tools porvided by the Stanbol Entityhub (e.g. entityhub/indexing/dbpedia).

The following Information will be used for EntityDisambiguation:

(1) TextAnnotations providing the label, the type as detected by the NLP framework, the context of the extraction
(1b) In addition links to other Text Annotations about the same Entity could be used to extend the context
(2) A Solr Index (ReferencedSite of the Stanbol Entityhub) providing at least the labels, types and the occurrences of the Entities

EntityDisambiguation will filter based on the label and the type (filter query) and rank selected Entities based on a "More Like This" query with the context over the occurrences.

A first prototype of this engine was implemented during the bbuzz - Semantic Hackatron (http://berlinbuzzwords.de/wiki/semantic-nlp-hackathon) as an own EnhancementEngine that uses an separate Solr Index for the MLT queries.

The plan is to implement this as an optional (configureable) feature to the existing ReferencedSiteEntityTaggingEnhancementEngine. Users will be able to activate/deactivate Entity disambiguation via the OSGI Console if the required data are available for a ReferencedSite.


         Labels: gsoc2012  (was: )
        Summary: Entity Disambiguation  (was: Entity Disambiguation based on Solr MLT)
    
> Entity Disambiguation
> ---------------------
>
>                 Key: STANBOL-223
>                 URL: https://issues.apache.org/jira/browse/STANBOL-223
>             Project: Stanbol
>          Issue Type: New Feature
>          Components: Enhancer
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>              Labels: gsoc2012
>
> Adding Disambiguation support to the Stanbol Enhancer includes the following points
> 1. Dataset: For Disambiguation you need not only a set of Entities but also additional data used for the disambiguation
>   * This might need some preprocessing of the data (e.g. using mentions of the entity in sentences; Using data from linked Entities to create a context)
>   * This data need to accessible for the Stanbol Enhancer (e.g. by using the Entityhub, an own SolrIndex or even other means)
> 2. Deciding on possible algorithms
>   * This Issue already two possible algorithms (see below and comments)
> 3. Workflow:
>   a) Disambiguate while linking (basically you have the String "Paris" and the Sentence/Document as context and want to know if you
> should link to Paris, France or Paris, Texas)
>   b) Disambiguate already linked Entities (you have 5 suggested Entities by two different Engines and you want to disambiguate (rank)
> them)
> 4. Validation of the Disambiguation: We need to compare enhancement quality with/without disambiguation
>   * The Benchmarking (enhancer/benchmark) tool could be used for that
>   * Question: How much time would be needed to create Benchmarking Examples
> 5. What are the expected results?
>   * implementation of a (maybe more) disambiguation algorithm(s)
>   * integration to the Stanbol Enhancer as one or more EnhancementEngines
>   * management of the data needed for disambiguation (e.g. as part of the Entityhub)
>   * support (tools) for creating/extracting data needed for disambiguation
>   * Validation results using the enhancer/benchmarking tool
>   * Documentation on the Stanbol Webpage
>   * Simple Web interface showing the improved enhancement results (I am thinking of a single text box to put the text and two enhancement results one with and one without entity disambiguation.
> Optional
>   * integration of user feedback to enhance learning/validation set
> Disambiguation based on Solr MLT
> ===========================
> The Idea is to use sentences with links to an Entity in a dataset (e.g. wikipedia) as context and compare this with the surrounding text of an Entity extracted from the analyzed content. Solr More Like This (MLT) queries will be used for the ranking. 
>  
> More details:
> Sentences with occurrences of the Entity can be extracted by using https://github.com/ogrisel/pignlproc. Functionality will be added to output the results (entity -[0..*]-> "sentence") as N-TRIPLEs (a RDF serialization). This will allow it to indexed this information (together with all the other information of Entities) by using the Indexing Tools porvided by the Stanbol Entityhub (e.g. entityhub/indexing/dbpedia).
> The following Information will be used for EntityDisambiguation:
> (1) TextAnnotations providing the label, the type as detected by the NLP framework, the context of the extraction
> (1b) In addition links to other Text Annotations about the same Entity could be used to extend the context
> (2) A Solr Index (ReferencedSite of the Stanbol Entityhub) providing at least the labels, types and the occurrences of the Entities
> EntityDisambiguation will filter based on the label and the type (filter query) and rank selected Entities based on a "More Like This" query with the context over the occurrences.
> A first prototype of this engine was implemented during the bbuzz - Semantic Hackatron (http://berlinbuzzwords.de/wiki/semantic-nlp-hackathon) as an own EnhancementEngine that uses an separate Solr Index for the MLT queries.
> The plan is to implement this as an optional (configureable) feature to the existing ReferencedSiteEntityTaggingEnhancementEngine. Users will be able to activate/deactivate Entity disambiguation via the OSGI Console if the required data are available for a ReferencedSite.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira