You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@stanbol.apache.org by Antonio Perez <ap...@zaizi.com> on 2013/09/18 12:28:36 UTC

[GSoC] [Update] Freebase Disambiguation in Stanbol

Hi All,

At the midterm evaluation, I implemented successfully a pair of tools
needed for the objective of the project:
- Wikilinks parser and TDB generator and a service to extract information
[1]: Contains information of documents which contains references to
freebase entities
- Freebase to Graph Importer tool [2]: Tool to import Freebase data dump
into a Neo4j graph (at this moment) through Tinkerpop Blueprints API.

Following the project, I have successfully implemented and tested a first
version of the Freebase entity disambiguation engine with the help of
Rupert and Rafa.

The target of the engine is to modify the confidence values of Entity
Annotations using the relations of the entities referenced by those EA in
the Freebase Graph generated with the Freebase Importer to Graph tool.

The implemented algorithm is based on the shortest distance between
entities in the graph. These entities are the entities extracted from the
text being enhanced.

This engine requires Entity Annotations extracted from previous engines,
and entityhub configured with Freebase entities (using the Freebase
indexing tool).

In order to the algorithm be manageable computationally, a subgraph (from
the whole graph) is generated only with the entities (and relations between
them) extrated from the text. The shortest-path algorithm is applied using
this subgraph.

I'm going to explain a bit more the algorithm used:
1. Get Entity Annotations for each Text Annotation and extract the
referenced entities for each one.
2. At this point, we have several sets of entities (one set for every text
annotation)
3. Generate a subgraph with the entities and their relations
4. Generate all the possible solutions. This means all the possible
combinations between sets of entities. This way, a possible solution tuple
can not contain entities from the same Text Annotation (because it doesn't
make sense).
5. Filter the possible solutions, i.e entities isolated in the graph are
filtered in the possible solution.
6. For all the filtered possible solutions, calculate the shortest-path
distance for every pair of entities in the possible solution tuple. We are
looking for the tuple with the lower distance between each pair of entities.
7. Normalize the distances and for each entity set the disambiguation score
at the higher normalized distance in the possible solution tuples. That is
to say, if an entity has different values (because it belongs to some
possible solutions) then its disambiguation score will be the higher
normalized distance value.
8. Modify the confidence values of the Entity Annotations using the old
confidence value and the new disambiguation score of the entity which is
referenced by the Entity Annotation.

The engine's source is hosted at github [3].

The project contains a README file with the instructions to run the engine,
but the
steps are basically the following:
1. Generate the Freebase index with the Freebase data dump and configure a
Freebase site in Stanbol.
2. Build the Freebase Disambiguation Engine using maven project at github
repository with the command : "mvn clean package"
3. Download the blueprints-core and blueprint-neo4j-graph projects and uses
the new pom files (located at src/main/resources folder of the engine
project) in order to generate those dependencies as OSGI bundles.
4. Install these bundles and the Freebase Disambiguation Engine bundle
(gsoc-freebase-disambiguation-engine-0.0.1-SNAPSHOT.jar)
5. Configure a new chain using a new Entity linking with Freebase and the
new engine (identified by 'freebase-disambiguation' engine name)

At this moment, the Wilikinks information is not being used but in a second
version of the algorithm, the Wikilinks information can be used to
calculate a local disambiguation score of each entity based on the current
context and then use both values (local score and algorithm score) to
refine confidences values.

Please have a look at the freebase-disambiguation engine and give
your comments for improvements.

The related JIRA issues are:
- WikiLinks Parser and TDB Generator:
https://issues.apache.org/jira/browse/STANBOL-1141
- Freebase to Graph Importer:
https://issues.apache.org/jira/browse/STANBOL-1140
- Freebase Disambiguation Algorithm:
https://issues.apache.org/jira/browse/STANBOL-1157

I'll upload the source code of the projects to the corresponding JIRA
issues.

Thanks for all your support given throughout the project.

Regards,
Antonio

[1]
https://github.com/adperezmorales/gsoc-wikilinks/tree/master/gsoc-wikilinks
[2]
https://github.com/adperezmorales/gsoc-freebase-graph-importer/tree/master/gsoc-freebase-graph-importer
[3]
https://github.com/adperezmorales/gsoc-freebase-disambiguation-engine/tree/master/gsoc-freebase-disambiguation-engine
<https://github.com/dileepajayakody/FOAFSite>

------------------------------
This message should be regarded as confidential. If you have received this
email in error please notify the sender and destroy it immediately.
Statements of intent shall only become binding when confirmed in hard copy
by an authorised signatory.

Zaizi Ltd is registered in England and Wales with the registration number
6440931. The Registered Office is Brook House, 229 Shepherds Bush Road,
London W6 7AN.