You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stanbol.apache.org by Rupert Westenthaler <ru...@gmail.com> on 2012/09/01 12:40:32 UTC

Re: Entity Disambiguation Engine

Hi all,

I have made a lot of progress with the Disambiguation Engine
(STANBOL-723) this week. So let me provide you with an update.


All work described in this mail takes place in the
"disambiguation-engine" branch [1]. So if you want to test the
features described in this mail you will need to check-out this
branch.


### Disambiguation Engine

This Engine disambiguates (modifies fise:confidence values) for
existing Entity suggestions (fise:TextAnnotations with a dc:relation
to a fise:TextAnnotation). It does not create any new suggestions. So
the Disambiguation Engine MUST BE used in combination with some other
engine that suggests Entities managed by the Stanbol Entityhub
(NamedEntityTaggingEngine or KeywordLinkingEngine).

This Engine is based on the SimilarityConstraint [2] supported by
FieldQuery interface implemented by the Stanbol Entityhub. The
implementation of this is based on Solr MLT [3]. The Engine can
disambiguate with any Entityhub Site. By default the "full text field"
is used for the Similarity. The Entityhub Site used to disambiguate
suggested Entities need not to be configured as the
fise:EntityAnnotations do provide this information by the value of the
"entityhub:site" property [4]. This means that if you have an
Enhancement Chain that suggest Entities from different Entityhub Sites
the Disambiguation Engine will be able to disambiguate Entities from
any site.

The confidence of disambiguated Entities is combined with the original
confidence with the disambiguation score. For this a user configured
ratio '{disambiguation-weight}:{original-confidence-weight}' (default
is '2:1') is used.

The algorithm uses:

    dc := (oc* cw / ( cw + dw)) + (ds * dw / ( cw + dw))

    oc ... original-confidence [0..1]
    ds ... disambiguation-score [0..1]
    dc ... disambiguated - confidence [0..1]
    cw ... original-confidence-weight
    dw ... disambiguation-weight

Notes:

* Confidences of suggestions where not a single one was found by the
Disambiguation Engine are currently not modified
* The disambiguation engine currently ignores all fise:TextAnnotations
with only a single suggestion
* Currently the Disambiguation Engine can not be configured. However
this will change in the near future.
* No updates to the semantic contexts. The Engine uses all
'fise:selected-text' of other 'fise:TextAnnotations without a window
of 100 characters surrounding the currently processed
fise:TextAnnotations.


### Stanbol Enhancer UI

In the disambiguation branch I implemented a lot of improvements to
the Web UI of the Stanbol Enhancer as the current UI (in the trunk
version) was not able to visualize disambiguation results.

Most important new version shows multiple entires for
fise:TextAnnotations with the same "fise:selected-text" if there is a
different set of suggested entities. In addition the new interface
shows additional metadata (mentions, occurrence, confidence) and lists
all mentions if an entity was found several times in the text (with
the same list of suggested entities.

### KeywordLinkingEngine

The version of the KeywordLinkingEngine in the trunk uses a slightly
different version to calculate matches. The main differences are

* Only "processable" Tokens are counted as matches.  "Processable" are
only Tokens that are Nouns, or - if no POS tagging is available or the
confidence of the POS tag is to low - all tokens that are equals or
longer as the configured "Min Token Length".
* No restriction about the minimum number of matching tokens relative
to the overall number of tokens in the matched Label.

Both those changes improve the performance of the engines with
configurations that do allow a lot of Entities to match (e.g. when
setting the "Minimum Found Tokens" to 1). While those configurations
are not typical in current settings they do become much more desirable
assuming that a DisambiguationEngine post-processes results.


### Default Configuarion

The disambiguation branch also provides a modified default
configuration. This configuration adds the Disambiguation Engine to
the default chain and also provides an additional Enhancement Chain
with the name "dbpedia-keyword-disambiguation". While the modified
default chain just adds the DisambiguationEngine at the end of the
default "langdetect, ner, dbpediaLinking" chain the
"dbpedia-keyword-disambiguation" is intended to validate the
performance of the Disambiguation Engine as it uses a configuration of
the KeywordLinkingEngine that suggests up to 20 Entities and only
require a single Token to match.

NOTE that in both cases disambiguation is based on Solr MLT queries
over the short abstract of DBpedia entities. It is planed to provide
other vocabularies with better disambiguation contexts (see also
section "Managing "Shallow KB"s with the Stanbol Entityhub" in the
previous mail of this thread).

### Testing the Branch

This explains how to test the changes of this branch.

The following steps are requires (current Stanbol users might have
already completed 1. and 2.)

1. check out the Apache Stanbol trunk

    svn co http://svn.apache.org/repos/asf/incubator/stanbol/trunk/
stanbol-trunk

2. build the Stanbol trunk

    cd stanbol-trunk
    export MAVEN_OPTS="-Xmx512M -XX:MaxPermSize=128M"
    mvn clean install

3. check out the disambiguation branch

    cd ..
    svn co http://svn.apache.org/repos/asf/incubator/stanbol/branches/disambiguation-engine/
stanbol-disambiguation

4. build the Stanbol disambiguation branch

    cd stanbol-disambiguation
    mvn clean install

5. build the full launcher of the stanbol-trunk a 2nd time - this will
now use/add the modified bundles of the  stanbol-disambiguation branch
installed to the local repository as part of step (4).

    cd ..
    cd stanbol-trunk/launchers/full
    mvn clean install

6. run the full launcher

    cd target
    java -Xmx1024m -XX:MaxPermSize=256m -jar
org.apache.stanbol.launchers.full-0.10.0-incubating-SNAPSHOT.jar

7. Install the bigger DBpedia Index available at [5] and copying it to
the "{stanbol-working-dir}/stanbol/datafiles". While this is not
required it is still recommended as the bigger index contains much
more Entities and is therefore much better suited to test
disambiguation. However not the the examples in (8) do also work with
the small index included by the Stanbol launcher.

8. Try the disambiguation Engine at

    http://localhost:8080/enhancer/chain/dbpedia-keyword-disambiguation

and using texts like

    "Apple is a company based in California"
    "A Jaguar would not eat an Apple"
    "I am impressed by the performance of Jaguar in this years F1 season."

If you want to have details open the Stanbol log file
({stanbol-working-dir}/stanbol/logs/error.log) and look for loggings
of the "org.apache.stanbol.enhancer.engine.disambiguation.mlt.DisambiguatorEngine"
component.

For each disambiguated fise:TextAnnotation the following loggings are provided

1. "Use Window: '{window}'" - the text of the window
2. "Query '{site-name}' for {selected-text}@{language} with context
'{context}'": The Entityhub {site-name}, {selected-text} of the
EntityAnnotation as well as the {context} extracted from {window}
3. "disambiguate {label}: " with the results in the following lines.

    " - not found {uri}" means that this Entity was returned by the
Solr MLT query, but was not part of the suggested Entities
    " - found {uri} origConf:{oc}, disScore:{dc}, disConf:{dc}" if an
entity was disambiguated
    " - none found" : in case non of the MLT results do match with the
suggestions.


Happy testing
Rupert Westenthaler


[1] http://svn.apache.org/repos/asf/incubator/stanbol/branches/disambiguation-engine/
[2] see STANBOL-202, STANBOL-589 and STANBOL-596
[3] http://wiki.apache.org/solr/MoreLikeThis
[4] http://incubator.apache.org/stanbol/docs/trunk/components/enhancer/enhancementstructure.html#fiseentityannotation

[5] http://dev.iks-project.eu/downloads/stanbol-indices/dbpedia-3.7/dbpedia.solrindex.zip

-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen