You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by rw...@apache.org on 2013/06/10 07:29:06 UTC

svn commit: r1491336 - /stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext

Author: rwesten
Date: Mon Jun 10 05:29:06 2013
New Revision: 1491336

URL: http://svn.apache.org/r1491336
Log:
STANBOL-1100: changed all mentions of the 'prop' property to 'prob'. STANBOL-1070: Added Documentation for the LinkinStateAware extension point. Also fixed some wrong property names

Modified:
    stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext

Modified: stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext
URL: http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext?rev=1491336&r1=1491335&r2=1491336&view=diff
==============================================================================
--- stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext (original)
+++ stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext Mon Jun 10 05:29:06 2013
@@ -20,7 +20,7 @@ The Linking Process consists of three ma
 
 ### Token Types
 
-The KeywordLinkingEngine operates based on tokens (words). Those tokens are divided in the following Categories
+The EntityLinkingEngine operates based on tokens (words). Those tokens are divided in the following Categories
 
 * __Linkable Tokens__: This are words that are linked with the Vocabulary. This means that the engine will issue quires in the controlled vocabulary for those tokens
 * __Matchable Tokens__: Matchable tokens are used to refine quires. For the matching of entity labels with the text those words are treated in the same way as linkable words. So the main difference is that matchable words alone will not cause the engine to query for Entities in the Controlled Vocabulary.
@@ -38,7 +38,7 @@ In addition to the token type the engine
 
 ### Consumed NLP Processing Results:
 
-The KeywordLinkingEngine consumes NLP processing results from the AnalyzedText ContentPart of the processed ContentItem. The following list describes the consumed information and their usage in the linking process: 
+The EntityLinkingEngine consumes NLP processing results from the AnalyzedText ContentPart of the processed ContentItem. The following list describes the consumed information and their usage in the linking process: 
 
 1. __Language_ _(required)_: The Language of the Text is acquired from the Metadata of the ContentItem. It is required to search for labels in the correct language and also to correctly apply language specific configurations of the engine.
 2. __Sentences__ _(optional)_: Sentence annotations are used as segments for the matching process. In addition for the first word of an Sentence the _Upper Case_ feature is NOT set. In the case that no Sentence Annotations are present the whole text is treated as a single Sentence.
@@ -128,7 +128,7 @@ This specifies that all Languages other 
 
 Values MUST BE parsed as Array or Vector. This is done by using the ["elem1","elem2",...] syntax as defined by OSGI ".config" files. The following example shows the two above examples combined to a single configuration.
 
-    org.apache.stanbol.enhancer.engines.keywordextraction.processedLanguages=["!fr","!it","de","en","*"]
+    enhancer.engines.linking.processedLanguages=["!fr","!it","de","en","*"]
 
 
 __2. Language specific Parameter Configuration__
@@ -141,7 +141,7 @@ In addition to specifying the processed 
 
 The first line sets the parameter for {language}. The 2nd and 3rd line show that either the wildcard language '*' or the empty language '' can be used to configure parameters that are used as defaults for all languages. 
 
-The following param-names are supported by the KeywordLinkingEngine
+The following param-names are supported by the EntityLinkingEngine
 
 __Phrase level Parameters:__
 
@@ -162,20 +162,20 @@ NOTE: that tokens are linked if any of "
 
 __Examples:__
 
-The default configuration for the KeywordLinkingEngine uses the following setting
+The default configuration for the EntityLinkingEngine uses the following setting
 
-    *;lmmtip;uc=LINK;prop=0.75;pprob=0.75
+    *;lmmtip;uc=LINK;prob=0.75;pprob=0.75
     de;uc=MATCH
     es;lc=Noun
     nl;lc=Noun
 
 The first line enable _Link Multiple Matchable Tokens in Phrases_ and linking of upper case tokens for all languages. In addition it sets the minimum probabilities for Pos- and Phrase annotations to 0.75 (what would be also the default). The following three lines provide additional language specific defaults. For German the upper case mode is reset to MATCH as in German all Nouns use upper case. For Spain and Dutch linking for the LexicalCategory Noun is enabled. This is because the OpenNLP POS tagger for those languages does not support ProperNoun's and therefore the Engine would not link any tokens if _Link ProperNouns only_ is enabled. The same configuration in the OSGI '.config' file syntax would look like follows
 
-    org.apache.stanbol.enhancer.engines.keywordextraction.processedLanguages=["*;lmmtip;uc\=LINK;prop\=0.75;pprob\=0.75","de;uc\=MATCH","es;lc\=Noun","nl;lc\=Noun"]
+    enhancer.engines.linking.processedLanguages=["*;lmmtip;uc\=LINK;prop\=0.75;pprob\=0.75","de;uc\=MATCH","es;lc\=Noun","nl;lc\=Noun"]
 
 The 2nd example shows how to define default settings without using the wildcard '*' that would enable processing of all languages. The following example shows an configuration that only enables English and ignores text in all other languages.
 
-    ;lmmtip;uc=LINK;prop=0.75;pprob=0.75
+    ;lmmtip;uc=LINK;prob=0.75;pprob=0.75
     en
     de;uc=MATCH
 
@@ -187,7 +187,7 @@ This configuration allows to configure t
 * __Label Field__ _(enhancer.engines.linking.labelField)_: The name of the field/property used to link (search and match) Entities. Only a single field is supported for performance reasons.
 * __Case Sensitivity__ _(enhancer.engines.linking.caseSensitive)_: Boolean switch that allows to activate/deactivate case sensitive matching. It is important to understand that even with case sensitivity activated an Entity with the label such as "Anaconda" will be suggested for the mention of "anaconda" in the text. The main difference will be the confidence value of such a suggestion as with case sensitivity activated the starting letters "A" and "a" are NOT considered to be matching. See the second technical part for details about the matching process. Case Sensitivity is deactivated by default. It is recommended to be activated if controlled vocabularies contain abbreviations similar to commonly used words e.g. CAN for Canada.
 * __Type Field__ _(enhancer.engines.linking.typeField)_: Values of this field are used as values of the "fise:entity-types" property of created "[fise:EntityAnnotation](../enhancementstructure.html#fiseentityannotation)"s. The default is "rdf:type". _NOTE_ that in contrast to the [NamedEntityLinking](namedentityextractionengine) the types are not used for the linking process. They are only used while writing the 'fise:EntityAnnotation's and to determine the 'dc:type' values of 'fise:TextAnnotation's.
-* __Type Mappings__ _(enhancer.engines.linking.typeMappings)_: The FISE enhancement structure (as used by the Stanbol Enhancer) distinguishes [TextAnnotation](../enhancementstructure.html#fisetextannotation) and [EntityAnnotation](../enhancementstructure.html#fiseentityannotation)s. The Keyword linking engine needs to create both types of Annotations: TextAnnotations selecting the words that match some Entities in the Controlled Vocabulary and EntityAnnotations that represent an Entity suggested for a TextAnnotation. The Type Mappings are used to determine the "dc:type" of the TextAnnotation based on the types of the suggested Entity. The default configuration comes with mappings for Persons, Organizations, Places and Concepts but this fields allows to define additional mappings. For details about the syntax see the sub-section "Type Mapping Syntax" below.
+* __Type Mappings__ _(enhancer.engines.linking.typeMappings)_: The FISE enhancement structure (as used by the Stanbol Enhancer) distinguishes [TextAnnotation](../enhancementstructure.html#fisetextannotation) and [EntityAnnotation](../enhancementstructure.html#fiseentityannotation)s. The EntityLinkingEgnine needs to create both types of Annotations: TextAnnotations selecting the words that match some Entities in the Controlled Vocabulary and EntityAnnotations that represent an Entity suggested for a TextAnnotation. The Type Mappings are used to determine the "dc:type" of the TextAnnotation based on the types of the suggested Entity. The default configuration comes with mappings for Persons, Organizations, Places and Concepts but this fields allows to define additional mappings. For details about the syntax see the sub-section "Type Mapping Syntax" below.
 * __Redirect Field__ _(enhancer.engines.linking.redirectField)_ and __Redirect Mode__ _(enhancer.engines.linking.redirectMode)_: Redirects allow to follow links to other entities defined in the vocabulary linked against. This is useful in cases where matched Entities are not equals to the Entities that users want to suggest. A good example is [DBpedia](http://dbpedia.org) where the Entity 'dbpedia:USA' defines only the label "USA" and an redirect to the Entity 'dbpedia:United_States' with all the information. The _Redirect Mode_ can now be used to define if redirects should be "IGNORE"; "ADD_VALUES" causes information of the redirected entity ('dbpedia:United_States') to be added to the matched one ('dbpedia:USA'); "FOLLOW" will suggest the redirected Entity ('dbpedia:United_States') instead of the matched one ('dbpedia:USA'). The _Redirect Field_ defines the field/property used for redirects.
 * __Suggestions__ _(enhancer.engines.linking.suggestions)_: The maximum number of suggestions. The default value for this is '3'. If the engine is used in combination with an post processing engine (e.g. disambiguation) that users might want to increase this value.
 
@@ -220,7 +220,7 @@ The parameters below are used to configu
 
 #### Type Mappings Syntax
 
-The Type Mappings are used to determine the "dc:type" of the [TextAnnotation](../enhancementstructure.html#fisetextannotation) based on the types of the suggested Entity. The field "Type Mappings" (property: _org.apache.stanbol.enhancer.engines.keywordextraction.typeMappings_) can be used to customize such mappings.
+The Type Mappings are used to determine the "dc:type" of the [TextAnnotation](../enhancementstructure.html#fisetextannotation) based on the types of the suggested Entity. The field "Type Mappings" (property: _enhancer.engines.linking.typeMappings_) can be used to customize such mappings.
 
 This field uses the following syntax
 
@@ -242,7 +242,7 @@ Some Examples of additional Mappings for
 
 The first two lines map some will known Classes that represent drugs and diseases to 'drugbank:drugs' and 'diseasome:diseases'. The third and fourth line define 1:1 mappings for side effects and ingredients and the last line adds 'dailymed:organization' as an additional mapping to DBpedia Ontology Organisation.
 
-The following mappings are predefined by the KeywordLinkingEngine.
+The following mappings are predefined by the EntityLinkingEngine.
 
     dbp-ont:Person; foaf:Person; schema:Person > dbp-ont:Person
     dbp-ont:Organisation; dbp-ont:Newspaper; schema:Organization > dbp-ont:Organisation
@@ -251,7 +251,7 @@ The following mappings are predefined by
 
 ## Extension Points
 
-This section describes Interfaces that are used as Extension Points by the KeywordLinkingEngine
+This section describes Interfaces that are used as Extension Points by the EntityLinkingEngine
 
 ### EntitySearcher
 
@@ -273,11 +273,11 @@ This method is used for searching entiti
 
 The [EntityhubLinkingEngine](entityhublinking) includes EntitySearcher implementations based on the FieldQuery search interface implemented by the Stanbol Entityhub.
 
-Currently the StanbolEntityhub based implementations are instantiated based on the value of the _'org.apache.stanbol.enhancer.engines.keywordextraction.referencedSiteId'_. Users that want to use a different implementation of this Interface to be used for linking will need to extend the KeywordLinkingEngine and override the #activateEntitySearcher(ComponentContext context, Dictionary<String,Object> configuration) and #deactivateEntitySearcher(). Those methods are called during activation/deactivation of the KeywordLinkingEngine and are expected to set/unset the #entitySearcher field.
+Currently the StanbolEntityhub based implementations are instantiated based on the value of the _'enhancer.engines.linking.entityhub.siteId'_. Users that want to use a different implementation of this Interface to be used for linking will need to extend the EntityLinkingEngine and override the #activateEntitySearcher(ComponentContext context, Dictionary<String,Object> configuration) and #deactivateEntitySearcher(). Those methods are called during activation/deactivation of the EntityLinkingEngine and are expected to set/unset the #entitySearcher field.
 
 ### LabelTokenizer
 
-The LabelTokenizer interface is used to tokenize labels of Entity suggestions as returned by the [EntitySearcer](#entitysearcher). As the matching process of the KeywordLinkingEngine is based on Tokens (words) multi-word labels (e.g. Univerity of Munich) need to be tokenized before they can be matched against the current context in the Text.
+The LabelTokenizer interface is used to tokenize labels of Entity suggestions as returned by the [EntitySearcer](#entitysearcher). As the matching process of the EntityLinkingEngine is based on Tokens (words) multi-word labels (e.g. Univerity of Munich) need to be tokenized before they can be matched against the current context in the Text.
 
 The _LabelTokenizer_ interface defines only the single _tokenize(String label, String language)::String[]_ method that gets the label and the language as parameter and returns the tokens as a String array. If the tokenizer where not able to tokenize the label (e.g. because he does not support the language) it MUST return NULL. In this case the NamedEntityLinking engine will try to match the label as a single token.
 
@@ -324,3 +324,70 @@ This _LabelTokenizer_ supports the confi
 
 Internally the OpenNLP service to load tokenizer models for languages. That means that tokenizer models are loaded via the DataFileProvider infrastructure. For user that means that custom tokenizer models are loaded from the Stanbol Datafiles directory ({stanbol-working-dir}/stanbol/datafiles).
 
+### LinkingStateAware
+
+Added with [STANBOL-1070](https://issues.apache.org/jira/browse/STANBOL-1070) this interface allows to receive callbacks about the processing state of the entity linking process. This interface define methods for start/end section as well as start/end token. Both the start and the end method do parsed the active Span as parameter. An instance of this interface can be parsed to the constructor of the EntityLinker implementation.
+
+The typical usage of this extension point is as follows:
+
+    :::java
+    @Reference 
+    protected LabelTokenizer labelTokenizer; 
+
+    private TextProcessingConfig textProcessingConfig;
+    private EntityLinkerConfig linkerConfig;
+
+    private EntitySearcher entitySearcher;
+
+    @Activate
+    @SuppressWarnings("unchecked")
+    protected void activate(ComponentContext ctx) throws ConfigurationException {
+        super.activate(ctx);
+        Dictionary<String,Object> properties = ctx.getProperties();
+        //extract TextProcessing and EnityLinking config from the provided properties
+        textProcessingConfig = TextProcessingConfig.createInstance(properties);
+        linkerConfig = EntityLinkerConfig.createInstance(properties,prefixService);
+
+        //create/init the entitySearcher
+        entitySearcher = new MyEntitySearcher();
+
+        //parse additional properties
+    }
+    
+    public void computeEnhancements(ContentItem ci) throws EngineException {
+        AnalysedText at = NlpEngineHelper.getAnalysedText(this, ci, true);
+        String language = NlpEngineHelper.getLanguage(this, ci, true);
+        
+        //create an instance of your LinkingStateAware implementation
+        LinkingStateAware linkingStateAware; //= new YourImpl(..);
+
+        //create one EntityLinker instance per enhancement request
+        EntityLinker entityLinker = new EntityLinker(at,language, 
+            languageConfig, entitySearcher, linkerConfig, 
+            labelTokenizer, linkingStateAware);
+
+        //during processing we will receive callbacks to the 
+        //linkingStateAware instance
+        try {
+            entityLinker.process();
+        } catch (EntitySearcherException e) {
+            log.error("Unable to link Entities with "+entityLinker,e);
+            throw new EngineException(this, ci, "Unable to link Entities with "+entityLinker, e);
+        }
+    }
+        
+Note that it is also possible to use a single EntityLinker/LinkingStateAware pair to process multiple ContentItems. However in this case received callbacks need to be filtered based on the AnalysedText being the context of the Span instanced parsed to the callback methods.
+
+    :::java
+    @Override
+    public void startToken(Token token) {
+        //process based on the context
+        AnalysedText at = token.getContext();
+        // …
+    }
+
+In addition such a usage would require the LinkingStateAware implementation to be thread save.
+ 
+
+
+