You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by rw...@apache.org on 2012/07/12 08:21:24 UTC
svn commit: r1360538 - in /incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines: keywordlinkingengine.mdtext keywordlinkingengineconfig.png

Author: rwesten
Date: Thu Jul 12 06:21:24 2012
New Revision: 1360538

URL: http://svn.apache.org/viewvc?rev=1360538&view=rev
Log:
updated the documentation of the KeywordLinkingEngine to reflect changes introduced by STANBOL-685 and STANBOL-686

Modified:
    incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.mdtext
    incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengineconfig.png

Modified: incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.mdtext
URL: http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.mdtext?rev=1360538&r1=1360537&r2=1360538&view=diff
==============================================================================
--- incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.mdtext (original)
+++ incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.mdtext Thu Jul 12 06:21:24 2012
@@ -21,6 +21,7 @@ The example in the scene shows an config
 * __Type Field__ _(org.apache.stanbol.enhancer.engines.keywordextraction.typeField)_: Values of this field are used as values of the "fise:entity-types" property of created "[fise:EntityAnnotation](../enhancementstructure.html#fiseentityannotation)"s. The default is "rdf:type".
 * __Redirect Field__ _(org.apache.stanbol.enhancer.engines.keywordextraction.redirectField)_ and __Redirect Mode__ _(org.apache.stanbol.enhancer.engines.keywordextraction.redirectMode)_: Redirects allow to tell the KeywordLinkingEngine to follow a specific property in the knowledge base for matched entities. This feature e.g. allows to follow redirects from "USA" to "United States" as defined in Wikipedia. See "Processing of Entity Suggestions" for details. Possible valued for the Redirect Mode are "IGNORE" - deactivates this feature; "ADD_VALUES" - uses label, type informations of redirected entities, but keeps the URI of the extracted entity; "FOLLOW" - follows the redirect
 * __Min Token Length__ _(org.apache.stanbol.enhancer.engines.keywordextraction.minSearchTokenLength)_: While the KeywordLinkingEngine preferable uses POS (part-of-speach) taggers to determine if a word should matched with the controlled vocabulary the minimum token length provides a fall back if (a) no POS tagger is available for the language of the parsed text or (b) if the confidence of the POS tagger is lower than the threshold.
+* __Minimum Token Match Factor__ _(org.apache.stanbol.enhancer.engines.keywordextraction.minTokenMatchFactor)_: If a Token of the text is compared with a Token of an Entity Label the similarity of those two is expressed in the range [0..1]. The minimum token match factor specifies the minimum similarity of two Tokens so that they are considered to match. Lower similarity scores are not considered as match. This parameter is important as it e.g. allows inflected forms of words to match. However it also may result in false positives of similar words. users should note that the similarity score is also used for calculating the confidence. So similarity scores < 1 but higher than the configured minimum token match factor will reduce the confidence of suggested Entities.
 * __Keyword Tokenizer__ _(org.apache.stanbol.enhancer.engines.keywordextraction.keywordTokenizer)_: This allows to use a special Tokenizer for matching keywords and alpha numeric IDs. Typical language specific Tokenizers tend to split such IDs in several tokens and therefore might prevent a correct matching. This Tokenizer should only be activated if the KeywordLinkingEngine is configured to match against IDs like ISBN numbers, Product IDs ... It should not be used to match against natural language labels. 
 * __Suggestions__ _(org.apache.stanbol.enhancer.engines.keywordextraction.maxSuggestions)_: The maximum number of suggested Entities.
 * __Languages__ _(org.apache.stanbol.enhancer.engines.keywordextraction.processedLanguages)_ and __Default Matching Language__ _(org.apache.stanbol.enhancer.engines.keywordextraction.defaultMatchingLanguage)_: The first allows to specify languages that should be processed by this engine. This is e.g. useful if the controlled vocabulary only contains labels in for a specific language but does not formally specify this information (by setting the "xml:lang" property for labels). The default matching language can be used to work around the exact opposite case. As an example in DBpedia labels do get the language of the dataset they are extracted from (e.g. all data extracted from en.wikipedia.org will get "xml:lang=en"). The default matching language allows to tell the KeywordLinkingEngine to use labels of that language for matching regardless of the language of the parsed content. In the case of DBpedia this allows e.g. to match persons mentioned in an Italian text with the eng
 lish labels extracted from en.wikipedia.org. Details about natural language processing features used by this engine are provided in the section "Multiple Language Support"
@@ -93,11 +94,20 @@ The current state of the processing is r
 * __Token:__ The currently processed word part of the chunk and the sentence.
 * __TokenIndex:__ The index of the currently active token relative to the AnalysedSentence.
 
-The ProcessingState provides means to navigate to the next token. If chunks are present tokens that are outside of chunks are ignored.
+Processing is done based on Tokens (words). The ProcessingState provides means to navigate to the next token. If Chunks are present tokens that are outside of chunks are ignored. Only 'processable' tokens are considered to lookup entities (see the next section for details). If a Token is processable is determined as follows
+
+* Only Tokens within a Chunk are considered. If no Chunks are available all Tokens.
+* If POS tags are available AND POS tags considered as NOUNS are configured (see [PosTagsCollectionEnum](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/opennlp/src/main/java/org/apache/stanbol/commons/opennlp/PosTagsCollectionEnum.java)) than POS tags are considered for deciding if a Token is processable
+    * The minimum POS tag probability is <code>0.667</code>
+    * Tokens with a POS tag representing a NOUN and a probability >= minPosTagProb are marked as processable
+    * Tokens with a POS tag NOT representing a NOUN and a probability >= minPosTagProb/2 are marked as NOT processable
+* If POS tags are NOT available or the NOUN POS tags configuration is missing the minimum token length _(org.apache.stanbol.enhancer.engines.keywordextraction.minSearchTokenLength)_ is used as fallback. This means that all Tokens equals or longer than this value are marked as processable.
+
+This algorithm was introduced by [STANBOL-658](https://issues.apache.org/jira/browse/STANBOL-685)
 
 ### Entity Lookup ###
 
-A "OR" query with [1..MAX_SEARCH_TOKENS] tokens is used to lookup entities via the [EntitySearcher](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntitySearcher.java) interface. If the actual implementation cut off results, than it must be ensured that Entities that match both tokens are ranked first.
+A "OR" query with [1..MAX_SEARCH_TOKENS] processable tokens is used to lookup entities via the [EntitySearcher](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntitySearcher.java) interface. If the actual implementation cut off results, than it must be ensured that Entities that match both tokens are ranked first.
 Currently there are two implementations of this interface: (1) for the Entityhub ([EntityhubSearcher](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntityhubSearcher.java)) and (2) for ReferencedSites ([ReferencedSiteSearcher](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/ReferencedSiteSearcher.java)). There is also an [Implementation](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/test/java/org/apache/stanbol/enhancer/engines/keywordextraction/impl/TestSearcherImpl.java) that holds entities in-memory, however currently this is only used for unit tests.
 
 Queries do use the configured [EntityLinkerConfig](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java).getNameField() and the language of labels is restricted to the current language or labels that do not define any language.
@@ -115,15 +125,18 @@ All labels (values of the [EntityLinkerC
 
 For each label that fulfills the above criteria the following steps are processed. The best result is used as the result of the whole matching process:
 
-* All tokens (of the text) following the current position are searched within the label.
-* As of now, tokens MUST appear in the correct order within a label (e.g. "Murdoch Rupert" will NOT match "Rupert Murdoch")
-* On the first processable token of the text that is not present within the label matching is canceled. (see the definition of processable token above)
-* On the second non-processable token not found in the label the matching is also canceled (e.g. "University of Michigan" will match "University Michigan")
+* Tokens (of the text) following the current position are searched within the label. This also includes non-processable Tokens. 
+    * Processable Tokens MUST match with Tokens in the Label. A maximum number of [EntityLinkerConfig](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java).getMaxNotFound() non-processable Tokens may not match.
+    * Token order is important. Tokens in the Entity Label are allied to be skipped (e.g. the text 'Barack Obama' will match the label 'Barack Hussein Obama' because Hussein is allowed to be skipped. The other way around it would be no match because processable Tokens in the Text are not allied to be skipped)
+* If the first Token of the Label is not matches preceding Tokens of the Text are matched against the Label. This is done to ensure that Entities that use adjectives in their labels (e.g. "great improvement", "Gute Deutschkenntnisse") are matched. In addition this also helps to match named entities (e.g. person names) as the first token of those mentions are sometimes erroneously classified adjectives by POS taggers.
+* Tokens that appear in the wrong order (e.g. the text 'Obama, Barack' with the label 'Barack Obama' are matched with a factor of <code>0.7</code>. Currently only exact matches are considered.
+
+If two tokens match is calculated by dividing the longest matching part from the begin of the Token to the maximum length of the two tokens. e.g. 'German' would match with 'Germany' with <code>5/6=0.83</code>. The result of this comparison is the token similarity. If this similarity is greater equals than the configured minimum token similarity factor _(org.apache.stanbol.enhancer.engines.keywordextraction.minTokenMatchFactor)_ than those tokens are considered to match. The token similarity is also used for calculating the confidence.  
 
 Entities are [Suggested](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/Suggestion.java) if:
 
-* a label does match exactly with the text following the current position it the entity is suggested. (e.g. [Passerine](http://en.wikipedia.org/wiki/Passerine))
-* a label matches at least [EntityLinkerConfig](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java).getMinFoundTokens() (default=2) are matching with the text. This ensures that "[Rupert Murdoch](http://en.wikipedia.org/wiki/Rupert_Murdoch)" is not suggested for "[Rupert](http://en.wikipedia.org/wiki/Rupert)" but on the other hand "Barack Hussein Obama" is suggested for "Barack Obama". Setting "minFoundToken" to values less than two will usually cause a lot of false positives, but would also come up with a suggestion for "Barack Obama" if the content contains the word "Obama".
+* a label does match exactly with the current position in the text. This is if all tokens of the Label match with the Tokens of the text. Note that tokens are considered to match if the similarity is greater equals than the minimum token match factor.
+* partial matches are considered if more than [EntityLinkerConfig](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java).getMinFoundTokens() (default=2) processable tokens match. Non-processable tokens are not considered for this. This ensures that "[Rupert Murdoch](http://en.wikipedia.org/wiki/Rupert_Murdoch)" is not suggested for "[Rupert](http://en.wikipedia.org/wiki/Rupert)" but on the other hand "Barack Hussein Obama" is suggested for "Barack Obama".
 
 The described matching process is currently directly part of the [EntityLinker](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinker.java). To support different matching strategies this would need to be externalized into an own "EntityLabelMatcher" interface.
 

Modified: incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengineconfig.png
URL: http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengineconfig.png?rev=1360538&r1=1360537&r2=1360538&view=diff
==============================================================================
Binary files - no diff available.