You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by rw...@apache.org on 2011/09/22 17:38:35 UTC

svn commit: r1174213 - in /incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines: keywordlinkingengine.mdtext keywordlinkingengine.mdtxt

Author: rwesten
Date: Thu Sep 22 15:38:35 2011
New Revision: 1174213

URL: http://svn.apache.org/viewvc?rev=1174213&view=rev
Log:
minor improvements
changed extension mdtext

Added:
    incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.mdtext
      - copied, changed from r1174193, incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.mdtxt
Removed:
    incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.mdtxt

Copied: incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.mdtext (from r1174193, incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.mdtxt)
URL: http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.mdtext?p2=incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.mdtext&p1=incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.mdtxt&r1=1174193&r2=1174213&rev=1174213&view=diff
==============================================================================
--- incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.mdtxt (original)
+++ incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.mdtext Thu Sep 22 15:38:35 2011
@@ -1,21 +1,23 @@
 Title: The Keyword Linking Engine: custom vocabularies and multiple languages
 
-The [KeywordLinkingEngine](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/) is a re-implementation of the [TaxonomyLinkingEngine](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/taxonomylinking/) that is more modular and therefore better suited for future improvements and extensions as requested by [STANBOL-303](https://issues.apache.org/jira/browse/STANBOL-303). Its main improvements are its ability to support multiple languages and provide enhancement results specific to custom vocabulary.
+The [KeywordLinkingEngine](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/) is a re-implementation of the [TaxonomyLinkingEngine](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/taxonomylinking/) that is more modular and therefore better suited for future improvements and extensions as requested by [STANBOL-303](https://issues.apache.org/jira/browse/STANBOL-303). 
 
+Currently the main advantage of using this engine is its ability to support multiple languages and provide enhancement results specific to custom vocabulary. 
 
 ## Multiple Language Support ##
 
-The KeywordLinkingEngine supports multiple languages. However, the performance and to some extend also the quality of the enhancements for a specific language is depended on the following:
+The KeywordLinkingEngine supports the extraction of keywords in multiple languages. However, the performance and to some extend also the quality of the enhancements depend on how well a language is supported by the used NLP framework (currently OpenNLP).
+The following list provides a short overview about the different language specific component/configurations:
 
 * **Language detection:** The KeywordLinkingEngine depends on the correct detection of the language by the LanguageIdentificationEngine. If no language is detected or this information is missing then "English" is assumed as default.
-* **Multi-lingual labels of the controlled vocabulary:** Occurrences are searched within labels of the current language and labels without any defined language. e.g. English labels will not be matched against German language texts.
+* **Multi-lingual labels of the controlled vocabulary:** Entities are matched based on labels of the current language and labels without any defined language. e.g. English labels will not be matched against German language texts. Therefore it is important to have a controlled vocabulary that includes labels in the language of the texts you want to enhance.
 * **Natural Language Processing support:** The KeywordLinkingEngine is able to use [Sentence Detectors](http://opennlp.sourceforge.net/api/opennlp/tools/sentdetect/SentenceDetector.html), [POS (Part of Speech) taggers](http://opennlp.sourceforge.net/api/opennlp/tools/postag/POSTagger.html) and [Chunkers](http://opennlp.sourceforge.net/api/opennlp/tools/chunker/Chunker.html). If such components are available for a language then they are used to optimize the enhancement process.
   
   **Sentence detector:** If a sentence detector is present the memory footprint of the engines improves, because Tokens, POS tags and Chunks are only kept for the currently active sentence. If no sentence detector is available the entire content is treated as a single sentence.
   
-  **Tokenizer:** A (word) [tokenizer](http://opennlp.sourceforge.net/api/opennlp/tools/tokenize/Tokenizer.html) is required. If no tokenizer is available for a given language, then the [OpenNLP SimpleTokenizer](http://opennlp.sourceforge.net/api/opennlp/tools/tokenize/SimpleTokenizer.html) is used as default.
+  **Tokenizer:** A (word) [tokenizer](http://opennlp.sourceforge.net/api/opennlp/tools/tokenize/Tokenizer.html) is required for the enhancement process. If no specific tokenizer is available for a given language, then the [OpenNLP SimpleTokenizer](http://opennlp.sourceforge.net/api/opennlp/tools/tokenize/SimpleTokenizer.html) is used as default. How well this tokenizer works will depend on the language.
   
-  **POS tagger:** POS taggers annotate tokens with their type. Because of the KeywordLinkingEngine is only interested in Nouns, Foreign Words and Numbers, the presence of such a tagger allows to skip a lot of the tokens and to improve performance. However POS taggers use different sets of tags for different languages. Because of that it is not enough that a POS tagger is available for a language there MUST BE also a configuration of the POS tags for that language that need to be processed.
+  **POS tagger:** POS (Part-of-Speech) taggers annotate tokens with their type. Because of the KeywordLinkingEngine is only interested in Nouns, Foreign Words and Numbers, the presence of such a tagger allows to skip a lot of the tokens and to improve performance. However POS taggers use different sets of tags for different languages. Because of that it is not enough that a POS tagger is available for a language there MUST BE also a configuration of the POS tags representing Nouns.
   
   **Chunker:** There are two types of Chunkers. First the [Chunkers](http://opennlp.sourceforge.net/api/opennlp/tools/chunker/Chunker.html) as provided by OpenNLP (based on statistical models) and second a [POS tag based Chunker](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/opennlp/src/main/java/org/apache/stanbol/commons/opennlp/PosTypeChunker.java) provided by the openNLP bundle of Stanbol. Currently the availability of a Chunker does not have a big influence on the performance nor the quality of the Enhancements.