You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by bu...@apache.org on 2012/11/24 10:54:57 UTC

svn commit: r839412 - in /websites/staging/stanbol/trunk/content: ./ docs/trunk/components/enhancer/engines/entitylinking.html

Author: buildbot
Date: Sat Nov 24 09:54:56 2012
New Revision: 839412

Log:
Staging update by buildbot for stanbol

Modified:
    websites/staging/stanbol/trunk/content/   (props changed)
    websites/staging/stanbol/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.html

Propchange: websites/staging/stanbol/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Sat Nov 24 09:54:56 2012
@@ -1 +1 @@
-1412985
+1413162

Modified: websites/staging/stanbol/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.html
==============================================================================
--- websites/staging/stanbol/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.html (original)
+++ websites/staging/stanbol/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.html Sat Nov 24 09:54:56 2012
@@ -325,7 +325,7 @@ Configuration wise this will pre-set the
 <p>The <em>LabelTokenizer</em> interface defines only the single <em>tokenize(String label, String language)::String[]</em> method that gets the label and the language as parameter and returns the tokens as a String array. If the tokenizer where not able to tokenize the label (e.g. because he does not support the language) it MUST return NULL. In this case the NamedEntityLinking engine will try to match the label as a single token.</p>
 <h4 id="mainlabeltokenizer">MainLabelTokenizer</h4>
 <p>As it might very likely be the case that users will want to use multiple LabelTokenizer for different languages the EntityLinkingEngine comes with an MainLabelTokenizer implementation. It registers itself as LabelTokenizer with highest possible OSGI 'service.ranking' and tracks all other registered <em>LabelTokenizers</em>.</p>
-<p>So if custom <em>LabelTokenizers</em> register themselves as OSGI service than the MainLabelTokenizer can forward requests to them. It will do so in the order of the '<code>service.ranking</code>'s. in addition <em>LabelTokenizer</em> can use the '<code>enhancer.engines.keywordextraction.labeltokenizer.languages</code>' property to formally specify the languages they are supporting. This property does use the language configuration syntax (e.g. "en,de" would include English and German; "!it,!fr,<em>" would specify all languages expect Italian and French). If no configuration is provided than "</em>" (all languages) is assumed - what is fine as default as long as <em>LabelTokenizer</em> correctly return NULL for languages they do not support.</p>
+<p>So if custom <em>LabelTokenizers</em> register themselves as OSGI service than the MainLabelTokenizer can forward requests to them. It will do so in the order of the '<code>service.ranking</code>'s. in addition <em>LabelTokenizer</em> can use the '<code>enhancer.engines.entitylinking.labeltokenizer.languages</code>' property to formally specify the languages they are supporting. This property does use the language configuration syntax (e.g. "en,de" would include English and German; "!it,!fr,<em>" would specify all languages expect Italian and French). If no configuration is provided than "</em>" (all languages) is assumed - what is fine as default as long as <em>LabelTokenizer</em> correctly return NULL for languages they do not support.</p>
 <p>The MainLabelTokenizer forwards tokenize requests to all available LabelTokenizer implementations that support a specific language sorted by their '<code>service.ranking</code>' until the first one does NOT return NULL. If no LabelTokenizer was found or all returned NULL it will also return NULL.</p>
 <p>The following code snippet shows how to use the <em>MainLabelTokenizer</em> as <em>LabelTokenizer</em> for the <em>EntityLinkingEngine</em></p>
 <div class="codehilite"><pre><span class="nd">@Reference</span>
@@ -347,10 +347,13 @@ Configuration wise this will pre-set the
 
 
 <p>Configuring the NamedEntityLinkingEngine like this ensures that all registered <em>LabelTokenizers</em> are considered for tokenizing.s_</p>
+<h4 id="simple-labeltokenizer">Simple LabelTokenizer</h4>
+<p>This is the default implementation of a LabelTokenizer that does not depend on any external dependencies. This implementation behaves exactly the same as the <a href="http://opennlp.apache.org">OpenNLP</a> SimpleTokenizer. It is active by default and configured to process all languages. It uses an '<code>service.ranking</code>' of '-1000' so will be typically overwritten by custom registers implementations.</p>
+<p>The main intension of this implementation is to be a reasonable default ensuring LabelTokenizer support for all languages.</p>
 <h4 id="opennlp-labeltokenizer">OpenNLP LabelTokenizer</h4>
-<p>This is the default implementation of an LabelTokenizer based on the <a href="http://opennlp.apache.org">OpenNLP</a> tokenizer API. Internally it uses the OpenNLP service to load tokenizer models for languages. If language specific model is available it uses the OpenNLP SimpleTokenizer implementation. The <em>OpenNlpLabelTokenizer</em> registers itself with a '<code>service.ranking</code>' of '-1000' so it will b</p>
-<p>The <em>LabelTokenizerManager</em> interface extends the _</p>
-<p>The KeywordLinkingEngine will - by default - always use the LabelTokenizer with the highest "service.ranking" for a given language to tokenize labels. By default it comes with an OpenNLP based Tokenizer implementation that registers itself for all languages with a "service.ranking" of "-1000".</p>
+<p>The EntityLinkingEngie also contains an <a href="http://opennlp.apache.org">OpenNLP</a> tokenizer API based implementation. As the dependency to OpenNLP and the Stanbol Commons OpenNLP module are optionally this implementation will only be active if the <code>org.apache.stanbol:org.apache.stanbol.commons.opennlp</code> bundle with an version starting from <code>0.10.0</code> is active.</p>
+<p>This <em>LabelTokenizer</em> supports the configuration of custom OpenNLP tokenizer models for specific languages e.g. "de;model=my-de-tokenizermodel.zip;*" would use a custom model for German and the default models for all other languages.</p>
+<p>Internally the OpenNLP service to load tokenizer models for languages. That means that tokenizer models are loaded via the DataFileProvider infrastructure. For user that means that custom tokenizer models are loaded from the Stanbol Datafiles directory ({stanbol-working-dir}/stanbol/datafiles).</p>
   </div>
   
   <div id="footer">