You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by bu...@apache.org on 2012/11/23 14:21:40 UTC

svn commit: r839316 - in /websites/staging/stanbol/trunk/content: ./ docs/trunk/components/enhancer/engines/entitylinking.html docs/trunk/components/enhancer/nlp/analyzedtext.html docs/trunk/components/enhancer/nlp/nlpannotations.html

Author: buildbot
Date: Fri Nov 23 13:21:39 2012
New Revision: 839316

Log:
Staging update by buildbot for stanbol

Modified:
    websites/staging/stanbol/trunk/content/   (props changed)
    websites/staging/stanbol/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.html
    websites/staging/stanbol/trunk/content/docs/trunk/components/enhancer/nlp/analyzedtext.html
    websites/staging/stanbol/trunk/content/docs/trunk/components/enhancer/nlp/nlpannotations.html

Propchange: websites/staging/stanbol/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Fri Nov 23 13:21:39 2012
@@ -1 +1 @@
-1412876
+1412877

Modified: websites/staging/stanbol/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.html
==============================================================================
--- websites/staging/stanbol/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.html (original)
+++ websites/staging/stanbol/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.html Fri Nov 23 13:21:39 2012
@@ -162,7 +162,7 @@ Configuration wise this will pre-set the
 <ol>
 <li>the POS tagging for given languages do support <em>Pos#ProperNoun</em>. If this is not the case for some languages than language specific configurations need to be used to manually adjust configurations for such languages. The next section provides examples for that.</li>
 <li>the Entities in the Vocabulary linked against need typically be mentioned as Proper Nouns in the Text. Users that need to link Vocabularies with Entities that use common nouns as their labels (e.g. House, Mountain, Summer, ...) can typically not use "Proper Noun Linking" with the following exceptions:<ul>
-<li>Entities with labels comprised of multiple common nouns (e.g. White House) can be detected in cases where <em>Chunk_s are supported and the _Link Multiple Matchable Tokens in Phrases</em> option is enabled (see the next sub-section for details).</li>
+<li>Entities with labels comprised of multiple common nouns (e.g. White House) can be detected in cases where <em>Chunks</em> are supported and the <em>Link Multiple Matchable Tokens in Phrases</em> option is enabled (see the next sub-section for details).</li>
 <li>In case Entities mentioned in the text are written as upper case tokens that the <em>Upper Case Token Mode</em> can be set to "LINK" (see the next sub-section for details)</li>
 </ul>
 </li>
@@ -252,7 +252,7 @@ Configuration wise this will pre-set the
 <li><strong>Default Matching Language</strong> <em>(enhancer.engines.linking.defaultMatchingLanguage)</em>: Linking is always done in the language of the processed text and in the <em>Default Matching Language</em>. By default the default language are labels without an language tag, but this parameter allows to override this to a specific language. This is e.g. useful for <a href="http://dbpedia.org">DBpedia</a> where all labels are marked with the language of the source Wikipedia data. So it makes sense to configure the default matching language to this value.</li>
 <li><strong>Max Search Token Distance</strong> <em>(enhancer.engines.linking.maxSearchTokenDistance)</em>: The maximum number of Tokens searched around a linked token to search for additional matchable tokens to be included for searches for Entities. The default value is '3'. As an Example in the text section "at the University of Munich a new procedure to" only "Munich" would be marked as linkable token if <em>Proper Noun Linking</em> is activated. However for searching Entities it makes sense to also use the matchable term 'University', because otherwise a search would potentially return an huge number of candidates of Entities mentioning 'Munich' in their labels. This parameter allows to configure the maximum distance of tokens so that the EntityLinkingEngine may include them as additional optional constraints for queries via the EntitySearcher interface. <em>NOTE</em> that this parameter will not allow to include tokens outside of a <em>processable chunk</em> if the <em>
 linked token</em> is within an such.</li>
 <li><strong>Max Search Tokens</strong> <em>(enhancer.engines.linking.maxSearchTokens)</em>: The maximum number of Tokens used for searches via the <em>EntitySearcher</em> interface. The default value is '2'. In case more <em>matchable tokens</em> are within the configured <em>Max Search Token Distance</em> than those closer &amp; trailing the <em>linkable token</em> are preferred. E.g. the text "president Barack Obama" where 'Barack' is the currently active <em>linkable token</em> will result in a query with the tokens 'Barack' OR 'Obama' if <em>Max Search Tokens</em>=2 and <em>Max Search Token Distance</em>&gt;=1 because both 'president' and 'Obama' do have a distance of 1 but trailing Tokens are preferred. </li>
-<li><strong>Lemma based Matching</strong> <em>(enhancer.engines.linking.lemmaMatching)</em>: If this feature in enabled than the <em>MorphoFeatures#getLemma()</em> values are used instead of the _Token#getSpan()_s if present.</li>
+<li><strong>Lemma based Matching</strong> <em>(enhancer.engines.linking.lemmaMatching)</em>: If this feature in enabled than the <em>MorphoFeatures#getLemma()</em> values are used instead of the <em>Token#getSpan()s</em> if present.</li>
 <li><strong>Min Search Token Length</strong> <em>(enhancer.engines.linking.minSearchTokenLength)</em>: This is used as fallback if the <em>Tokens</em> in the <em><a href="../nlp/analyzedtext">AnalyzedText</a></em> do not contain Part of Speech annotations or if the confidence of those annotations is to low. The default value is '3' meaning that in such cases all tokens with more than '3' characters are linked with the vocabulary. <em>NOTE</em> that this configuration might move to the <em>Text Processing Configuration</em> in future versions.</li>
 </ul>
 <p>The parameters below are used to configure the matching process.</p>
@@ -312,7 +312,7 @@ Configuration wise this will pre-set the
 <ul>
 <li><strong>Entity Search</strong> <em>lookup(String field, Set&lt;String&gt; includeFields, List&lt;String&gt; search, String[] languages,Integer limit)::Collection&lt;Representation&gt;</em></li>
 </ul>
-<p>This method is used for searching entities in the controlled vocabulary. The configured <em>Label Field</em> is parsed in the 'field' parameter. The 'includedFileds' contain all fields required for the linking process. _Representation_s returned as result need to include values for those fields. The 'search' parameter includes the tokens used for the search. Values should be considered optional however Results are considered to rank Entities that match more search tokens first. The array of 'languages' is used to parse the languages that need to be considered for the search. If 'languages' contains NULL or '' it means that also labels without an language tag need to be included in the search (NOTE that this DOES NOT mean to include labels of any language!). Finally the 'limit' parameter is used to specify the maximum number of results. If NULL than the implementation can choose an meaningful default.</p>
+<p>This method is used for searching entities in the controlled vocabulary. The configured <em>Label Field</em> is parsed in the 'field' parameter. The 'includedFileds' contain all fields required for the linking process. <em>Representations</em> returned as result need to include values for those fields. The 'search' parameter includes the tokens used for the search. Values should be considered optional however Results are considered to rank Entities that match more search tokens first. The array of 'languages' is used to parse the languages that need to be considered for the search. If 'languages' contains NULL or '' it means that also labels without an language tag need to be included in the search (NOTE that this DOES NOT mean to include labels of any language!). Finally the 'limit' parameter is used to specify the maximum number of results. If NULL than the implementation can choose an meaningful default.</p>
 <ul>
 <li><strong>Offline Mode</strong> <em>supportsOfflineMode()::boolean</em> : indicates if the EntitySearcher implementation needs to connect an remote service. This is needed to deactivate the EntityLinkingEngine in cases where Apache Stanbol is started in OfflineMode</li>
 <li><strong>Serach Result Limit</strong> <em>getLimit()::Integer</em> : The maximum number of search results supported by the EntitySearcher implementation. Can return NULL if not applicable or unknown.</li>
@@ -324,8 +324,8 @@ Configuration wise this will pre-set the
 <p>The LabelTokenizer interface is used to tokenize labels of Entity suggestions as returned by the <a href="#entitysearcher">EntitySearcer</a>. As the matching process of the KeywordLinkingEngine is based on Tokens (words) multi-word labels (e.g. Univerity of Munich) need to be tokenized before they can be matched against the current context in the Text.</p>
 <p>The <em>LabelTokenizer</em> interface defines only the single <em>tokenize(String label, String language)::String[]</em> method that gets the label and the language as parameter and returns the tokens as a String array. If the tokenizer where not able to tokenize the label (e.g. because he does not support the language) it MUST return NULL. In this case the NamedEntityLinking engine will try to match the label as a single token.</p>
 <h4 id="mainlabeltokenizer">MainLabelTokenizer</h4>
-<p>As it might very likely be the case that users will want to use multiple LabelTokenizer for different languages the EntityLinkingEngine comes with an MainLabelTokenizer implementation. It registers itself as LabelTokenizer with highest possible OSGI 'service.ranking' and tracks all other registered _LabelTokenizer_s.</p>
-<p>So if custom <em>LabelTokenizer_s register themselves as OSGI service than the MainLabelTokenizer can forward requests to them. It will do so in the order of the '<code>service.ranking</code>'s. in addition _LabelTokenizer</em> can use the '<code>enhancer.engines.keywordextraction.labeltokenizer.languages</code>' property to formally specify the languages they are supporting. This property does use the language configuration syntax (e.g. "en,de" would include English and German; "!it,!fr,<em>" would specify all languages expect Italian and French). If no configuration is provided than "</em>" (all languages) is assumed - what is fine as default as long as <em>LabelTokenizer</em> correctly return NULL for languages they do not support.</p>
+<p>As it might very likely be the case that users will want to use multiple LabelTokenizer for different languages the EntityLinkingEngine comes with an MainLabelTokenizer implementation. It registers itself as LabelTokenizer with highest possible OSGI 'service.ranking' and tracks all other registered <em>LabelTokenizers</em>.</p>
+<p>So if custom <em>LabelTokenizers</em> register themselves as OSGI service than the MainLabelTokenizer can forward requests to them. It will do so in the order of the '<code>service.ranking</code>'s. in addition <em>LabelTokenizer</em> can use the '<code>enhancer.engines.keywordextraction.labeltokenizer.languages</code>' property to formally specify the languages they are supporting. This property does use the language configuration syntax (e.g. "en,de" would include English and German; "!it,!fr,<em>" would specify all languages expect Italian and French). If no configuration is provided than "</em>" (all languages) is assumed - what is fine as default as long as <em>LabelTokenizer</em> correctly return NULL for languages they do not support.</p>
 <p>The MainLabelTokenizer forwards tokenize requests to all available LabelTokenizer implementations that support a specific language sorted by their '<code>service.ranking</code>' until the first one does NOT return NULL. If no LabelTokenizer was found or all returned NULL it will also return NULL.</p>
 <p>The following code snippet shows how to use the <em>MainLabelTokenizer</em> as <em>LabelTokenizer</em> for the <em>EntityLinkingEngine</em></p>
 <div class="codehilite"><pre><span class="nd">@Reference</span>
@@ -346,7 +346,7 @@ Configuration wise this will pre-set the
 </pre></div>
 
 
-<p>Configuring the NamedEntityLinkingEngine like this ensures that all registered _LabelTokenizer_s are considered for tokenizing.</p>
+<p>Configuring the NamedEntityLinkingEngine like this ensures that all registered <em>LabelTokenizers</em> are considered for tokenizing.s_</p>
 <h4 id="opennlp-labeltokenizer">OpenNLP LabelTokenizer</h4>
 <p>This is the default implementation of an LabelTokenizer based on the <a href="http://opennlp.apache.org">OpenNLP</a> tokenizer API. Internally it uses the OpenNLP service to load tokenizer models for languages. If language specific model is available it uses the OpenNLP SimpleTokenizer implementation. The <em>OpenNlpLabelTokenizer</em> registers itself with a '<code>service.ranking</code>' of '-1000' so it will b</p>
 <p>The <em>LabelTokenizerManager</em> interface extends the _</p>

Modified: websites/staging/stanbol/trunk/content/docs/trunk/components/enhancer/nlp/analyzedtext.html
==============================================================================
--- websites/staging/stanbol/trunk/content/docs/trunk/components/enhancer/nlp/analyzedtext.html (original)
+++ websites/staging/stanbol/trunk/content/docs/trunk/components/enhancer/nlp/analyzedtext.html Fri Nov 23 13:21:39 2012
@@ -142,7 +142,7 @@
 </ul>
 <p>This order is used by all Iterators returned by the AnalyzedText API</p>
 <h3 id="concurrent-modifications-and-iterators">Concurrent Modifications and Iterators</h3>
-<p>Iterators returned by the AnalyzedText API MUST throw _ConcurrentModificationException_s but rather reflect changes to the underlaying model. While this is not constant with the default behavior of Iterators in Java this is central for the effective usage of the AnalyzedText API - e.g. when Iterating over Sentences while adding Tokens.</p>
+<p>Iterators returned by the AnalyzedText API MUST throw <em>ConcurrentModificationExceptions</em> but rather reflect changes to the underlaying model. While this is not constant with the default behavior of Iterators in Java this is central for the effective usage of the AnalyzedText API - e.g. when Iterating over Sentences while adding Tokens.</p>
 <h3 id="code-samples">Code Samples:</h3>
 <p>The following Code Snippet shows some typical usages of the API:</p>
 <div class="codehilite"><pre><span class="n">AnalysedText</span> <span class="n">at</span><span class="o">;</span> <span class="c1">//typically retrieved from the contentPart</span>
@@ -203,7 +203,7 @@
 
 </li>
 <li>
-<p>Defined <em>Annotation</em> are used to add information to an <em>Annotated</em> instance (like a Span). For adding annotations the use of _Annotation_s is required to ensure type safety. The following code snippet shows how to add an PosTag with the probability 0.95.</p>
+<p>Defined <em>Annotation</em> are used to add information to an <em>Annotated</em> instance (like a Span). For adding annotations the use of <em>Annotations</em> is required to ensure type safety. The following code snippet shows how to add an PosTag with the probability 0.95.</p>
 <div class="codehilite"><pre><span class="n">PosTag</span> <span class="n">tag</span> <span class="o">=</span> <span class="k">new</span> <span class="n">PosTag</span><span class="o">(</span><span class="s">&quot;N&quot;</span><span class="o">);</span> <span class="c1">//a simple POS tag</span>
 <span class="n">Token</span> <span class="n">token</span><span class="o">;</span> <span class="c1">//The Token we want to add the tag</span>
 <span class="n">token</span><span class="o">.</span><span class="na">addAnnotations</span><span class="o">(</span><span class="n">POS_ANNOTATION</span><span class="o">,</span><span class="n">Value</span><span class="o">.</span><span class="na">value</span><span class="o">(</span><span class="n">tag</span><span class="o">),</span><span class="mf">0.95</span><span class="o">);</span>

Modified: websites/staging/stanbol/trunk/content/docs/trunk/components/enhancer/nlp/nlpannotations.html
==============================================================================
--- websites/staging/stanbol/trunk/content/docs/trunk/components/enhancer/nlp/nlpannotations.html (original)
+++ websites/staging/stanbol/trunk/content/docs/trunk/components/enhancer/nlp/nlpannotations.html Fri Nov 23 13:21:39 2012
@@ -116,7 +116,7 @@
 
 
 <p><em>TagSet</em> is the other important class as it allows to manage the set of PosTag instances. <em>TagSet</em> has two main functions: First it allows an integrator of an POS tagger with Stanbol to define the mappings from the string POS tags used by the Pos Tagger to the LexicalCategory and Pos enumeration members as preferable used by the Stanbol NLP chain. Second it ensures that there is only a single instance of PosTag used to annotate all Tokens with the same type.</p>
-<p>_TagSet_s are typically specified as static members of utility classes. The following code snippet shows an example</p>
+<p>TagSets are typically specified as static members of utility classes. The following code snippet shows an example</p>
 <div class="codehilite"><pre><span class="c1">//Tagset is generically typed. We need a TagSet for PosTag&#39;s</span>
 <span class="kd">public</span> <span class="kd">static</span> <span class="kd">final</span> <span class="n">TagSet</span><span class="o">&lt;</span><span class="n">PosTag</span><span class="o">&gt;</span> <span class="n">STTS</span> <span class="o">=</span> <span class="k">new</span> <span class="n">TagSet</span><span class="o">&lt;</span><span class="n">PosTag</span><span class="o">&gt;(</span>
     <span class="s">&quot;STTS&quot;</span><span class="o">,</span> <span class="s">&quot;de&quot;</span><span class="o">);</span> <span class="c1">//define a name and the languages it supports</span>
@@ -131,11 +131,12 @@
     <span class="n">STTS</span><span class="o">.</span><span class="na">addTag</span><span class="o">(</span><span class="k">new</span> <span class="n">PosTag</span><span class="o">(</span><span class="s">&quot;ADJA&quot;</span><span class="o">,</span> <span class="n">Pos</span><span class="o">.</span><span class="na">AttributiveAdjective</span><span class="o">));</span>
     <span class="n">STTS</span><span class="o">.</span><span class="na">addTag</span><span class="o">(</span><span class="k">new</span> <span class="n">PosTag</span><span class="o">(</span><span class="s">&quot;ADJD&quot;</span><span class="o">,</span> <span class="n">Pos</span><span class="o">.</span><span class="na">PredicativeAdjective</span><span class="o">));</span>
     <span class="n">STTS</span><span class="o">.</span><span class="na">addTag</span><span class="o">(</span><span class="k">new</span> <span class="n">PosTag</span><span class="o">(</span><span class="s">&quot;ADV&quot;</span><span class="o">,</span> <span class="n">LexicalCategory</span><span class="o">.</span><span class="na">Adverb</span><span class="o">));</span>
+    <span class="c1">//[...]</span>
+<span class="o">}</span>
 </pre></div>
 
 
-<p class="..">//</p>
-<p>The string tag (first parameter) of the <em>PosTag</em> is used as unique key by the <em>TagSet</em>. Adding an 2nd <em>PasTag</em> with the same tag will override the first one. <em>PosTag_s that are added to a _TagSet</em> have the <em>Tag#getAnnotationModel()</em> property set to that model.</p>
+<p>The string tag (first parameter) of the <em>PosTag</em> is used as unique key by the <em>TagSet</em>. Adding an 2nd <em>PasTag</em> with the same tag will override the first one. <em>PosTags</em> that are added to a <em>TagSet</em> have the <em>Tag#getAnnotationModel()</em> property set to that model.</p>
 <p>The final example shows a code snippet shows the core part of an POS tagging engine using the both the <a href="analyzedtext">AnalyzedText</a> and the <em>PosTag</em> and <em>TagSet</em> APIs.</p>
 <div class="codehilite"><pre><span class="n">TagSet</span><span class="o">&lt;</span><span class="n">PosTag</span><span class="o">&gt;</span> <span class="n">tagSet</span><span class="o">;</span> <span class="c1">//the used TagSet</span>
 <span class="c1">//holds PosTags for tags returned by the POS tagger that</span>