You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by bu...@apache.org on 2014/06/02 09:54:08 UTC

svn commit: r910868 - in /websites/staging/stanbol/trunk/content: ./ docs/trunk/customvocabulary.html

Author: buildbot
Date: Mon Jun  2 07:54:08 2014
New Revision: 910868

Log:
Staging update by buildbot for stanbol

Modified:
    websites/staging/stanbol/trunk/content/   (props changed)
    websites/staging/stanbol/trunk/content/docs/trunk/customvocabulary.html

Propchange: websites/staging/stanbol/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Mon Jun  2 07:54:08 2014
@@ -1 +1 @@
-1599100
+1599105

Modified: websites/staging/stanbol/trunk/content/docs/trunk/customvocabulary.html
==============================================================================
--- websites/staging/stanbol/trunk/content/docs/trunk/customvocabulary.html (original)
+++ websites/staging/stanbol/trunk/content/docs/trunk/customvocabulary.html Mon Jun  2 07:54:08 2014
@@ -94,8 +94,7 @@
 <p>The aim of this usage scenario is to provide Apache Stanbol users with all the required knowledge to customize Apache Stanbol to be used in their specific domain. This includes</p>
 <ul>
 <li>Two possibilities to manage custom Vocabularies<ol>
-<li>via the RESTful interface provided by a Managed Site or<br />
-</li>
+<li>via the RESTful interface provided by a Managed Site or  </li>
 <li>by using a ReferencedSite with a full local index</li>
 </ol>
 </li>
@@ -104,11 +103,11 @@
 <li>Configuring the Stanbol Enhancer to make use of the indexed and imported Vocabularies</li>
 </ul>
 <h2 id="overview">Overview</h2>
-<p>The following figure shows the typical Enhancement workflow that may start with some preprocessing steps (e.g. the conversion of rich text formats to plain text) followed by the Natural Language Processing phase. Next 'Semantic Lifting' aims to connect the results of text processing and link it to the application domain of the user. During Postprocessing those results may get further refined.
-<p style="text-align: center;">![Typical Enhancement Workflow](enhancementworkflow.png "The typical Enhancement Chain includes the </p>
+<p>The following figure shows the typical Enhancement workflow that may start with some preprocessing steps (e.g. the conversion of rich text formats to plain text) followed by the Natural Language Processing phase. Next 'Semantic Lifting' aims to connect the results of text processing and link it to the application domain of the user. During Postprocessing those results may get further refined.</p>
+<p style="text-align: center;">![Typical Enhancement Workflow](enhancementworkflow.png)</p>
+
 <p>This usage scenario is all about the Semantic Lifting phase. This phase is most central to for how well enhancement results to match the requirements of the users application domain. Users that need to process health related documents will need to provide vocabularies containing life science related entities otherwise the Stanbol Enhancer will not perform as expected on those documents. Similar processing Customer requests can only work if Stanbol has access to data managed by the CRM.</p>
-<p>This scenario aims to provide Stanbol users with all information necessary to use Apache Stanbol in scenarios where domain specific vocabularies are required.<br />
-</p>
+<p>This scenario aims to provide Stanbol users with all information necessary to use Apache Stanbol in scenarios where domain specific vocabularies are required.  </p>
 <h2 id="managing-custom-vocabularies-with-the-stanbol-entityhub">Managing Custom Vocabularies with the Stanbol Entityhub</h2>
 <p>By default the Stanbol Enhancer does use the Entityhub component for linking Entities with mentions in the processed text. While Users may extend the Enhancer to allow the usage of other sources this is outside of the scope of this scenario.</p>
 <p>The Stanbol Entityhub provides two possibilities to manage vocabularies</p>
@@ -124,13 +123,13 @@
 <li>the <em><a href="components/entityhub/managedsite#configuration-of-the-yardsite">YardSite</a></em> - the component that implements the ManagedSite interface.</li>
 </ol>
 <p>After completing those two steps an empty Managed site should be ready to use available under</p>
-<div class="codehilite"><pre><span class="n">http:</span><span class="sr">//</span><span class="p">{</span><span class="n">stanbol</span><span class="o">-</span><span class="n">host</span><span class="p">}</span><span class="sr">/entityhub/si</span><span class="n">tes</span><span class="sr">/{managed-site-name}/</span>
+<div class="codehilite"><pre><span class="n">http</span><span class="p">:</span><span class="o">//</span><span class="p">{</span><span class="n">stanbol</span><span class="o">-</span><span class="n">host</span><span class="p">}</span><span class="o">/</span><span class="n">entityhub</span><span class="o">/</span><span class="n">sites</span><span class="o">/</span><span class="p">{</span><span class="n">managed</span><span class="o">-</span><span class="n">site</span><span class="o">-</span><span class="n">name</span><span class="p">}</span><span class="o">/</span>
 </pre></div>
 
 
 <p>and users can start to upload the Entities of the controlled vocabulary by using the RESTful interface such as</p>
-<div class="codehilite"><pre><span class="n">curl</span> <span class="o">-</span><span class="n">i</span> <span class="o">-</span><span class="n">X</span> <span class="n">PUT</span> <span class="o">-</span><span class="n">H</span> <span class="s">&quot;Content-Type: application/rdf+xml&quot;</span> <span class="o">-</span><span class="n">T</span> <span class="p">{</span><span class="n">rdf</span><span class="o">-</span><span class="n">xml</span><span class="o">-</span><span class="n">data</span><span class="p">}</span> <span class="o">\</span>
-    <span class="s">&quot;http://{stanbol-host}/entityhub/site/{managed-site-name}/entity&quot;</span>
+<div class="codehilite"><pre><span class="n">curl</span> <span class="o">-</span><span class="nb">i</span> <span class="o">-</span><span class="n">X</span> <span class="n">PUT</span> <span class="o">-</span><span class="n">H</span> &quot;<span class="n">Content</span><span class="o">-</span><span class="n">Type</span><span class="p">:</span> <span class="n">application</span><span class="o">/</span><span class="n">rdf</span><span class="o">+</span><span class="n">xml</span>&quot; <span class="o">-</span><span class="n">T</span> <span class="p">{</span><span class="n">rdf</span><span class="o">-</span><span class="n">xml</span><span class="o">-</span><span class="n">data</span><span class="p">}</span> <span class="o">\</span>
+    &quot;<span class="n">http</span><span class="p">:</span><span class="o">//</span><span class="p">{</span><span class="n">stanbol</span><span class="o">-</span><span class="n">host</span><span class="p">}</span><span class="o">/</span><span class="n">entityhub</span><span class="o">/</span><span class="n">site</span><span class="o">/</span><span class="p">{</span><span class="n">managed</span><span class="o">-</span><span class="n">site</span><span class="o">-</span><span class="n">name</span><span class="p">}</span><span class="o">/</span><span class="n">entity</span>&quot;
 </pre></div>
 
 
@@ -208,7 +207,7 @@ org.apache.stanbol.entityhub.indexing.ge
 </ul>
 <p>You find both files in the <code>{indexing-working-dir}/indexing/dist/</code> folder.</p>
 <p>After the installation your data will be available at</p>
-<div class="codehilite"><pre><span class="n">http:</span><span class="sr">//</span><span class="p">{</span><span class="n">stanbol</span><span class="o">-</span><span class="n">instance</span><span class="p">}</span><span class="sr">/entityhub/si</span><span class="n">te</span><span class="o">/</span><span class="p">{</span><span class="n">name</span><span class="p">}</span>
+<div class="codehilite"><pre><span class="n">http</span><span class="p">:</span><span class="o">//</span><span class="p">{</span><span class="n">stanbol</span><span class="o">-</span><span class="n">instance</span><span class="p">}</span><span class="o">/</span><span class="n">entityhub</span><span class="o">/</span><span class="n">site</span><span class="o">/</span><span class="p">{</span><span class="n">name</span><span class="p">}</span>
 </pre></div>
 
 
@@ -242,14 +241,22 @@ For the configuration of this engine you
 <li>"{name}Linking - the <a href="components/enhancer/engines/namedentitytaggingengine.html">Named Entity Tagging Engine</a> for your vocabulary as configured above.</li>
 </ul>
 <p>Both the <a href="components/enhancer/chains/weightedchain.html">weighted chain</a> and the <a href="components/enhancer/chains/listchain.html">list chain</a> can be used for the configuration of such a chain.</p>
-<h3 id="configuring-named-entity-linking_1">Configuring Named Entity Linking</h3>
-<p>First it is important to note the difference between <em>Named Entity Linking</em> and <em>Entity Linking</em>. While <em>Named Entity Linking</em> only considers <em>Named Entities</em> detected by NER (Named Entity Recognition) <em>Entity Linking</em> does work on Words (Tokens). Because of that is has much lower NLP requirements and can even operate for languages where only word tokenization is supported. However extraction results AND performance do greatly improve with POS (Part of Speech) tagging support. Also Chunking (Noun Phrase detection), NER and Lemmatization results can be consumed by Entity Linking to further improve extraction results. For details see the documentation of the <a href="components/enhancer/engines/entitylinking#linking-process">Entity Linking Process</a>.</p>
+<h3 id="configuring-entity-linking">Configuring Entity Linking</h3>
+<p>First it is important to note the difference between <em>Named Entity Linking</em> and <em>Entity Linking</em>. While <em>Named Entity Linking</em> only considers <em>Named Entities</em> detected by NER (Named Entity Recognition) <em>Entity Linking</em> does work on Words (Tokens). As NER support is only available for a limited number of languages <em>Named Entity Linking</em> is only an option for those languages. <em>Entity Linking</em> only require correct tokenization of the text. So it can be used for nearly every language. However <em>NOTE</em> that POS (Part of Speech) tagging will greatly improve quality and also speed as it allows to only lookup Nouns. Also Chunking (Noun Phrase detection), NER and Lemmatization results are considered by Entity Linking to improve vocabulary lookups. For details see the documentation of the <a href="components/enhancer/engines/entitylinking#linking-process">Entity Linking Process</a>.</p>
 <p>The second big difference is that <em>Named Entity Linking</em> can only support Entity types supported by the NER modles (Persons, Organizations and Places). <em>Entity Linking</em> does not have this restriction. This advantage comes also with the disadvantage that Entity Lookups to the Controlled Vocabulary are only based on Label similarities. <em>Named Entity Linking</em> does also use the type information provided by NER.</p>
-<p>To use <em>Entity Linking</em> with a custom Vocabulary Users need to configure an instance of the <a href="components/enhancer/engines/entityhublinking">Entityhub Linking Engine</a>. While this Engine provides more than twenty configuration parameters the following list provides an overview about the most important. For detailed information please see the documentation of the Engine.</p>
+<p>To use <em>Entity Linking</em> with a custom Vocabulary Users need to configure an instance of the <a href="components/enhancer/engines/entityhublinking">Entityhub Linking Engine</a> or a <a href="components/enhancer/engines/lucenefstlinking">FST Linking engine</a>. While both of those Engines provides 20+ configuration parameters only very few of them are required for a working configuration.</p>
 <ol>
-<li>The "Name" of the enhancement engine. It is recommended to use something like "{name}Extraction" - where {name} is the name of the Entityhub Site</li>
-<li>The name of the "Managed- / Referenced Site" holding your vocabulary. Here you have to configure the {name}</li>
-<li>The "Label Field" is the URI of the property in your vocabulary providing the labels used for matching. You can only use a single field. If you want to use values of several fields you have two options: (1) to adapt your indexing configuration to copy the values of those fields to a single one (e.g. the values of "skos:prefLabel" and "skos:altLabel" are copied to "rdfs:label" in the default configuration of the Entityhub indexing tool (see {indexing-working-dir}/indexing/config/mappings.txt) (2) to configure multiple EntityubLinkingEngines - one for each label field. Option (1) is preferable as long as you do not need to use different configurations for the different labels.</li>
+<li>The "Name" of the enhancement engine. It is recommended to use something like "{name}Extraction" or "{name}-linking" - where {name} is the name of the Entityhub Site</li>
+<li>The link to the data source<ul>
+<li>in case of the Entityhub Linking Engine this is the name of the "Managed- / Referenced Site" holding your vocabulary - so if you followed this scenario you need to configure the {name}</li>
+<li>in case of the FST linking engine this is the link to the SolrCore with the index of your custom vocabulary. If you followed this scenario you need to configure the {name} and set the field name encoding to "SolrYard".</li>
+</ul>
+</li>
+<li>The configuration of the field used for linking<ul>
+<li>in case of the Entityhub Linking Engine the "Label Field" needs to be set to the URI of the property holding the labels. You can only use a single field. If you want to use values of several fields you need to adapt your indexing configuration to copy the values of those fields to a single one (e.g. by adding <code>skos:prefLabel &gt; rdfs:label</code> and <code>skos:altLabel &gt; rdfs:label</code> to the <code>{indexing-working-dir}/indexing/config/mappings.txt</code> config.</li>
+<li>in case of the FST Linking engine you need to provide the <a href="components/enhancer/engines/lucenefstlinking#fst-tagging-configuration">FST Tagging Configuration</a>. If you store your labels in the <code>rdfs:label</code> field and you want to support all languages present in your vocabulary use <code>*;field=rdfs:label;generate=true</code>. <em>NOTE</em> that <code>generate=true</code> is required to allow the engine to (re)create FST models at runtime.</li>
+</ul>
+</li>
 <li>The "Link ProperNouns only": If the custom Vocabulary contains Proper Nouns (Named Entities) than this parameter should be activated. This options causes the Entity Linking process to not making queries for commons nouns and by that receding the number of queries agains the controlled vocabulary by ~70%. However this is not feasible if the vocabulary does contain Entities that are common nouns in the language. </li>
 <li>The "Type Mappings" might be interesting for you if your vocabulary contains custom types as those mappings can be used to map 'rdf:type's of entities in your vocabulary to 'dc:type's used for 'fise:TextAnnotation's - created by the Apache Stanbol Enhancer to annotate occurrences of extracted entities in the parsed text. See the <a href="components/enhancer/engines/keywordlinkingengine.html#type-mappings-syntax">type mapping syntax</a> and the <a href="enhancementusage.html#entity-tagging-with-disambiguation-support">usage scenario for the Apache Stanbol Enhancement Structure</a> for details.</li>
 </ol>
@@ -260,7 +267,7 @@ For the configuration of this engine you
 <li>opennlp-token - <a href="components/enhancer/engines/opennlptokenizer">OpenNLP based Word tokenization</a>. Works for all languages where white spaces can be used to tokenize.</li>
 <li>opennlp-pos - <a href="components/enhancer/engines/opennlppos">OpenNLP Part of Speech tagging</a></li>
 <li>opennlp-chunker - The <a href="components/enhancer/engines/opennlpchunker">OpenNLP chunker</a> provides Noun Phrases</li>
-<li>"{name}Extraction - the <a href="components/enhancer/engines/entityhublinking">Entityhub Linking Engine</a> configured for the custom vocabulary.</li>
+<li>"{name}Extraction - the <a href="components/enhancer/engines/entityhublinking">Entityhub Linking Engine</a> or <a href="components/enhancer/engines/lucenefstlinking#fst-tagging-configuration">FST Tagging Configuration</a> configured for the custom vocabulary.</li>
 </ul>
 <p>Both the <a href="components/enhancer/chains/weightedchain.html">weighted chain</a> and the <a href="components/enhancer/chains/listchain.html">list chain</a> can be used for the configuration of such a chain.</p>
 <p>The documentation of the Stanbol NLP processing module provides <a href="components/enhancer/nlp/#stanbol-enhancer-nlp-support">detailed Information</a> about integrated NLP frameworks and suupported languages.</p>
@@ -271,7 +278,7 @@ For the configuration of this engine you
 3) a "dbpedia-proper-noun-linking" chain showing <em>Named Entity Linking</em> based on DBpedia</p>
 <p><strong>Change the enhancement chain bound to "/enhancer"</strong></p>
 <p>The enhancement chain bound to </p>
-<div class="codehilite"><pre><span class="n">http:</span><span class="sr">//</span><span class="p">{</span><span class="n">stanbol</span><span class="o">-</span><span class="n">host</span><span class="p">}</span><span class="o">/</span><span class="n">enhancer</span>
+<div class="codehilite"><pre><span class="n">http</span><span class="p">:</span><span class="o">//</span><span class="p">{</span><span class="n">stanbol</span><span class="o">-</span><span class="n">host</span><span class="p">}</span><span class="o">/</span><span class="n">enhancer</span>
 </pre></div>