You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by rw...@apache.org on 2013/04/18 14:20:02 UTC
svn commit: r1469293 - in /stanbol/site/trunk/content/docs/trunk: ./ components/enhancer/ components/enhancer/engines/

Author: rwesten
Date: Thu Apr 18 12:20:02 2013
New Revision: 1469293

URL: http://svn.apache.org/r1469293
Log:
updated custom vocabulary useage scenario, fixed broken links to usage scenarios (STANBOL-972)

Added:
    stanbol/site/trunk/content/docs/trunk/enhancementworkflow.png   (with props)
Modified:
    stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext
    stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/keywordlinkingengine.mdtext
    stanbol/site/trunk/content/docs/trunk/components/enhancer/enhancementstructure.mdtext
    stanbol/site/trunk/content/docs/trunk/customvocabulary.mdtext

Modified: stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext
URL: http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext?rev=1469293&r1=1469292&r2=1469293&view=diff
==============================================================================
--- stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext (original)
+++ stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext Thu Apr 18 12:20:02 2013
@@ -216,6 +216,7 @@ The parameters below are used to configu
 
 * __Min Text Score__ _(enhancer.engines.linking.minTextScore)_ [0..1]::double: The "Text Score" [0..1] represents how well the Label of an Entity matches to the selected Span in the Text. It compares the number of matched {@link Token} from the label with the number of Tokens enclosed by the Span in the Text an Entity is suggested for. Not exact matches for Tokens, or if the Tokens within the label do appear in an other order than in the text do also reduce this score. Entities are only considered if at least one of their labels cores higher than the minimum for all tree of _Min Labe Score_, _Min Text Match Score_ and _Min Match Score_.
 * __Min Match Score__ _(enhancer.engines.linking.minMatchScore)_ [0..1]::double: Defined as the product of the "Text Score" with the "Label Score" - meaning that this value represents both how well the label matches the text and how much of the label is matched with the text. Entities are only considered if at least one of their labels cores higher than the minimum for all tree of _Min Labe Score_, _Min Text Match Score_ and _Min Match Score_. 
+* __Use EntityRankings__ _(enhancer.engines.linking.useEntityRankings)_ ::boolean (default=true): Entity Rankings can be used to define the ranking (popularity, importance, connectivity, ...) of an entity relative to other within the knowledge base. While fise:confidence values calculated by the EntityLinkingEngie do only represent how well a label of the entity do match with the given section in the processed text it does make sense for manny use cases to sort Entities with the same score based on their entity rankings (e.g. users would expect to get "Paris (France)" suggested before "Paris (Texas)" for Paris appearing in a text. Enabling this feature will slightly (&lt; 0.1) change the score of suggestions to ensure such a ordering.	 
 
 #### Type Mappings Syntax
 

Modified: stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/keywordlinkingengine.mdtext
URL: http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/keywordlinkingengine.mdtext?rev=1469293&r1=1469292&r2=1469293&view=diff
==============================================================================
--- stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/keywordlinkingengine.mdtext (original)
+++ stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/keywordlinkingengine.mdtext Thu Apr 18 12:20:02 2013
@@ -22,7 +22,7 @@ The example in the scene shows an config
 
 * __Name__ _(stanbol.enhancer.engine.name)_: The name of the Enhancement Engine. This name is used to refer an [EnhancementEngine](index.html) in [EnhancementChain](enhancementchain.html)s
 * __Referenced Site__ _(org.apache.stanbol.enhancer.engines.keywordextraction.referencedSiteId)_: The name of the ReferencedSite of the Stanbol Entityhub that holds the controlled vocabulary to be used for extracting Entities. "entityhub" or "local" can be used to extract Entities managed directly by the Entityhub.
-* __Label Field__ _(org.apache.stanbol.enhancer.engines.keywordextraction.nameField)_: The name of the property used to lookup Entities. Only a single field is supported for performance reasons. Users that want to use values of several fields should collect such values by an according configuration in the mappings.txt used during indexing. This [usage scenario](../../customvocabulary.html) provides more information on this.
+* __Label Field__ _(org.apache.stanbol.enhancer.engines.keywordextraction.nameField)_: The name of the property used to lookup Entities. Only a single field is supported for performance reasons. Users that want to use values of several fields should collect such values by an according configuration in the mappings.txt used during indexing. This [usage scenario](../../../customvocabulary.html) provides more information on this.
 * __Case Sensitivity__ _(org.apache.stanbol.enhancer.engines.keywordextraction.caseSensitive)_: This allows to activate/deactivate case sensitive matching. It is important to understand that even with case sensitivity activated an Entity with the label such as "Anaconda" will be suggested for the mention of "anaconda" in the text. The main difference will be the confidence value of such a suggestion as with case sensitivity activated the starting letters "A" and "a" are NOT considered to be matching. See the second technical part for details about the matching process. Case Sensitivity is deactivated by default. It is recommended to be activated if controlled vocabularies contain abbreviations similar to commonly used words e.g. CAN for Canada.
 * __Type Field__ _(org.apache.stanbol.enhancer.engines.keywordextraction.typeField)_: Values of this field are used as values of the "fise:entity-types" property of created "[fise:EntityAnnotation](../enhancementstructure.html#fiseentityannotation)"s. The default is "rdf:type".
 * __Redirect Field__ _(org.apache.stanbol.enhancer.engines.keywordextraction.redirectField)_ and __Redirect Mode__ _(org.apache.stanbol.enhancer.engines.keywordextraction.redirectMode)_: Redirects allow to tell the KeywordLinkingEngine to follow a specific property in the knowledge base for matched entities. This feature e.g. allows to follow redirects from "USA" to "United States" as defined in Wikipedia. See "Processing of Entity Suggestions" for details. Possible valued for the Redirect Mode are "IGNORE" - deactivates this feature; "ADD_VALUES" - uses label, type informations of redirected entities, but keeps the URI of the extracted entity; "FOLLOW" - follows the redirect

Modified: stanbol/site/trunk/content/docs/trunk/components/enhancer/enhancementstructure.mdtext
URL: http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/components/enhancer/enhancementstructure.mdtext?rev=1469293&r1=1469292&r2=1469293&view=diff
==============================================================================
--- stanbol/site/trunk/content/docs/trunk/components/enhancer/enhancementstructure.mdtext (original)
+++ stanbol/site/trunk/content/docs/trunk/components/enhancer/enhancementstructure.mdtext Thu Apr 18 12:20:02 2013
@@ -11,7 +11,7 @@ Its two main purposes are to facilitate 
     * group Entity suggestion based on detected "Named Entities" (disambiguation support)
     * show the occurrence of detected Entities within the analyzed text (similar to spell checker UIs)
 
-While this document focuses on the first Engine and provides details on how the Stanbol Enhancement Structure it the integral part of the Stanbol Enhancer there is also a [Usage Scenario](../enhancementusage.html) available that focuses on how the Enhancements can be consumed by Stanbol Enhancer users.
+While this document focuses on the first Engine and provides details on how the Stanbol Enhancement Structure it the integral part of the Stanbol Enhancer there is also a [Usage Scenario](../../enhancementusage.html) available that focuses on how the Enhancements can be consumed by Stanbol Enhancer users.
 
 ## Overview on the Stanbol Enhancement Structure
 

Modified: stanbol/site/trunk/content/docs/trunk/customvocabulary.mdtext
URL: http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/customvocabulary.mdtext?rev=1469293&r1=1469292&r2=1469293&view=diff
==============================================================================
--- stanbol/site/trunk/content/docs/trunk/customvocabulary.mdtext (original)
+++ stanbol/site/trunk/content/docs/trunk/customvocabulary.mdtext Thu Apr 18 12:20:02 2013
@@ -5,27 +5,62 @@ The ability to work with custom vocabula
 
 The aim of this usage scenario is to provide Apache Stanbol users with all the required knowledge to customize Apache Stanbol to be used in their specific domain. This includes
 
-* Index custom Vocabularies using the Entityhub Indexing Tool
+* Two possibilities to manage custom Vocabularies
+    1. via the RESTful interface provided by a Managed Site or  
+    2. by using a ReferencedSite with a full local index
+* Building full local indexes with the Entityhub Indexing Tool
 * Importing Indexes to Apache Stanbol
 * Configuring the Stanbol Enhancer to make use of the indexed and imported Vocabularies
 
 ## Overview
 
-For text enhancement and linking to external sources, the Entityhub component of Apache Stanbol allows to work with local indexes of datasets. This has several advantages. 
+The following figure shows the typical Enhancement workflow that may start with some preprocessing steps (e.g. the conversion of rich text formats to plain text) followed by the Natural Language Processing phase. Next 'Semantic Lifting' aims to connect the results of text processing and link it to the application domain of the user. During Postprocessing those results may get further refined.
+<p style="text-align: center;">![Typical Enhancement Workflow](enhancementworkflow.png "The typical Enhancement Chain includes the 
 
-- You do not rely on internet connectivity, thus it is possible to operate offline with a huge set of entities.
-- You can do local updates of these datasets.
-- You can work with local resources, such as your LDAP directory or a specific and private enterprise vocabulary of a specific domain.
+This usage scenario is all about the Semantic Lifting phase. This phase is most central to for how well enhancement results to match the requirements of the users application domain. Users that need to process health related documents will need to provide vocabularies containing life science related entities otherwise the Stanbol Enhancer will not perform as expected on those documents. Similar processing Customer requests can only work if Stanbol has access to data managed by the CRM.
 
-Creating your own indexes is the preferred way of working with custom vocabularies. Small vocabularies can also be uploaded to the Entityhub as ontologies, directly. A downside to this approach is that only one ontology per installation is supported.
+This scenario aims to provide Stanbol users with all information necessary to use Apache Stanbol in scenarios where domain specific vocabularies are required.  
 
-If you want to use multiple datasets in parallel, you have to create a local index for these datasets and configure the Entityhub to use them. In the following we will focuses on the main case, which is: Creating and using a local [Apache Solr](http://lucene.apache.org/solr/) index of a custom vocabulary, e.g. a SKOS thesaurus or taxonomy of your domain.
+## Managing Custom Vocabularies with the Stanbol Entityhub
 
-## Creating and working with custom local indexes
+By default the Stanbol Enhancer does use the Entityhub component for linking Entities with mentions in the processed text. While Users may extend the Enhancer to allow the usage of other sources this is outside of the scope of this scenario.
 
-Apache Stanbol provides the machinery to start with vocabularies in standard languages such as [SKOS](http://www.w3.org/2004/02/skos/) or [RDF](http://www.w3.org/TR/rdf-primer/) encoded data sets. The Apache Stanbol components, which are needed for this functionality are the Entityhub and its indexing tool for creating and managing the index and [enhancement engines](components/enhancer/engines) that make use of the indexes during the enhancement process.
+The Stanbol Entityhub provides two possibilities to manage vocabularies
 
-To create and import your own vocabulary to the Apache Stanbol Entityhub you need to follow the following Steps
+1. __[Managed Sites](components/entityhub/managedsite)__: A fully read/write able storage for Entities. Once created users can use a RESTful interface to create, update, retrieve, query and delete entities.
+2. __Referenced Site__: A read-only version of a Site that can either be used as a local cache of remotely managed data (such as a [Linked Data](http://linkeddata.org/) server) or use a fully local index of the knowledge base - the relevant case in the context of this scenario.
+
+As a rule of thump users should prefer to use a __Managed Site__ if the vocabulary does change regularly and those changes need to be reflected in enhancement results of processed documents. A __Referenced Site__ is typically the better choice for vocabularies that do not change on a regular base and/or for users that what to use apply advanced rules while indexing a dataset.
+
+### Using a Entityhub Managed Site
+
+How to use a Managed Site is already described in detail by the [Documentation of Managed Sites](components/entityhub/managedsite). To configure a new Managed Site on the Entityhub users need to create two components:
+
+1. the _Yard_ - the storage component of the Stanbol Entityhub. While there are multiple Yard implementations, when used for EntiyLinking the [SolrYard implementation](components/entityhub/managedsite#configuration-of-a-solryard) should be used. Second the 
+2. the _[YardSite](components/entityhub/managedsite#configuration-of-the-yardsite)_ - the component that implements the ManagedSite interface.
+
+After completing those two steps an empty Managed site should be ready to use available under
+
+    http://{stanbol-host}/entityhub/sites/{managed-site-name}/
+
+and users can start to upload the Entities of the controlled vocabulary by using the RESTful interface such as
+
+    curl -i -X PUT -H "Content-Type: application/rdf+xml" -T {rdf-xml-data} \
+        "http://{stanbol-host}/entityhub/site/{managed-site-name}/entity"
+
+In case you have opted to use a _Managed Site_ for managing your entities you can now skip the next section until section 'B. Configure and use the index with the Apache Stanbol Enhancer'
+
+### Using a Entityhub Referenced Site
+
+Referenced Sites are used by the Stanbol Entityhub to reference external knowledge bases. This can be done by configuring remote services for dereferencing and querying information, but also by providing a full local index of the referenced knowledge base. 
+
+When using a Referenced Site in combination with the Stanbol Enhancer it is highly recommended for performance considerations to provide a full local index. To create such local indexes Stanbol provides the _Entityhub Indexing Tool_. See the following section for detailed information on how to use this tool.
+
+## Building full local indexes with the Entityhub Indexing Tool
+
+The Entityhub Indexing Tool allows to create full local indexes of knowledge bases that can be loaded to the Stanbol Entityhub as Referenced Sites. Users that do use Managed Sites may want to skip this section.
+
+Users of the Entityhub Indexing Tool will typically need to complete the steps described in the following sub sections.
 
 ### Step 1 : Compile and assemble the indexing tool
 
@@ -66,7 +101,9 @@ This will create/initialize the default 
 After the initialization you will need to provide the following configurations in files located in the configuration folder (<code>{indexing-working-dir}/indexing/config</code>)
 
 * Within the <code>indexing.properties</code> file you need to set the {name} of your index by changing the value of the "name" property. In addition you should also provide a "description". At the end of the indexing.properties file you can also specify the license and attribution for the data you index. The Apache Entityhub will ensure that those information will be included with any entity data returned for requests.
-* If the data you index do use some none common namespaces you will need to add those to the <code>mapping.txt</code> file (here is an [example](examples/anl-mappings.txt)  including default and specific mappings for one dataset)
+* Optionally, if your data do use namespaces that are not present in [prefix.cc](http://prefix.cc) (or the server used for indexing does not have internet connectivity) you can manually define required prefixes by creating/using the a <code>indexing/config/namespaceprefix.mappings</code> file. The syntax is '<code>'{prefix}\t{namespace}\n</code>' where '<code>{prefix} ... [0..9A..Za..z-_]</code>' and '<code>{namespace} ... must end with '#' or '/' for URLs and ':' for URNs</code>'.
+* Optionally, if the data you index do use some none common namespaces you will need to add those to the <code>mapping.txt</code> file (here is an [example](examples/anl-mappings.txt)  including default and specific mappings for one dataset)
+* Optionally, if you want to use a custom SolrCore configuration the core configuration needs to be copied to the <code>indexing/config/{core-name}</code>. Default configuration - to start from - can be downloaded from the [Stanbol SVN](https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/yard/solr/src/main/resources/solr/core/) and extracted to the <code>indexing/config/</code> folder. If the {core-name} is different from the 'name' configured in the <code>indexing.properties</code> than the '<code>solrConf</code>' parameter of the '<code>indexingDestination</code>' MUST be set to '<code>solrConf:{core-name}</code>'. After those configurations users can make custom adaptations to the SolrCore configuration used for indexing. 
 
 Finally you will also need to copy your source files into the source directory <code>{indexing-working-dir}/indexing/resources/rdfdata</code>. All files within this directory will be indexed. THe indexing tool support most common RDF serialization. You can also directly index compressed RDF files.
 
@@ -78,23 +115,23 @@ Once all source files are in place, you 
     $ cd {indexing-working-dir}
     $ java -Xmx1024m -jar org.apache.stanbol.entityhub.indexing.genericrdf-*-jar-with-dependencies.jar index
 
-Depending on your hardware and on complexity and size of your sources, it may take several hours to built the index. As a result, you will get an archive of an [Apache Solr](http://lucene.apache.org/solr/) index together with an OSGI bundle to work with the index in Stanbol.
+Depending on your hardware and on complexity and size of your sources, it may take several hours to built the index. As a result, you will get an archive of an [Apache Solr](http://lucene.apache.org/solr/) index together with an OSGI bundle to work with the index in Stanbol. Both files will be located within the <code>indexing/dist</code> folder.
 
 _IMPORTANT NOTES:_ 
 
 * The import of the RDF files to the Jena TDB triple store - used as source for the indexing - takes a lot of time. Because of that imported data are reused for multiple runs of the indexing tool. This has two important effects users need to be aware of:
 
-    1. Already imported RDF files should be removed from the <code>{indexing-working-dir}/indexing/resources/rdfdata</code> to avoid to re-import them on every run of the tool
+    1. Already imported RDF files should be removed from the <code>{indexing-working-dir}/indexing/resources/rdfdata</code> to avoid to re-import them on every run of the tool. NOTE: newer versions of the Entityhub indexing tool might automatically move successfully imported RDF files to a different folder.
     2. If the RDF data change you will need to delete the Jena TDB store so that those changes are reflected in the created index. To do this delete the <code>{indexing-working-dir}/indexing/resources/tdb</code> folder
 
 * Also the destination folder <code>{indexing-working-dir}/indexing/destination</code> is NOT deleted between multiple calls to index. This has the effect that Entities indexed by previous indexing calls are not deleted. While this allows to index a dataset in multiple steps - or even to combine data of multiple datasets in a single index - this also means that you will need to delete the destination folder if the RDF data you index have changed - especially if some Entities where deleted. 
 
 
-###Step 3 : Initialize the index within Apache Stanbol
+### Step 3 : Initialize the index within Apache Stanbol
 
 We assume that you already have a running Apache Stanbol instance at http://{stanbol-host} and that {stanbol-working-dir} is the working directory of that instance on the local hard disk. To install the created index you need to 
 
-* copy the "{name}.solrindex.zip" file to the <code>{stanbol-working-dir}/stanbol/datafiles</code> directory (NOTE if you run the 0.9.0-incubating version the path is <code>{stanbol-working-dir}/sling/datafiles</code>.
+* copy the "{name}.solrindex.zip" file to the <code>{stanbol-working-dir}/stanbol/datafiles</code> directory (NOTE if you run the 0.9.0-incubating version the path is <code>{stanbol-working-dir}/sling/datafiles</code>).
 * install the <code>org.apache.stanbol.data.site.{name}-{version}.jar</code> to the OSGI environment of your Stanbol instance e.g. by using the Bundle tab of the Apache Felix web console at </code>http://{stanbol-host}/system/console/bundles</code>
 
 You find both files in the <code>{indexing-working-dir}/indexing/dist/</code> folder.
@@ -105,7 +142,8 @@ After the installation your data will be
 
 You can use the Web UI of the Stanbol Enhancer to explore your vocabulary. Note, that in case of big vocabulary it might take some time until the site becomes functional.
 
-## B. Configure and use the index with the Apache Stanbol Enhancer
+
+## Configuring the Stanbol Enhancer for your custom Vocabularies
 
 This section covers how to configure the Apache Stanbol Enhancer to recognize and link entities of your custom vocabulary with processed documents.
 
@@ -125,54 +163,59 @@ Depending on if you want to use named en
 In case named entity linking is used the linking with the custom vocabulary is done by the [Named Entity Tagging Engine](components/enhancer/engines/namedentitytaggingengine.html).
 For the configuration of this engine you need to provide the following parameters
 
-1. The "name" of the enhancement engine. It is recommended to use "{name}Linking" - where {name} is the name of your vocabulary as used in part A. of this scenario.
+1. The "name" of the enhancement engine. It is recommended to use "{name}Linking" - where {name} is the name of the Entityhub Site (ReferenceSite or ManagedSite).
 2. The name of the referenced site holding your vocabulary. Here you have to configure the {name}.
 3. Enable/disable persons, organizations and places and if enabled configure the <code>rdf:type</code> used by your vocabulary for those type. If you do not want to restrict the type, you can also leave the type field empty.
 4. Define the property used to match against the named entities detected by the used NER engine(s).
 
 For more detailed information please see the documentation of the [Named Entity Tagging Engine](components/enhancer/engines/namedentitytaggingengine.html).
 
-Note, that for using named entity linking you need also ensure that an enhancement engine that provides NER is available in the [enhancement chain](components/enhancer/chains). By default Apache Stanbol includes three different engines that provide this feature: (1) [Named Entity Extraction Enhancement Engine](components/enhancer/engines/namedentityextractionengine.html) based on [OpenNLP](http://opennlp.apache.org), (2) CELI NER engine based on the [linguagrid.org](http://Linguagrid.org) service and (3) [OpenCalais Enhancement Engine](components/enhancer/engines/opencalaisengine.html) based on [OpenCalais](http://opencalais.com). Note that the later two options will require to send your content to the according services that are not part of your local Apache Stanbol instance.
+Note, that for using named entity linking you need also ensure that an enhancement engine that provides NER (Named Entity Recoqunition) is available in the [enhancement chain](components/enhancer/chains). See [Stanbol NLP processing Language Support](components/enhancer/nlp/#stanbol-enhancer-nlp-support) section for detailed information on Languages with NER support.
 
-A typical [enhancement chain](components/enhancer/chains) for named entity linking with your custom vocabulary might look like
+The following Example shows a [enhancement chain](components/enhancer/chains) for named entity linking based on OpenNLP and CELI as NLP processing modules
 
-* "langid" - [Language Identification Engine](components/enhancer/engines/langidengine.html) - to detect the language of the parsed content - a pre-requirement of all NER engines
-* "ner" - for NER support in English, Spanish and Dutch via the [Named Entity Extraction Enhancement Engine](components/enhancer/engines/namedentityextractionengine.html)
+* "langdetect" - [Language Detection Engine](components/enhancer/engines/langdetectengine) - to detect the language of the parsed content - a pre-requirement of all NER engines
+* "opennlp-ner" - for NER support in English, Spanish and Dutch via the [Named Entity Extraction Enhancement Engine](components/enhancer/engines/namedentityextractionengine.html)
 * "celiNer" - for NER support in French and Italien via the CELI NER engine
 * "{name}Linking - the [Named Entity Tagging Engine](components/enhancer/engines/namedentitytaggingengine.html) for your vocabulary as configured above.
 
 Both the [weighted chain](components/enhancer/chains/weightedchain.html) and the [list chain](components/enhancer/chains/listchain.html) can be used for the configuration of such a chain.
 
-### Configure Keyword Linking
-
-In case you want to use keyword linking to extract and link entities of your vocabulary you will need to configure the [Keyword Linking Engine](components/enhancer/engines/keywordlinkingengine.html) accordingly.
+### Configuring Named Entity Linking
 
-Here are the most important configuration options provided by the Keyword Linking Engine when configured via the [configuration tab](http://localhost:8080/system/console/configMgr) of the Apache Felix web console - http://{host}:{port}/system/console/configMgr. For the full list and detailed information please see the [documentation](components/enhancer/engines/keywordlinkingengine.html)).
+First it is important to note the difference between _Named Entity Linking_ and _Entity Linking_. While _Named Entity Linking_ only considers _Named Entities_ detected by NER (Named Entity Recognition) _Entity Linking_ does work on Words (Tokens). Because of that is has much lower NLP requirements and can even operate for languages where only word tokenization is supported. However extraction results AND performance do greatly improve with POS (Part of Speech) tagging support. Also Chunking (Noun Phrase detection), NER and Lemmatization results can be consumed by Entity Linking to further improve extraction results. For details see the documentation of the [Entity Linking Process](components/enhancer/engines/entitylinking#linking-process).
 
-1. The "Name" of the enhancement engine. It is recommended to use "{name}Keyword" - where {name} is the name of your vocabulary as used in part A. of this scenario
-2. The name of the "Referenced Site" holding your vocabulary. Here you have to configure the {name}
-3. The "Label Field" is the URI of the property in your vocabulary providing the labels used for matching. You can only use a single field. If you want to use values of several fields you have two options: (1) to adapt your indexing configuration to copy the values of those fields to a single one (e.g. the values of "skos:prefLabel" and "skos:altLabel" are copied to "rdfs:label" in the default configuration of the Entityhub indexing tool (see {indexing-working-dir}/indexing/config/mappings.txt) (2) to configure multiple Keyword Linking Engine(s) - one for each label field. Option (1) is preferable as long as you do not need to use different configurations for the different labels.
-4. The "Type Mappings" might be interesting for you if your vocabulary contains custom types as those mappings can be used to map 'rdf:type's of entities in your vocabulary to 'dc:type's used for 'fise:TextAnnotation's - created by the Apache Stanbol Enhancer to annotate occurrences of extracted entities in the parsed text. See the [type mapping syntax](components/enhancer/engines/keywordlinkingengine.html#type-mappings-syntax) and the [usage scenario for the Apache Stanbol Enhancement Structure](enhancementusage.html#entity-tagging-with-disambiguation-support) for details.
+The second big difference is that _Named Entity Linking_ can only support Entity types supported by the NER modles (Persons, Organizations and Places). _Entity Linking_ does not have this restriction. This advantage comes also with the disadvantage that Entity Lookups to the Controlled Vocabulary are only based on Label similarities. _Named Entity Linking_ does also use the type information provided by NER.
 
-A typical [enhancement chain](components/enhancer/chains) for named entity linking with your vocabulary might look like
+To use _Entity Linking_ with a custom Vocabulary Users need to configure an instance of the [Entityhub Linking Engine](components/enhancer/engines/entityhublinking). While this Engine provides more than twenty configuration parameters the following list provides an overview about the most important. For detailed information please see the documentation of the Engine.
 
-* "langid" - [Language Identification Engine](components/enhancer/engines/langidengine.html) - to detect the language of the parsed content - a pre-requirement of the Keyword Linking Engine.
-* "{name}Keyword - the [Keyword Linking Engine](components/enhancer/engines/keywordlinkingengine.html) for your vocabulary as configured above.
+1. The "Name" of the enhancement engine. It is recommended to use something like "{name}Extraction" - where {name} is the name of the Entityhub Site
+2. The name of the "Managed- / Referenced Site" holding your vocabulary. Here you have to configure the {name}
+3. The "Label Field" is the URI of the property in your vocabulary providing the labels used for matching. You can only use a single field. If you want to use values of several fields you have two options: (1) to adapt your indexing configuration to copy the values of those fields to a single one (e.g. the values of "skos:prefLabel" and "skos:altLabel" are copied to "rdfs:label" in the default configuration of the Entityhub indexing tool (see {indexing-working-dir}/indexing/config/mappings.txt) (2) to configure multiple EntityubLinkingEngines - one for each label field. Option (1) is preferable as long as you do not need to use different configurations for the different labels.
+4. The "Link ProperNouns only": If the custom Vocabulary contains Proper Nouns (Named Entities) than this parameter should be activated. This options causes the Entity Linking process to not making queries for commons nouns and by that receding the number of queries agains the controlled vocabulary by ~70%. However this is not feasible if the vocabulary does contain Entities that are common nouns in the language. 
+5. The "Type Mappings" might be interesting for you if your vocabulary contains custom types as those mappings can be used to map 'rdf:type's of entities in your vocabulary to 'dc:type's used for 'fise:TextAnnotation's - created by the Apache Stanbol Enhancer to annotate occurrences of extracted entities in the parsed text. See the [type mapping syntax](components/enhancer/engines/keywordlinkingengine.html#type-mappings-syntax) and the [usage scenario for the Apache Stanbol Enhancement Structure](enhancementusage.html#entity-tagging-with-disambiguation-support) for details.
+
+The following Example shows an Example of an [enhancement chain](components/enhancer/chains) using OpenNLP for NLP
+
+* "langdetect" - [Language Detection Engine](components/enhancer/engines/langdetectengine) - to detect the language of the parsed content - a pre-requirement of all NER engines
+* opennlp-sentence - [Sentence detection with OpenNLP](components/enhancer/engines/opennlpsentence)
+* opennlp-token - [OpenNLP based Word tokenization](components/enhancer/engines/opennlptokenizer). Works for all languages where white spaces can be used to tokenize.
+* opennlp-pos - [OpenNLP Part of Speech tagging](components/enhancer/engines/opennlppos)
+* opennlp-chunker - The [OpenNLP chunker](components/enhancer/engines/opennlpchunker) provides Noun Phrases
+* "{name}Extraction - the [Entityhub Linking Engine](components/enhancer/engines/entityhublinking) configured for the custom vocabulary.
 
 Both the [weighted chain](components/enhancer/chains/weightedchain.html) and the [list chain](components/enhancer/chains/listchain.html) can be used for the configuration of such a chain.
 
+The documentation of the Stanbol NLP processing module provides [detailed Information](components/enhancer/nlp/#stanbol-enhancer-nlp-support) about integrated NLP frameworks and suupported languages.
+
 ### How to use enhancement chains
 
-In the default configuration the Apache Stanbol Enhancer provides two enhancement chains:
+In the default configuration the Apache Stanbol Enhancer provides several enhancement chains including:
 
-1) a "default" chain that includes all currently active [enhancement engines](components/enhancer/engines) and 
+1) a "default" chain providing _Named Entity Linking_ based on DBpedia and _Entity Linking_ based on the Entityhub
 2) the "language" chain that is intended to be used to detect the language of parsed content.
+3) a "dbpedia-proper-noun-linking" chain showing _Named Entity Linking_ based on DBpedia
 
-As soon as Apache Stanbol users start to add own vocabularies to the Apache Stanbol Entityhub and configure [Named Entity Tagging Engine](components/enhancer/engines/namedentitytaggingengine.html) or [Keyword Linking Engine](components/enhancer/engines/keywordlinkingengine.html), the default chain, which includes all active engines, may become unusable. Most likely users want to deactivate the "default" chain and configure their own. This section provides more information on how to do that.
-
-__Deactivate the chain of all active enhancement engines__
-
-Users that add additional enhancement engines might need to deactivate the enhancement chain that includes all active engines. This can be done in the configuration tab of the Apache Felix web console - [http://{stabol-host}/system/console/configMgr](http://localhost:8080/system/console/configMgr). Open the configuration dialog of the "Apache Stanbol Enhancer Chain: Default Chain" component and deactivate it.
 
 __Change the enhancement chain bound to "/enhancer"__
 

Added: stanbol/site/trunk/content/docs/trunk/enhancementworkflow.png
URL: http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/enhancementworkflow.png?rev=1469293&view=auto
==============================================================================
Binary file - no diff available.

Propchange: stanbol/site/trunk/content/docs/trunk/enhancementworkflow.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream