You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by rw...@apache.org on 2012/06/18 20:07:33 UTC

svn commit: r1351433 - /incubator/stanbol/site/trunk/content/stanbol/docs/trunk/contentenhancement.mdtext

Author: rwesten
Date: Mon Jun 18 18:07:32 2012
New Revision: 1351433

URL: http://svn.apache.org/viewvc?rev=1351433&view=rev
Log:
updated content enhancement usage scenario

Modified:
    incubator/stanbol/site/trunk/content/stanbol/docs/trunk/contentenhancement.mdtext

Modified: incubator/stanbol/site/trunk/content/stanbol/docs/trunk/contentenhancement.mdtext
URL: http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/contentenhancement.mdtext?rev=1351433&r1=1351432&r2=1351433&view=diff
==============================================================================
--- incubator/stanbol/site/trunk/content/stanbol/docs/trunk/contentenhancement.mdtext (original)
+++ incubator/stanbol/site/trunk/content/stanbol/docs/trunk/contentenhancement.mdtext Mon Jun 18 18:07:32 2012
@@ -2,155 +2,90 @@ Title: Using Apache Stanbol for enhancin
 
 For enhancing content you simply post plain text content to the Enhancement Engines and you will get back enhancement data. The enhancement process is stateless, so neither your content item, nor the enhancements will be stored. 
 
-You can test this via the [web interface of the engines][stan-engines] or from console via
+You can test this via the [Web interface](http://localhost:8080/enhancer) of the Stanbol Enhancer - http://{host}:{port}/enhancer or from the console using the CURL command.
 
+    :::bash
     curl -X POST -H "Accept: text/turtle" -H "Content-type: text/plain" \
     --data "The Stanbol enhancer can detect famous cities such as Paris \
     and people such as Bob Marley." http://localhost:8080/engines
 
-or by using the text examples delivered with Stanbol.
+The following script sends the contents of the text-examples folder to the Stanbol Enhancer.
 
-	for file in enhancer/data/text-examples/*.txt;
+    for file in enhancer/data/text-examples/*.*;
     do
-    curl -X POST -H "Accept: text/turtle" -H "Content-type: text/plain" -T $file http://localhost:8080/engines;
+        curl -X POST -H "Accept: text/turtle" -H "Content-type: text/plain" \
+            -T $file http://localhost:8080/enhancer;
     done
 
-Content items in formats other than plain text can be tested via the [web interface of contenthub][stan-contenthub] or via the console by attaching files. (The Metaxa Engine needs to be activated).
+The Stanbol Enhancer can also enhancer non-plain-text files. In this case [Apache Tika](http://tika.apache.org) - via the [Tika Engine](enhancer/engines/tikaengine.html) is used to extract the plain text from those files (see the [Apache Tika](http://tika.apache.org) documentation for supported file formats).
 
+## Configuring and Using Enhancement Chains
 
-## Using the enhancement engines
+The Stanbol Enhancer supports multiple [Enhancement Chains](enhancer/chains). This feature allows to configure use multiple processing chains for parsed content within the same Stanbol Instance.
 
-Apache Stanbol starts with a number of active enhancement engines by default. You can activate or deactivate engines as well as configure them to your needs via the [OSGI administration console][stan-admin].
+Chains are build based on an [Execution Plan](enhancer/chains/executionpla.html) referencing one or more [Enhancement Engines](enhancer/engines) by there name. Users can create and modify EnhancementChains by using the [Configuration Tab](http://localhost:8080/system/console/configMgr) of the Apache Felix Webconsole - http://{host}:{port}/system/console/configMgr. There are three different implementations: (1) the self sorting [Weighted Chain](enhancer/chains/weightedchain.html), (2) the [List Chain](enhancer/chains/listchain.html) and (3) the [Graph Chain](enhancer/chains/graphchain.html) that allows the direct configuration of the execution graph. There is also a (4) [Default Chain](enhancer/chains/defaultchain.html) that includes all currently active Enhancement Engines. While this engine is enabled by default most users might want to deactivate it as soon as they have configured there own chains.
 
-For the enhancement engines, a workflow for the enhancement process is defined as pre-processing, content-extraction, extraction-enhancement, default and post-processing. 
+To configure Enhancement Engine it is essential to understand the intension of the different [Enhancement Engine](enhancer/engines) implementations. The [List of all Enhancement Engines](enhancer/engines/list.html) managed by the Apache Stanbol Community is available [here](enhancer/engines/list.html). See the documentation of the listed Engines for detailed information.
 
-The following pre-processing engines are available:
+The list groups engines by categories: Preprocessing Engines_ typically perform operations on a content scope. This includes plain-text extraction, metadata extraction, language detection. This is followed by Engines that analyses the parsed content. This category currently includes all Natural Language Processing related engines but also would include Image-, Audio- and Viedo- processing. The third category consist of Engines that consume extracted features from the content and perform some kind of semantic lifting on it - e.g. linking extracted features with Entities/Concepts contained in Controlled Vocabularies. Finally Post-Processing Engines can be used to adjust rankings, filter out unwanted enhancements or do other kind of transformations on the Enhancement results.
 
-- The __Language Identification Engine__ detects several European languages of the content items you want to process.
+A typical Text Processing Enhancement Chain might look like that:
 
-- The __Metaxa Engine__ extracts embedded metadata and textual content from a large variety of document types and formats.
+* [tika](enhancer/engines/tikaengine.html) - to convert parsed content to "text/plain"
+* [langid](enhancer/engines/langidengine.html) - to detect the language of the parsed text
+* [ner](enhancer/engines/namedentityextractionengine.html) - to extract named entities (Persons, Organizations, Places) from the parsed text
+* [dbpediaLinking](enhancer/engines/namedentitytaggingengine.html) - link extracted named entities with Entities defined by [dbpedia.org](http://dbpedia.org)
+* [myCustomVocExtraction](enhancer/engines/keywordlinkingengine.html) - Keyword Extraction based on a custom built vocabulary - as described by this [usage scenario](customvocabulary.html).
 
-For content extraction / natural language processing one engine is available:
+An other Enhancement Chain using an External service
 
-- The __Named Entity Extraction Enhancement Engine__ leverages the sentence detector and name finder tools of the OpenNLP project bundled with statistical models trained to detect occurrences of names of persons, places and organizations.
+* [tika](enhancer/engines/tikaengine.html) - assuming we want to send MS Word dokuments to Zemanta
+* [zemanta](enhancer/engines/zemantaengine.html) - this wraps [Zemanta.com/](http://www.zemanta.com/) as Stanbol Enhancement Engine
 
+_Tips for configuring Enhancment Chains:_ 
 
-The extracted items will then be enhanced by a dedicated engine:
+* [http://{host}:{port}/enhancer/chain](http://localhost:8080/enhancer/engine) provides a list of all configured [Enhancement Chains](enhancer/chains). It also includes direct links to their configurations.
+* As one needs to use the names of active [Enhancement Engines](enhancer/engines) for the configuration of Enhancement Chains it is very useful to open [http://{host}:{port}/enhancer/engine](http://localhost:8080/enhancer/engine) in an other browser window.
 
-- The __Named Entity Tagging Engine__ provides according suggestions from dbpedia (default) and other references sites for entities extracted by the NER engine .
+After configuring all the Enhancement Engines and combining them to Enhancement Chains it is important to understand how to inspect and call the configured components via the RESTful API of the Stanbol Enhancer.
 
+Enhancement requests directly issued to <code>/enhancer</code> (or the old deprecated <code>/engines</code>) endpoint are processed by using the Enhancement Chain with the name "default" or if none with that name the one with the highest "service.ranking" (see [here](enhancer/chains/#default-chain) for details). To process content with a specific chain requests need to be issued against <code>/enhancer/chain/{chain-name}</code>. 
 
-Specific additional enhancement engines are: 
+Note that it is also possible to enhance content by using a single [Enhancement Engine](enhancer/engines). For that request can be sent to <code>enhancer/engine/{engine-name}</code>. A typical example would be parsing text directly to the [Language Identification Engine](enhancer/engine/langidengine.html) to use the Stanbol Enhancer to detect the language of the parsed content.
 
-- The __Location Enhancement Engine__ takes its suggestions from geonames.org only.
+To sum up the RESTful API of the Stanbol Enhancer is structured like follows
 
-- The __OpenCalais Enhancement Engine__ uses services from Open Calais. (Note: You need to provide a key in order to use this engine)
+    GET /enhancer - returns the configuration of the Stanbol Enhancer
+    GET /enhancer/chain - returns the configuration of all active [Enhancement Chains](enhancer/chains)
+    GET /enhancer/engine - returns the configuration of all active [Enhancement Engines](enhancer/engines)
+    POST /enhancer - enhances parsed content by using the default Enhancement Chain
+    POST /enhancer/chain/{chain-name} - enhances parsed content by using the Enhancement Chain with the given name
+    POST /enhancer/engine/{engine-name} - enhances parsed content by using only the referenced Enhancement Engine
 
-- The __Zemanta Enhancement Engine__ uses the Zemanta services. (Note: You need to provide a key in order to use this engine)
+See the [Documentation](enhancer/enhancerrest.html) of the the RESTful API for all services and parameters of the Stanbol Enhancer.
 
+## Using an index of linked open data locally
 
-For post-processing the results of the enhancement engines
+Both the [Named Entity Tagging Engine](enhancer/engines/namedentitytaggingengine.html) and the [Keyword Linking Engine](enhancer/engines/keywordlinkingengine.html) require to be configured with a dataset containing Entities to link/extract for parsed content. As those Engines typically need to make a lot of requests against those datasets it is important to make those data locally available - a feature of the [Apache Stanbol Entityhub](entityhub)
 
-- The __CachingDereferencerEngine__ is used for the Web UI and fetches files such as images for locations from external sites and is used to present the enhancement results. 
+Because of this Apache Stanbol allows to create/install local indexes of datasets. A detailed description on how to create those indexes is described by this [user scenario](customvocabulary.html). A set of pre-computed indexes can be downloaded from the [IKS development server](http://dev.iks-project.eu/downloads/stanbol-indices/).
 
+Indexes always consist of two parts:
 
-## Using an index of linked open data locally
+* org.apache.stanbol.data.site.{name}-{version}.jar - An OSGI bundle containing the configuration for
+    - the Apache Entityhub "ReferencedSite" accessible at "http://{host}/{root}/entityhub/site/{name}"
+    - the "Cache" used to connect the ReferencedSite with your Data and 
+    - the "SolrYard" component managing the installed data.
+* {name}.solrindex.zip - The index data of the dataset (basically a ZIP archive of a [Solr](http://lucene.apache.org/solr/) Core
+
+To install the local index of a dataset the following two steps need to be performed 
 
-To use the pre-configured indexes you can download them from [here][stan-download]. You will get two files for each index:
+* copying the zip archive into the "{stanbol-working-dir}/stanbol/datafiles" folder
+* adding the OSGI bundle to the Stanbol Environment (e.g. by using the [Bundle Tab(http://localhost:8080/system/console/bundles) of the Apache Felix Webconsle - http://{host}:{port}/system/console.
 
-* org.apache.stanbol.data.site.{name}-{version}.jar 
-* {name}.solrindex.zip
+_NOTE:_ In case of "dbpedia" the OSGI bundle with the configuration does not need to be installed as the default configuration of the Apache Stanbol launcher does already include the configuration of the necessary components.
 
 
-By copying the zip archive into the "/sling/datafiles" folder before installing the bundle, the data will used during the installation of the bundle automatically. If you provide the file after installing the bundle, you will need to restart the SolrYard installed by the bundle.
+## Processing the Enhancement Results
 
-The jar can be installed at any OSGI environment running the Apache Stanbol Entityhub. When started it will create and configure:
-
-- a "ReferencedSite" accessible at "http://{host}/{root}/entityhub/site/{name}"
-- a "Cache" used to connect the ReferencedSite with your Data and
-- a "SolrYard" that manages the data indexed by this utility.
-
-This bundle does not contain the indexed data but only the configuration for the Solr Index.
-
-If one has not copied the archive beforehand, the ZIP archive will be requested by the Apache Stanbol Data File Provider after installing the Bundle. To install the data you need copy this file to the "/sling/datafiles" folder within the working directory of your Stanbol Server.
-
-_Note: {name} denotes to the value you configured for the "name" property within the "indexing.properties" file._
-
-
-## Enhancement Example
-
-The text "The Stanbol enhancer can detect famous cities such as Paris and people such as Bob Marley." with the default configuration of enhancement engines and with a local index of dbpedia entities will result in the following output graph of several __Entity Annotations__ and __Text Annotations__. 
-
-Two of the relevant fragments for "Paris" are listed below in Turtle-Syntax:
-
-### Example for Text Annotation
-
-    <urn:enhancement-4a2543d8-4d83-43ce-3a33-2924f457c872>
-      a       <http://fise.iks-project.eu/ontology/TextAnnotation> , 
-              <http://fise.iks-project.eu/ontology/Enhancement> ;
-      
-      <http://fise.iks-project.eu/ontology/confidence>
-              "0.9322403510215739"^^<http://www.w3.org/2001/XMLSchema#double> ;
-    
-      <http://fise.iks-project.eu/ontology/end>
-              "59"^^<http://www.w3.org/2001/XMLSchema#int> ;
-      
-      <http://fise.iks-project.eu/ontology/extracted-from>
-              <urn:content-item-sha1-37c8a8244041cf6113d4ee04b3a04d0a014f6e10> ;
-      
-      <http://fise.iks-project.eu/ontology/selected-text>
-              "Paris"^^<http://www.w3.org/2001/XMLSchema#string> ;
-      
-      <http://fise.iks-project.eu/ontology/selection-context>
-              "The Stanbol enhancer can detect famous cities such as 
-              Paris and people such as Bob Marley."
-              ^^<http://www.w3.org/2001/XMLSchema#string> ;
-    
-      <http://fise.iks-project.eu/ontology/start>
-              "54"^^<http://www.w3.org/2001/XMLSchema#int> ;
-      
-      <http://purl.org/dc/terms/created>
-              "2012-02-29T11:18:36.282Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
-      
-      <http://purl.org/dc/terms/creator>
-              "org.apache.stanbol.enhancer.engines.opennlp.impl.NEREngineCore"
-              ^^<http://www.w3.org/2001/XMLSchema#string> ;
-      
-      <http://purl.org/dc/terms/type>
-              <http://dbpedia.org/ontology/Place> .
-
-### Example for Entity Annotation
-    
-    <urn:enhancement-b5e71f70-4978-a70b-7111-8d6e31283a58>
-	  a       <http://fise.iks-project.eu/ontology/EntityAnnotation> , 
-	          <http://fise.iks-project.eu/ontology/Enhancement> ;
-	
-	  <http://fise.iks-project.eu/ontology/confidence>
-	          "1323049.5"^^<http://www.w3.org/2001/XMLSchema#double> ;
-	
-	  <http://fise.iks-project.eu/ontology/entity-label>
-	           "Paris"@en ;
-	
-	  <http://fise.iks-project.eu/ontology/entity-reference>
-	           <http://dbpedia.org/resource/Paris> ;
-	
-	  <http://fise.iks-project.eu/ontology/entity-type>
-	           <http://www.w3.org/2002/07/owl#Thing> , 
-	           <http://www.opengis.net/gml/_Feature> , 
-	           <http://dbpedia.org/ontology/Place> , 
-	           <http://dbpedia.org/ontology/Settlement> , 
-	           <http://dbpedia.org/ontology/PopulatedPlace> ;
-	
-	  <http://fise.iks-project.eu/ontology/extracted-from>
-	           <urn:content-item-sha1-37c8a8244041cf6113d4ee04b3a04d0a014f6e10> ;
-	
-	  <http://purl.org/dc/terms/created>
-	           "2012-02-29T11:18:36.320Z"
-	           ^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
-
-      <http://purl.org/dc/terms/creator>
-	           "org.apache.stanbol.enhancer.engines.entitytagging.impl.NamedEntityTaggingEngine"
-	           ^^<http://www.w3.org/2001/XMLSchema#string> ;
-    
-      <http://purl.org/dc/terms/relation>
-	           <urn:enhancement-4a2543d8-4d83-43ce-3a33-2924f457c872> .
\ No newline at end of file
+The final step in using the Stanbol Enhancer is about processing the Enhancement Results. As this is a central part developers of client applications this is described by an own [Usage Scenario](enhancementusage.html)
\ No newline at end of file