You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by rw...@apache.org on 2012/03/05 14:28:39 UTC

svn commit: r1297047 - in /incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines: list.mdtext tikaengine.mdtext

Author: rwesten
Date: Mon Mar  5 13:28:39 2012
New Revision: 1297047

URL: http://svn.apache.org/viewvc?rev=1297047&view=rev
Log:
Documentation for the TikaEngine

Added:
    incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/tikaengine.mdtext
Modified:
    incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/list.mdtext

Modified: incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/list.mdtext
URL: http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/list.mdtext?rev=1297047&r1=1297046&r2=1297047&view=diff
==============================================================================
--- incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/list.mdtext (original)
+++ incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/list.mdtext Mon Mar  5 13:28:39 2012
@@ -4,54 +4,57 @@ This provides an overview about all [Enh
 
 ## Preprocessing
 
-- __[Language Identification Engine](langidengine.html)__
-	- language detection for textual content utilizing [Apache Tika](http://tika.apache.org/)
+* __[Language Identification Engine](langidengine.html)__
+	* language detection for textual content utilizing [Apache Tika](http://tika.apache.org/)
 	
-
-- __[Metaxa Engine](metaxaengine.html)__
-	- text extraction from various document formats
-	- extraction of metadata from document formats
-	-
+* __[Tika Engine](tikaengine.html)__ (based on [Apache Tika](http://tika.apache.org/))
+	* content type detection
+	* text extraction from various document formats
+	* extraction of metadata from document formats
+
+* __[Metaxa Engine](metaxaengine.html)__
+	* text extraction from various document formats
+	* extraction of metadata from document formats
 	
 ## Natural Language Processing
 
-- __[Named Entity Extraction Enhancement Engine](namedentityextractionengine.html)__ 
-	- NLP processing using OpenNLP NER
-	- detects occurrences of persons, places and organizations only
+* __[Named Entity Extraction Enhancement Engine](namedentityextractionengine.html)__ 
+	* NLP processing using OpenNLP NER
+	* detects occurrences of persons, places and organizations only
 	
 	
-- __[KeywordLinkingEngine](keywordlinkingengine.html)__
-	- NLP processing using OpenNLP
-	- supports multiple languages
-	- detects occurrences of untyped entities as concepts, takes local taxonomies as linking target
+* __[KeywordLinkingEngine](keywordlinkingengine.html)__
+	* NLP processing using OpenNLP
+	* supports multiple languages
+	* detects occurrences of untyped entities as concepts, takes local taxonomies as linking target
 
 	
-- _Taxonomy Linking Engine_ (deprecated, see KeywordLinkingEngine)
-	- NLP processing using OpenNLP POS
-	- detect occurrences of untyped entities as concepts, takes local taxonomies as linking target
+* _Taxonomy Linking Engine_ (deprecated, see KeywordLinkingEngine)
+	* NLP processing using OpenNLP POS
+	* detect occurrences of untyped entities as concepts, takes local taxonomies as linking target
 	
 
 ## Linking Suggestions
 
-- __[Named Entity Tagging Engine](namedentitytaggingengine.html)__
-	- suggest links to several Linked Data Sources (e.g. DBpedia)
+* __[Named Entity Tagging Engine](namedentitytaggingengine.html)__
+	* suggest links to several Linked Data Sources (e.g. DBpedia)
 
-- __[Geonames Enhancement Engine](geonamesengine.html)__ 
-	- suggests links to geonames.org
-	- provides hierarchical links for locations
+* __[Geonames Enhancement Engine](geonamesengine.html)__ 
+	* suggests links to geonames.org
+	* provides hierarchical links for locations
 
-- __[OpenCalais Enhancement Engine](opencalaisengine.html)__
- 	- integrates service from Open Calais. (Note: You need to provide a key in order to use this engine)
+* __[OpenCalais Enhancement Engine](opencalaisengine.html)__
+ 	* integrates service from Open Calais. (Note: You need to provide a key in order to use this engine)
 
-- __[Zemanta Enhancement Engine](zemantaengine.html)__
-	- integrates the Zemanta services. (Note: You need to provide a key in order to use this engine)
+* __[Zemanta Enhancement Engine](zemantaengine.html)__
+	* integrates the Zemanta services. (Note: You need to provide a key in order to use this engine)
 
 
 
 ## Postprocessing / Other
 
-- _CachingDereferencerEngine_ (deprecated, see dereferencing support of individual engines as well as  [STANBOL-336](https://issues.apache.org/jira/browse/STANBOL-336))
-	- retrieves additional content for presenting the enhancement results.
+* _CachingDereferencerEngine_ (deprecated, see dereferencing support of individual engines as well as  [STANBOL-336](https://issues.apache.org/jira/browse/STANBOL-336))
+	* retrieves additional content for presenting the enhancement results.
 	
-- __[Refactor Engine](refactorengine.html)__
-		- transforms enhancements according to a target ontology, requires KRES launcher.
+* __[Refactor Engine](refactorengine.html)__
+	* transforms enhancements according to a target ontology, requires KRES launcher.

Added: incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/tikaengine.mdtext
URL: http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/tikaengine.mdtext?rev=1297047&view=auto
==============================================================================
--- incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/tikaengine.mdtext (added)
+++ incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/tikaengine.mdtext Mon Mar  5 13:28:39 2012
@@ -0,0 +1,67 @@
+Title: Tika Engine
+
+
+Apache Stanbol Enhancement Engine based on Apache Tika that has three main functionalities:
+
+1. To detect the content type of parsed content. This is only performed if the no content type is parsed of the cogent type is set to "application/octed-stream". The detected content type is added to the metadata of the Content Item. 
+2. To extract the plain text (and XHTML) from parsed content and add it to the [ContentItem](../contentitem.html)   as content parts with the type Blob.
+3. To extract metadata from the parsed content and add it to the metadata of the [ContentItem](../contentitem.html)
+
+
+## Supported Media Types
+
+As this engine uses Apache Tika the supported media types are the same as stated on the [Tika Homepage](http://tika.apache.org/1.0/formats.html).
+
+## Extracted Metadata
+
+Tika provides metadata as 'key:values' pairs. To use them efficiently within stanbol they need to be converted to valid RDF and aligned with existing Ontologies.
+
+The TikaEngine supports alignments to several different Ontologies. Such alignment rules can be activated/deactivated within the configuration of the TikaEngine.
+
+Supported Ontologies:
+
+* [Ontology for Media Resources](http://www.w3.org/TR/mediaont-10/): This is the most complete mapping to an single Ontology. This includes mappings for all Dublin Core metadata; geo locations; some image specific data and most of the Audio and Viedo related metadata.
+
+* [DC terms](http://dublincore.org/documents/dcmi-terms/): Provides good mappings for text documents (HTML, Office, OpenOffice, PDF ...)
+
+* [Nepomuk EXIF ontology](http://www.semanticdesktop.org/ontologies/2007/05/10/nexif/): Interesting for users that want to work with EXIF metadata extracted from images.
+
+* [Nepomuk Message Ontology](http://www.semanticdesktop.org/ontologies/2007/03/22/nmo/): Used for sender and recaiver information of mail messages. 
+
+* SKOS: Allows mapping of labels and notes to [SKOS](http://www.w3.org/2009/08/skos-reference/skos.html). This is deactivated by default.
+
+* RDFS: Allows to map labels and comments to "rdfs:label" and "rdfs:comment"
+
+### ContentType:
+
+The detected content type for the parsed contentItem is added by using the following two properties:
+
+* 'http://purl.org/dc/terms/format': Dublin Core terms 'format'
+* 'http://www.w3.org/ns/ma-ont#hasFormat': Media Resource Ontology 'hasFormat'
+
+Note that this properties will only be present if the related Ontology is activated in the TikaEngine configuration.
+
+
+## Sending Requests directly to the Tika Engine
+
+The Stanbol Enhancer allows to send enhancement requests directly to specific EnhancementEngine. This feature can be used in combination with the Tika Engine to request
+
+1. the "text/plain" or "application/xhtml+xml" version of parsed content
+2. the extracted metadata as RDF aligned to the activated Ontologies
+
+The first example requests the plain text version of a PDF file with the name "test.pdf". Note the 
+
+* 'Accept' header is set to the contentType of the requested content and the 
+* 'omitMetadata=true' telling the Enhancer to not return the RDF metadata.
+
+    :::bash
+    curl -v -X POST -H "Accept: text/plain" -T mag_internes_protokoll_20100721_rw.doc \
+        "http://localhost:8080/enhancer/engine/tika?omitMetadata=true"
+
+This second example returns the metadata as extracted from the parsed "song.mp3"
+
+    :::bash
+    curl -v -X POST -H "Accept: application/rdf+xml" -T song.mp3 \
+        "http://localhost:8080/enhancer/engine/tika"
+
+