You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by rw...@apache.org on 2012/03/05 14:28:39 UTC
svn commit: r1297047 - in
/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines:
list.mdtext tikaengine.mdtext
Author: rwesten
Date: Mon Mar 5 13:28:39 2012
New Revision: 1297047
URL: http://svn.apache.org/viewvc?rev=1297047&view=rev
Log:
Documentation for the TikaEngine
Added:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/tikaengine.mdtext
Modified:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/list.mdtext
Modified: incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/list.mdtext
URL: http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/list.mdtext?rev=1297047&r1=1297046&r2=1297047&view=diff
==============================================================================
--- incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/list.mdtext (original)
+++ incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/list.mdtext Mon Mar 5 13:28:39 2012
@@ -4,54 +4,57 @@ This provides an overview about all [Enh
## Preprocessing
-- __[Language Identification Engine](langidengine.html)__
- - language detection for textual content utilizing [Apache Tika](http://tika.apache.org/)
+* __[Language Identification Engine](langidengine.html)__
+ * language detection for textual content utilizing [Apache Tika](http://tika.apache.org/)
-
-- __[Metaxa Engine](metaxaengine.html)__
- - text extraction from various document formats
- - extraction of metadata from document formats
- -
+* __[Tika Engine](tikaengine.html)__ (based on [Apache Tika](http://tika.apache.org/))
+ * content type detection
+ * text extraction from various document formats
+ * extraction of metadata from document formats
+
+* __[Metaxa Engine](metaxaengine.html)__
+ * text extraction from various document formats
+ * extraction of metadata from document formats
## Natural Language Processing
-- __[Named Entity Extraction Enhancement Engine](namedentityextractionengine.html)__
- - NLP processing using OpenNLP NER
- - detects occurrences of persons, places and organizations only
+* __[Named Entity Extraction Enhancement Engine](namedentityextractionengine.html)__
+ * NLP processing using OpenNLP NER
+ * detects occurrences of persons, places and organizations only
-- __[KeywordLinkingEngine](keywordlinkingengine.html)__
- - NLP processing using OpenNLP
- - supports multiple languages
- - detects occurrences of untyped entities as concepts, takes local taxonomies as linking target
+* __[KeywordLinkingEngine](keywordlinkingengine.html)__
+ * NLP processing using OpenNLP
+ * supports multiple languages
+ * detects occurrences of untyped entities as concepts, takes local taxonomies as linking target
-- _Taxonomy Linking Engine_ (deprecated, see KeywordLinkingEngine)
- - NLP processing using OpenNLP POS
- - detect occurrences of untyped entities as concepts, takes local taxonomies as linking target
+* _Taxonomy Linking Engine_ (deprecated, see KeywordLinkingEngine)
+ * NLP processing using OpenNLP POS
+ * detect occurrences of untyped entities as concepts, takes local taxonomies as linking target
## Linking Suggestions
-- __[Named Entity Tagging Engine](namedentitytaggingengine.html)__
- - suggest links to several Linked Data Sources (e.g. DBpedia)
+* __[Named Entity Tagging Engine](namedentitytaggingengine.html)__
+ * suggest links to several Linked Data Sources (e.g. DBpedia)
-- __[Geonames Enhancement Engine](geonamesengine.html)__
- - suggests links to geonames.org
- - provides hierarchical links for locations
+* __[Geonames Enhancement Engine](geonamesengine.html)__
+ * suggests links to geonames.org
+ * provides hierarchical links for locations
-- __[OpenCalais Enhancement Engine](opencalaisengine.html)__
- - integrates service from Open Calais. (Note: You need to provide a key in order to use this engine)
+* __[OpenCalais Enhancement Engine](opencalaisengine.html)__
+ * integrates service from Open Calais. (Note: You need to provide a key in order to use this engine)
-- __[Zemanta Enhancement Engine](zemantaengine.html)__
- - integrates the Zemanta services. (Note: You need to provide a key in order to use this engine)
+* __[Zemanta Enhancement Engine](zemantaengine.html)__
+ * integrates the Zemanta services. (Note: You need to provide a key in order to use this engine)
## Postprocessing / Other
-- _CachingDereferencerEngine_ (deprecated, see dereferencing support of individual engines as well as [STANBOL-336](https://issues.apache.org/jira/browse/STANBOL-336))
- - retrieves additional content for presenting the enhancement results.
+* _CachingDereferencerEngine_ (deprecated, see dereferencing support of individual engines as well as [STANBOL-336](https://issues.apache.org/jira/browse/STANBOL-336))
+ * retrieves additional content for presenting the enhancement results.
-- __[Refactor Engine](refactorengine.html)__
- - transforms enhancements according to a target ontology, requires KRES launcher.
+* __[Refactor Engine](refactorengine.html)__
+ * transforms enhancements according to a target ontology, requires KRES launcher.
Added: incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/tikaengine.mdtext
URL: http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/tikaengine.mdtext?rev=1297047&view=auto
==============================================================================
--- incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/tikaengine.mdtext (added)
+++ incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/engines/tikaengine.mdtext Mon Mar 5 13:28:39 2012
@@ -0,0 +1,67 @@
+Title: Tika Engine
+
+
+Apache Stanbol Enhancement Engine based on Apache Tika that has three main functionalities:
+
+1. To detect the content type of parsed content. This is only performed if the no content type is parsed of the cogent type is set to "application/octed-stream". The detected content type is added to the metadata of the Content Item.
+2. To extract the plain text (and XHTML) from parsed content and add it to the [ContentItem](../contentitem.html) as content parts with the type Blob.
+3. To extract metadata from the parsed content and add it to the metadata of the [ContentItem](../contentitem.html)
+
+
+## Supported Media Types
+
+As this engine uses Apache Tika the supported media types are the same as stated on the [Tika Homepage](http://tika.apache.org/1.0/formats.html).
+
+## Extracted Metadata
+
+Tika provides metadata as 'key:values' pairs. To use them efficiently within stanbol they need to be converted to valid RDF and aligned with existing Ontologies.
+
+The TikaEngine supports alignments to several different Ontologies. Such alignment rules can be activated/deactivated within the configuration of the TikaEngine.
+
+Supported Ontologies:
+
+* [Ontology for Media Resources](http://www.w3.org/TR/mediaont-10/): This is the most complete mapping to an single Ontology. This includes mappings for all Dublin Core metadata; geo locations; some image specific data and most of the Audio and Viedo related metadata.
+
+* [DC terms](http://dublincore.org/documents/dcmi-terms/): Provides good mappings for text documents (HTML, Office, OpenOffice, PDF ...)
+
+* [Nepomuk EXIF ontology](http://www.semanticdesktop.org/ontologies/2007/05/10/nexif/): Interesting for users that want to work with EXIF metadata extracted from images.
+
+* [Nepomuk Message Ontology](http://www.semanticdesktop.org/ontologies/2007/03/22/nmo/): Used for sender and recaiver information of mail messages.
+
+* SKOS: Allows mapping of labels and notes to [SKOS](http://www.w3.org/2009/08/skos-reference/skos.html). This is deactivated by default.
+
+* RDFS: Allows to map labels and comments to "rdfs:label" and "rdfs:comment"
+
+### ContentType:
+
+The detected content type for the parsed contentItem is added by using the following two properties:
+
+* 'http://purl.org/dc/terms/format': Dublin Core terms 'format'
+* 'http://www.w3.org/ns/ma-ont#hasFormat': Media Resource Ontology 'hasFormat'
+
+Note that this properties will only be present if the related Ontology is activated in the TikaEngine configuration.
+
+
+## Sending Requests directly to the Tika Engine
+
+The Stanbol Enhancer allows to send enhancement requests directly to specific EnhancementEngine. This feature can be used in combination with the Tika Engine to request
+
+1. the "text/plain" or "application/xhtml+xml" version of parsed content
+2. the extracted metadata as RDF aligned to the activated Ontologies
+
+The first example requests the plain text version of a PDF file with the name "test.pdf". Note the
+
+* 'Accept' header is set to the contentType of the requested content and the
+* 'omitMetadata=true' telling the Enhancer to not return the RDF metadata.
+
+ :::bash
+ curl -v -X POST -H "Accept: text/plain" -T mag_internes_protokoll_20100721_rw.doc \
+ "http://localhost:8080/enhancer/engine/tika?omitMetadata=true"
+
+This second example returns the metadata as extracted from the parsed "song.mp3"
+
+ :::bash
+ curl -v -X POST -H "Accept: application/rdf+xml" -T song.mp3 \
+ "http://localhost:8080/enhancer/engine/tika"
+
+