You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by bu...@apache.org on 2012/03/05 14:28:45 UTC

svn commit: r807424 - in /websites/staging/stanbol/trunk/content: ./ stanbol/docs/trunk/enhancer/engines/list.html stanbol/docs/trunk/enhancer/engines/tikaengine.html

Author: buildbot
Date: Mon Mar  5 13:28:45 2012
New Revision: 807424

Log:
Staging update by buildbot for stanbol

Added:
    websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/tikaengine.html
Modified:
    websites/staging/stanbol/trunk/content/   (props changed)
    websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/list.html

Propchange: websites/staging/stanbol/trunk/content/
------------------------------------------------------------------------------
    cms:source-revision = 1297047

Modified: websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/list.html
==============================================================================
--- websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/list.html (original)
+++ websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/list.html Mon Mar  5 13:28:45 2012
@@ -46,7 +46,7 @@
 <ul>
 <li><a href="/stanbol/docs/trunk/downloads.html">Overview</a></li>
 </ul>
-<h1 id="the_asf">The ASF</h1>
+<h1 id="the-asf">The ASF</h1>
 <ul>
 <li><a href="http://www.apache.org">Apache Software Foundation</a></li>
 <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li>
@@ -60,22 +60,33 @@
     <p>This provides an overview about all <a href="index.html">Enhancement Engine</a> implementations managed by the Apache Stanbol community.</p>
 <h2 id="preprocessing">Preprocessing</h2>
 <ul>
-<li><strong><a href="langidengine.html">Language Identification Engine</a></strong><ul>
+<li>
+<p><strong><a href="langidengine.html">Language Identification Engine</a></strong></p>
+<ul>
 <li>language detection for textual content utilizing <a href="http://tika.apache.org/">Apache Tika</a></li>
 </ul>
 </li>
 <li>
+<p><strong><a href="tikaengine.html">Tika Engine</a></strong> (based on <a href="http://tika.apache.org/">Apache Tika</a>)</p>
+<ul>
+<li>content type detection</li>
+<li>text extraction from various document formats</li>
+<li>extraction of metadata from document formats</li>
+</ul>
+</li>
+<li>
 <p><strong><a href="metaxaengine.html">Metaxa Engine</a></strong></p>
 <ul>
 <li>text extraction from various document formats</li>
-<li>extraction of metadata from document formats
--</li>
+<li>extraction of metadata from document formats</li>
 </ul>
 </li>
 </ul>
-<h2 id="natural_language_processing">Natural Language Processing</h2>
+<h2 id="natural-language-processing">Natural Language Processing</h2>
+<ul>
+<li>
+<p><strong><a href="namedentityextractionengine.html">Named Entity Extraction Enhancement Engine</a></strong> </p>
 <ul>
-<li><strong><a href="namedentityextractionengine.html">Named Entity Extraction Enhancement Engine</a></strong> <ul>
 <li>NLP processing using OpenNLP NER</li>
 <li>detects occurrences of persons, places and organizations only</li>
 </ul>
@@ -96,9 +107,11 @@
 </ul>
 </li>
 </ul>
-<h2 id="linking_suggestions">Linking Suggestions</h2>
+<h2 id="linking-suggestions">Linking Suggestions</h2>
+<ul>
+<li>
+<p><strong><a href="namedentitytaggingengine.html">Named Entity Tagging Engine</a></strong></p>
 <ul>
-<li><strong><a href="namedentitytaggingengine.html">Named Entity Tagging Engine</a></strong><ul>
 <li>suggest links to several Linked Data Sources (e.g. DBpedia)</li>
 </ul>
 </li>
@@ -122,15 +135,19 @@
 </ul>
 </li>
 </ul>
-<h2 id="postprocessing__other">Postprocessing / Other</h2>
+<h2 id="postprocessing-other">Postprocessing / Other</h2>
+<ul>
+<li>
+<p><em>CachingDereferencerEngine</em> (deprecated, see dereferencing support of individual engines as well as  <a href="https://issues.apache.org/jira/browse/STANBOL-336">STANBOL-336</a>)</p>
 <ul>
-<li><em>CachingDereferencerEngine</em> (deprecated, see dereferencing support of individual engines as well as  <a href="https://issues.apache.org/jira/browse/STANBOL-336">STANBOL-336</a>)<ul>
 <li>retrieves additional content for presenting the enhancement results.</li>
 </ul>
 </li>
 <li>
-<p><strong><a href="refactorengine.html">Refactor Engine</a></strong>
-        - transforms enhancements according to a target ontology, requires KRES launcher.</p>
+<p><strong><a href="refactorengine.html">Refactor Engine</a></strong></p>
+<ul>
+<li>transforms enhancements according to a target ontology, requires KRES launcher.</li>
+</ul>
 </li>
 </ul>
   </div>

Added: websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/tikaengine.html
==============================================================================
--- websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/tikaengine.html (added)
+++ websites/staging/stanbol/trunk/content/stanbol/docs/trunk/enhancer/engines/tikaengine.html Mon Mar  5 13:28:45 2012
@@ -0,0 +1,133 @@
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
+<html>
+<head>
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one or more
+    contributor license agreements.  See the NOTICE file distributed with
+    this work for additional information regarding copyright ownership.
+    The ASF licenses this file to You under the Apache License, Version 2.0
+    (the "License"); you may not use this file except in compliance with
+    the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE- 2.0
+
+    Unless required by applicable law or agreed to in writing, software
+    distributed under the License is distributed on an "AS IS" BASIS,
+    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+    See the License for the specific language governing permissions and
+    limitations under the License.
+-->
+
+  <link href="/stanbol/css/stanbol.css" rel="stylesheet" type="text/css">
+  <title>Apache Stanbol - Tika Engine</title>
+  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
+  <link rel="icon" type="image/png" href="/stanbol/images/stanbol-logo/stanbol-favicon.png"/>
+</head>
+
+<body>
+  <div id="navigation"> 
+  <a href="/stanbol/index.html"><img alt="Apache Stanbol" width="220" height="101" border="0" src="/stanbol/images/stanbol-logo/stanbol-2010-12-14.png"/></a>
+  <h1 id="stanbol">Stanbol</h1>
+<ul>
+<li><a href="/stanbol/index.html">Home</a></li>
+<li><a href="/stanbol/docs/trunk/tutorial.html">Tutorial</a></li>
+<li><a href="/stanbol/docs/trunk/">Documentation</a></li>
+<li><a href="/stanbol/docs/trunk/building.html">Building</a></li>
+</ul>
+<h1 id="project">Project</h1>
+<ul>
+<li><a href="/stanbol/docs/trunk/mailinglists.html">Mailing Lists</a></li>
+<li><a href="https://issues.apache.org/jira/browse/STANBOL">Issue Tracker</a></li>
+<li><a href="/stanbol/team.html">Project Team</a></li>
+<li><a href="http://www.apache.org/licenses/LICENSE-2.0">License</a></li>
+</ul>
+<h1 id="downloads">Downloads</h1>
+<ul>
+<li><a href="/stanbol/docs/trunk/downloads.html">Overview</a></li>
+</ul>
+<h1 id="the-asf">The ASF</h1>
+<ul>
+<li><a href="http://www.apache.org">Apache Software Foundation</a></li>
+<li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li>
+<li><a href="http://www.apache.org/foundation/sponsorship.html">Become a Sponsor</a></li>
+<li><a href="http://www.apache.org/security/">Security</a></li>
+</ul>
+  </div>
+  
+  <div id="content">
+    <h1 class="title">Tika Engine</h1>
+    <p>Apache Stanbol Enhancement Engine based on Apache Tika that has three main functionalities:</p>
+<ol>
+<li>To detect the content type of parsed content. This is only performed if the no content type is parsed of the cogent type is set to "application/octed-stream". The detected content type is added to the metadata of the Content Item. </li>
+<li>To extract the plain text (and XHTML) from parsed content and add it to the <a href="../contentitem.html">ContentItem</a>   as content parts with the type Blob.</li>
+<li>To extract metadata from the parsed content and add it to the metadata of the <a href="../contentitem.html">ContentItem</a></li>
+</ol>
+<h2 id="supported-media-types">Supported Media Types</h2>
+<p>As this engine uses Apache Tika the supported media types are the same as stated on the <a href="http://tika.apache.org/1.0/formats.html">Tika Homepage</a>.</p>
+<h2 id="extracted-metadata">Extracted Metadata</h2>
+<p>Tika provides metadata as 'key:values' pairs. To use them efficiently within stanbol they need to be converted to valid RDF and aligned with existing Ontologies.</p>
+<p>The TikaEngine supports alignments to several different Ontologies. Such alignment rules can be activated/deactivated within the configuration of the TikaEngine.</p>
+<p>Supported Ontologies:</p>
+<ul>
+<li>
+<p><a href="http://www.w3.org/TR/mediaont-10/">Ontology for Media Resources</a>: This is the most complete mapping to an single Ontology. This includes mappings for all Dublin Core metadata; geo locations; some image specific data and most of the Audio and Viedo related metadata.</p>
+</li>
+<li>
+<p><a href="http://dublincore.org/documents/dcmi-terms/">DC terms</a>: Provides good mappings for text documents (HTML, Office, OpenOffice, PDF ...)</p>
+</li>
+<li>
+<p><a href="http://www.semanticdesktop.org/ontologies/2007/05/10/nexif/">Nepomuk EXIF ontology</a>: Interesting for users that want to work with EXIF metadata extracted from images.</p>
+</li>
+<li>
+<p><a href="http://www.semanticdesktop.org/ontologies/2007/03/22/nmo/">Nepomuk Message Ontology</a>: Used for sender and recaiver information of mail messages. </p>
+</li>
+<li>
+<p>SKOS: Allows mapping of labels and notes to <a href="http://www.w3.org/2009/08/skos-reference/skos.html">SKOS</a>. This is deactivated by default.</p>
+</li>
+<li>
+<p>RDFS: Allows to map labels and comments to "rdfs:label" and "rdfs:comment"</p>
+</li>
+</ul>
+<h3 id="contenttype">ContentType:</h3>
+<p>The detected content type for the parsed contentItem is added by using the following two properties:</p>
+<ul>
+<li>'http://purl.org/dc/terms/format': Dublin Core terms 'format'</li>
+<li>'http://www.w3.org/ns/ma-ont#hasFormat': Media Resource Ontology 'hasFormat'</li>
+</ul>
+<p>Note that this properties will only be present if the related Ontology is activated in the TikaEngine configuration.</p>
+<h2 id="sending-requests-directly-to-the-tika-engine">Sending Requests directly to the Tika Engine</h2>
+<p>The Stanbol Enhancer allows to send enhancement requests directly to specific EnhancementEngine. This feature can be used in combination with the Tika Engine to request</p>
+<ol>
+<li>the "text/plain" or "application/xhtml+xml" version of parsed content</li>
+<li>the extracted metadata as RDF aligned to the activated Ontologies</li>
+</ol>
+<p>The first example requests the plain text version of a PDF file with the name "test.pdf". Note the </p>
+<ul>
+<li>'Accept' header is set to the contentType of the requested content and the </li>
+<li>
+<p>'omitMetadata=true' telling the Enhancer to not return the RDF metadata.</p>
+<p>:::bash
+curl -v -X POST -H "Accept: text/plain" -T mag_internes_protokoll_20100721_rw.doc \
+    "http://localhost:8080/enhancer/engine/tika?omitMetadata=true"</p>
+</li>
+</ul>
+<p>This second example returns the metadata as extracted from the parsed "song.mp3"</p>
+<div class="codehilite"><pre>curl -v -X POST -H <span class="s2">&quot;Accept: application/rdf+xml&quot;</span> -T song.mp3 <span class="se">\</span>
+    <span class="s2">&quot;http://localhost:8080/enhancer/engine/tika&quot;</span>
+</pre></div>
+  </div>
+  
+  <div id="footer">
+    <div class="copyright">
+      <p>
+        Copyright &copy; 2010 The Apache Software Foundation, Licensed under 
+        the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>.
+        <br />
+        Apache, Stanbol and the Apache feather and Stanbol logos are trademarks of The Apache Software Foundation.
+      </p>
+    </div>
+  </div>
+  
+</body>
+</html>