You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by bu...@apache.org on 2012/07/16 15:02:48 UTC
svn commit: r825985 [4/12] - in /websites/staging/stanbol/trunk/content: ./ stanbol/docs/trunk/ stanbol/docs/trunk/cmsadapter/ stanbol/docs/trunk/components/ stanbol/docs/trunk/components/cmsadapter/ stanbol/docs/trunk/components/contenthub/ stanbol/do...

Added: websites/staging/stanbol/trunk/content/stanbol/docs/trunk/components/enhancer/engines/keywordlinkingengine.html
==============================================================================
--- websites/staging/stanbol/trunk/content/stanbol/docs/trunk/components/enhancer/engines/keywordlinkingengine.html (added)
+++ websites/staging/stanbol/trunk/content/stanbol/docs/trunk/components/enhancer/engines/keywordlinkingengine.html Mon Jul 16 13:02:45 2012
@@ -0,0 +1,265 @@
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
+<html>
+<head>
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one or more
+    contributor license agreements.  See the NOTICE file distributed with
+    this work for additional information regarding copyright ownership.
+    The ASF licenses this file to You under the Apache License, Version 2.0
+    (the "License"); you may not use this file except in compliance with
+    the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE- 2.0
+
+    Unless required by applicable law or agreed to in writing, software
+    distributed under the License is distributed on an "AS IS" BASIS,
+    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+    See the License for the specific language governing permissions and
+    limitations under the License.
+-->
+
+  <link href="/stanbol/css/stanbol.css" rel="stylesheet" type="text/css">
+  <title>Apache Stanbol - The Keyword Linking Engine: custom vocabularies and multiple languages</title>
+  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
+  <link rel="icon" type="image/png" href="/stanbol/images/stanbol-logo/stanbol-favicon.png"/>
+  <script type="text/javascript">
+    // Google Analytics Tracking Code
+    var _gaq = _gaq || [];
+    _gaq.push(['_setAccount', 'UA-32086816-1']);
+    _gaq.push(['_trackPageview']);
+
+    (function() {
+      var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
+      ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
+      var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
+    })();
+  </script>  
+</head>
+
+<body>
+  <div id="logo"> <!-- do not scroll the logo -->
+  <a href="/stanbol/index.html"><img alt="Apache Stanbol" width="220" height="101" border="0" src="/stanbol/images/stanbol-logo/stanbol-2010-12-14.png"/></a></div>
+  <div id="navigation"> <!-- but auto scroll the menue -->
+      <h1 id="stanbol">Stanbol</h1>
+<ul>
+<li><a href="/stanbol/index.html">Home</a></li>
+<li><a href="/stanbol/docs/trunk/tutorial.html">Getting Started</a></li>
+<li><a href="/stanbol/docs/trunk/">Documentation</a><ul>
+<li><a href="/stanbol/docs/trunk/scenarios.html">Usage Scenarios</a></li>
+<li><a href="/stanbol/docs/trunk/components.html">Components</a></li>
+</ul>
+</li>
+<li><a href="/stanbol/development/">Development</a></li>
+</ul>
+<h1 id="project">Project</h1>
+<ul>
+<li><a href="/stanbol/docs/trunk/mailinglists.html">Mailing Lists</a></li>
+<li><a href="https://issues.apache.org/jira/browse/STANBOL">Issue Tracker</a></li>
+<li><a href="/stanbol/team.html">Project Team</a></li>
+<li><a href="http://www.apache.org/licenses/LICENSE-2.0">License</a></li>
+<li><a href="/stanbol/privacy-policy.html">Privacy Policy</a></li>
+</ul>
+<h1 id="downloads">Downloads</h1>
+<ul>
+<li><a href="/stanbol/downloads/">Overview</a><ul>
+<li><a href="/stanbol/downloads/releases.html">Releases</a></li>
+<li><a href="/stanbol/downloads/launchers.html">Launchers</a></li>
+</ul>
+</li>
+</ul>
+<h1 id="archive">Archive</h1>
+<ul>
+<li><a href="/stanbol/docs/0.9.0-incubating/">0.9.0-incubating</a></li>
+</ul>
+<h1 id="the-asf">The ASF</h1>
+<ul>
+<li><a href="http://www.apache.org">Apache Software Foundation</a></li>
+<li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li>
+<li><a href="http://www.apache.org/foundation/sponsorship.html">Become a Sponsor</a></li>
+<li><a href="http://www.apache.org/security/">Security</a></li>
+</ul>
+  </div>
+  <div id="content">
+    <div class="breadcrump" style="font-size: 80%;">
+      <a href="/">Home</a>&nbsp;&raquo&nbsp;<a href="/stanbol/">Stanbol</a>&nbsp;&raquo&nbsp;<a href="/stanbol/docs/">Docs</a>&nbsp;&raquo&nbsp;<a href="/stanbol/docs/trunk/">Trunk</a>&nbsp;&raquo&nbsp;<a href="/stanbol/docs/trunk/components/">Components</a>&nbsp;&raquo&nbsp;<a href="/stanbol/docs/trunk/components/enhancer/">Enhancer</a>&nbsp;&raquo&nbsp;<a href="/stanbol/docs/trunk/components/enhancer/engines/">Engines</a>
+    </div>
+    <h1 class="title">The Keyword Linking Engine: custom vocabularies and multiple languages</h1>
+    <p>The KeywordLinkingEngine is intended to be used to extract occurrences of Entities part of a Controlled Vocabulary in content parsed to the Stanbol Enhancer. To do this words appearing within the text are compared with labels of entities. The Stanbol Entityhub is used to lookup Entities based on their labels.</p>
+<p>This documentation first provides information about the configuration options of this engine. This section is mainly intended for users of this engine. The remaining part of this document is rather technical and intended to be read by developers that want to extend this engine or want to know the technical details.</p>
+<h2 id="configuration">Configuration</h2>
+<p>The KeywordLinkingEnigne provides a lot of configuration possibilities. This section provides describes the different option based on the configuration dialog as shown by the Apache Felix Webconsole. </p>
+<p><img alt="KeywordLinkingEngine configuration" src="keywordlinkingengineconfig.png" title="The configuration dialog as shown by the Apache Felix web console" /></p>
+<p>The example in the scene shows an configuration that is used to extract Drugs base on various IDs (e.g. the ATC code and the nchi key) that are all stored as values of the skos:notation property. This example is used to emphasize on newer features like case sensitive mapping, keyword tokenizer and also customized type mappings. Similar configurations would be also need to extract product ids, ISBN number or more generally concepts of an thesaurus based on there notation.</p>
+<h3 id="configuration-parameter">Configuration Parameter</h3>
+<ul>
+<li><strong>Name</strong> <em>(stanbol.enhancer.engine.name)</em>: The name of the Enhancement Engine. This name is used to refer an <a href="index.html">EnhancementEngine</a> in <a href="enhancementchain.html">EnhancementChain</a>s</li>
+<li><strong>Referenced Site</strong> <em>(org.apache.stanbol.enhancer.engines.keywordextraction.referencedSiteId)</em>: The name of the ReferencedSite of the Stanbol Entityhub that holds the controlled vocabulary to be used for extracting Entities. "entityhub" or "local" can be used to extract Entities managed directly by the Entityhub.</li>
+<li><strong>Label Field</strong> <em>(org.apache.stanbol.enhancer.engines.keywordextraction.nameField)</em>: The name of the property used to lookup Entities. Only a single field is supported for performance reasons. Users that want to use values of several fields should collect such values by an according configuration in the mappings.txt used during indexing. This <a href="../../customvocabulary.html">usage scenario</a> provides more information on this.</li>
+<li><strong>Case Sensitivity</strong> <em>(org.apache.stanbol.enhancer.engines.keywordextraction.caseSensitive)</em>: This allows to activate/deactivate case sensitive matching. It is important to understand that even with case sensitivity activated an Entity with the label such as "Anaconda" will be suggested for the mention of "anaconda" in the text. The main difference will be the confidence value of such a suggestion as with case sensitivity activated the starting letters "A" and "a" are NOT considered to be matching. See the second technical part for details about the matching process. Case Sensitivity is deactivated by default. It is recommended to be activated if controlled vocabularies contain abbreviations similar to commonly used words e.g. CAN for Canada.</li>
+<li><strong>Type Field</strong> <em>(org.apache.stanbol.enhancer.engines.keywordextraction.typeField)</em>: Values of this field are used as values of the "fise:entity-types" property of created "<a href="../enhancementstructure.html#fiseentityannotation">fise:EntityAnnotation</a>"s. The default is "rdf:type".</li>
+<li><strong>Redirect Field</strong> <em>(org.apache.stanbol.enhancer.engines.keywordextraction.redirectField)</em> and <strong>Redirect Mode</strong> <em>(org.apache.stanbol.enhancer.engines.keywordextraction.redirectMode)</em>: Redirects allow to tell the KeywordLinkingEngine to follow a specific property in the knowledge base for matched entities. This feature e.g. allows to follow redirects from "USA" to "United States" as defined in Wikipedia. See "Processing of Entity Suggestions" for details. Possible valued for the Redirect Mode are "IGNORE" - deactivates this feature; "ADD_VALUES" - uses label, type informations of redirected entities, but keeps the URI of the extracted entity; "FOLLOW" - follows the redirect</li>
+<li><strong>Min Token Length</strong> <em>(org.apache.stanbol.enhancer.engines.keywordextraction.minSearchTokenLength)</em>: While the KeywordLinkingEngine preferable uses POS (part-of-speach) taggers to determine if a word should matched with the controlled vocabulary the minimum token length provides a fall back if (a) no POS tagger is available for the language of the parsed text or (b) if the confidence of the POS tagger is lower than the threshold.</li>
+<li><strong>Minimum Token Match Factor</strong> <em>(org.apache.stanbol.enhancer.engines.keywordextraction.minTokenMatchFactor)</em>: If a Token of the text is compared with a Token of an Entity Label the similarity of those two is expressed in the range [0..1]. The minimum token match factor specifies the minimum similarity of two Tokens so that they are considered to match. Lower similarity scores are not considered as match. This parameter is important as it e.g. allows inflected forms of words to match. However it also may result in false positives of similar words. users should note that the similarity score is also used for calculating the confidence. So similarity scores &lt; 1 but higher than the configured minimum token match factor will reduce the confidence of suggested Entities.</li>
+<li><strong>Keyword Tokenizer</strong> <em>(org.apache.stanbol.enhancer.engines.keywordextraction.keywordTokenizer)</em>: This allows to use a special Tokenizer for matching keywords and alpha numeric IDs. Typical language specific Tokenizers tend to split such IDs in several tokens and therefore might prevent a correct matching. This Tokenizer should only be activated if the KeywordLinkingEngine is configured to match against IDs like ISBN numbers, Product IDs ... It should not be used to match against natural language labels. </li>
+<li><strong>Suggestions</strong> <em>(org.apache.stanbol.enhancer.engines.keywordextraction.maxSuggestions)</em>: The maximum number of suggested Entities.</li>
+<li><strong>Languages</strong> <em>(org.apache.stanbol.enhancer.engines.keywordextraction.processedLanguages)</em> and <strong>Default Matching Language</strong> <em>(org.apache.stanbol.enhancer.engines.keywordextraction.defaultMatchingLanguage)</em>: The first allows to specify languages that should be processed by this engine. This is e.g. useful if the controlled vocabulary only contains labels in for a specific language but does not formally specify this information (by setting the "xml:lang" property for labels). The default matching language can be used to work around the exact opposite case. As an example in DBpedia labels do get the language of the dataset they are extracted from (e.g. all data extracted from en.wikipedia.org will get "xml:lang=en"). The default matching language allows to tell the KeywordLinkingEngine to use labels of that language for matching regardless of the language of the parsed content. In the case of DBpedia this allows e.g. to match persons
  mentioned in an Italian text with the english labels extracted from en.wikipedia.org. Details about natural language processing features used by this engine are provided in the section "Multiple Language Support"</li>
+<li><strong>Type Mappings</strong> <em>(org.apache.stanbol.enhancer.engines.keywordextraction.typeMappings)</em>: The FISE enhancement structure (as used by the Stanbol Enhancer) distinguishes <a href="../enhancementstructure.html#fisetextannotation">TextAnnotation</a> and <a href="../enhancementstructure.html#fiseentityannotation">EntityAnnotation</a>s. The Keyword linking engine needs to create both types of Annotations: TextAnnotations selecting the words that match some Entities in the Controlled Vocabulary and EntityAnnotations that represent an Entity suggested for a TextAnnotation. The Type Mappings are used to determine the "dc:type" of the TextAnnotation based on the types of the suggested Entity. The default configuration comes with mappings for Persons, Organizations, Places and Concepts but this fields allows to define additional mappings. For details see the section "Type Mapping Syntax" and "Processing of Entity Suggestions".</li>
+<li><strong>Dereference Entities</strong> <em>(org.apache.stanbol.enhancer.engines.keywordextraction.dereference)</em>: If enabled this engine adds additional information about the suggested Entities to the Metadata of the enhanced content item.</li>
+<li><strong>Ranking</strong> <em>(service.ranking)</em>: This property is used of two engines do use the same <strong>Name</strong>. In such cases the one with the higher ranking will be used to enhance content items. Typically users will not need to change this.</li>
+</ul>
+<p>Additionally the following properties can be configured via a configuration file:</p>
+<ul>
+<li><strong>Minimum Found Tokens</strong> <em>(org.apache.stanbol.enhancer.engines.keywordextraction.minFoundTokens)</em>: This allows to tell the KeywordLinking Engine how to deal with Entities that do not exactly match words in the text. Typical Examples are "George W. Bush" -&gt; "George Walker Bush". This parameter allows the minimum number of tokens that need to match. The default value is '2'. Note that this does not apply for exact matches. Setting this to a high value can be used to force a mode that will only consider entities where all tokens of the label match the mention in the text.</li>
+<li><strong>Minimum Pos Tag Probability</strong> <em>(org.apache.stanbol.enhancer.engines.keywordextraction.minPosTagProbability)</em>: The minimum probability of a POS (part-of-speech) tag. Tags with a lower probability will be ignored. In such cases the configured value for the <strong>Min Token Length</strong> will apply. The value MUST BE in the range [0..1]</li>
+</ul>
+<h3 id="type-mappings-syntax">Type Mappings Syntax</h3>
+<p>The Type Mappings are used to determine the "dc:type" of the <a href="../enhancementstructure.html#fisetextannotation">TextAnnotation</a> based on the types of the suggested Entity. The field "Type Mappings" (property: <em>org.apache.stanbol.enhancer.engines.keywordextraction.typeMappings</em>) can be used to customize such mappings.</p>
+<p>This field uses the following syntax</p>
+<div class="codehilite"><pre><span class="p">{</span><span class="n">uri</span><span class="p">}</span>
+<span class="p">{</span><span class="n">source</span><span class="p">}</span> <span class="o">&gt;</span> <span class="p">{</span><span class="n">target</span><span class="p">}</span>
+<span class="p">{</span><span class="n">source1</span><span class="p">};</span> <span class="p">{</span><span class="n">source2</span><span class="p">};</span> <span class="o">...</span> <span class="p">{</span><span class="n">sourceN</span><span class="p">}</span> <span class="o">&gt;</span> <span class="p">{</span><span class="n">target</span><span class="p">}</span>
+</pre></div>
+
+
+<p>The first variant is a shorthand for {uri} &gt; {uri} and therefore specifies that the {uri} should be used as 'dc:type' for <a href="../enhancementstructure.html#fisetextannotation">TextAnnotation</a>s if the matched entity is of type {uri}. The second variant matches a {source} URI to a {target}. Variant three shows the possibility to match multiple URIs to the same target in a single configuration line.</p>
+<p>Both 'ns:localName' and full qualified URIs are supported. For supported namespaces see the <a href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/entityhub/generic/servicesapi/src/main/java/org/apache/stanbol/entityhub/servicesapi/defaults/NamespaceEnum.java">NamespaceEnum</a>. Information about accepted (INFO) and ignored (WARN) type mappings are available in the logs.</p>
+<p>Some Examples of additional Mappings for the e-health domain:</p>
+<div class="codehilite"><pre><span class="err">drugbank:drugs;</span> <span class="err">dbp-ont:Drug;</span> <span class="err">dailymed:drugs;</span> <span class="err">sider:drugs;</span> <span class="err">tcm:Medicine</span> <span class="err">&gt;</span> <span class="err">drugbank:drugs</span>
+<span class="err">diseasome:diseases;</span> <span class="err">linkedct:condition;</span> <span class="err">tcm:Disease</span> <span class="err">&gt;</span> <span class="err">diseasome:diseases</span> 
+<span class="err">sider:side_effects</span>
+<span class="err">dailymed:ingredients</span>
+<span class="err">dailymed:organization</span> <span class="err">&gt;</span> <span class="err">dbp-ont:Organisation</span>
+</pre></div>
+
+
+<p>The first two lines map some will known Classes that represent drugs and diseases to 'drugbank:drugs' and 'diseasome:diseases'. The third and fourth line define 1:1 mappings for side effects and ingredients and the last line adds 'dailymed:organization' as an additional mapping to DBpedia Ontology Organisation.</p>
+<p>The following mappings are predefined by the KeywordLinkingEngine.</p>
+<div class="codehilite"><pre><span class="n">dbp</span><span class="o">-</span><span class="n">ont:Person</span><span class="p">;</span> <span class="n">foaf:Person</span><span class="p">;</span> <span class="n">schema:Person</span> <span class="o">&gt;</span> <span class="n">dbp</span><span class="o">-</span><span class="n">ont:Person</span>
+<span class="n">dbp</span><span class="o">-</span><span class="n">ont:Organisation</span><span class="p">;</span> <span class="n">dbp</span><span class="o">-</span><span class="n">ont:Newspaper</span><span class="p">;</span> <span class="n">schema:Organization</span> <span class="o">&gt;</span> <span class="n">dbp</span><span class="o">-</span><span class="n">ont:Organisation</span>
+<span class="n">dbp</span><span class="o">-</span><span class="n">ont:Place</span><span class="p">;</span> <span class="n">schema:Place</span><span class="p">;</span> <span class="n">gml:_Feature</span> <span class="o">&gt;</span> <span class="n">dbp</span><span class="o">-</span><span class="n">ont:Place</span>
+<span class="n">skos:Concept</span>
+</pre></div>
+
+
+<h2 id="multiple-language-support">Multiple Language Support</h2>
+<p>The KeywordLinkingEngine supports the extraction of keywords in multiple languages. However, the performance and to some extend also the quality of the enhancements depend on how well a language is supported by the used NLP framework (currently OpenNLP).
+The following list provides a short overview about the different language specific component/configurations:</p>
+<ul>
+<li><strong>Language detection:</strong> The KeywordLinkingEngine depends on the correct detection of the language by the LanguageIdentificationEngine. If no language is detected or this information is missing then "English" is assumed as default.</li>
+<li><strong>Multi-lingual labels of the controlled vocabulary:</strong> Entities are matched based on labels of the current language and labels without any defined language. e.g. English labels will not be matched against German language texts. Therefore it is important to have a controlled vocabulary that includes labels in the language of the texts you want to enhance.</li>
+<li><strong>Natural Language Processing support:</strong> The KeywordLinkingEngine is able to use <a href="http://opennlp.sourceforge.net/api/opennlp/tools/sentdetect/SentenceDetector.html">Sentence Detectors</a>, <a href="http://opennlp.sourceforge.net/api/opennlp/tools/postag/POSTagger.html">POS (Part of Speech) taggers</a> and <a href="http://opennlp.sourceforge.net/api/opennlp/tools/chunker/Chunker.html">Chunkers</a>. If such components are available for a language then they are used to optimize the enhancement process.</li>
+<li><strong>Sentence detector:</strong> If a sentence detector is present the memory footprint of the engines improves, because Tokens, POS tags and Chunks are only kept for the currently active sentence. If no sentence detector is available the entire content is treated as a single sentence.</li>
+<li><strong>Tokenizer:</strong> A (word) <a href="http://opennlp.sourceforge.net/api/opennlp/tools/tokenize/Tokenizer.html">tokenizer</a> is required for the enhancement process. If no specific tokenizer is available for a given language, then the <a href="http://opennlp.sourceforge.net/api/opennlp/tools/tokenize/SimpleTokenizer.html">OpenNLP SimpleTokenizer</a> is used as default. The parameter <strong>Keyword Tokenizer</strong> can be used to force the usage of a special Tokenizer that is optimized for matching keyword. This Tokenizer ensures that alpha-numeric IDs are not tokenized to ensure correct matching of such tokens. If this option is enabled than any language specific Tokenizer will be ignored in favor of the KeywordTokenizer.</li>
+<li><strong>POS tagger:</strong> POS (Part-of-Speech) taggers annotate tokens with their type. Because of the KeywordLinkingEngine is only interested in Nouns, Foreign Words and Numbers, the presence of such a tagger allows to skip a lot of the tokens and to improve performance. However POS taggers use different sets of tags for different languages. Because of that it is not enough that a POS tagger is available for a language there MUST BE also a configuration of the POS tags representing Nouns.</li>
+<li><strong>Chunker:</strong> There are two types of Chunkers. First the <a href="http://opennlp.sourceforge.net/api/opennlp/tools/chunker/Chunker.html">Chunkers</a> as provided by OpenNLP (based on statistical models) and second a <a href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/opennlp/src/main/java/org/apache/stanbol/commons/opennlp/PosTypeChunker.java">POS tag based Chunker</a> provided by the openNLP bundle of Stanbol. Currently the availability of a Chunker does not have a big influence on the performance nor the quality of the Enhancements.</li>
+</ul>
+<h2 id="keyword-extraction-and-linking-workflow">Keyword extraction and linking workflow</h2>
+<p>Basically the text is parsed from the beginning to the end and words are looked up in the configured controlled vocabulary.</p>
+<h3 id="text-processing">Text Processing</h3>
+<p>The <a href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/AnalysedContent.java">AnalysedContent</a> Interface is used to access natural language text that was already processed by a NLP framework. Currently there is only a single implementation based on the commons.opennlp <a href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/opennlp/src/main/java/org/apache/stanbol/commons/opennlp/TextAnalyzer.java">TextAnalyzer</a> utility. In general this part is still very focused on OpenNLP. Making it also usable together with other NLP frameworks would probably need some re-factoring.</p>
+<p>The current state of the processing is represented by the <a href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/impl/ProcessingState.java">ProcessingState</a>. Based on the capabilities of the NLP framework for the current language it provides a the following set of information:</p>
+<ul>
+<li><strong>AnalysedSentence:</strong> If a sentence detector is present, than this represent the current sentence of the text. If not, then the whole text is represented as a single sentence. The AnalysedSentence also provides access to POS tags and Chunks (if available)</li>
+<li><strong>Chunk:</strong> If a chunker is present, then this represents the current chunk. Otherwise this will be null. </li>
+<li><strong>Token:</strong> The currently processed word part of the chunk and the sentence.</li>
+<li><strong>TokenIndex:</strong> The index of the currently active token relative to the AnalysedSentence.</li>
+</ul>
+<p>Processing is done based on Tokens (words). The ProcessingState provides means to navigate to the next token. If Chunks are present tokens that are outside of chunks are ignored. Only 'processable' tokens are considered to lookup entities (see the next section for details). If a Token is processable is determined as follows</p>
+<ul>
+<li>Only Tokens within a Chunk are considered. If no Chunks are available all Tokens.</li>
+<li>If POS tags are available AND POS tags considered as NOUNS are configured (see <a href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/opennlp/src/main/java/org/apache/stanbol/commons/opennlp/PosTagsCollectionEnum.java">PosTagsCollectionEnum</a>) than POS tags are considered for deciding if a Token is processable<ul>
+<li>The minimum POS tag probability is <code>0.667</code></li>
+<li>Tokens with a POS tag representing a NOUN and a probability &gt;= minPosTagProb are marked as processable</li>
+<li>Tokens with a POS tag NOT representing a NOUN and a probability &gt;= minPosTagProb/2 are marked as NOT processable</li>
+</ul>
+</li>
+<li>If POS tags are NOT available or the NOUN POS tags configuration is missing the minimum token length <em>(org.apache.stanbol.enhancer.engines.keywordextraction.minSearchTokenLength)</em> is used as fallback. This means that all Tokens equals or longer than this value are marked as processable.</li>
+</ul>
+<p>This algorithm was introduced by <a href="https://issues.apache.org/jira/browse/STANBOL-685">STANBOL-658</a></p>
+<h3 id="entity-lookup">Entity Lookup</h3>
+<p>A "OR" query with [1..MAX_SEARCH_TOKENS] processable tokens is used to lookup entities via the <a href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntitySearcher.java">EntitySearcher</a> interface. If the actual implementation cut off results, than it must be ensured that Entities that match both tokens are ranked first.
+Currently there are two implementations of this interface: (1) for the Entityhub (<a href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntityhubSearcher.java">EntityhubSearcher</a>) and (2) for ReferencedSites (<a href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/ReferencedSiteSearcher.java">ReferencedSiteSearcher</a>). There is also an <a href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/test/java/org/apache/stanbol/enhancer/engines/keywordextraction/impl/TestSearcherImpl.java">Implementation</a> that holds entities in-memory, however currently this is only used for unit tests.</p>
+<p>Queries do use the configured <a href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java">EntityLinkerConfig</a>.getNameField() and the language of labels is restricted to the current language or labels that do not define any language.</p>
+<p>Only "processable" tokens are used to lookup entities. If a token is processable is determined as follows:</p>
+<ul>
+<li>If POS tags are available the "Boolean processPOS(String posTag)" method of the <a href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/AnalysedContent.java">AnalysedContent</a> is used to check if a Token needs to be processed.</li>
+<li>If this method returns NULL or no POS tags are available, then all Tokens longer than <a href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java">EntityLinkerConfig</a>.getMinSearchTokenLength() (default=3) are considered as processable.</li>
+</ul>
+<p>Typically the next MAX_SEARCH_TOKENS processable tokens are used for a lookup. However the current Chunk/Sentence is never left in the search for processable tokens.</p>
+<h3 id="matching-of-found-entities">Matching of found Entities:</h3>
+<p>All labels (values of the <a href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java">EntityLinkerConfig</a>.getNameField() field) in the language of the content or without any defined language are candidates for matches.</p>
+<p>For each label that fulfills the above criteria the following steps are processed. The best result is used as the result of the whole matching process:</p>
+<ul>
+<li>Tokens (of the text) following the current position are searched within the label. This also includes non-processable Tokens. <ul>
+<li>Processable Tokens MUST match with Tokens in the Label. A maximum number of <a href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java">EntityLinkerConfig</a>.getMaxNotFound() non-processable Tokens may not match.</li>
+<li>Token order is important. Tokens in the Entity Label are allied to be skipped (e.g. the text 'Barack Obama' will match the label 'Barack Hussein Obama' because Hussein is allowed to be skipped. The other way around it would be no match because processable Tokens in the Text are not allied to be skipped)</li>
+</ul>
+</li>
+<li>If the first Token of the Label is not matches preceding Tokens of the Text are matched against the Label. This is done to ensure that Entities that use adjectives in their labels (e.g. "great improvement", "Gute Deutschkenntnisse") are matched. In addition this also helps to match named entities (e.g. person names) as the first token of those mentions are sometimes erroneously classified adjectives by POS taggers.</li>
+<li>Tokens that appear in the wrong order (e.g. the text 'Obama, Barack' with the label 'Barack Obama' are matched with a factor of <code>0.7</code>. Currently only exact matches are considered.</li>
+</ul>
+<p>If two tokens match is calculated by dividing the longest matching part from the begin of the Token to the maximum length of the two tokens. e.g. 'German' would match with 'Germany' with <code>5/6=0.83</code>. The result of this comparison is the token similarity. If this similarity is greater equals than the configured minimum token similarity factor <em>(org.apache.stanbol.enhancer.engines.keywordextraction.minTokenMatchFactor)</em> than those tokens are considered to match. The token similarity is also used for calculating the confidence.<br />
+</p>
+<p>Entities are <a href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/Suggestion.java">Suggested</a> if:</p>
+<ul>
+<li>a label does match exactly with the current position in the text. This is if all tokens of the Label match with the Tokens of the text. Note that tokens are considered to match if the similarity is greater equals than the minimum token match factor.</li>
+<li>partial matches are considered if more than <a href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java">EntityLinkerConfig</a>.getMinFoundTokens() (default=2) processable tokens match. Non-processable tokens are not considered for this. This ensures that "<a href="http://en.wikipedia.org/wiki/Rupert_Murdoch">Rupert Murdoch</a>" is not suggested for "<a href="http://en.wikipedia.org/wiki/Rupert">Rupert</a>" but on the other hand "Barack Hussein Obama" is suggested for "Barack Obama".</li>
+</ul>
+<p>The described matching process is currently directly part of the <a href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinker.java">EntityLinker</a>. To support different matching strategies this would need to be externalized into an own "EntityLabelMatcher" interface.</p>
+<h3 id="processing-of-entity-suggestions">Processing of Entity Suggestions</h3>
+<p>In case there are one or more <a href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/Suggestion.java">Suggestion</a>s of Entities for the current position within the text a <a href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/LinkedEntity.java">LinkedEntity</a> instance is created.</p>
+<p>LinkedEntity is an object model representing the Stanbol Enhancement Structure. After the processing of the parsed content is completed, the LinkedEntities are "serialized" as RDF triples to the metadata of the ContentItem.</p>
+<p><a href="../enhancementstructure.html#fisetextannotation">TextAnnotation</a>s as defined in the <a href="../enhancementstructure.html">Stanbol Enhancement Structure</a> do use the <a href="http://www.dublincore.org/documents/dcmi-terms/#terms-type">dc:type</a> property to provide the general type of the extracted Entity. However suggested Entities might have very specific types. Therefore the <a href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java">EntityLinkerConfig</a> provides the possibility to map the specific types of the Entity to types used for the dc:type property of TextAnnotations. The <a href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinkerConfig.java">EntityLinkerConfig</a>.DEFAULT_ENTITY
 _TYPE_MAPPINGS contains some predefined mappings.
+<em>Note that the field used to retrieve the types of a suggested Entity can be configured by the EntityLinkerConfig. The default value for the type field is "rdf:type".</em></p>
+<p>In some cases suggested entities might redirect to others. In the case of Wikipedia/DBpedia this is often used to link from acronyms like <a href="http://en.wikipedia.org/w/index.php?title=IMF&amp;redirect=no">IMF</a> to the real entity <a href="http://en.wikipedia.org/wiki/International_Monetary_Fund">International Monetary Fund</a>. But also some Thesauri define labels as own Entities with an URI and users might want to use the URI of the Concept rather than one of the label.
+To support such use cases the KeywordLinkingEngine has support for redirects. Users can first configure the redirect mode (ignore, copy values, follow) and secondly the field used to search for redirects (default=rdfs:seeAlso).
+If the redirect mode != ignore for each suggestion the Entities referenced by the configured redirect field are retrieved. In case of the "copy values" mode the values of the name, and type field are copied. In case of the "follow" mode the suggested entity is replaced with the first redirected entity.</p>
+<h3 id="confidence-for-suggestions">Confidence for Suggestions</h3>
+<p>The confidence for suggestions is calculated based on the following algorithm:</p>
+<p>Input Parameters</p>
+<ul>
+<li>max_matched: maximum number of the matched tokens of all suggestions  e.g. the text contains "Barack Obama" -&gt; 2</li>
+<li>matched: number of tokens that match for the current suggestion e.g. "Barack Hussein Obama" -&gt; 2</li>
+<li>span: number of tokens selected by the current suggestion e.g. "Barack Hussein Obama" -&gt; 2</li>
+<li>label_tokens: number of tokens of the matched label of the current entity (label_token) e.g. "Barack Hussein Obama" -&gt; 3</li>
+</ul>
+<p>The confidence is calculated as follows: </p>
+<div class="codehilite"><pre><span class="n">confidence</span> <span class="o">=</span> <span class="o">(</span><span class="n">match</span><span class="o">/</span><span class="n">max_matched</span><span class="o">)^</span><span class="mi">2</span> <span class="o">*</span> <span class="o">(</span><span class="n">matched</span><span class="o">/</span><span class="n">span</span><span class="o">)</span> <span class="o">*</span> <span class="o">(</span><span class="n">matched</span><span class="o">/</span><span class="n">label_tokens</span><span class="o">)</span>
+</pre></div>
+
+
+<p>Some Examples:</p>
+<ul>
+<li>"Barack Hussein Obama" matched against the text "Barack Obama" results in a confidence of (2/2)^2 * (2/2) * (2/3) = 0,67 </li>
+<li>"University Michigan" matched against the text "University of Michigan" results in a confidence of (2/2)^2 * (2/3) * (2/2) = 0,67</li>
+<li>"New York City" matched against the text "New York Rangers" - assuming that "New York Rangers" is the best match - results in a confidence of (2/3)^2 * (2/2) * (2/3) = 0,3; Note that the best match "New York Rangers" has max_matched=3 and gets a confidence of 1.</li>
+</ul>
+<p>The calculation of the confidence is currently direct part of the <a href="http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/EntityLinker.java">EntityLinker</a>. To support different matching strategies this would need to be externalized into an own interface.</p>
+<h2 id="notes-about-the-taxonomylinkingengine">Notes about the TaxonomyLinkingEngine</h2>
+<p>The KeywordLinkingEngine is a re-implementation of the TaxonomyLinkingEngine which is more modular and therefore better suited for future improvements and extensions as requested by <a href="https://issues.apache.org/jira/browse/STANBOL-303">STANBOL-303</a>. As of <a href="https://issues.apache.org/jira/browse/STANBOL-506">STANBOL-506</a> this engine is now deprecated and will be deleted from the SVN.</p>
+<!--
+However there would be now the possibility to implement a new version of an TaxonomyLinkingEngine with special support for hierarchical taxonomies. Such an engine would feature:
+
+* default configuration optimized for SKOS
+* support for term hierarchies - adding suggestions for parent concepts. Optionally by using a transitive closure over the hierarchy.
+* support for SKOS matching relations
+* support for restricting enhancements to a specific Taxonomy (skos:ConceptScheme) - this would allow to index several taxonomies in the same ReferencedSite but still use only a specific one for the enhancements.
+
+One Idea would be to allow users to use [LDPath](http://code.google.com/p/ldpath/) to configure post processing rules applied to extracted concepts of the Taxonomy.
+-->
+  </div>
+  
+  <div id="footer">
+    <div class="copyright">
+      <p>
+        Copyright &copy; 2010 The Apache Software Foundation, Licensed under 
+        the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>.
+        <br />
+        Apache, Stanbol and the Apache feather and Stanbol logos are trademarks of The Apache Software Foundation.
+      </p>
+    </div>
+  </div>
+  
+</body>
+</html>

Added: websites/staging/stanbol/trunk/content/stanbol/docs/trunk/components/enhancer/engines/keywordlinkingengineconfig.png
==============================================================================
Binary file - no diff available.

Propchange: websites/staging/stanbol/trunk/content/stanbol/docs/trunk/components/enhancer/engines/keywordlinkingengineconfig.png
------------------------------------------------------------------------------
    svn:mime-type = image/png

Added: websites/staging/stanbol/trunk/content/stanbol/docs/trunk/components/enhancer/engines/langidengine.html
==============================================================================
--- websites/staging/stanbol/trunk/content/stanbol/docs/trunk/components/enhancer/engines/langidengine.html (added)
+++ websites/staging/stanbol/trunk/content/stanbol/docs/trunk/components/enhancer/engines/langidengine.html Mon Jul 16 13:02:45 2012
@@ -0,0 +1,178 @@
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
+<html>
+<head>
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one or more
+    contributor license agreements.  See the NOTICE file distributed with
+    this work for additional information regarding copyright ownership.
+    The ASF licenses this file to You under the Apache License, Version 2.0
+    (the "License"); you may not use this file except in compliance with
+    the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE- 2.0
+
+    Unless required by applicable law or agreed to in writing, software
+    distributed under the License is distributed on an "AS IS" BASIS,
+    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+    See the License for the specific language governing permissions and
+    limitations under the License.
+-->
+
+  <link href="/stanbol/css/stanbol.css" rel="stylesheet" type="text/css">
+  <title>Apache Stanbol - The Language Identification Engine: detect the language of an text</title>
+  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
+  <link rel="icon" type="image/png" href="/stanbol/images/stanbol-logo/stanbol-favicon.png"/>
+  <script type="text/javascript">
+    // Google Analytics Tracking Code
+    var _gaq = _gaq || [];
+    _gaq.push(['_setAccount', 'UA-32086816-1']);
+    _gaq.push(['_trackPageview']);
+
+    (function() {
+      var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
+      ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
+      var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
+    })();
+  </script>  
+</head>
+
+<body>
+  <div id="logo"> <!-- do not scroll the logo -->
+  <a href="/stanbol/index.html"><img alt="Apache Stanbol" width="220" height="101" border="0" src="/stanbol/images/stanbol-logo/stanbol-2010-12-14.png"/></a></div>
+  <div id="navigation"> <!-- but auto scroll the menue -->
+      <h1 id="stanbol">Stanbol</h1>
+<ul>
+<li><a href="/stanbol/index.html">Home</a></li>
+<li><a href="/stanbol/docs/trunk/tutorial.html">Getting Started</a></li>
+<li><a href="/stanbol/docs/trunk/">Documentation</a><ul>
+<li><a href="/stanbol/docs/trunk/scenarios.html">Usage Scenarios</a></li>
+<li><a href="/stanbol/docs/trunk/components.html">Components</a></li>
+</ul>
+</li>
+<li><a href="/stanbol/development/">Development</a></li>
+</ul>
+<h1 id="project">Project</h1>
+<ul>
+<li><a href="/stanbol/docs/trunk/mailinglists.html">Mailing Lists</a></li>
+<li><a href="https://issues.apache.org/jira/browse/STANBOL">Issue Tracker</a></li>
+<li><a href="/stanbol/team.html">Project Team</a></li>
+<li><a href="http://www.apache.org/licenses/LICENSE-2.0">License</a></li>
+<li><a href="/stanbol/privacy-policy.html">Privacy Policy</a></li>
+</ul>
+<h1 id="downloads">Downloads</h1>
+<ul>
+<li><a href="/stanbol/downloads/">Overview</a><ul>
+<li><a href="/stanbol/downloads/releases.html">Releases</a></li>
+<li><a href="/stanbol/downloads/launchers.html">Launchers</a></li>
+</ul>
+</li>
+</ul>
+<h1 id="archive">Archive</h1>
+<ul>
+<li><a href="/stanbol/docs/0.9.0-incubating/">0.9.0-incubating</a></li>
+</ul>
+<h1 id="the-asf">The ASF</h1>
+<ul>
+<li><a href="http://www.apache.org">Apache Software Foundation</a></li>
+<li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li>
+<li><a href="http://www.apache.org/foundation/sponsorship.html">Become a Sponsor</a></li>
+<li><a href="http://www.apache.org/security/">Security</a></li>
+</ul>
+  </div>
+  <div id="content">
+    <div class="breadcrump" style="font-size: 80%;">
+      <a href="/">Home</a>&nbsp;&raquo&nbsp;<a href="/stanbol/">Stanbol</a>&nbsp;&raquo&nbsp;<a href="/stanbol/docs/">Docs</a>&nbsp;&raquo&nbsp;<a href="/stanbol/docs/trunk/">Trunk</a>&nbsp;&raquo&nbsp;<a href="/stanbol/docs/trunk/components/">Components</a>&nbsp;&raquo&nbsp;<a href="/stanbol/docs/trunk/components/enhancer/">Enhancer</a>&nbsp;&raquo&nbsp;<a href="/stanbol/docs/trunk/components/enhancer/engines/">Engines</a>
+    </div>
+    <h1 class="title">The Language Identification Engine: detect the language of an text</h1>
+    <p>The <strong>LangId</strong> engine determines the language of text.</p>
+<h2 id="technical-description">Technical Description</h2>
+<p>The provided engine is based on the language identifier of <a href="http://tika.apache.org/">Apache Tika</a>.
+The text to be checked must be provided in plain text format in one of two forms:</p>
+<ul>
+<li>a plain text content item</li>
+<li>
+<p>by the content item's metadata as the string value of the property </p>
+<p>:::html
+<pre><code>http://www.semanticdesktop.org/ontologies/2007/01/19/nie#plainTextContent</pre></code></p>
+</li>
+</ul>
+<p>The result of language identification is added as <a href="../enhancementstructure.html#fisetextannotation">fise:TextAnnotation</a> to the content item's metadata as string value of the property</p>
+<div class="codehilite"><pre>http://purl.org/dc/terms/language
+</pre></div>
+
+
+<p>This RDF snippet illustrates the output:</p>
+<div class="codehilite"><pre><span class="nt">&lt;fise:TextAnnotation</span> <span class="na">rdf:about=</span><span class="s">&quot;urn:enhancement-a147957b-41f9-58f7-bbf1-b880b3aa4b49&quot;</span><span class="nt">&gt;</span>
+    <span class="nt">&lt;dc:language&gt;</span>en<span class="nt">&lt;/dc:language&gt;</span>
+    <span class="nt">&lt;dc:creator&gt;</span>org.apache.stanbol.enhancer.engines.langid.LangIdEnhancementEngine<span class="nt">&lt;/dc:creator&gt;</span>
+<span class="nt">&lt;/fise:TextAnnotation&gt;</span>
+</pre></div>
+
+
+<p>By default the language identifier distinguishes the languages listed below. After the colon the value of the language label in the metadata is given.</p>
+<ul>
+<li>German: de</li>
+<li>English: en</li>
+<li>Estonian: et</li>
+<li>French: fr</li>
+<li>Spanish: es</li>
+<li>Italian: it</li>
+<li>Swedish: sv</li>
+<li>Polish: pl</li>
+<li>Dutch: nl</li>
+<li>Norwegian: no</li>
+<li>Finnish: fi</li>
+<li>Greek: el</li>
+<li>Danish: da</li>
+<li>Hungarian: hu</li>
+<li>Icelandic: is</li>
+<li>Lithuanian: lt</li>
+<li>Portuguese: pt</li>
+<li>Russian: ru</li>
+<li>Thai: th</li>
+</ul>
+<p>Additional language models can be created as Tika <a href="org.apache.tika.language.LanguageProfile">LanguageProfile</a>.</p>
+<h2 id="configuration-options">Configuration options</h2>
+<ul>
+<li><code>org.apache.stanbol.enhancer.engines.langid.probe-length</code>: an integer specifying how many characters will be used for identification. A value of 0 or below means to use the complete text. Otherwise only a substring of the specified length taken from the middle of the text will be used. The default value is 400 characters.</li>
+</ul>
+<h2 id="usage">Usage</h2>
+<p>Assuming that the Stanbol endpoint with the full launcher is running at</p>
+<div class="codehilite"><pre>http://localhost:8080
+</pre></div>
+
+
+<p>and the engine is activated, from the command line commands like this
+can be used for submitting some text file as content item:</p>
+<ul>
+<li>
+<p>stateless interface</p>
+<p>:::bash
+curl -i -X POST -H "Content-Type:text/plain" -T testfile.txt http://localhost:8080/engines</p>
+</li>
+<li>
+<p>stateful interface</p>
+<p>:::bash
+curl -i -X PUT -H "Content-Type:text/plain" -T testfile.txt http://localhost:8080/contenthub/content/someFileId</p>
+</li>
+</ul>
+<p>Alternatively, the Stanbol web interface can be used for submitting documents
+and viewing the metadata at</p>
+<div class="codehilite"><pre>http://localhost:8080/contenthub
+</pre></div>
+  </div>
+  
+  <div id="footer">
+    <div class="copyright">
+      <p>
+        Copyright &copy; 2010 The Apache Software Foundation, Licensed under 
+        the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>.
+        <br />
+        Apache, Stanbol and the Apache feather and Stanbol logos are trademarks of The Apache Software Foundation.
+      </p>
+    </div>
+  </div>
+  
+</body>
+</html>

Added: websites/staging/stanbol/trunk/content/stanbol/docs/trunk/components/enhancer/engines/list.html
==============================================================================
--- websites/staging/stanbol/trunk/content/stanbol/docs/trunk/components/enhancer/engines/list.html (added)
+++ websites/staging/stanbol/trunk/content/stanbol/docs/trunk/components/enhancer/engines/list.html Mon Jul 16 13:02:45 2012
@@ -0,0 +1,188 @@
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
+<html>
+<head>
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one or more
+    contributor license agreements.  See the NOTICE file distributed with
+    this work for additional information regarding copyright ownership.
+    The ASF licenses this file to You under the Apache License, Version 2.0
+    (the "License"); you may not use this file except in compliance with
+    the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE- 2.0
+
+    Unless required by applicable law or agreed to in writing, software
+    distributed under the License is distributed on an "AS IS" BASIS,
+    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+    See the License for the specific language governing permissions and
+    limitations under the License.
+-->
+
+  <link href="/stanbol/css/stanbol.css" rel="stylesheet" type="text/css">
+  <title>Apache Stanbol - Enhancement Engines and their main features</title>
+  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
+  <link rel="icon" type="image/png" href="/stanbol/images/stanbol-logo/stanbol-favicon.png"/>
+  <script type="text/javascript">
+    // Google Analytics Tracking Code
+    var _gaq = _gaq || [];
+    _gaq.push(['_setAccount', 'UA-32086816-1']);
+    _gaq.push(['_trackPageview']);
+
+    (function() {
+      var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
+      ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
+      var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
+    })();
+  </script>  
+</head>
+
+<body>
+  <div id="logo"> <!-- do not scroll the logo -->
+  <a href="/stanbol/index.html"><img alt="Apache Stanbol" width="220" height="101" border="0" src="/stanbol/images/stanbol-logo/stanbol-2010-12-14.png"/></a></div>
+  <div id="navigation"> <!-- but auto scroll the menue -->
+      <h1 id="stanbol">Stanbol</h1>
+<ul>
+<li><a href="/stanbol/index.html">Home</a></li>
+<li><a href="/stanbol/docs/trunk/tutorial.html">Getting Started</a></li>
+<li><a href="/stanbol/docs/trunk/">Documentation</a><ul>
+<li><a href="/stanbol/docs/trunk/scenarios.html">Usage Scenarios</a></li>
+<li><a href="/stanbol/docs/trunk/components.html">Components</a></li>
+</ul>
+</li>
+<li><a href="/stanbol/development/">Development</a></li>
+</ul>
+<h1 id="project">Project</h1>
+<ul>
+<li><a href="/stanbol/docs/trunk/mailinglists.html">Mailing Lists</a></li>
+<li><a href="https://issues.apache.org/jira/browse/STANBOL">Issue Tracker</a></li>
+<li><a href="/stanbol/team.html">Project Team</a></li>
+<li><a href="http://www.apache.org/licenses/LICENSE-2.0">License</a></li>
+<li><a href="/stanbol/privacy-policy.html">Privacy Policy</a></li>
+</ul>
+<h1 id="downloads">Downloads</h1>
+<ul>
+<li><a href="/stanbol/downloads/">Overview</a><ul>
+<li><a href="/stanbol/downloads/releases.html">Releases</a></li>
+<li><a href="/stanbol/downloads/launchers.html">Launchers</a></li>
+</ul>
+</li>
+</ul>
+<h1 id="archive">Archive</h1>
+<ul>
+<li><a href="/stanbol/docs/0.9.0-incubating/">0.9.0-incubating</a></li>
+</ul>
+<h1 id="the-asf">The ASF</h1>
+<ul>
+<li><a href="http://www.apache.org">Apache Software Foundation</a></li>
+<li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li>
+<li><a href="http://www.apache.org/foundation/sponsorship.html">Become a Sponsor</a></li>
+<li><a href="http://www.apache.org/security/">Security</a></li>
+</ul>
+  </div>
+  <div id="content">
+    <div class="breadcrump" style="font-size: 80%;">
+      <a href="/">Home</a>&nbsp;&raquo&nbsp;<a href="/stanbol/">Stanbol</a>&nbsp;&raquo&nbsp;<a href="/stanbol/docs/">Docs</a>&nbsp;&raquo&nbsp;<a href="/stanbol/docs/trunk/">Trunk</a>&nbsp;&raquo&nbsp;<a href="/stanbol/docs/trunk/components/">Components</a>&nbsp;&raquo&nbsp;<a href="/stanbol/docs/trunk/components/enhancer/">Enhancer</a>&nbsp;&raquo&nbsp;<a href="/stanbol/docs/trunk/components/enhancer/engines/">Engines</a>
+    </div>
+    <h1 class="title">Enhancement Engines and their main features</h1>
+    <p>This provides an overview about all <a href="index.html">Enhancement Engine</a> implementations managed by the Apache Stanbol community.</p>
+<h2 id="preprocessing">Preprocessing</h2>
+<ul>
+<li>
+<p><strong><a href="langidengine.html">Language Identification Engine</a></strong></p>
+<ul>
+<li>language detection for textual content utilizing <a href="http://tika.apache.org/">Apache Tika</a></li>
+</ul>
+</li>
+<li>
+<p><strong><a href="tikaengine.html">Tika Engine</a></strong> (based on <a href="http://tika.apache.org/">Apache Tika</a>)</p>
+<ul>
+<li>content type detection</li>
+<li>text extraction from various document formats</li>
+<li>extraction of metadata from document formats</li>
+</ul>
+</li>
+<li>
+<p><strong><a href="metaxaengine.html">Metaxa Engine</a></strong></p>
+<ul>
+<li>text extraction from various document formats</li>
+<li>extraction of metadata from document formats</li>
+</ul>
+</li>
+</ul>
+<h2 id="natural-language-processing">Natural Language Processing</h2>
+<ul>
+<li>
+<p><strong><a href="namedentityextractionengine.html">Named Entity Extraction Enhancement Engine</a></strong> </p>
+<ul>
+<li>NLP processing using OpenNLP NER</li>
+<li>detects occurrences of persons, places and organizations only</li>
+</ul>
+</li>
+<li>
+<p><strong><a href="keywordlinkingengine.html">KeywordLinkingEngine</a></strong></p>
+<ul>
+<li>NLP processing using OpenNLP</li>
+<li>supports multiple languages</li>
+<li>detects occurrences of untyped entities as concepts, takes local taxonomies as linking target </li>
+</ul>
+</li>
+</ul>
+<h2 id="linking-suggestions">Linking Suggestions</h2>
+<ul>
+<li>
+<p><strong><a href="namedentitytaggingengine.html">Named Entity Tagging Engine</a></strong></p>
+<ul>
+<li>suggest links to several Linked Data Sources (e.g. DBpedia)</li>
+</ul>
+</li>
+<li>
+<p><strong><a href="geonamesengine.html">Geonames Enhancement Engine</a></strong> </p>
+<ul>
+<li>suggests links to geonames.org</li>
+<li>provides hierarchical links for locations</li>
+</ul>
+</li>
+<li>
+<p><strong><a href="opencalaisengine.html">OpenCalais Enhancement Engine</a></strong></p>
+<ul>
+<li>integrates service from Open Calais. (Note: You need to provide a key in order to use this engine)</li>
+</ul>
+</li>
+<li>
+<p><strong><a href="zemantaengine.html">Zemanta Enhancement Engine</a></strong></p>
+<ul>
+<li>integrates the Zemanta services. (Note: You need to provide a key in order to use this engine)</li>
+</ul>
+</li>
+</ul>
+<h2 id="postprocessing-other">Postprocessing / Other</h2>
+<ul>
+<li>
+<p><em>CachingDereferencerEngine</em> (deprecated, see dereferencing support of individual engines as well as  <a href="https://issues.apache.org/jira/browse/STANBOL-336">STANBOL-336</a>)</p>
+<ul>
+<li>retrieves additional content for presenting the enhancement results.</li>
+</ul>
+</li>
+<li>
+<p><strong><a href="refactorengine.html">Refactor Engine</a></strong></p>
+<ul>
+<li>transforms enhancements according to a target ontology, requires KRES launcher.</li>
+</ul>
+</li>
+</ul>
+  </div>
+  
+  <div id="footer">
+    <div class="copyright">
+      <p>
+        Copyright &copy; 2010 The Apache Software Foundation, Licensed under 
+        the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>.
+        <br />
+        Apache, Stanbol and the Apache feather and Stanbol logos are trademarks of The Apache Software Foundation.
+      </p>
+    </div>
+  </div>
+  
+</body>
+</html>

Added: websites/staging/stanbol/trunk/content/stanbol/docs/trunk/components/enhancer/engines/metaxaengine.html
==============================================================================
--- websites/staging/stanbol/trunk/content/stanbol/docs/trunk/components/enhancer/engines/metaxaengine.html (added)
+++ websites/staging/stanbol/trunk/content/stanbol/docs/trunk/components/enhancer/engines/metaxaengine.html Mon Jul 16 13:02:45 2012
@@ -0,0 +1,394 @@
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
+<html>
+<head>
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one or more
+    contributor license agreements.  See the NOTICE file distributed with
+    this work for additional information regarding copyright ownership.
+    The ASF licenses this file to You under the Apache License, Version 2.0
+    (the "License"); you may not use this file except in compliance with
+    the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE- 2.0
+
+    Unless required by applicable law or agreed to in writing, software
+    distributed under the License is distributed on an "AS IS" BASIS,
+    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+    See the License for the specific language governing permissions and
+    limitations under the License.
+-->
+
+  <link href="/stanbol/css/stanbol.css" rel="stylesheet" type="text/css">
+  <title>Apache Stanbol - The Metaxa Enhancement Engine: extracting content and metadata from various formats</title>
+  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
+  <link rel="icon" type="image/png" href="/stanbol/images/stanbol-logo/stanbol-favicon.png"/>
+  <script type="text/javascript">
+    // Google Analytics Tracking Code
+    var _gaq = _gaq || [];
+    _gaq.push(['_setAccount', 'UA-32086816-1']);
+    _gaq.push(['_trackPageview']);
+
+    (function() {
+      var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
+      ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
+      var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
+    })();
+  </script>  
+</head>
+
+<body>
+  <div id="logo"> <!-- do not scroll the logo -->
+  <a href="/stanbol/index.html"><img alt="Apache Stanbol" width="220" height="101" border="0" src="/stanbol/images/stanbol-logo/stanbol-2010-12-14.png"/></a></div>
+  <div id="navigation"> <!-- but auto scroll the menue -->
+      <h1 id="stanbol">Stanbol</h1>
+<ul>
+<li><a href="/stanbol/index.html">Home</a></li>
+<li><a href="/stanbol/docs/trunk/tutorial.html">Getting Started</a></li>
+<li><a href="/stanbol/docs/trunk/">Documentation</a><ul>
+<li><a href="/stanbol/docs/trunk/scenarios.html">Usage Scenarios</a></li>
+<li><a href="/stanbol/docs/trunk/components.html">Components</a></li>
+</ul>
+</li>
+<li><a href="/stanbol/development/">Development</a></li>
+</ul>
+<h1 id="project">Project</h1>
+<ul>
+<li><a href="/stanbol/docs/trunk/mailinglists.html">Mailing Lists</a></li>
+<li><a href="https://issues.apache.org/jira/browse/STANBOL">Issue Tracker</a></li>
+<li><a href="/stanbol/team.html">Project Team</a></li>
+<li><a href="http://www.apache.org/licenses/LICENSE-2.0">License</a></li>
+<li><a href="/stanbol/privacy-policy.html">Privacy Policy</a></li>
+</ul>
+<h1 id="downloads">Downloads</h1>
+<ul>
+<li><a href="/stanbol/downloads/">Overview</a><ul>
+<li><a href="/stanbol/downloads/releases.html">Releases</a></li>
+<li><a href="/stanbol/downloads/launchers.html">Launchers</a></li>
+</ul>
+</li>
+</ul>
+<h1 id="archive">Archive</h1>
+<ul>
+<li><a href="/stanbol/docs/0.9.0-incubating/">0.9.0-incubating</a></li>
+</ul>
+<h1 id="the-asf">The ASF</h1>
+<ul>
+<li><a href="http://www.apache.org">Apache Software Foundation</a></li>
+<li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li>
+<li><a href="http://www.apache.org/foundation/sponsorship.html">Become a Sponsor</a></li>
+<li><a href="http://www.apache.org/security/">Security</a></li>
+</ul>
+  </div>
+  <div id="content">
+    <div class="breadcrump" style="font-size: 80%;">
+      <a href="/">Home</a>&nbsp;&raquo&nbsp;<a href="/stanbol/">Stanbol</a>&nbsp;&raquo&nbsp;<a href="/stanbol/docs/">Docs</a>&nbsp;&raquo&nbsp;<a href="/stanbol/docs/trunk/">Trunk</a>&nbsp;&raquo&nbsp;<a href="/stanbol/docs/trunk/components/">Components</a>&nbsp;&raquo&nbsp;<a href="/stanbol/docs/trunk/components/enhancer/">Enhancer</a>&nbsp;&raquo&nbsp;<a href="/stanbol/docs/trunk/components/enhancer/engines/">Engines</a>
+    </div>
+    <h1 class="title">The Metaxa Enhancement Engine: extracting content and metadata from various formats</h1>
+    <p>The <strong>Metaxa Enhancement Engine</strong> extracts embedded metadata and textual content from a large variety of document types and formats. The text extraction functionality also makes Metaxa suitable as a pre-processor for other components, especially NLP processors and indexing for search.</p>
+<h2 id="technical-description">Technical description</h2>
+<p>The engine is based on the <a href="http://aperture.sourceforge.net/">Aperture
+framework</a> with new extensions to handling structured content embedded in HTML web content, such as <a href="http://microformats.org/">Microformats</a> and <a href="http://www.w3.org/TR/rdfa-syntax/">RDFa</a>.
+Also some of the original extractors of Aperture were replaced by other engines using different base libraries.
+Metaxa introduces a single TextEnhancement instance that refers to the content item by its <em>extracted-from</em> property. The specific metadata extracted by Metaxa are ascribed directly to the content item/document since they represent
+document properties and not text annotations. Various ontologies are employed to describe various types of metadata. An overview will be given below.</p>
+<p>The general structure of the Metaxa annotations consists of three levels of annotations illustrated in the following example:</p>
+<h4 id="the-top-level-wzxhzdk10textannotationwzxhzdk11-instance">The top-level <tt>TextAnnotation</tt> instance</h4>
+<div class="codehilite"><pre>&lt;urn:enhancement-03c9e85e-2681-21b7-a5af-6da62d67ef6b&gt;
+     a       &lt;http://fise.iks-project.eu/ontology/TextAnnotation&gt; ,
+             &lt;http://fise.iks-project.eu/ontology/Enhancement&gt; ;
+             &lt;http://fise.iks-project.eu/ontology/confidence&gt;
+                 &quot;1.0&quot;^^&lt;http://www.w3.org/2001/XMLSchema#double&gt; ;
+     &lt;http://fise.iks-project.eu/ontology/extracted-from&gt;
+             &lt;http://localhost:8080/store/content/mf_example.htm&gt; ;
+     &lt;http://purl.org/dc/terms/created&gt;
+             &quot;2010-09-22T09:06:53.056+02:00&quot;^^&lt;http://www.w3.org/2001/XMLSchema#dateTime&gt; ;
+     &lt;http://purl.org/dc/terms/creator&gt;
+              &quot;org.apache.enhancer.engines.metaxa.MetaxaEngine&quot;^^&lt;http://www.w3.org/2001/XMLSchema#string&gt; .
+</pre></div>
+
+
+<h4 id="the-top-level-document-metadata-referenced-from-the-wzxhzdk12textannotationwzxhzdk13-instance-via-the-extracted-from-property">The top-level document metadata, referenced from the <tt>TextAnnotation</tt> instance via the <em>extracted-from</em> property:</h4>
+<div class="codehilite"><pre>&lt;http://localhost:8080/store/content/mf_example.htm&gt;
+     a       &lt;http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#HtmlDocument&gt; ;
+     &lt;http://www.semanticdesktop.org/ontologies/2007/01/19/nie#contains&gt;
+             &lt;urn:rnd:-9e25553:12b3843df43:-7ffe&gt; ;
+     &lt;http://www.semanticdesktop.org/ontologies/2007/01/19/nie#description&gt;
+             &quot;Cheap Flights to Tenerife, Arrecife, Paphos, Mahon, Las Palmas, Malaga, Alicante, Faro, Heraklion, Palma and the rest of the World. Flightline searches over 100 Airlines and 30,000 Hotels. ABTA, IATA, ATOL Bonded.&quot; ;
+     &lt;http://www.semanticdesktop.org/ontologies/2007/01/19/nie#keyword&gt;
+             &quot;travel&quot; , &quot;bargain flights&quot; , &quot;late deals&quot; , &quot;hotels&quot; , &quot;air tickets&quot; , &quot;air fares&quot; , &quot;discount travel&quot; , &quot;last minute flights&quot; , &quot;cheap airlines&quot; , &quot;cheap holidays&quot; , &quot;cheap flights&quot; , &quot;flightline&quot; , &quot;hotel reservations&quot; , &quot;discount flights&quot; , &quot;air travel&quot; , &quot;package holidays&quot; ;
+     &lt;http://www.semanticdesktop.org/ontologies/2007/01/19/nie#title&gt;
+             &quot;Flightline | Cheap Flights, Package Holidays, Hotels, Travel Insurance &amp;amp; More&quot; .
+</pre></div>
+
+
+<p>NOTE: The extracted plain text is no longer added to the metadata of the ContentItem but stores in an own <a href="../contentitem.html#content_parts">ContentPart</a> with the media type "text/plain". Both the RESTful Service as the Java API allows to request this data. See the according documentations for details.</p>
+<h4 id="embedded-wzxhzdk14hcardwzxhzdk15-microformat-data-referenced-via-the-wzxhzdk16niecontainswzxhzdk17-property">Embedded <tt>hCard</tt> microformat data referenced via the <tt>nie:contains</tt> property:</h4>
+<div class="codehilite"><pre>&lt;urn:rnd:-9e25553:12b3843df43:-7ffe&gt;
+     a       &lt;http://www.w3.org/2006/vcard/ns#VCard&gt; ;
+     &lt;http://www.w3.org/2006/vcard/ns#adr&gt;
+           &lt;urn:rnd:-9e25553:12b3843df43:-7ffc&gt; ;
+     &lt;http://www.w3.org/2006/vcard/ns#fn&gt;
+           &quot;Flightgeoline Essex Limited&quot; ;
+     &lt;http://www.w3.org/2006/vcard/ns#geo&gt;
+           &lt;urn:rnd:-9e25553:12b3843df43:-7ffb&gt; ;
+    &lt;http://www.w3.org/2006/vcard/ns#org&gt;
+           &lt;urn:rnd:-9e25553:12b3843df43:-7ffd&gt; ;
+    &lt;http://www.w3.org/2006/vcard/ns#photo&gt;
+           &lt;https://www.flightline.co.uk/common/images/building_banner_sm.jpg&gt; ;
+    &lt;http://www.w3.org/2006/vcard/ns#url&gt;
+           &lt;http://www.flightline.co.uk&gt; ;
+    &lt;http://www.w3.org/2006/vcard/ns#workTel&gt;
+           &lt;tel:0800541541&gt; .
+
+&lt;urn:rnd:-9e25553:12b3843df43:-7ffd&gt;
+     a       &lt;http://www.w3.org/2006/vcard/ns#Organization&gt; ;
+     &lt;http://www.w3.org/2006/vcard/ns#organization-name&gt;
+           &quot;Flightline Essex Limited&quot; .
+
+&lt;urn:rnd:-9e25553:12b3843df43:-7ffc&gt;
+     a       &lt;http://www.w3.org/2006/vcard/ns#Address&gt; ;
+     &lt;http://www.w3.org/2006/vcard/ns#countryName&gt;
+           &quot;UK&quot; ;
+     &lt;http://www.w3.org/2006/vcard/ns#extendedAddress&gt;
+          &quot;Flightline House&quot; ;
+     &lt;http://www.w3.org/2006/vcard/ns#locality&gt;
+          &quot;Westcliff-on-Sea&quot; ;
+     &lt;http://www.w3.org/2006/vcard/ns#postalCode&gt;
+          &quot;SS0 7JE&quot; ;
+     &lt;http://www.w3.org/2006/vcard/ns#region&gt;
+          &quot;Essex&quot; ;
+     &lt;http://www.w3.org/2006/vcard/ns#streetAddress&gt;
+          &quot;32-38 Milton Road&quot; .
+
+&lt;urn:rnd:-9e25553:12b3843df43:-7ffb&gt;
+     a       &lt;http://www.w3.org/2006/vcard/ns#Location&gt; ;
+     &lt;http://www.w3.org/2006/vcard/ns#latitude&gt;
+          &quot;51.53894902845868&quot; ;
+     &lt;http://www.w3.org/2006/vcard/ns#longitude&gt;
+          &quot;0.700753927230835&quot; .
+</pre></div>
+
+
+<h3 id="supported-document-types">Supported document types</h3>
+<p>The set of extraction engines for specific document types is defined by the resource <em>extractionregistry.xml</em>. Each engine specifies what MIME types it can handle. By default the extraction registry provides extractors for the
+following set of document formats:</p>
+<ul>
+<li><em>Office documents</em>:</li>
+<li>MS-Works</li>
+<li>MS-Office</li>
+<li>Excel</li>
+<li>PowerPoint</li>
+<li>Word</li>
+<li>Visio</li>
+<li>OpenDocument</li>
+<li>OpenXml</li>
+<li>Publisher</li>
+<li>Corel-Presentations</li>
+<li>QuattroPro</li>
+<li>
+<p>WordPerfect</p>
+</li>
+<li>
+<p><em>Multimedia documents</em>:</p>
+</li>
+<li>JPG</li>
+<li>
+<p>MP3</p>
+</li>
+<li>
+<p><em>(X)HTML</em>, supporting also these types of embedded structures/microformats, as defined by the default resource <em>htmlextractors.xml</em>:</p>
+</li>
+<li>RDFa</li>
+<li>geo</li>
+<li>hAtom</li>
+<li>hCal</li>
+<li>hCard</li>
+<li>hReview</li>
+<li>rel-license</li>
+<li>rel-tag</li>
+<li>
+<p>xFolk</p>
+</li>
+<li>
+<p><em>Other</em>:</p>
+</li>
+<li>PDF</li>
+<li>RTF</li>
+<li>Plain Text</li>
+<li>XML</li>
+</ul>
+<h3 id="textual-content">Textual Content</h3>
+<p>The extracted plain text is no longer added to the metadata of the contentItem but stores in an own <a href="../contentitem.html#content_parts">ContentPart</a> with the media type "text/plain".</p>
+<p>The following POST request to the Enhancer can be used to directly request the plain text version of a parsed content</p>
+<div class="codehilite"><pre>curl -v -X POST -H <span class="s2">&quot;Accept: text/plain&quot;</span> <span class="se">\</span>
+    -H <span class="s2">&quot;Content-type: text/html; charset=UTF-8&quot;</span> <span class="se">\</span>
+    --data <span class="s2">&quot;&lt;html&gt;&lt;body&gt;&lt;p&gt;The Stanbol enhancer can detect \</span>
+<span class="s2">      famous cities such as Paris and people such as Bob Marley.&lt;/p&gt;&lt;/body&gt;&lt;/html&gt;&quot;</span> <span class="se">\</span>
+    <span class="s2">&quot;http://localhost:8080/enhancer/chain/language?omitMetadata=true&quot;</span>
+</pre></div>
+
+
+<p>There is also the possibility to request both the extracted metadata and the plain text version. Please see the Documentation of the RESTful API (<a href="http://localhost:8080/enhacer">http://localhost:8080/enhacer</a> if Stanbol runs on localhost).</p>
+<p>NOTE: previous versions of this engine had stored the plain text version by using the "http://www.semanticdesktop.org/ontologies/2007/01/19/nie#plainTextContent" property directly in the metadata of the ContentItem. This is no longer supported.</p>
+<h3 id="vocabularies">Vocabularies</h3>
+<p>Metaxa uses a set of vocabularies ("ontologies") for structured data representation.</p>
+<h4 id="aperture-core-ontologies">Aperture Core Ontologies</h4>
+<p>These ontologies belong to the underlying Aperture subsystem, contained in the
+package</p>
+<div class="codehilite"><pre>org.semanticdesktop.aperture.vocabulary
+</pre></div>
+
+
+<p>The most important ones with respect to top-level document properties are</p>
+<ul>
+<li>
+<p>NIE (Nepomuk Information Element):</p>
+<p>:::text
+http://www.semanticdesktop.org/ontologies/2007/01/19/nie#</p>
+</li>
+<li>
+<p>NFO (Nepomuk File Object):</p>
+<p>:::text
+http://www.semanticdesktop.org/ontologies/2007/01/19/nfo# </p>
+</li>
+</ul>
+<p>Documentation of Aperture's core ontologies is provided in Aperture's Javadoc <a href="http://aperture.sourceforge.net/doc/javadoc/1.5.0/index.html">http://aperture.sourceforge.net/doc/javadoc/1.5.0/index.html</a> for the packages in </p>
+<div class="codehilite"><pre>org.semanticdesktop.aperture.vocabulary.
+</pre></div>
+
+
+<h4 id="html-microformat-extractors">HTML Microformat Extractors</h4>
+<p>The following table describes which vocabularies are used for representing microformat data in Metaxa: </p>
+<table border="1">
+    <tr>
+        <th>MF</th>
+        <th>Vocabulary (Namespace)</th>
+    </tr>
+    <tr>
+        <td>geo</td>
+        <td>wgs84 (<tt>http://www.w3.org/2003/01/geo/wgs84_pos#</tt>)</td>
+    </tr>
+    <tr>
+        <td>hAtom</td>
+        <td>atom (<tt>http://www.w3.org/2005/Atom#)</td>
+    </tr>
+    <tr>
+    <td/>
+        <td>tagging (<tt>http://aperture.sourceforge.net/ontologies/tagging#</tt>)</td>
+    </tr>
+    <tr>
+        <td>hCal</td>
+        <td> ical (<tt>http://www.w3.org/2002/12/cal/icaltzd#</tt>)</td>
+    </tr>
+    <tr>
+        <td></td>
+        <td>vcard (<tt>http://www.w3.org/2006/vcard/ns#</tt>)</td>
+    </tr>
+    <tr>
+        <td>hCard</td>
+        <td>vcard (<tt>http://www.w3.org/2006/vcard/ns#</tt>)</td>
+    </tr>
+    <tr>
+        <td>hReview</td>
+        <td>review (<tt>http://www.purl.org/stuff/rev#</tt>)</td></tr>
+    <tr>
+        <td></td>
+        <td>wgs84 (<tt>http://www.w3.org/2003/01/geo/wgs84_pos#</tt>)</td>
+    </tr>
+    <tr>
+        <td></td>
+        <td>dc (<tt>http://purl.org/dc/elements/1.1/</tt>)</td>
+    </tr>
+    <tr>
+        <td></td>
+        <td>dcterms (<tt>http://purl.org/dc/dcmitype/</tt>)</td>
+    </tr>
+    <tr>
+        <td></td>
+        <td>foaf (<tt>http://xmlns.com/foaf/0.1/</tt>)</td>
+    </tr>
+    <tr>
+        <td></td>
+        <td>vcard (<tt>http://www.w3.org/2006/vcard/ns#</tt>)</td>
+    </tr>
+    <tr>
+        <td></td>
+        <td>tag (<tt>http://www.holygoat.co.uk/owl/redwood/0.1/tags/</tt>)</td>
+    </tr>
+    <tr>
+        <td>rel-license</td>
+        <td>dc (<tt>http://purl.org/dc/elements/1.1</tt>/)</td>
+    </tr>
+    <tr>
+        <td>rel-tag</td>
+        <td> tagging (<tt>http://aperture.sourceforge.net/ontologies/tagging#</tt>)</td>
+    </tr>
+    <tr>
+        <td>xFolk</td>
+        <td>nfo (<tt>http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#</tt>)</td>
+    </tr>
+    <tr>
+        <td></td>
+        <td>dc (<tt>http://purl.org/dc/elements/1.1</tt>/)</td>
+    </tr>
+    <tr>
+        <td></td>
+        <td>tagging (<tt>http://aperture.sourceforge.net/ontologies/tagging#</tt>)</td>
+    </tr>
+</table>
+
+<h2 id="configuration-options">Configuration options</h2>
+<p>By default, Metaxa uses the extractors specified in the resource "extractionregistry.xml", and for HTML pages, the resource "htmlregistry.xml".
+Alternative configurations and extractors can be attached to Metaxa as fragment bundles, specifying as host bundle</p>
+<div class="codehilite"><pre>Fragment-Host: org.apache.stanbol.enhancer.engines.metaxa
+</pre></div>
+
+
+<p>The alternative configuration files then can be set as values of the properties</p>
+<ul>
+<li>
+<p><pre><code>org.apache.stanbol.enhancer.engines.metaxa.extractionregistry</pre></code></p>
+</li>
+<li>
+<p><pre><code>org.apache.stanbol.enhancer.engines.metaxa.htmlextractors</pre></code></p>
+</li>
+</ul>
+<h2 id="usage">Usage</h2>
+<p>Assuming that the Stanbol endpoint with the full launcher is running at</p>
+<div class="codehilite"><pre>http://localhost:8080
+</pre></div>
+
+
+<p>and the engine is activated, from the command line commands like this can be used for submitting some file as content item, where the mime type must match the document type:</p>
+<ul>
+<li>
+<p>stateless interface</p>
+<p>:::text
+curl -i -X POST -H "Content-Type:text/html" -T testpage.html http://localhost:8080/engines</p>
+</li>
+<li>
+<p>stateful interface</p>
+<p>:::text
+curl -i -X PUT -H "Content-Type:text/html" -T testpage.html http://localhost:8080/contenthub/content/someFileId</p>
+</li>
+</ul>
+<p>Alternatively, the Stanbol web interface can be used for submitting documents
+and viewing the metadata at</p>
+<div class="codehilite"><pre>http://localhost:8080/contenthub
+</pre></div>
+  </div>
+  
+  <div id="footer">
+    <div class="copyright">
+      <p>
+        Copyright &copy; 2010 The Apache Software Foundation, Licensed under 
+        the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>.
+        <br />
+        Apache, Stanbol and the Apache feather and Stanbol logos are trademarks of The Apache Software Foundation.
+      </p>
+    </div>
+  </div>
+  
+</body>
+</html>

Added: websites/staging/stanbol/trunk/content/stanbol/docs/trunk/components/enhancer/engines/namedentityextractionengine.html
==============================================================================
--- websites/staging/stanbol/trunk/content/stanbol/docs/trunk/components/enhancer/engines/namedentityextractionengine.html (added)
+++ websites/staging/stanbol/trunk/content/stanbol/docs/trunk/components/enhancer/engines/namedentityextractionengine.html Mon Jul 16 13:02:45 2012
@@ -0,0 +1,125 @@
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
+<html>
+<head>
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one or more
+    contributor license agreements.  See the NOTICE file distributed with
+    this work for additional information regarding copyright ownership.
+    The ASF licenses this file to You under the Apache License, Version 2.0
+    (the "License"); you may not use this file except in compliance with
+    the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE- 2.0
+
+    Unless required by applicable law or agreed to in writing, software
+    distributed under the License is distributed on an "AS IS" BASIS,
+    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+    See the License for the specific language governing permissions and
+    limitations under the License.
+-->
+
+  <link href="/stanbol/css/stanbol.css" rel="stylesheet" type="text/css">
+  <title>Apache Stanbol - The Named Entity Extraction Engine</title>
+  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
+  <link rel="icon" type="image/png" href="/stanbol/images/stanbol-logo/stanbol-favicon.png"/>
+  <script type="text/javascript">
+    // Google Analytics Tracking Code
+    var _gaq = _gaq || [];
+    _gaq.push(['_setAccount', 'UA-32086816-1']);
+    _gaq.push(['_trackPageview']);
+
+    (function() {
+      var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
+      ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
+      var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
+    })();
+  </script>  
+</head>
+
+<body>
+  <div id="logo"> <!-- do not scroll the logo -->
+  <a href="/stanbol/index.html"><img alt="Apache Stanbol" width="220" height="101" border="0" src="/stanbol/images/stanbol-logo/stanbol-2010-12-14.png"/></a></div>
+  <div id="navigation"> <!-- but auto scroll the menue -->
+      <h1 id="stanbol">Stanbol</h1>
+<ul>
+<li><a href="/stanbol/index.html">Home</a></li>
+<li><a href="/stanbol/docs/trunk/tutorial.html">Getting Started</a></li>
+<li><a href="/stanbol/docs/trunk/">Documentation</a><ul>
+<li><a href="/stanbol/docs/trunk/scenarios.html">Usage Scenarios</a></li>
+<li><a href="/stanbol/docs/trunk/components.html">Components</a></li>
+</ul>
+</li>
+<li><a href="/stanbol/development/">Development</a></li>
+</ul>
+<h1 id="project">Project</h1>
+<ul>
+<li><a href="/stanbol/docs/trunk/mailinglists.html">Mailing Lists</a></li>
+<li><a href="https://issues.apache.org/jira/browse/STANBOL">Issue Tracker</a></li>
+<li><a href="/stanbol/team.html">Project Team</a></li>
+<li><a href="http://www.apache.org/licenses/LICENSE-2.0">License</a></li>
+<li><a href="/stanbol/privacy-policy.html">Privacy Policy</a></li>
+</ul>
+<h1 id="downloads">Downloads</h1>
+<ul>
+<li><a href="/stanbol/downloads/">Overview</a><ul>
+<li><a href="/stanbol/downloads/releases.html">Releases</a></li>
+<li><a href="/stanbol/downloads/launchers.html">Launchers</a></li>
+</ul>
+</li>
+</ul>
+<h1 id="archive">Archive</h1>
+<ul>
+<li><a href="/stanbol/docs/0.9.0-incubating/">0.9.0-incubating</a></li>
+</ul>
+<h1 id="the-asf">The ASF</h1>
+<ul>
+<li><a href="http://www.apache.org">Apache Software Foundation</a></li>
+<li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li>
+<li><a href="http://www.apache.org/foundation/sponsorship.html">Become a Sponsor</a></li>
+<li><a href="http://www.apache.org/security/">Security</a></li>
+</ul>
+  </div>
+  <div id="content">
+    <div class="breadcrump" style="font-size: 80%;">
+      <a href="/">Home</a>&nbsp;&raquo&nbsp;<a href="/stanbol/">Stanbol</a>&nbsp;&raquo&nbsp;<a href="/stanbol/docs/">Docs</a>&nbsp;&raquo&nbsp;<a href="/stanbol/docs/trunk/">Trunk</a>&nbsp;&raquo&nbsp;<a href="/stanbol/docs/trunk/components/">Components</a>&nbsp;&raquo&nbsp;<a href="/stanbol/docs/trunk/components/enhancer/">Enhancer</a>&nbsp;&raquo&nbsp;<a href="/stanbol/docs/trunk/components/enhancer/engines/">Engines</a>
+    </div>
+    <h1 class="title">The Named Entity Extraction Engine</h1>
+    <p>This engine detects named entities from unstructured text. It is implemented based on Natural Language Processing (NLP) features of the <a href="http://incubator.apache.org/opennlp/">Apache OpenNLP (incubating)</a>. It uses the maximum entropy models to detect persons, names and organizations.</p>
+<h2 id="example-result">Example Result</h2>
+<p>This engine adds <a href="../enhancementstructure.html#fisetextannotation">fise:TextAnnotation</a> for the text "The Stanbol enhancer can detect famous cities such as Paris and people such as Bob Marley.", (amongst other) the following information to the enhancement graph, suggesting Bob Marley (of type: Person) for the string "Bob Marley":</p>
+<div class="codehilite"><pre>{
+  &quot;@subject&quot;: &quot;urn:enhancement-b3d4617d-1760-0374-f471-e0e746003f4e&quot;,
+      &quot;@type&quot;: [ &quot;enhancer:Enhancement&quot;,&quot;enhancer:TextAnnotation&quot;],
+      &quot;dc:created&quot;: &quot;2012-02-29T11:34:56.369Z&quot;,
+      &quot;dc:creator&quot;: &quot;org.apache.stanbol.enhancer.engines.opennlp.impl.NEREngineCore&quot;,
+      &quot;dc:type&quot;: &quot;dbp-ont:Person&quot;,
+      &quot;enhancer:confidence&quot;: 0.94647044,
+      &quot;enhancer:end&quot;: 59,
+      &quot;enhancer:extracted-from&quot;: &quot;urn:content-item-sha1-37c8a8244041cf6113d4ee04b3a04d0a014f6e10&quot;,
+      &quot;enhancer:selected-text&quot;: &quot;Bob Marley&quot;,
+      &quot;enhancer:selection-context&quot;: 
+      &quot;The Stanbol enhancer can detect famous Entities such as Paris or Bob Marley.&quot;,
+      &quot;enhancer:start&quot;: 69
+}
+</pre></div>
+
+
+<p>The following figure provides a visual representation of the above graph</p>
+<p><img alt="'fise:TextAnnotation'" src="../es_textannotation.png" title="This figure shows a TextAnnotation describing the occurrence of &quot;Bob Marley&quot; located from character 59 to 69 in the given text" /></p>
+<p>See the documentation of the <a href="../enhancementstructure.html">Enhancement Structure</a> for details.</p>
+  </div>
+  
+  <div id="footer">
+    <div class="copyright">
+      <p>
+        Copyright &copy; 2010 The Apache Software Foundation, Licensed under 
+        the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>.
+        <br />
+        Apache, Stanbol and the Apache feather and Stanbol logos are trademarks of The Apache Software Foundation.
+      </p>
+    </div>
+  </div>
+  
+</body>
+</html>