You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by bu...@apache.org on 2011/09/14 17:16:14 UTC

svn commit: r795754 - in /websites/staging/stanbol/trunk/content/stanbol/docs/trunk: customvocabulary.html index.html

Author: buildbot
Date: Wed Sep 14 15:16:13 2011
New Revision: 795754

Log:
Staging update by buildbot

Added:
    websites/staging/stanbol/trunk/content/stanbol/docs/trunk/customvocabulary.html
Modified:
    websites/staging/stanbol/trunk/content/stanbol/docs/trunk/index.html

Added: websites/staging/stanbol/trunk/content/stanbol/docs/trunk/customvocabulary.html
==============================================================================
--- websites/staging/stanbol/trunk/content/stanbol/docs/trunk/customvocabulary.html (added)
+++ websites/staging/stanbol/trunk/content/stanbol/docs/trunk/customvocabulary.html Wed Sep 14 15:16:13 2011
@@ -0,0 +1,140 @@
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
+<html>
+<head>
+<!--
+
+    Licensed to the Apache Software Foundation (ASF) under one or more
+    contributor license agreements.  See the NOTICE file distributed with
+    this work for additional information regarding copyright ownership.
+    The ASF licenses this file to You under the Apache License, Version 2.0
+    (the "License"); you may not use this file except in compliance with
+    the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE- 2.0
+
+    Unless required by applicable law or agreed to in writing, software
+    distributed under the License is distributed on an "AS IS" BASIS,
+    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+    See the License for the specific language governing permissions and
+    limitations under the License.
+-->
+
+  <link href="/stanbol/css/stanbol.css" rel="stylesheet" type="text/css">
+  <title>Apache Stanbol - Using custom/local vocabularies with Apache Stanbol</title>
+  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
+  <link rel="icon" type="image/png" href="/stanbol/images/stanbol-logo/stanbol-favicon.png"/>
+</head>
+
+<body>
+  <div id="navigation"> 
+  <img alt="Apache Stanbol" width="220" height="101" src="/stanbol/images/stanbol-logo/stanbol-2010-12-14.png"/>
+  <h1 id="stanbol_links">Stanbol links</h1>
+<ul>
+<li><a href="/stanbol/index.html">Home</a></li>
+<li><a href="/stanbol/team.html">Project Team</a></li>
+<li><a href="/stanbol/docs/trunk/">Documentation</a></li>
+</ul>
+<h1 id="asf_links">ASF links</h1>
+<ul>
+<li><a href="http://www.apache.org">Apache Software Foundation</a></li>
+<li><a href="http://www.apache.org/licenses/LICENSE-2.0">License</a></li>
+<li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li>
+<li><a href="http://www.apache.org/foundation/sponsorship.html">Become a Sponsor</a></li>
+<li><a href="http://www.apache.org/security/">Security</a></li>
+</ul>
+  </div>
+  
+  <div id="content">
+    <h1 class="title">Using custom/local vocabularies with Apache Stanbol</h1>
+    <p>For text enhancement and linking to external sources, the Entityhub provides you with the possibility to work with local indexes of datasets for several reasons. Firstly, you do not want to rely on internet connectivity to these services, secondly you may want to manage local changes to these public repository and thirdly, you may want to work with local resources only, such as your LDAP directory or a specific and private enterprise vocabulary of your domain.</p>
+<p>The main other possibility is to upload ontologies to the ontology manager and to use the reasoning components over it.</p>
+<p>This document focuses on two cases:</p>
+<ul>
+<li>Creating and using a local SOLr index of a given vocabulary e.g. a SKOS thesaurus or taxonomy of your domain</li>
+<li>Directly working with individual instance entities from given ontologies e.g. a FOAF repository.</li>
+</ul>
+<h2 id="creating_and_working_with_local_indexes">Creating and working with local indexes</h2>
+<p>The ability to work with custom vocabularies in Stanbol is necessary for many organizational use cases such as beeing able to detect various types of named entities specific to a company or to detect and work with concepts from a specific domain. Stanbol provides the machinery to start with vocabularies in standard languages such as <a href="http://www.w3.org/2004/02/skos/">SKOS - Simple Knowledge Organization Systems</a> or more general <a href="http://www.w3.org/TR/rdf-primer/">RDF</a> encoded data sets. The respective Stanbol components, which are needed for this functionality are the Entityhub for creating and managing the index and several <a href="engines.html">Enhancement Engines</a> to make use of the index during the enhancement process.</p>
+<h3 id="create_your_own_index">Create your own index</h3>
+<p><strong>Step 1 : Create the indexing tool</strong></p>
+<p>The indexing tool provides a default configuration for creating a SOLr index of RDF files (e.g. a SKOS export of a thesaurus or a set of foaf files).</p>
+<p>(1) If not yet built during the Stanbol build process of the entityhub call</p>
+<div class="codehilite"><pre><span class="n">mvn</span> <span class="n">install</span>
+</pre></div>
+
+
+<p>in the directory <code> {root}/entityhub/indexing/genericrdf/</code>and than</p>
+<div class="codehilite"><pre><span class="n">mvn</span> <span class="n">assembly:single</span>
+</pre></div>
+
+
+<p>Move the generated tool from</p>
+<div class="codehilite"><pre><span class="n">target</span><span class="o">/</span><span class="n">org</span><span class="o">.</span><span class="n">apache</span><span class="o">.</span><span class="n">stanbol</span><span class="o">.</span><span class="n">entityhub</span><span class="o">.</span><span class="n">indexing</span><span class="o">.</span><span class="n">genericrdf</span><span class="o">-*-</span><span class="n">jar</span><span class="o">-</span><span class="n">with</span><span class="o">-</span><span class="n">dependencies</span><span class="o">.</span><span class="n">jar</span>
+</pre></div>
+
+
+<p>into a custom directory, where you want to index your files.</p>
+<p><strong>Step 2 : Create the index</strong></p>
+<p>Initialize the tool with</p>
+<div class="codehilite"><pre><span class="n">java</span> <span class="o">-</span><span class="n">jar</span> <span class="n">org</span><span class="o">.</span><span class="n">apache</span><span class="o">.</span><span class="n">stanbol</span><span class="o">.</span><span class="n">entityhub</span><span class="o">.</span><span class="n">indexing</span><span class="o">.</span><span class="n">genericrdf</span><span class="o">-*-</span><span class="n">jar</span><span class="o">-</span><span class="n">with</span><span class="o">-</span><span class="n">dependencies</span><span class="o">.</span><span class="n">jar</span> <span class="n">init</span>
+</pre></div>
+
+
+<p>You will get a directory with the default configuration files, one for the sources and a distribution directory for the resulting files. Make sure, that you adapt the default configuration with at least the name of your index and namespaces and properties you need to include to the index and copy your source files into the respective directory <code>indexing/resources/rdfdata</code>. Several standard formats for RDF, multiple files and archives of them are supported. <em>For details of possible configurations, please consult the <code>{root}/entityhub/indexing/genericrdf/readme.md</code>.</em></p>
+<p>Then, you can start the index by running</p>
+<div class="codehilite"><pre><span class="n">java</span> <span class="o">-</span><span class="n">Xmx1024m</span> <span class="o">-</span><span class="n">jar</span> <span class="n">org</span><span class="o">.</span><span class="n">apache</span><span class="o">.</span><span class="n">stanbol</span><span class="o">.</span><span class="n">entityhub</span><span class="o">.</span><span class="n">indexing</span><span class="o">.</span><span class="n">dblp</span><span class="o">-*-</span><span class="n">jar</span><span class="o">-</span><span class="n">with</span><span class="o">-</span><span class="n">dependencies</span><span class="o">.</span><span class="n">jar</span> <span class="nb">index</span>
+</pre></div>
+
+
+<p>Depending on your hardware and on complexity and size of your sources, it may take several hours to built the index. As a result, you will get an archive of a <a href="http://lucene.apache.org/solr/">SOLr</a> index together with an OSGI bundle to work with the index in Stanbol.</p>
+<p><strong>Step 3 : Initialise the index within Stanbol</strong></p>
+<p>At your running Stanbol instance, copy the ZIP archive into <code>{root}/sling/datafiles</code>. Then, at the "Bundles" tab of the administration console add and start the <code>org.apache.stanbol.data.site.{name}-{version}.jar</code>.</p>
+<h3 id="configuring_the_enhancement_engines">Configuring the enhancement engines</h3>
+<p>Before you can make use of the custom vocabulary you need to decide, which kind of enhancements you want to support. If your enhancements are NamedEntities in its more strict sense (Persons, Locations, Organizations), then you can may use the standard NER engine together with its EntityLinkingEngine to configure the destination of your links.</p>
+<p>In such cases, where you want to match all kinds of named entities and concepts from your custom vocabulary, you should work with the TaxonomyLinkingEngine to both, find occurrences and to link them to custom entities. In this case, you'll get only results, if there is a match, while in the case above, you even get entities, where you don't find exact links. This approach will have its advantages when you need to have a high recall rate on your custom entities.</p>
+<p>In the following the configuration options are described briefly.</p>
+<p><strong>Use the TaxonomyLinkingEngine only</strong></p>
+<p>(1) To make sure, that the enhancement process uses the TaxonomyEngine only, deactivate the "standard NLP" enhancement engines, especially the NamedEntityExtractionEnhancementEngine (NER) and the EntityLinkingEngine before to work with the TaxonomyLinkingEngine.</p>
+<p>(2) Open the configuration console at http://localhost:8080/system/console/configMgr and navigate to the TaxonomyLinkingEngine. Its main options are configurable via the UI.</p>
+<ul>
+<li>Referenced Site: {put the id/name of your index} (required)</li>
+<li>Label Field: {the property to search for}</li>
+<li>Use Simple Tokenizer: {deactivate to use language specific tokenizers}</li>
+<li>Min Token Length: {set minimal token length}</li>
+<li>Use Chunker: {disable/enable language specific chunkers}</li>
+<li>Suggestions: {maximum number of suggestions}</li>
+<li>Number of Required Tokens: {minimal required tokens}</li>
+</ul>
+<p><em>For further details please on the engine and its configuration please consult the according Readme file at TODO: create the readme <code>{root}/stanbol/enhancer/engines/taxonomylinking/<code>.</em></p>
+<p><strong>Use several instances of the TaxonomyLinkingEngine</strong></p>
+<p>To work at the same time with different instances of the TaxonomyLinkingEngine can be useful in cases, where you have two or more distinct custom vocabularies/indexes and/or if you want to combine your specific domain vocabulary with general purpose datasets such as dbpedia or others.</p>
+<p><strong>Use the TaxonomyLinkingEngine together with the NER engine and the EntityLinkingEngine</strong></p>
+<p>If your text corpus contains and you are interested in both, generic NamedEntities and custom thesaurus you may use <br />
+</p>
+<h3 id="demos_and_examples">Demos and Examples</h3>
+<ul>
+<li>The full demo installation of Stanbol is configured to also work with an environmental thesaurus - if you test it with unstructured text from the domain, you should get enhancements with additional results for specific "concepts".</li>
+<li>One example can be found with metadata from the Austrian National Library is described (TODO: link) here.</li>
+</ul>
+<p>(TODO) - Examples</p>
+<h2 id="create_a_custom_index_for_dbpedia">Create a custom index for dbpedia</h2>
+<p>(TODO) dbpedia indexing (&lt;-- olivier)</p>
+<h2 id="working_with_ontologies_in_entityhub">Working with ontologies in EntityHub</h2>
+<p>(TODO)</p>
+<h3 id="demos_and_examples_1">Demos and Examples</h3>
+<p>(TODO)</p>
+  </div>
+  
+  <div id="footer">
+    <div class="copyright">
+      <p>
+        Copyright &copy; 2010 The Apache Software Foundation, Licensed under 
+        the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>.
+        <br />
+        Apache, Stanbol and the Apache feather and Stanbol logos are trademarks of The Apache Software Foundation.
+      </p>
+    </div>
+  </div>
+  
+</body>
+</html>

Modified: websites/staging/stanbol/trunk/content/stanbol/docs/trunk/index.html
==============================================================================
--- websites/staging/stanbol/trunk/content/stanbol/docs/trunk/index.html (original)
+++ websites/staging/stanbol/trunk/content/stanbol/docs/trunk/index.html Wed Sep 14 15:16:13 2011
@@ -86,7 +86,7 @@ contains Stanbol's persistent data, depl
 </ul>
 <p>Analyze textual content, enhance with with named entities (person, place, organization), suggest links to open data sources.</p>
 <ul>
-<li>Working with "local" Entities</li>
+<li><a href="customvocabulary.html">Working with "local" Entities</a></li>
 </ul>
 <p>Use locally defined entities (e.g. thesaurus concepts) from an organization's context.<br />
 </p>