You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by bu...@apache.org on 2012/06/21 17:43:51 UTC

svn commit: r822652 - in /websites/staging/stanbol/trunk/content: ./ stanbol/docs/trunk/customvocabulary.html

Author: buildbot
Date: Thu Jun 21 15:43:50 2012
New Revision: 822652

Log:
Staging update by buildbot for stanbol

Modified:
    websites/staging/stanbol/trunk/content/   (props changed)
    websites/staging/stanbol/trunk/content/stanbol/docs/trunk/customvocabulary.html

Propchange: websites/staging/stanbol/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Thu Jun 21 15:43:50 2012
@@ -1 +1 @@
-1351720
+1352576

Modified: websites/staging/stanbol/trunk/content/stanbol/docs/trunk/customvocabulary.html
==============================================================================
--- websites/staging/stanbol/trunk/content/stanbol/docs/trunk/customvocabulary.html (original)
+++ websites/staging/stanbol/trunk/content/stanbol/docs/trunk/customvocabulary.html Thu Jun 21 15:43:50 2012
@@ -78,60 +78,60 @@
   
   <div id="content">
     <h1 class="title">Using custom/local vocabularies with Apache Stanbol</h1>
-    <p>The ability to work with custom vocabularies is necessary for many organisations. Use cases range from being able to detect various types of named entities specific of a company or to detect and work with concepts from a specific domain.</p>
-<p>For text enhancement and linking to external sources, the Entityhub component of Apache Stanbol allows to work with local indexes of datasets for several reasons: </p>
+    <p>The ability to work with custom vocabularies is necessary for many use cases. The use cases range from being able to detect various types of named entities specific of a company or to detect and work with concepts from a specific domain.</p>
+<p>For text enhancement and linking to external sources, the Entityhub component of Apache Stanbol allows to work with local indexes of datasets. This has several advantages. </p>
 <ul>
-<li>do not want to rely on internet connectivity to these services, thus working offline with a huge set of entities</li>
-<li>want to manage local updates of these public repositories and </li>
-<li>want to work with local resources only, such as your LDAP directory or a specific and private enterprise vocabulary of a specific domain.</li>
+<li>You do not rely on internet connectivity, thus it is possible to operate offline with a huge set of entities.</li>
+<li>You can do local updates of these datasets.</li>
+<li>You can work with local resources, such as your LDAP directory or a specific and private enterprise vocabulary of a specific domain.</li>
 </ul>
-<p>Creating your custom indexes the preferred way of working with custom vocabularies. For small vocabularies, with Entithub one can also upload simple ontologies together instance data directly to the Entityhub and manage them - but as a major downside to this approach, one can only manage one ontology per installation.</p>
-<p>This document focuses on the main case: Creating and using a local SOLr indexes of a custom vocabularies e.g. a SKOS thesaurus or taxonomy of your domain.</p>
+<p>Creating your own indexes is the preferred way of working with custom vocabularies. Small vocabularies can also be uploaded to the Entityhub as ontologies, directly. A downside to this approach is that only one ontology per installation is supported.</p>
+<p>If you want to use multiple datasets in parallel, you have to create a local index for these datasets and configure the Entityhub to use them. In the following we will focuses on the main case, which is: Creating and using a local <a href="http://lucene.apache.org/solr/">Apache Solr</a> index of a custom vocabulary, e.g. a SKOS thesaurus or taxonomy of your domain.</p>
 <h2 id="creating-and-working-with-custom-local-indexes">Creating and working with custom local indexes</h2>
-<p>Stanbol provides the machinery to start with vocabularies in standard languages such as <a href="http://www.w3.org/2004/02/skos/">SKOS - Simple Knowledge Organization Systems</a> or more general <a href="http://www.w3.org/TR/rdf-primer/">RDF</a> encoded data sets. The respective Stanbol components, which are needed for this functionality are the Entityhub for creating and managing the index and several <a href="enhancer/engines/list.html">Enhancement Engines</a> to make use of the indexes during the enhancement process.</p>
+<p>Apache Stanbol provides the machinery to start with vocabularies in standard languages such as <a href="http://www.w3.org/2004/02/skos/">SKOS</a> or <a href="http://www.w3.org/TR/rdf-primer/">RDF</a> encoded data sets. The Apache Stanbol components, which are needed for this functionality are the Entityhub and its indexing tool for creating and managing the index and <a href="enhancer/engines/list.html">enhancement engines</a> that make use of the indexes during the enhancement process.</p>
 <h3 id="a-create-your-own-index">A. Create your own index</h3>
-<p><strong>Step 1 : Create the indexing tool</strong></p>
-<p>The indexing tool provides a default configuration for creating a SOLr index of RDF files (e.g. a SKOS export of a thesaurus or a set of foaf files).</p>
-<p>If not yet built during the Stanbol build process of the Entityhub call</p>
-<div class="codehilite"><pre><span class="n">mvn</span> <span class="n">install</span>
+<p><strong>Step 1 : Compile and assemble the indexing tool</strong></p>
+<p>The indexing tool provides a default configuration for creating an <a href="http://lucene.apache.org/solr/">Apache Solr</a> index of RDF files (e.g. a SKOS export of a thesaurus or a set of foaf files).</p>
+<p>If not yet built during the Apache Stanbol build process of the Entityhub call</p>
+<div class="codehilite"><pre><span class="p">{</span><span class="n">root</span><span class="p">}</span><span class="o">/</span><span class="n">entityhub</span> <span class="nv">$</span> <span class="nv">mvn</span> <span class="n">install</span>
 </pre></div>
 
 
-<p>in the directory <code> {root}/entityhub/indexing/genericrdf/</code>and than</p>
-<div class="codehilite"><pre><span class="n">mvn</span> <span class="n">assembly:single</span>
+<p>and then</p>
+<div class="codehilite"><pre><span class="p">{</span><span class="n">root</span><span class="p">}</span><span class="sr">/entityhub/i</span><span class="n">ndexing</span><span class="sr">/genericrdf/</span> <span class="nv">$</span> <span class="nv">mvn</span> <span class="n">assembly:single</span>
 </pre></div>
 
 
 <p>Move the generated tool from</p>
-<div class="codehilite"><pre><span class="n">target</span><span class="o">/</span><span class="n">org</span><span class="o">.</span><span class="n">apache</span><span class="o">.</span><span class="n">stanbol</span><span class="o">.</span><span class="n">entityhub</span><span class="o">.</span><span class="n">indexing</span><span class="o">.</span><span class="n">genericrdf</span><span class="o">-*-</span><span class="n">jar</span><span class="o">-</span><span class="n">with</span><span class="o">-</span><span class="n">dependencies</span><span class="o">.</span><span class="n">jar</span>
+<div class="codehilite"><pre><span class="p">{</span><span class="n">root</span><span class="p">}</span><span class="sr">/entityhub/i</span><span class="n">ndexing</span><span class="sr">/genericrdf/</span><span class="n">target</span><span class="o">/</span><span class="n">org</span><span class="o">.</span><span class="n">apache</span><span class="o">.</span><span class="n">stanbol</span><span class="o">.</span><span class="n">entityhub</span><span class="o">.</span><span class="n">indexing</span><span class="o">.</span><span class="n">genericrdf</span><span class="o">-*-</span><span class="n">jar</span><span class="o">-</span><span class="n">with</span><span class="o">-</span><span class="n">dependencies</span><span class="o">.</span><span class="n">jar</span>
 </pre></div>
 
 
-<p>into a custom directory, where you want to index your files.</p>
+<p>into a new directory. We will refer to this new directory as {indexroot}.</p>
 <p><strong>Step 2 : Create the index</strong></p>
 <p>Initialize the tool with</p>
-<div class="codehilite"><pre><span class="n">java</span> <span class="o">-</span><span class="n">jar</span> <span class="n">org</span><span class="o">.</span><span class="n">apache</span><span class="o">.</span><span class="n">stanbol</span><span class="o">.</span><span class="n">entityhub</span><span class="o">.</span><span class="n">indexing</span><span class="o">.</span><span class="n">genericrdf</span><span class="o">-*-</span><span class="n">jar</span><span class="o">-</span><span class="n">with</span><span class="o">-</span><span class="n">dependencies</span><span class="o">.</span><span class="n">jar</span> <span class="n">init</span>
+<div class="codehilite"><pre><span class="p">{</span><span class="n">indexroot</span><span class="p">}</span> <span class="nv">$</span> <span class="nv">java</span> <span class="o">-</span><span class="n">jar</span> <span class="n">org</span><span class="o">.</span><span class="n">apache</span><span class="o">.</span><span class="n">stanbol</span><span class="o">.</span><span class="n">entityhub</span><span class="o">.</span><span class="n">indexing</span><span class="o">.</span><span class="n">genericrdf</span><span class="o">-*-</span><span class="n">jar</span><span class="o">-</span><span class="n">with</span><span class="o">-</span><span class="n">dependencies</span><span class="o">.</span><span class="n">jar</span> <span class="n">init</span>
 </pre></div>
 
 
-<p>You will get a directory with the default configuration files, one for the sources and a distribution directory for the resulting files. Make sure, that you adapt the default configuration with at least </p>
+<p>This will create a directory for the configuration files with a default configuration, another directory for the sources, and a distribution directory for the resulting files. Make sure, that you adapt the default configuration with at least </p>
 <ul>
 <li>the id/name and license information of your data and </li>
-<li>namespaces and properties mapping you want to include to the index (see example of a <a href="examples/anl-mappings.txt">mappings.txt</a> including default and specific mappings for one dataset)</li>
+<li>namespaces and properties mapping you want to include in the index (see example of a <a href="examples/anl-mappings.txt">mappings.txt</a> including default and specific mappings for one dataset)</li>
 </ul>
-<p>Then, copy your source files into the respective directory <code>indexing/resources/rdfdata</code>. Several standard formats for RDF, multiple files and archives of them are supported. </p>
-<p><em>For more details of possible configurations, please consult the <a href="https://github.com/apache/stanbol/blob/trunk/entityhub/indexing/genericrdf/README.md">README</a>.</em></p>
-<p>Then, you can start the index by running</p>
-<div class="codehilite"><pre><span class="n">java</span> <span class="o">-</span><span class="n">Xmx1024m</span> <span class="o">-</span><span class="n">jar</span> <span class="n">org</span><span class="o">.</span><span class="n">apache</span><span class="o">.</span><span class="n">stanbol</span><span class="o">.</span><span class="n">entityhub</span><span class="o">.</span><span class="n">indexing</span><span class="o">.</span><span class="n">genericrdf</span><span class="o">-*-</span><span class="n">jar</span><span class="o">-</span><span class="n">with</span><span class="o">-</span><span class="n">dependencies</span><span class="o">.</span><span class="n">jar</span> <span class="nb">index</span>
+<p>Then, copy your source files into the source directory <code>indexing/resources/rdfdata</code>. The Entityhub indexing tool supports several standard formats for RDF, multiple files and archives of them as source input. </p>
+<p><em>For more details about possible configurations, please consult the <a href="https://github.com/apache/stanbol/blob/trunk/entityhub/indexing/genericrdf/README.md">README</a>.</em></p>
+<p>Once all source files are in place, you can start the index process by running</p>
+<div class="codehilite"><pre><span class="p">{</span><span class="n">indexroot</span><span class="p">}</span> <span class="nv">$</span> <span class="nv">java</span> <span class="o">-</span><span class="n">Xmx1024m</span> <span class="o">-</span><span class="n">jar</span> <span class="n">org</span><span class="o">.</span><span class="n">apache</span><span class="o">.</span><span class="n">stanbol</span><span class="o">.</span><span class="n">entityhub</span><span class="o">.</span><span class="n">indexing</span><span class="o">.</span><span class="n">genericrdf</span><span class="o">-*-</span><span class="n">jar</span><span class="o">-</span><span class="n">with</span><span class="o">-</span><span class="n">dependencies</span><span class="o">.</span><span class="n">jar</span> <span class="nb">index</span>
 </pre></div>
 
 
-<p>Depending on your hardware and on complexity and size of your sources, it may take several hours to built the index. As a result, you will get an archive of a <a href="http://lucene.apache.org/solr/">SOLr</a> index together with an OSGI bundle to work with the index in Stanbol.</p>
-<p><strong>Step 3 : Initialize the index within Stanbol</strong></p>
-<p>At your running Stanbol instance, copy the ZIP archive into <code>{root}/sling/datafiles</code>. Then, at the "Bundles" tab of the administration console add and start the <code>org.apache.stanbol.data.site.{name}-{version}.jar</code>.</p>
+<p>Depending on your hardware and on complexity and size of your sources, it may take several hours to built the index. As a result, you will get an archive of an <a href="http://lucene.apache.org/solr/">Apache Solr</a> index together with an OSGI bundle to work with the index in Stanbol.</p>
+<p><strong>Step 3 : Initialize the index within Apache Stanbol</strong></p>
+<p>We assume that you already have a running Apache Stanbol instance. Copy the ZIP archive into the <code>datafiles</code> folder of that instance. Now open the OSGi administration console of your in a web browser. Navigate to the "Bundles" tab and start the newly created bundle named <code>org.apache.stanbol.data.site.{name}-{version}.jar</code>.</p>
 <h3 id="b-configure-and-use-the-index-with-enhancement-engines">B. Configure and use the index with enhancement engines</h3>
-<p>Before you can make use of the custom vocabulary you need to decide, which kind of enhancements you want to support. If your enhancements are Named Entities in its strict sense (Persons, Locations, Organizations), then you may use the standard NER engine together with its EntityLinkingEngine to configure the destination of your links.</p>
-<p>In cases, where you want to match all kinds of named entities and concepts from your custom vocabulary, you should work with the <a href="enhancer/engines/keywordlinkingengine.html">KeywordLinkingEngine</a> to both, find occurrences and to link them to custom entities. In this case, you'll get only results, if there is a match, while in the case above, you even get entities, where you don't find exact links. This approach will have its advantages when you need to have a high recall rate on your custom entities.</p>
+<p>Before you can make use of the custom vocabulary you need to decide, which kind of enhancements you want to support. If your enhancements are named entities in its strict sense (e.g. persons, locations, organizations), then you may use the standard NER engine in combination with the EntityLinkingEngine to configure the link destinations of the found entities.</p>
+<p>In case, you want to match all kinds of named entities and concepts from your custom vocabulary, you should work with the <a href="enhancer/engines/keywordlinkingengine.html">KeywordLinkingEngine</a> to both, find occurrences and to link them to custom entities. In this case, you'll get only results, if there is a match, while in the case above, you even get entities, where you don't find exact links. This approach will have its advantages when you need to have a high recall rate on your custom entities.</p>
 <p>In the following the configuration options are described briefly.</p>
 <p><strong>Use the KeywordLinkingEngine only</strong></p>
 <p>(1) To make sure, that the enhancement process uses the KeywordLinkingEngine only, deactivate the "standard NLP" enhancement engines, especially the NamedEntityExtractionEnhancementEngine (NER) and the EntityLinkingEngine before to work with the TaxonomyLinkingEngine.</p>
@@ -148,9 +148,10 @@
 </ul>
 <p><em>Full details on the engine and its configuration are available <a href="enhancer/engines/keywordlinkingengine.html">here</a>.</em></p>
 <p><strong>Use several instances of the KeywordLinkingEngine</strong></p>
-<p>To work at the same time with different instances of the KeywordLinkingEngine can be useful in cases, where you have two or more distinct custom vocabularies/indexes and/or if you want to combine your specific domain vocabulary with general purpose datasets such as dbpedia or others.</p>
+<p>To work at the same time with different instances of the KeywordLinkingEngine ... FIXME</p>
+<p>This can be useful in cases, where you have two or more distinct custom vocabularies/indexes and/or if you want to combine your specific domain vocabulary with general purpose datasets such as dbpedia or others.</p>
 <p><strong>Use the KeywordLinkingEngine together with the NER engine and the EntityLinkingEngine</strong></p>
-<p>If your text corpus contains common entities and enterprise specific as well and you are interested getting enhancements for both, you may also use the KeywordLinkingEngine for your custom thesaurus and the NERengine together with the EntityLinkingEngine targeting at e.g. dbpedia at the same time. </p>
+<p>If your text corpus contains common entities as well a enterprise specific entities and you are interested getting enhancements for both, you may also use the KeywordLinkingEngine for your custom thesaurus and the NER engine in combination with the EntityLinkingEngine, targeting at e.g. dbpedia, at the same time. </p>
 <h2 id="examples">Examples</h2>
 <p>You can find guidance for the following indexers in the README files at <code>{root}/entityhub/indexing/{name-for-indexer}</code></p>
 <ul>
@@ -161,9 +162,9 @@
 </ul>
 <h2 id="demos-and-resources">Demos and Resources</h2>
 <ul>
-<li>The full <a href="http://dev.iks-project.eu:8081/">demo</a> installation of Stanbol is configured to also work with an environmental thesaurus - if you test it with unstructured text from the domain, you should get enhancements with additional results for specific "concepts".</li>
-<li>Download custom test indexes and installer bundles for Stanbol from <a href="http://dev.iks-project.eu/downloads/stanbol-indices/">here</a> (e.g. for GEMET environmental thesaurus, or a big dbpedia index).</li>
-<li>A very concrete example using metadata from the Austrian National Library is described <a href="http://blog.iks-project.eu/using-custom-vocabularies-with-apache-stanbol/">here</a>.</li>
+<li>The full <a href="http://dev.iks-project.eu:8081/">demo</a> IKS installation of Apache Stanbol is configured to also work with an environmental thesaurus - if you test it with unstructured text from the domain, you should get enhancements with additional results for specific concepts.</li>
+<li>Download custom test indexes and installer bundles for Apache Stanbol from <a href="http://dev.iks-project.eu/downloads/stanbol-indices/">here</a> (e.g. for GEMET environmental thesaurus, or a big dbpedia index).</li>
+<li>Another example using metadata from the Austrian National Library is described <a href="http://blog.iks-project.eu/using-custom-vocabularies-with-apache-stanbol/">here</a>.</li>
 </ul>
   </div>