You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by ag...@apache.org on 2011/09/16 11:59:14 UTC
svn commit: r1171482 - in /incubator/stanbol/site/trunk/content/stanbol/docs/trunk: customvocabulary.mdtext examples/ examples/anl-mappings.txt

Author: agruber
Date: Fri Sep 16 09:59:13 2011
New Revision: 1171482

URL: http://svn.apache.org/viewvc?rev=1171482&view=rev
Log:
updated customvocabulary description with examples

Added:
    incubator/stanbol/site/trunk/content/stanbol/docs/trunk/examples/
    incubator/stanbol/site/trunk/content/stanbol/docs/trunk/examples/anl-mappings.txt
Modified:
    incubator/stanbol/site/trunk/content/stanbol/docs/trunk/customvocabulary.mdtext

Modified: incubator/stanbol/site/trunk/content/stanbol/docs/trunk/customvocabulary.mdtext
URL: http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/customvocabulary.mdtext?rev=1171482&r1=1171481&r2=1171482&view=diff
==============================================================================
--- incubator/stanbol/site/trunk/content/stanbol/docs/trunk/customvocabulary.mdtext (original)
+++ incubator/stanbol/site/trunk/content/stanbol/docs/trunk/customvocabulary.mdtext Fri Sep 16 09:59:13 2011
@@ -1,25 +1,28 @@
 Title: Using custom/local vocabularies with Apache Stanbol
 
-For text enhancement and linking to external sources, the Entityhub provides you with the possibility to work with local indexes of datasets for several reasons. Firstly, you do not want to rely on internet connectivity to these services, secondly you may want to manage local changes to these public repository and thirdly, you may want to work with local resources only, such as your LDAP directory or a specific and private enterprise vocabulary of your domain.
+The ability to work with custom vocabularies is necessary for many organisations. Use cases range from being able to detect various types of named entities specific of a company or to detect and work with concepts from a specific domain.
 
-The main other possibility is to upload ontologies to the ontology manager and to use the reasoning components over it.
+For text enhancement and linking to external sources, the Entityhub component of Apache Stanbol allows to work with local indexes of datasets for several reasons: 
 
-This document focuses on two cases:
+- do not want to rely on internet connectivity to these services, thus working offline with a huge set of enties
+- want to manage local updates of these public repositories and 
+- want to work with local resources only, such as your LDAP directory or a specific and private enterprise vocabulary of a specific domain.
 
-- Creating and using a local SOLr index of a given vocabulary e.g. a SKOS thesaurus or taxonomy of your domain
-- Directly working with individual instance entities from given ontologies e.g. a FOAF repository.
+Creating your custom indexes the preferred way of working with custom vocabularies. For small vocabularies, with Entithub one can also upload simple ontologies together instance data directly to the Entityhub and manage them - but as a major downside to this approach, one can only manage one ontology per installation.
 
-## Creating and working with local indexes
+This document focuses on the main case: Creating and using a local SOLr indexes of a custom vocabularies e.g. a SKOS thesaurus or taxonomy of your domain.
 
-The ability to work with custom vocabularies in Stanbol is necessary for many organizational use cases such as beeing able to detect various types of named entities specific to a company or to detect and work with concepts from a specific domain. Stanbol provides the machinery to start with vocabularies in standard languages such as [SKOS - Simple Knowledge Organization Systems](http://www.w3.org/2004/02/skos/) or more general [RDF](http://www.w3.org/TR/rdf-primer/) encoded data sets. The respective Stanbol components, which are needed for this functionality are the Entityhub for creating and managing the index and several [Enhancement Engines](engines.html) to make use of the index during the enhancement process.
+## Creating and working with custom local indexes
 
-### Create your own index
+Stanbol provides the machinery to start with vocabularies in standard languages such as [SKOS - Simple Knowledge Organization Systems](http://www.w3.org/2004/02/skos/) or more general [RDF](http://www.w3.org/TR/rdf-primer/) encoded data sets. The respective Stanbol components, which are needed for this functionality are the Entityhub for creating and managing the index and several [Enhancement Engines](engines.html) to make use of the indexes during the enhancement process.
+
+### A. Create your own index
 
 **Step 1 : Create the indexing tool**
 
 The indexing tool provides a default configuration for creating a SOLr index of RDF files (e.g. a SKOS export of a thesaurus or a set of foaf files).
 
-(1) If not yet built during the Stanbol build process of the entityhub call
+If not yet built during the Stanbol build process of the entityhub call
 
     mvn install
 
@@ -40,7 +43,14 @@ Initialize the tool with
 
     java -jar org.apache.stanbol.entityhub.indexing.genericrdf-*-jar-with-dependencies.jar init
 
-You will get a directory with the default configuration files, one for the sources and a distribution directory for the resulting files. Make sure, that you adapt the default configuration with at least the name of your index and namespaces and properties you need to include to the index and copy your source files into the respective directory <code>indexing/resources/rdfdata</code>. Several standard formats for RDF, multiple files and archives of them are supported. *For details of possible configurations, please consult the <code>{root}/entityhub/indexing/genericrdf/readme.md</code>.*
+You will get a directory with the default configuration files, one for the sources and a distribution directory for the resulting files. Make sure, that you adapt the default configuration with at least 
+
+- the id/name and licence information of your data and 
+- namespaces and properties mapping you want to include to the index (see example of a [mappings.txt](examples/anl-mappings.txt) including default and specific mappings for one dataset)
+
+Then, copy your source files into the respective directory <code>indexing/resources/rdfdata</code>. Several standard formats for RDF, multiple files and archives of them are supported. 
+
+*For more details of possible configurations, please consult the README at <code>{root}/entityhub/indexing/genericrdf/</code>.*
 
 Then, you can start the index by running
 
@@ -54,7 +64,7 @@ Depending on your hardware and on comple
 At your running Stanbol instance, copy the ZIP archive into <code>{root}/sling/datafiles</code>. Then, at the "Bundles" tab of the administration console add and start the <code>org.apache.stanbol.data.site.{name}-{version}.jar</code>.
 
 
-### Configuring the enhancement engines
+### B. Configure and use the index with enhancement engines
 
 Before you can make use of the custom vocabulary you need to decide, which kind of enhancements you want to support. If your enhancements are NamedEntities in its more strict sense (Persons, Locations, Organizations), then you can may use the standard NER engine together with its EntityLinkingEngine to configure the destination of your links.
 
@@ -69,15 +79,15 @@ In the following the configuration optio
 
 (2) Open the configuration console at http://localhost:8080/system/console/configMgr and navigate to the TaxonomyLinkingEngine. Its main options are configurable via the UI.
 
-- Referenced Site: {put the id/name of your index} (required)
-- Label Field: {the property to search for}
+- Referenced Site: {put the id/name of your index}
+- Label Field: {the property to search for} 
 - Use Simple Tokenizer: {deactivate to use language specific tokenizers}
 - Min Token Length: {set minimal token length}
 - Use Chunker: {disable/enable language specific chunkers}
 - Suggestions: {maximum number of suggestions}
 - Number of Required Tokens: {minimal required tokens}
 
-*For further details please on the engine and its configuration please consult the according Readme file at TODO: create the readme <code>{root}/stanbol/enhancer/engines/taxonomylinking/<code>.*
+*For further details please on the engine and its configuration please refer to the according README at <code>{root}/stanbol/enhancer/engines/taxonomylinking/</code>.* (TODO: create the Readme)
 	
 
 **Use several instances of the TaxonomyLinkingEngine**
@@ -87,28 +97,18 @@ To work at the same time with different 
 
 **Use the TaxonomyLinkingEngine together with the NER engine and the EntityLinkingEngine**
 
-If your text corpus contains and you are interested in both, generic NamedEntities and custom thesaurus you may use   
-
-
-
-### Demos and Examples
-
-- The full demo installation of Stanbol is configured to also work with an environmental thesaurus - if you test it with unstructured text from the domain, you should get enhancements with additional results for specific "concepts".
-- One example can be found with metadata from the Austrian National Library is described (TODO: link) here.
-
-(TODO) - Examples
-
+If your text corpus contains and you are interested in both, generic NamedEntities and custom thesaurus you may use (TODO)  
 
-## Create a custom index for dbpedia
 
-(TODO) dbpedia indexing (<-- olivier)
+## Specific Examples
 
+**Create your custom index for dbpedia:** (TODO: dbpedia indexing (<-- olivier))
 
-## Working with ontologies in EntityHub
 
-(TODO)
+## Resources
 
-### Demos and Examples
+- The full [demo](http://dev.iks-project.eu:8081/) installation of Stanbol is configured to also work with an environmental thesaurus - if you test it with unstructured text from the domain, you should get enhancements with additional results for specific "concepts".
+- Download custom test indexes and installer bundles for Stanbol from [here](http://dev.iks-project.eu/downloads/stanbol-indices/) (e.g. for GEMET environmental thesaurus, or a big dbpedia index).
+- Another concrete example with metadata from the Austrian National Library is described (TODO: link) here.
 
-(TODO)
 

Added: incubator/stanbol/site/trunk/content/stanbol/docs/trunk/examples/anl-mappings.txt
URL: http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/examples/anl-mappings.txt?rev=1171482&view=auto
==============================================================================
--- incubator/stanbol/site/trunk/content/stanbol/docs/trunk/examples/anl-mappings.txt (added)
+++ incubator/stanbol/site/trunk/content/stanbol/docs/trunk/examples/anl-mappings.txt Fri Sep 16 09:59:13 2011
@@ -0,0 +1,164 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+#NOTE: THIS IS A DEFAULT MAPPING SPECIFICATION THAT INCLUDES MAPPINGS FOR
+#      COMMON ONTOLOGIES. USERS MIGHT WANT TO ADAPT THIS CONFIGURATION AB
+#      COMMENTING/UNCOMMENTING AND/OR ADDING NEW MAPPINGS
+
+# --- Define the Languages for all fields ---
+# to restrict languages to be imported (for all fields)
+#| @=null;en;de;fr;it
+
+#NOTE: null is used to import labels with no specified language
+
+# to import all languages leave this empty
+
+# --- RDF RDFS and OWL Mappings ---
+# This configuration only index properties that are typically used to store
+# instance data defined by such namespaces. This excludes Ontology definitions
+
+# NOTE that nearly all other ontologies are are using properties of these three
+#      schemas, therefore it is strongly recommended to include such information!
+
+rdf:type | d=entityhub:ref
+
+rdfs:label 
+rdfs:comment
+rdfs:seeAlso | d=entityhub:ref
+
+
+owl:sameAs | d=entityhub:ref
+
+#If one likes to also index Ontologies one should add the following statements
+#owl:*
+#rdfs:*
+
+# --- Dublin Core (DC) ---
+# The default configuration imports all dc-terms data and copies vlaues for the
+# old dc-elements standard over to the according properties ofthe dc-terms
+#standard.
+
+# NOTE that a lot of other ontologies are also using DC for some of there data
+#      therefore it is strongly recommended to include such information!
+
+#mapping for all dc-terms properties
+dc:*
+
+# copy dc:title to rdfs:label
+dc:title > rdfs:label
+
+# deactivated by default, because such mappings are mapped to dc-terms
+#dc-elements:*
+
+# mappings for the dc-elements properties to the dc-terms
+dc-elements:contributor > dc:contributor
+dc-elements:coverage > dc:coverage
+dc-elements:creator > dc:creator
+dc-elements:date > dc:date
+dc-elements:description > dc:description
+dc-elements:format > dc:format
+dc-elements:identifier > dc:identifier
+dc-elements:language > dc:language
+dc-elements:publisher > dc:publisher
+dc-elements:relation > dc:relation
+dc-elements:rights > dc:rights
+dc-elements:source > dc:source
+dc-elements:subject > dc:subject
+dc-elements:title > dc:title
+dc-elements:type > dc:type
+#also use ec-elements:title as label
+dc-elements:title > rdfs:label
+
+# --- Social Networks (via foaf) ---
+#The Friend of a Friend schema often used to describe social relations between people
+foaf:*
+
+# copy the name of a person over to rdfs:label
+foaf:name > rdfs:label
+
+# additional data types checks
+foaf:knows | d=entityhub:ref
+foaf:made | d=entityhub:ref
+foaf:maker | d=entityhub:ref
+foaf:member | d=entityhub:ref
+foaf:homepage | d=xsd:anyURI
+foaf:depiction | d=xsd:anyURI
+foaf:img | d=xsd:anyURI
+foaf:logo | d=xsd:anyURI
+#page about the entity
+foaf:page | d=xsd:anyURI
+
+
+# --- Simple Knowledge Organization System (SKOS) ---
+
+# A common data model for sharing and linking knowledge organization systems 
+# via the Semantic Web. Typically used to encode controlled vocabularies auch as
+# a thesaurus  
+skos:*
+
+# copy the preferred label  over to rdfs:label
+skos:prefLabel > rdfs:label
+
+# copy values of **Match relations to the according related, broader and narrower
+skos:relatedMatch > skos:related
+skos:broadMatch > skos:broader
+skos:narrowMatch > skos:skos:narrower
+
+#similar mappings for transitive variants are not contained, because transitive
+#reasoning is not directly supported by the Entityhub.
+
+# Some SKOS thesaurus do use "skos:transitiveBroader" and "skos:transitiveNarrower"
+# however such properties are only intended to be used by reasoners to
+# calculate transitive closures over broader/narrower hierarchies.
+# see http://www.w3.org/TR/skos-reference/#L2413 for details
+# to correct such cases we will copy transitive relations to there counterpart
+skos:narrowerTransitive > skos:narrower
+skos:broaderTransitive > skos:broader
+
+
+# --- Semantically-Interlinked Online Communities (SIOC) ---
+
+# an ontology for describing the information in online communities. 
+# This information can be used to export information from online communities 
+# and to link them together. The scope of the application areas that SIOC can 
+# be used for includes (and is not limited to) weblogs, message boards, 
+# mailing lists and chat channels.
+sioc:*
+
+# --- biographical information (bio)
+# A vocabulary for describing biographical information about people, both living
+# and dead. (see http://vocab.org/bio/0.1/)
+bio:*
+
+# --- Rich Site Summary (rss) ---
+rss:*
+
+# --- GoodRelations (gr) ---
+# GoodRelations is a standardised vocabulary for product, price, and company data
+gr:*
+
+# --- Creative Commons Rights Expression Language (cc)
+# The Creative Commons Rights Expression Language (CC REL) lets you describe 
+# copyright licenses in RDF.
+cc:*
+
+# --- Additional namespaces added for the Europeana dataset (http://ckan.net/dataset/europeana-lod) ---
+http://www.europeana.eu/schemas/edm/*
+http://www.openarchives.org/ore/terms/*
+
+
+
+
+