You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by rw...@apache.org on 2014/06/02 10:02:59 UTC

svn commit: r1599111 - /stanbol/site/trunk/content/docs/trunk/customvocabulary.mdtext

Author: rwesten
Date: Mon Jun  2 08:02:59 2014
New Revision: 1599111

URL: http://svn.apache.org/r1599111
Log:
chainged remaining keyword linking mentions to entity linking

Modified:
    stanbol/site/trunk/content/docs/trunk/customvocabulary.mdtext

Modified: stanbol/site/trunk/content/docs/trunk/customvocabulary.mdtext
URL: http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/customvocabulary.mdtext?rev=1599111&r1=1599110&r2=1599111&view=diff
==============================================================================
--- stanbol/site/trunk/content/docs/trunk/customvocabulary.mdtext (original)
+++ stanbol/site/trunk/content/docs/trunk/customvocabulary.mdtext Mon Jun  2 08:02:59 2014
@@ -67,7 +67,7 @@ Users of the Entityhub Indexing Tool wil
 
 The indexing tool provides a default configuration for creating an [Apache Solr](http://lucene.apache.org/solr/) index of RDF files (e.g. a SKOS export of a thesaurus or a set of foaf files).
 
-To build the indexing tool from source - recommended - you will need to checkout Apache Stanbol form SVN (or [download](../../downloads) a source-release). Instructions for this can be found [here](tutorial.html). However if you want to skip this you can also obtain a [binary version](http://dev.iks-project.eu/downloads/stanbol-launchers/) from the IKS development server (search the sub-folders of the different versions for a file named like "<code>org.apache.stanbol.entityhub.indexing.genericrdf-*-jar-with-dependencies.jar</code>").
+To build the indexing tool from source - recommended - you will need to checkout Apache Stanbol form SVN (or [download](../../downloads) a source-release). Instructions for this can be found [here](tutorial.html). However if you want to skip this you can also obtain a [binary version](http://dev.iks-project.eu/downloads/stanbol-launchers/) from the IKS development server (search the sub-folders of the different versions for a file named like "`org.apache.stanbol.entityhub.indexing.genericrdf-*-jar-with-dependencies.jar`").
 
 In case you downloaded or "svn co" the source to {stanbol-source} and successfully build the source as described in the [Tutorial](tutorial.html) you still need to assembly the indexing tool by
  
@@ -94,19 +94,19 @@ Initialize the tool with
 
 This will create/initialize the default configuration for the Indexing Tool including (relative to {indexing-working-dir}):
 
-*  <code>/indexing/config</code>: Folder containing the default configuration including the "indexing.properties" and "mappings.txt" file.
-*  <code>/indexing/resources</code>: Folder with the source files used for indexing including the "rdfdata" folder where you will need to copy the RDF files to be indexed
-*  <code>/indexing/destination</code>: Folder used to write the data during the indexing process.
-*  <code>/indexing/dist</code>: Folder where you will find the <code>{name}.solrindex.zip</code> and <code>org.apache.stanbol.data.site.{name}-{version}.jar</code> files needed to install your index to the Apache Stanbol Entityhub.
-
-After the initialization you will need to provide the following configurations in files located in the configuration folder (<code>{indexing-working-dir}/indexing/config</code>)
-
-* Within the <code>indexing.properties</code> file you need to set the {name} of your index by changing the value of the "name" property. In addition you should also provide a "description". At the end of the indexing.properties file you can also specify the license and attribution for the data you index. The Apache Entityhub will ensure that those information will be included with any entity data returned for requests.
-* Optionally, if your data do use namespaces that are not present in [prefix.cc](http://prefix.cc) (or the server used for indexing does not have internet connectivity) you can manually define required prefixes by creating/using the a <code>indexing/config/namespaceprefix.mappings</code> file. The syntax is '<code>'{prefix}\t{namespace}\n</code>' where '<code>{prefix} ... [0..9A..Za..z-_]</code>' and '<code>{namespace} ... must end with '#' or '/' for URLs and ':' for URNs</code>'.
-* Optionally, if the data you index do use some none common namespaces you will need to add those to the <code>mapping.txt</code> file (here is an [example](examples/anl-mappings.txt)  including default and specific mappings for one dataset)
-* Optionally, if you want to use a custom SolrCore configuration the core configuration needs to be copied to the <code>indexing/config/{core-name}</code>. Default configuration - to start from - can be downloaded from the [Stanbol SVN](https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/yard/solr/src/main/resources/solr/core/) and extracted to the <code>indexing/config/</code> folder. If the {core-name} is different from the 'name' configured in the <code>indexing.properties</code> than the '<code>solrConf</code>' parameter of the '<code>indexingDestination</code>' MUST be set to '<code>solrConf:{core-name}</code>'. After those configurations users can make custom adaptations to the SolrCore configuration used for indexing. 
+*  `/indexing/config`: Folder containing the default configuration including the "indexing.properties" and "mappings.txt" file.
+*  `/indexing/resources`: Folder with the source files used for indexing including the "rdfdata" folder where you will need to copy the RDF files to be indexed
+*  `/indexing/destination`: Folder used to write the data during the indexing process.
+*  `/indexing/dist`: Folder where you will find the `{name}.solrindex.zip` and `org.apache.stanbol.data.site.{name}-{version}.jar` files needed to install your index to the Apache Stanbol Entityhub.
+
+After the initialization you will need to provide the following configurations in files located in the configuration folder (`{indexing-working-dir}/indexing/config`)
+
+* Within the `indexing.properties` file you need to set the {name} of your index by changing the value of the "name" property. In addition you should also provide a "description". At the end of the indexing.properties file you can also specify the license and attribution for the data you index. The Apache Entityhub will ensure that those information will be included with any entity data returned for requests.
+* Optionally, if your data do use namespaces that are not present in [prefix.cc](http://prefix.cc) (or the server used for indexing does not have internet connectivity) you can manually define required prefixes by creating/using the a `indexing/config/namespaceprefix.mappings` file. The syntax is '`'{prefix}\t{namespace}\n`' where '`{prefix} ... [0..9A..Za..z-_]`' and '`{namespace} ... must end with '#' or '/' for URLs and ':' for URNs`'.
+* Optionally, if the data you index do use some none common namespaces you will need to add those to the `mapping.txt` file (here is an [example](examples/anl-mappings.txt)  including default and specific mappings for one dataset)
+* Optionally, if you want to use a custom SolrCore configuration the core configuration needs to be copied to the `indexing/config/{core-name}`. Default configuration - to start from - can be downloaded from the [Stanbol SVN](https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/yard/solr/src/main/resources/solr/core/) and extracted to the `indexing/config/` folder. If the {core-name} is different from the 'name' configured in the `indexing.properties` than the '`solrConf`' parameter of the '`indexingDestination`' MUST be set to '`solrConf:{core-name}`'. After those configurations users can make custom adaptations to the SolrCore configuration used for indexing. 
 
-Finally you will also need to copy your source files into the source directory <code>{indexing-working-dir}/indexing/resources/rdfdata</code>. All files within this directory will be indexed. THe indexing tool support most common RDF serialization. You can also directly index compressed RDF files.
+Finally you will also need to copy your source files into the source directory `{indexing-working-dir}/indexing/resources/rdfdata`. All files within this directory will be indexed. THe indexing tool support most common RDF serialization. You can also directly index compressed RDF files.
 
 For more details about possible configurations, please consult the [README](https://github.com/apache/stanbol/blob/trunk/entityhub/indexing/genericrdf/README.md).
 
@@ -116,26 +116,26 @@ Once all source files are in place, you 
     $ cd {indexing-working-dir}
     $ java -Xmx1024m -jar org.apache.stanbol.entityhub.indexing.genericrdf-*-jar-with-dependencies.jar index
 
-Depending on your hardware and on complexity and size of your sources, it may take several hours to built the index. As a result, you will get an archive of an [Apache Solr](http://lucene.apache.org/solr/) index together with an OSGI bundle to work with the index in Stanbol. Both files will be located within the <code>indexing/dist</code> folder.
+Depending on your hardware and on complexity and size of your sources, it may take several hours to built the index. As a result, you will get an archive of an [Apache Solr](http://lucene.apache.org/solr/) index together with an OSGI bundle to work with the index in Stanbol. Both files will be located within the `indexing/dist` folder.
 
 _IMPORTANT NOTES:_ 
 
 * The import of the RDF files to the Jena TDB triple store - used as source for the indexing - takes a lot of time. Because of that imported data are reused for multiple runs of the indexing tool. This has two important effects users need to be aware of:
 
-    1. Already imported RDF files should be removed from the <code>{indexing-working-dir}/indexing/resources/rdfdata</code> to avoid to re-import them on every run of the tool. NOTE: newer versions of the Entityhub indexing tool might automatically move successfully imported RDF files to a different folder.
-    2. If the RDF data change you will need to delete the Jena TDB store so that those changes are reflected in the created index. To do this delete the <code>{indexing-working-dir}/indexing/resources/tdb</code> folder
+    1. Already imported RDF files should be removed from the `{indexing-working-dir}/indexing/resources/rdfdata` to avoid to re-import them on every run of the tool. NOTE: newer versions of the Entityhub indexing tool might automatically move successfully imported RDF files to a different folder.
+    2. If the RDF data change you will need to delete the Jena TDB store so that those changes are reflected in the created index. To do this delete the `{indexing-working-dir}/indexing/resources/tdb` folder
 
-* Also the destination folder <code>{indexing-working-dir}/indexing/destination</code> is NOT deleted between multiple calls to index. This has the effect that Entities indexed by previous indexing calls are not deleted. While this allows to index a dataset in multiple steps - or even to combine data of multiple datasets in a single index - this also means that you will need to delete the destination folder if the RDF data you index have changed - especially if some Entities where deleted. 
+* Also the destination folder `{indexing-working-dir}/indexing/destination` is NOT deleted between multiple calls to index. This has the effect that Entities indexed by previous indexing calls are not deleted. While this allows to index a dataset in multiple steps - or even to combine data of multiple datasets in a single index - this also means that you will need to delete the destination folder if the RDF data you index have changed - especially if some Entities where deleted. 
 
 
 ### Step 3 : Initialize the index within Apache Stanbol
 
 We assume that you already have a running Apache Stanbol instance at http://{stanbol-host} and that {stanbol-working-dir} is the working directory of that instance on the local hard disk. To install the created index you need to 
 
-* copy the "{name}.solrindex.zip" file to the <code>{stanbol-working-dir}/stanbol/datafiles</code> directory (NOTE if you run the 0.9.0-incubating version the path is <code>{stanbol-working-dir}/sling/datafiles</code>).
-* install the <code>org.apache.stanbol.data.site.{name}-{version}.jar</code> to the OSGI environment of your Stanbol instance e.g. by using the Bundle tab of the Apache Felix web console at </code>http://{stanbol-host}/system/console/bundles</code>
+* copy the "{name}.solrindex.zip" file to the `{stanbol-working-dir}/stanbol/datafiles` directory (NOTE if you run the 0.9.0-incubating version the path is `{stanbol-working-dir}/sling/datafiles`).
+* install the `org.apache.stanbol.data.site.{name}-{version}.jar` to the OSGI environment of your Stanbol instance e.g. by using the Bundle tab of the Apache Felix web console at `http://{stanbol-host}/system/console/bundles`
 
-You find both files in the <code>{indexing-working-dir}/indexing/dist/</code> folder.
+You find both files in the `{indexing-working-dir}/indexing/dist/` folder.
 
 After the installation your data will be available at
 
@@ -151,13 +151,13 @@ This section covers how to configure the
 Generally there are two possible ways you can use to recognize entities of your vocabulary:
 
 1. __Named Entity Linking__: This first uses Named Entity Recoqunition (NER) for spotting "named entities" in the text and second try to link those named entities with entities defined in your vocabulary. This approach is limited to entities with the type person, organization and places. So if your vocabulary contains entities of other types, they will not be recognized. In addition it also requires the availability of NER for the language(s) of the processed documents.
-2. __Keyword Linking__: This uses the labels of entities in your vocabulary for the recognition and linking process. Natural Language Processing (NLP) techniques such as part-of-speach (POS) detection can be used to improve performance and results but this works also without NLP support. As extraction and linking is based on labels mentioned in the analyzed content this method has no restrictions regarding the types of your entities.
+2. __Entity Linking__: This uses the labels of entities in your vocabulary for the recognition and linking process. Natural Language Processing (NLP) techniques such as part-of-speach (POS) detection can be used to improve performance and results but this works also without NLP support. As extraction and linking is based on labels mentioned in the analyzed content this method has no restrictions regarding the types of your entities.
 
 For more information about this you might also have a look at the introduction of the [multi lingual](multilingual) usage scenario.
 
 _TIP_: If you are unsure about what to use you can also start with configuring both options to give it a try. 
 
-Depending on if you want to use named entity linking or keyword linking the configuration of the [enhancement chain](components/enhancer/chains) and the [enhancement engine](components/enhancer/engines) making use of your vocabulary will be different.
+Depending on if you want to use _named entity linking_ or _entity linking_ the configuration of the [enhancement chain](components/enhancer/chains) and the [enhancement engine](components/enhancer/engines) making use of your vocabulary will be different. The following two sub-sections provide more information on that.
 
 ### Configuring Named Entity Linking
 
@@ -166,7 +166,7 @@ For the configuration of this engine you
 
 1. The "name" of the enhancement engine. It is recommended to use "{name}Linking" - where {name} is the name of the Entityhub Site (ReferenceSite or ManagedSite).
 2. The name of the referenced site holding your vocabulary. Here you have to configure the {name}.
-3. Enable/disable persons, organizations and places and if enabled configure the <code>rdf:type</code> used by your vocabulary for those type. If you do not want to restrict the type, you can also leave the type field empty.
+3. Enable/disable persons, organizations and places and if enabled configure the `rdf:type` used by your vocabulary for those type. If you do not want to restrict the type, you can also leave the type field empty.
 4. Define the property used to match against the named entities detected by the used NER engine(s).
 
 For more detailed information please see the documentation of the [Named Entity Tagging Engine](components/enhancer/engines/namedentitytaggingengine.html).
@@ -198,7 +198,7 @@ To use _Entity Linking_ with a custom Vo
     * in case of the Entityhub Linking Engine the "Label Field" needs to be set to the URI of the property holding the labels. You can only use a single field. If you want to use values of several fields you need to adapt your indexing configuration to copy the values of those fields to a single one (e.g. by adding `skos:prefLabel > rdfs:label` and `skos:altLabel > rdfs:label` to the `{indexing-working-dir}/indexing/config/mappings.txt` config.
     * in case of the FST Linking engine you need to provide the [FST Tagging Configuration](components/enhancer/engines/lucenefstlinking#fst-tagging-configuration). If you store your labels in the `rdfs:label` field and you want to support all languages present in your vocabulary use `*;field=rdfs:label;generate=true`. _NOTE_ that `generate=true` is required to allow the engine to (re)create FST models at runtime.
 4. The "Link ProperNouns only": If the custom Vocabulary contains Proper Nouns (Named Entities) than this parameter should be activated. This options causes the Entity Linking process to not making queries for commons nouns and by that receding the number of queries agains the controlled vocabulary by ~70%. However this is not feasible if the vocabulary does contain Entities that are common nouns in the language. 
-5. The "Type Mappings" might be interesting for you if your vocabulary contains custom types as those mappings can be used to map 'rdf:type's of entities in your vocabulary to 'dc:type's used for 'fise:TextAnnotation's - created by the Apache Stanbol Enhancer to annotate occurrences of extracted entities in the parsed text. See the [type mapping syntax](components/enhancer/engines/keywordlinkingengine.html#type-mappings-syntax) and the [usage scenario for the Apache Stanbol Enhancement Structure](enhancementusage.html#entity-tagging-with-disambiguation-support) for details.
+5. The "Type Mappings" might be interesting for you if your vocabulary contains custom types as those mappings can be used to map 'rdf:type's of entities in your vocabulary to 'dc:type's used for 'fise:TextAnnotation's - created by the Apache Stanbol Enhancer to annotate occurrences of extracted entities in the parsed text. See the [type mapping syntax](components/enhancer/engines/entitylinking.html#type-mappings-syntax) and the [usage scenario for the Apache Stanbol Enhancement Structure](enhancementusage.html#entity-tagging-with-disambiguation-support) for details.
 
 The following Example shows an Example of an [enhancement chain](components/enhancer/chains) using OpenNLP for NLP