You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by rw...@apache.org on 2012/06/18 11:49:59 UTC

svn commit: r1351251 - /incubator/stanbol/site/trunk/content/stanbol/docs/trunk/multilingual.mdtext

Author: rwesten
Date: Mon Jun 18 09:49:58 2012
New Revision: 1351251

URL: http://svn.apache.org/viewvc?rev=1351251&view=rev
Log:
updated the multi lingual extraction user scenario with recent additions to the Stanbol Enhancer

Modified:
    incubator/stanbol/site/trunk/content/stanbol/docs/trunk/multilingual.mdtext

Modified: incubator/stanbol/site/trunk/content/stanbol/docs/trunk/multilingual.mdtext
URL: http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/multilingual.mdtext?rev=1351251&r1=1351250&r2=1351251&view=diff
==============================================================================
--- incubator/stanbol/site/trunk/content/stanbol/docs/trunk/multilingual.mdtext (original)
+++ incubator/stanbol/site/trunk/content/stanbol/docs/trunk/multilingual.mdtext Mon Jun 18 09:49:58 2012
@@ -1,78 +1,136 @@
 Title: Configure Apache Stanbol to work with multiple languages
 
 
-The following languages are supported -
+To understand multi lingual support with Apache Stanbol one needs to consider that Stanbol supports two different workflows for extracting Entities from parsed text:
 
-- English
-- German
-- Danish
-- Swedish
-- Dutch
-- Portuguese
+1. __Named Entity Linking__: This first uses Named Entity Recoqunition (NER) for spotting Entities and second linked found Named Entities with Entities defined by the Controlled Vocabulary (e.g. DBpedia.or). For the NER step the [NamedEntityExtraction](enhancer/engines/namedentityextractionengine.html), the CELI NER engine - using the [linguagrid.org](http://linguagrid.org) service or the [OpenCalais](enhancer/engines/opencalaisengine.html) can be used. The linking functionality is implemented by the [NamedEntityTaggingEngine](enhancer/engines/namedentitytaggingengine.html). Multi lingual support depends on the availability of NER models for a language. Note also that separate models are required for each Entity type. Typical supported types are Persons, Organizations and Places.
+2. __Keyword Linking__: Entity label based spotting and linking of Entities as implemented by the [KeywordLinkingEngine](enhancer/engines/keywordlinkingengine.html). Natural Language Processing (NLP) techniques such as Part-of-Speach (POS) processing are used to improve performance and result of the extraction process but are not a absolute requirement. As extraction only requires a label this method is also independent of the types of the Entities.
 
+The following Languages are supported for NER - and can therefore be used for Named Entity Linking:
 
-##Configuration steps
+* __English__ (via [NamedEntityTaggingEngine](enhancer/engines/namedentitytaggingengine.html), [OpenCalais](enhancer/engines/opencalaisengine.html))
+* __Spansh__ (via [NamedEntityTaggingEngine](enhancer/engines/namedentitytaggingengine.html))
+* __Dutch__ ((via [NamedEntityTaggingEngine](enhancer/engines/namedentitytaggingengine.html))
+* __French__ (via CELI NER engine)
+* __Italien__ (via CELI NER engine)
 
-- Have language labels in your target data and install the index
-- Add language models to your Stanbol instance
-- Activate the LangIdEnhancementEngine and the KeywordLinkingEngine
-- Configure the KeywordLinkingEngine
+_NOTE:_ The CELI and OpenCalais engine require users to create an Account with the according services. In addition analyzed Content will be sent to those services!
 
+For the following languages NLP support is available to improve results when using the Keyword Extraction Engine:
 
-###Install your index
+* __English__
+* __German__
+* __Danish__
+* __Swedish__
+* __Dutch__
+* __Portuguese__
 
-In DBpedia, there exist language labels for many entities. In case you want to use an index of your custom vocabulary, first [create the index](customvocabulary.html) from it and  add the index to your stanbol instance. Simply paste the <code>{yourindex}.solr.zip</code> into your <code>{stanbol-root}/sling/datafiles</code> directory and install the respective OSGI bundle at your OSGI admin console.
 
-Make sure, that this index contains language labels in all languages you want to work with and that they are properly indexed.
+## Configuration steps
 
-###Build and add the necessary language bundles
+This describes the typical configuration steps required for multi lingual text processing with Apache Stanbol. 
 
-To build the language bundles go to "{stanbol-root}/data/" and call
+1. Ensure that labels for the {language(s)} are available in the controlled vocabulary: By default labels with the given language and with no defined language will be used for linking.
+2. Add language models to your Stanbol instance: This includes general NLP models, NER models and possible the configuration of external services such as CELI or OpenCalais
+4. Configure the Named Entity Linking / Keyword Linking chain(s)
+    * ensure language detection support (e.g by using the [Language Identification Engine](enhancer/engines/langidengine.html)
+    * decide to use (1) Named Entity Linking or (2) Keyword Linking based on the supported/required languages and the supported/present types of Entities in the controlled vocabulary
+    * configure the required Enhancement Engines and one or more [Enhancement Chain](enhancer/chains) for processing parsed content.
 
-    mvn clean install -P opennlp
 
-This enables the profile to build the OpenNLP models for all languages.
+###Install your multi lingual controlled vocabulary
 
-After this the bundles are available in the folder
+If you want to link Entities in a given language you MUST ensure that there are labels in those languages present in the controlled vocabulary you want link against. It is also possible to tell Stanbol that labels are valid regardless of the language by adding labels without a language tag.
 
-    {stanbol-root}/data/opennlp/lang/{language}/target
+In case you want to link against your own vocabulary you will need to [create your own index](customvocabulary.html) at this point. If you want to use an already indexed dataset you will need to install those to your Stanbol Environment by:
 
-The naming of the bundles is "org.apache.stanbol.data.opennlp.lang.{language}-*.jar".
+* copying the <code>{dataset}.solr.zip</code> file to <code>{stanbol-working-dir}/stanbol/datafiles</code> directory
+* installing the <code>org.apache.stanbol.data.site.{dataset}-1.0.0.jar</code> bundle (e.g. by using the [Bundle Tab](http://localhost:8080/system/console/bundles) of the Apache Felix Web Console - http://{host}:{port}/system/console/bundle).
 
-Add the bundles via the OSGI admin console in the bundles tab. The language bundles will fetch and install the according [OpenNLP](http://dev.iks-project.eu/downloads/opennlp/models-1.5/) models for the languages you want to use.
+_NOTES:_ 
 
+* Indexed datasets can be found at the download section of the [IKS Development Server](http://dev.iks-project.eu/downloads/stanbol-indices/)
+* In case of DBpedia the installation of the <code>org.apache.stanbol.data.site.dbpedia-1.0.0.jar</code> bundle is NOT required, because DBpedia is already included in the default configuration of the launcher.
 
+### Build and add the necessary language bundles
 
-###Activate LangID engine and KeywordLinkingEngine
+Users of the <code>full-war</code> or <code>full</code> launcher can skip this as all available language bundles are included by default. In case you use the <code>stable</code> or a custom build launchers you will need to manually provide the required language models.
 
-Go to the admin console and deactivate some of the available engines. Especially the standard NER engine and the Entity Linking Engines should be deactivated, as they do not support multiple languages. At least two engines need to be activated:
+In principle there are two possibilities to add language processing and NER models to your Stanbol instance:
 
-- The [Language Identification Engine](enhancer/engines/langidengine.html) provides you with the language of the text you want to enhance, it creates a dc:terms languaage property. The 
-- The [Keyword Linking Engine](enhancer/engines/keywordlinkingengine.html) provides you with the TextAnnotations (selects potential parts of your text) as well as with EntitiyAnnotations (provides suggestions for links). Be aware, that the result (especially the recall) heavily depends on the amount of entities you have specified in your target data source.
+1. you can use the OSGI bundles: Those uses artifactIds like <code>org.apache.stanbol.data.opennlp.lang.{language}-*.jar</code> and  <code>org.apache.stanbol.data.opennlp.ner.{language}-*.jar</code> and can be found under <code>{stanbol-root}/data/opennlp/[ner|lang]/{language}</code> in the Apache Stanbol source
+2. you can obtain the OpenNLP language models yourself and copy them to the <code>{stanbol-working-dir}/stanbol/datafiles</code> folder.
 
+While the later provides more flexibility it also requires a basic understanding of the [OpenNLP](http://opennlp.apache.org/) models and the processing workflow the [KeywordLinkingEngine](enhancer/engines/keywordlinkingengine.html).
 
+###Configuring Language Identification Support
 
-###Configure the KeywordLinkingEngine
+By default Apache Stanbol uses the [Language Identification Engine](enhancer/engines/langidengine.html) that is based on the language identification functionality provided by [Apache Tika](http://tika.apache.org/1.1/detection.html#Language_Detection). As an alternative there is also a language identification engine that uses [linguagrid.org](http://linguagrid.org).
 
-At the OSGI admin console, you can get the most relevant configuration options of the Keyword Linking Engine.
+If you configure your own [Enhancement Chain](enhancer/chains) it is important to use one of those Engines and to ensure that it processes the content before the other engines referenced in this document.
 
-- **Referenced Site:** The ID of the Entityhub Referenced Site holding the Controlled Vocabulary (e.g. a taxonomy or just a set of named entities) 
-- **Label Field:** The field used to match Entities with a mentions within the parsed text.
-- **Type Field:** The field used to retrieve the types of matched Entities. Values of that field are expected to be URIs 
+###Configure Named Entity Linking
+
+To use Named Entity Linking users need to add at least two Enhancement Engines to the current [Enhancement Chain](enhancer/chains)
+
+1. NER Engine: possibilities include
+    * [NamedEntityTaggingEngine](enhancer/engines/namedentitytaggingengine.html) - default name "<code>ner</code>"
+    * <code>CeliNamedEntityExtractionEnhancementEngine</code> - default name "<code>celiNer</code>": To use this Engine you need to configure a "License Key" or to activate the usage of the Test Account. After providing this configuration you will need to manually disable/enable this engine to bring it from "unsatisfied" to the "active" state.
+    * [OpenCalais](enhancer/engines/opencalaisengine.html) - default name "<code>opencalais</code>": To use this Engine you need to configure you OpenCalais license key. You should also activate the NER only mode if you used it for this purpose. After providing this configuration you will need to manually disable/enable this engine to bring it from "unsatisfied" to the "active" state.
+2. Entity Linking: possibilities include
+    * [Named Entity Tagging Engine](namedentitytaggingengine.html): This engine allows to create multiple instances for different controlled vocabularies. The default configuration of the Stanbol Launchers include an instance that is configured to link Entities form [DBpedia.org](http://dbpedia.org). To link to your own datasets you will need to create/configure your own instances of this engine by using the [Configuration Tab](http://localhost:8080/system/console/configMgr) of the Apache Felix WebConsole - http://{host}:{port}/system/console/configMgr.
+    * [Geonames Enhancement Engine](geonamesengine.html): Uses the web services provided by [geonames.org](http://geonames.org) to link extracted Places. To use this Engine you need to configure your geonames "License Key" or to activate the anonymous geonames.org service. After providing this configuration you will need to manually disable/enable this engine to bring it from "unsatisfied" to the "active" state.
+
+It is important to note that one can include multiple NER and Entity Linking Engines in a single [Enhancement Chain](enhancer/chains). A typical Example would be
+
+* "langid" - the required language identification (see previous section)
+* "ner" - for NER support in English, Spanish and Dutch)
+* "celiNer" - for NER support in French and Italien)
+* "dbpediaLinking" - default configuration of the [Named Entity Tagging Engine](namedentitytaggingengine.html) supporting linking with Entities defined by [DBpedia.org](http://dbpedia.org)
+* "{youLinking}" - one or several more [Named Entity Tagging Engine](namedentitytaggingengine.html) supporting linking against your Vocabularies (e.g. customers, employees, project partner, suppliers, competitors …)
+
+###Configure KeywordLinking
+
+To use Keyword Linking one needs only to create/configure an instance of the [KeywordLinkingEngine](enhancer/engines/keywordlinkingengine.html) and add it to the current [Enhancement Chain](enhancer/chains).
+
+The following describe the different Options provided by the [KeywordLinkingEngine](enhancer/engines/keywordlinkingengine.html) when configured via the [Configuration Tab](http://localhost:8080/system/console/configMgr) of the Apache Felix WebConsole - http://{host}:{port}/system/console/configMgr.
+
+- **Name**: The name of the Engine as referenced in the configuration of the [Enhancement Chain](enhancer/chains)
+- **Referenced Site:** The ID of the Entityhub Referenced Site holding the Controlled Vocabulary. The referenced site id is the name of the referenced site as included in the URL - http://{stanbol-instance}/entityhub/site/{referenced-site-id}
+- **Label Field:** The field used to match Entities with a mentions within the parsed text. For well known namespaces you can use "{prefix}:{localName}" instead of the full URI
+- **Case Sensitivity:** Allows to enable case sensitive matching of labels. This allows to work around problems with suggesting abbreviations like "AND" for mentions of the english stop word "and". 
+- **Type Field:** The field used to retrieve the types of matched Entities. The values of this field are added to the 'fise:entity-type' property of created 'fise:EntityAnnotation's. 
 - **Redirect Field:** Entities may define redirects to other Entities (e.g. "USA"(http://dbpedia.org/resource/USA) -> "United States"(http://dbpedia.org/resource/United_States). Values of this field are expected to link to other entities part of the controlled vocabulary
 - **Redirect Mode:** Defines how to process redirects of Entities mentioned in the parsed content.. Three modes to deal with such links are supported: Ignore redirects; Add values from redirected Entities to extracted; Follow Redirects and suggest the redirected Entity instead of the extracted. 
-- **Min Token Length:**	The minimum length of Tokens used to lookup Entities within the Controlled Vocabulary. This parameter is ignored in case a POS (Part of Speech) tagger is available for the language of the parsed content.
-- **Suggestions:** The maximal number of suggestions returned for a single mention. (org.apache.stanbol.enhancer.engines.keywordextraction.maxSuggestions)
-Languages	
+- **Min Token Length:**	The minimum length of Tokens used to lookup Entities within the Controlled Vocabulary. This parameter is ignored in case a certain POS (Part of Speech) tag is available.
+- **Keyword Tokenizer:** Forces the use of a word tokenizer that is optimized for Alpha numeric keys such as ISBN numbers, product codes ... 
+- **Suggestions:** The maximal number of suggestions returned for a single mention. 
 - **Languages to process:** An empty text indicates that all languages are processed. Use ',' as separator for languages (e.g. 'en,de' to enhance only English and German texts). 
-- **Default Matching Language:** The language used in addition to the language detected for the analysed text to search for Entities. Typically this configuration is an empty string to search for labels without any language defined, but for some data sets (such as DBpedia.org) that add languages to any labels it might improve resuls to change this configuration (e.g. to 'en' in the case of DBpedia.org).
+- **Default Matching Language:** The language used in addition to the language detected for the analysed text to search for Entities. Typically this configuration is an empty string to search for labels without any language defined, but for some data sets (such as DBpedia.org) that add languages to any labels it might improve results to change this configuration (e.g. to 'en' in the case of DBpedia.org).
+- **Type Mappings:** This allows to configure additional mappings for 'dc:type' values added to created 'fise:TextAnnotation's
+- **Dereference Entities:** This allows to include typical properties of linked Entities within the enhancement results. This engine currently includes only specific properties including the configured "Type Field", "Redirect Field" and "Label Field".
 
 Read the technical description of this [Enhancement  Engine](enhancer/engines/keywordlinkingengine.html) to learn about more configuration options.
 
+Note that an [Enhancement Chain](enhancer/chains) may also contain multiple instances of the [KeywordLinkingEngine](enhancer/engines/keywordlinkingengine.html). It is also possible to mix Named Entity Linking and Keyword Linking in a single chain e.g. to link Persons/Organizations and Places of DBPedia and any kind of Entities defined in your custom vocabulary. Such an Enhancement Chain could look like:
+
+* "langid" - the required language identification (see previous section)
+* "ner" - for NER support in English, Spanish and Dutch)
+* "celiNer" - for NER support in French and Italien)
+* "dbpediaLinking" - default configuration of the [Named Entity Tagging Engine](namedentitytaggingengine.html) supporting linking with Entities defined by [DBpedia.org](http://dbpedia.org)
+* "youVocKeyqord - custom configuration of the [KeywordLinkingEngine](enhancer/engines/keywordlinkingengine.html) configured to your controlled vocabulary.
 
 ##Results
 
-Depending on your linking target dataset - the engine provides you with enhancement suggestions using labels in your chosen language(s). Note: In the actual version of the DBpedia index, the link directs to the english version of the resource.
+Extracted Entities will be formally describend in the RDF enhancement results of the Stanbol Enhancer by
+
+* fise:TextAnnotation: The occurrence of the extracted Entity within the Text. Also providing the general nature - value of the 'dc:type' property - of the Entity. In case of Named Entity Linking TextAnnotations represent the Named Entities extracted by the used NER engine(s)
+* fise:EntityAnnotation: Entities of the configured controlled vocabulary suggested for one or more 'fise:TextAnnotation's - value(s) of the 'dc:relation' property.
+
+The following figure provides an overview about the knowledge structure.
+
+![Linked Entity Representation](es_entitydisambiguation.png "Bob Marley as spotted in the Text with two suggested Entities part of DBpedia.org)
+
 
 ##Examples
 This [article](http://blog.iks-project.eu/apache-stanbol-now-with-multi-language-support/) from October 2011 describes how to deal with multilingual texts.
\ No newline at end of file