You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by "Rupert Westenthaler (JIRA)" <ji...@apache.org> on 2012/10/29 14:30:13 UTC
[jira] [Commented] (STANBOL-740) Adopt the KeywordLinkingEngine to use the AnalyzedText content part

    [ https://issues.apache.org/jira/browse/STANBOL-740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13486009#comment-13486009 ] 

Rupert Westenthaler commented on STANBOL-740:
---------------------------------------------

With revision 1403242 a first implementation of the KeywordLinkingEngine that is based on the Stanbol NLP prodessing Module (STABOL-733) is available in the stanbol-nlp-processing branch [2]. This comment is intended to be moved to the documentation of the Stanbol Webpage as soon as this version is re-integrated to the trunk.


## Configuration

Only changes to the current version

### Removed Features

* Keyword Tokenizer (org.apache.stanbol.enhancer.engines.keywordextraction.keywordTokenizer): This allowed to use a special Tokenizer for matching keywords and alpha numeric IDs. This feature is no longer available as they KeywordLinkingEngine does no longer the tokenizing of parsed texts and has therefore no influence on how the text is tokenized. To preserve this feature a new Engine that is specialised for this task needed to be implemented.

### New Features

* __Link ProperNouns only__ _(org.apache.stanbol.enhancer.engines.keywordextraction.properNounsState)_: This boolean switch allows easily to switch between linking all nouns (state=false) or only proper nouns (state=true). "Noun linking" is equivalent to the current behavior of the KeywordLinkingEngine while "ProperNoun linking" is more similar to using NER with the NamedEntityLinking engine. For linking against vocabularies that contain Entities typically mentioned in texts as ProperNouns activating this option will greatly improve performance as much less words need to be looked-up in the Vocabulary. When linking to a Vocabulary that defines Entities that might be mentioned as common nouns this option need to be deactivated.

* __Processed Languages__ _(org.apache.stanbol.enhancer.engines.keywordextraction.processedLanguages)_: This features allows (1) to explicitly define languages processed by the Engine and (2) allows to provide Language Specific Configurations. Language specific configurations will override/extend engine global configurations. See the next section for details

* _org.apache.stanbol.enhancer.engines.keywordextraction.maxSearchTokenDistance_:  The maximum number of Tokens searched around linked Tokens to be included in searches within the linked vocabulary (default value is '3'). As an Example in the text section "at the University of Munich a new procedure to" only "Munich" would be  looked-up in the Vocabulary in case "ProperNoun" linking is activated. However for searching possible matches in the Vocabulary it makes sense to use additional Tokens to reduce (and better rank) possible matches for for "Munich". Because of that "matchable" words surrounding looked-up tokens are considered to be included for searches in the Vocabulary. This parameter allows to configure the maximum distance of words that are used for such searches. Note that this parameter will not cause Words outside of a Chunk to be used for searches (unless "Ingore Chunks" option is activated).

* _org.apache.stanbol.enhancer.engines.keywordextraction.masSearchTokens_: The maximum number of Tokens used for searches in the Controlled Vocabulary (default value is '2'). This sets the maximum number of Tokens used in OR queries to the linked Vocabulary.

*_org.apache.stanbol.enhancer.engines.keywordextraction.dereferenceFields_: Allows to define additional fields that are included for dereferneced Entities. Only applied of "Dereference Entities" is enabled.


### Processed Language Configuration

With they key _'org.apache.stanbol.enhancer.engines.keywordextraction.processedLanguages'_ the processed language(s) and language specific configurations can be applied.

For the configuration of the processed languages the following syntax is used:

    de
    en
    
This would configure the Engine to only process German and English texts. It is also possible to explicitly exclude languages

    !fr
    !it
    *
This specifies that all Languages other than French and Italien are processed.

Values MUST BE parsed as Array or Vector. This is done by using the ["elem1","elem2",...] syntax as defined by OSGI ".config" files. The following example shows the two above examples combined to a single configuration.

    org.apache.stanbol.enhancer.engines.keywordextraction.processedLanguages=["!fr","!it","de","en","*"]

In addition to specifying the processed languages this configuration can also be used to parse language specific parameters. The syntax for parameters is as follows

    {language};{param-name}={param-value};{param-name}={param-value}

The following param-names are supported by the KeywordLinkingEngine

* __lc__: This allows to parse LexicalCategories of words that shall be looked up in the Vocabulary. Valid values include the name's of members of the LexicalCategory enumeration (e.g. "Noun", "Verb", "Adjective", "Adposition", ...)
* __pos__: This allows to parse Pos types of words that shall be looked up in the Vocabulary. Valid values include the name's of members of the Pos enumeration (e.g. "ProperNoun", "CommonNoun", "Infinitive", "Gerund", "PresentParticiple" and ~150 others).
* __tag__: This allows to parse string tags used by the POS tagger for an language. Words that use those tags will be lokked-up with the vocabulary.
*__prob__: Allows a language specific setting of the _Min POS tag probability_ _(org.apache.stanbol.enhancer.engines.keywordextraction.minPosTagProbability)_. This value [0..1] is used to decide if a POS annotation is confident enough to use it for linking or rejecting ('value/2' is sufficient for rejecting).

Note that a word is linked if either "lc", "pos" or "tag" do match. So setting "pos=ProperNoun" will not have any effect if "lc=Noun" is already defined.

The following shows an "Processed Language Configuration" using all of the above mentioned features

    !fr
    !it
    nl;lc=Noun
    *;pos=ProperNoun

This would process all languages other that French and Italien; link all Nouns for Dutch texts and only ProperNouns for all others.

Users that want to define default parameters without using the "*" - wildcard language can use an empty language for parsing the parameters. Here an example

    nl;lc=Noun
    da
    en
    es
    pt
    sv
    de
    ;pos=ProperNoun

This explicitly includes the seven languages for that OpenNLP POS models are included in the Stanbol Full Launcher. In addition it sets "Noun" linking for Dutch - as the POS tagset for this language does not distinguish between ProperNouns and CommonNouns. For the other six languages only "ProperNouns" are linked.


Users that directly parse configurations as OSGI ".config" need to properly escape configured parameters. The following example shows the above configuration in the syntax used by ".config" files

    org.apache.stanbol.enhancer.engines.keywordextraction.processedLanguages=["nl;lc\=Noun","da","en","es","pt","sv","de",";pos\=ProperNoun"]

## Extension Points

This section describes Interfaces that are used as Extension Points by the KeywordLinkingEngine

### EntitySearcher

The EntitySearch Interface is used by the KeywordLinkingEngine to search for Entities in the linked Vocabulary. Currently the StanbolEntityhub based implementations are instantiated based on the value of the _'org.apache.stanbol.enhancer.engines.keywordextraction.referencedSiteId'_. Users that want to use a different implementation of this Interface to be used for linking will need to extend the KeywordLinkingEngine and override the #activateEntitySearcher(ComponentContext context, Dictionary<String,Object> configuration) and #deactivateEntitySearcher(). Those methods are called during activation/deactivation of the KeywordLinkingEngine and are expected to set/unset the #entitySearcher field.

### LabelTokenizer

The LabelTokenizer interface is used to tokenize labels of Entities from the linked Vocabulary. As the matching process of the KeywordLinkingEngine is based on Tokens (words) multi-word labels (e.g. Univerity of Munich) need to be tokenized before they can be matched against the current context in the Text.

LabelTokenizer are OSGI services. Their configuration optionally can define the _'enhancer.engines.keywordextraction.labeltokenizer.languages'_ property. Values are considered to be language configurations. Configurations can explicitly include/exclude languages. Also a wildcard is supported (e.g. "en,de" would include English and German; "!it,!fr,*" would specify all languages expect Italian and French. If no configuration is provided than "*" (all languages) is assumed.

The KeywordLinkingEngine will - by default - always use the LabelTokenizer with the highest "service.ranking" for a given language to tokenize labels. By default it comes with an OpenNLP based Tokenizer implementation that registers itself for all languages with a "service.ranking" of "-1000".

Users that want to use a different Tokenizer need to register an implementation for the given language(s) with an higher "service.ranking". Users that want to provide there own LabelTokenizer and ignore the values provided by OSGI need to extend the KeywordLinkingEngine set the #labelTokenizer field themself AND override the #bindLabelTokenizer(LabelTokenizerManager ltm) and #unbindLabelTokenizer(LabelTokenizerManager ltm) methods in a way that they do NOT change the #labelTokenizer field.


[1] http://svn.apache.org/viewvc?rev=1403242&view=rev
[2] https://svn.apache.org/repos/asf/stanbol/branches/stanbol-nlp-processing
                
> Adopt the KeywordLinkingEngine to use the AnalyzedText content part
> -------------------------------------------------------------------
>
>                 Key: STANBOL-740
>                 URL: https://issues.apache.org/jira/browse/STANBOL-740
>             Project: Stanbol
>          Issue Type: Sub-task
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>
> The KeywordLinkingEngine currently does both NLP processing AND linking against the target vocabulary. Up to now this was the only possibility as separating those two things was not feasible with the limitations of the RDF metadata.
> With the introduction of the AnalyzedText content part the NLP processing part needs no longer be part of the KeywordLinkingEngine.
> This issue covers
> * removal of the NLP related functionality from the KeywordLinkingEngine
> * reimplementation of the linking part on top of the API provided by the AnalyzedText contentpart
> * add support fore new features of the NLP chain
>     * use lemmas - if available - for entity lookup
>     * use POS tagset mappings to the OLIA ontology to decide what tokens to lookup
> After this change the KeywordLinkingEngine will also be able to work in combination with any NLP framework that is integrated with the Stanbol NLP components (writes its data to the AnalyzedText content part). 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira