You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by "Rupert Westenthaler (JIRA)" <ji...@apache.org> on 2013/06/03 14:31:19 UTC

[jira] [Commented] (STANBOL-1049) Add support for Upper Case Linking for Languages without NLP support

    [ https://issues.apache.org/jira/browse/STANBOL-1049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13673067#comment-13673067 ] 

Rupert Westenthaler commented on STANBOL-1049:
----------------------------------------------

As noted by Joseph M'Bimbi-Bene in http://markmail.org/message/erubqmhwytp7mxoa

The property

    enhancer.engines.linking.linkOnlyUpperCaseTokensWithMissingPosTag 

interferes with the upper case parameter ('uc={NONE/MATCH/LINK}') supported by the Text Processing configuration.

To avoid this it needs to be investigated if the functionality described by this issue can also be implemented by using the 'enhancer.engines.linking.minSearchTokenLength' property in combination with the value of the 'uc' parameter of the text processing configuration. 
                
> Add support for Upper Case Linking for Languages without NLP support
> --------------------------------------------------------------------
>
>                 Key: STANBOL-1049
>                 URL: https://issues.apache.org/jira/browse/STANBOL-1049
>             Project: Stanbol
>          Issue Type: Improvement
>          Components: Enhancement Engines
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>
> This issue will allow the EntityLinkingEngine to use upper case token information for linking of languages without NLP support. 
> If TextProcessingConfig#linkOnlyUpperCaseTokensWithMissingPosTag is enabled AND the language of the processed text uses bicameral script (alphabet with upper case letters) only upper case tokens that are equals or longer as TextProcessingConfig#minSearchTokenLength will be marked as 'linkable'. This will allow to avoid vocabulary lookups for lower case Tokens and therefore dramatically improve performance for processing languages without POS tagging support.
> ---
> Definitions:
> -------
> The EntityLinking Engine distinguishes three (Token Types)[1]:
> * Linkable Token: A Word that triggers a lookup in the Controlled Vocabulary
> * Matchable Token: A Word that is used to search and match Entities, but does not trigger an lookup
> * Other Tokens: Not used for search and matching. Might be used for fine tuning confidence values.
> Language level information incude
> * isUnicaseScript [true, false]: If the processed language uses a unicase script - does not know upper case letters
> Token level information include
> * hasLinkablePos [true,null,flase]: If a POS tag matches the linkable POS
> * hasMatchablePos [true,null,false]: If a POS tag matches the processable POS
> * isUpperCase [true,false]: If the first letter is an upper case one
> * hasAlphaNumeric [true,false]: if the word has an alpha numeric char
> * hasSearchableLength [true,false]: if the word is longer as the configured "Min Search Token Length"
> * isSubSentenceStart [true, false]: If the POS tag of an Token is Pos#Quote.
> Algorithm:
> ------
> This describes the algorithm used to classify Tokens as linkable, matchable and other based on the above properties. Rules are applied in the given order. A summary of the result for Tokens with no POS tags is given in the next section 
> __1. Basic rules:__
> * all Tokens without an AlphaNumeric character are not linkage and matchable
> * all tokens with hasLinkablePos are linkable
> * all linkable tokens and tokens with hasMatchablePOS are matchable
> __2. Uppercase Processing Rules__
> This rules are applied to UpperCase tokens that are not at a sentence or subSentence start 
> * if TextProcessingConfig#LinkUpperCaseTokens is enabled
>     * all tokens with hasMatchablePOS == true are also marked as linkable
>     * all tokens with hasMatchablePOS == false are marked as matchable
> * if TextProcessingConfig#MatchUpperCaseTokens is enable
>     * all tokens with hasMatchablePOS == false are marked as matchable
> __3. Unknown POS tag Rules__
> This rules only apply to Tokens that do have AlphaNumeric characters and  where both hasLinkablePos == null and hasMatchablePos == null
> * if the processed language uses a unicase script or TextProcessingConfig#linkOnlyUpperCaseTokensWithMissingPosTag is disabled
>     * all tokens equals or longer then  TextProcessingConfig#minSearchTokenLength are marked as linkable
> * else - bicameral script and TextProcessingConfig#linkOnlyUpperCaseTokensWithMissingPosTag is enabled
>     * if UpperCase token and not sentence or sub-sentence start
>         * tokens equals or longer as TextProcessingConfig#minSearchTokenLength are marked as linkable
>         * tokens shorter as TextProcessingConfig#minSearchTokenLength are marked as matchable
>     * else - lower case token or sentence or sub-sentence start
>         * tokens equals or longer as TextProcessingConfig#minSearchTokenLength are marked as matchable
> Languages without NLP support
> -----
> For languages without NLP processing support - meaning that no POS tagging is availabel - the following configurations are important
> * linkOnlyUpperCaseTokensWithMissingPosTag: This indicates that only upper case Tokens should be considered for linking. Note that this option is ignored for languages with a unicase script - scripts that do not use upper case characters.
> * minSearchTokenLength: This indicates that only words with equals or more as the configured characters should be considered for linking
> By default the 'linkOnlyUpperCaseTokensWithMissingPosTag' has the same value as the 'properNounsState' configuration. This means that if the "link only proper nouns" option is enabled only upper case tokens will be linked for languages without POS support. The default for the minSearchTokenLength is 3 letters.
> [1] http://stanbol.apache.org/docs/trunk/components/enhancer/engines/entitylinking#token-types

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira