You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by "Rupert Westenthaler (JIRA)" <ji...@apache.org> on 2012/07/11 17:06:33 UTC

[jira] [Created] (STANBOL-685) Improve POS tag handling of the KeywordLinkingEngine

Rupert Westenthaler created STANBOL-685:
-------------------------------------------

             Summary: Improve POS tag handling of the KeywordLinkingEngine
                 Key: STANBOL-685
                 URL: https://issues.apache.org/jira/browse/STANBOL-685
             Project: Stanbol
          Issue Type: Improvement
          Components: Engine - KeywordExtraction
            Reporter: Rupert Westenthaler
            Assignee: Rupert Westenthaler
            Priority: Minor


The KeywordLinkingEngine can make use of POS tags to decide of a Token (word) needs to be processed or can be skipped. If no POS tags are available or the POS tag probability is to low (currently the default is 0.8) than the minimum token length (default is 3) is used as fall-back.

Analyzing POS tag results have shown that often tags with non noun tags where below the 0.8 limit. For those the fall-back was used and in most cases this resulted in the KeywordLinkingEngine in processing those tokens.

However it can also be observed that while some of those POS tags where not correct usually non correct tags where only between tags where both where non-noun tags. Because of that it can improve results and processing time to decrease the minimum probability for accepting an non noun POS tag.

Because of that the algorithm will be adjusted like follows:

Introduce two Tag Probabilities:

1. "minPosTypeProb" for Accepting POS tags that represent Nouns and
2. "minPosTypeProb/2" for rejecting POS tags that are not nouns

Assuming that the <code>minPosTypePropb=0.667</code> a<ul>

 * noun with the prop 0.8 would result in returning <code>true</code>
 * noun with prop 0.5 would return <code>null</code>
 * verb with prop 0.4 would return <code>false</code>
 * verb with prop 0.3 would return <code>null</code>

NOTES: <code>null</code> indicates that no POS tag is available or the POS tag has a low propability

This changes will be need to be applied to the "OpenNlpAnalysedContentFactory#processPOS(..)" and the "EntityLinker#isProcessableToken(..)" methods

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (STANBOL-685) Improve POS tag handling of the KeywordLinkingEngine

Posted by "Rupert Westenthaler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/STANBOL-685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rupert Westenthaler resolved STANBOL-685.
-----------------------------------------

    Resolution: Fixed

fixed with revision 1360296
                
> Improve POS tag handling of the KeywordLinkingEngine
> ----------------------------------------------------
>
>                 Key: STANBOL-685
>                 URL: https://issues.apache.org/jira/browse/STANBOL-685
>             Project: Stanbol
>          Issue Type: Improvement
>          Components: Engine - KeywordExtraction
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>            Priority: Minor
>
> The KeywordLinkingEngine can make use of POS tags to decide of a Token (word) needs to be processed or can be skipped. If no POS tags are available or the POS tag probability is to low (currently the default is 0.8) than the minimum token length (default is 3) is used as fall-back.
> Analyzing POS tag results have shown that often tags with non noun tags where below the 0.8 limit. For those the fall-back was used and in most cases this resulted in the KeywordLinkingEngine in processing those tokens.
> However it can also be observed that while some of those POS tags where not correct usually non correct tags where only between tags where both where non-noun tags. Because of that it can improve results and processing time to decrease the minimum probability for accepting an non noun POS tag.
> Because of that the algorithm will be adjusted like follows:
> Introduce two Tag Probabilities:
> 1. "minPosTypeProb" for Accepting POS tags that represent Nouns and
> 2. "minPosTypeProb/2" for rejecting POS tags that are not nouns
> Assuming that the <code>minPosTypePropb=0.667</code> a<ul>
>  * noun with the prop 0.8 would result in returning <code>true</code>
>  * noun with prop 0.5 would return <code>null</code>
>  * verb with prop 0.4 would return <code>false</code>
>  * verb with prop 0.3 would return <code>null</code>
> NOTES: <code>null</code> indicates that no POS tag is available or the POS tag has a low propability
> This changes will be need to be applied to the "OpenNlpAnalysedContentFactory#processPOS(..)" and the "EntityLinker#isProcessableToken(..)" methods

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira