You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@stanbol.apache.org by "Rupert Westenthaler (JIRA)" <ji...@apache.org> on 2012/05/22 07:49:40 UTC

[jira] [Created] (STANBOL-624) The NamedEntityTagging engine should use confidence values between [0..1]

Rupert Westenthaler created STANBOL-624:
-------------------------------------------

             Summary: The NamedEntityTagging engine should use confidence values between [0..1]
                 Key: STANBOL-624
                 URL: https://issues.apache.org/jira/browse/STANBOL-624
             Project: Stanbol
          Issue Type: Bug
          Components: Enhancer
    Affects Versions: 0.9.0-incubating
            Reporter: Rupert Westenthaler
            Assignee: Rupert Westenthaler
             Fix For: 0.10.0-incubating


Currently the Solr result scores are used as confidence. Only exact matches are sorted in front of partial matches. However Solr result scores are not within the range [0..1] what makes it hard for clients to process confidence values.

The suggestion is to use the following algorithm to "normalize" confidence values of this engine

* score ... the Solr result score of the current entity
* maxScore ... the highest Solr result score
* maxExactScore ... the highest Solr result score of an Entity the exactly matches the fise:selected-text
* levenshteinSimilarity ... the LevenshteinDistance(selectedText,label)/Math.max(selectedText.length(),label.length())

The normalized Score is calculated as follows:

    if(levenshteinSimilarity == 1) //exact match
        score = score/maxExactScore;
    else
        score = score*levenshteinSimilarity/maxScore

This ensures that

* If there is a exact match it will have the confidence 1.0
* If there are multiple exact matches they will be rated based on the Solr result scores (normalized to 1 using the result score of the best exact match as base)
* all partial matches will have a score <= the levenshteinSimilarity
* Partial matches are normalized by using the max result score (regardless if the result with the max Solr result score is a exact match or not).

Note: This resembles a disambiguation based on the label of the Entity as well as possible Document Boosts in the Solr index. This is NOT intended to be a real Entity Disambiguation algorithm.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (STANBOL-624) The NamedEntityTagging engine should use confidence values between [0..1]

Posted by "Fabian Christ (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/STANBOL-624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Fabian Christ updated STANBOL-624:
----------------------------------

    Component/s: Enhancer
    
> The NamedEntityTagging engine should use confidence values between [0..1]
> -------------------------------------------------------------------------
>
>                 Key: STANBOL-624
>                 URL: https://issues.apache.org/jira/browse/STANBOL-624
>             Project: Stanbol
>          Issue Type: Bug
>          Components: Engine - EntityTagging, Enhancer
>    Affects Versions: 0.9.0-incubating
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>
> Currently the Solr result scores are used as confidence. Only exact matches are sorted in front of partial matches. However Solr result scores are not within the range [0..1] what makes it hard for clients to process confidence values.
> The suggestion is to use the following algorithm to "normalize" confidence values of this engine
> * score ... the Solr result score of the current entity
> * maxScore ... the highest Solr result score
> * maxExactScore ... the highest Solr result score of an Entity the exactly matches the fise:selected-text
> * levenshteinSimilarity ... the LevenshteinDistance(selectedText,label)/Math.max(selectedText.length(),label.length())
> The normalized Score is calculated as follows:
>     if(levenshteinSimilarity == 1) //exact match
>         score = score/maxExactScore;
>     else
>         score = score*levenshteinSimilarity/maxScore
> This ensures that
> * If there is a exact match it will have the confidence 1.0
> * If there are multiple exact matches they will be rated based on the Solr result scores (normalized to 1 using the result score of the best exact match as base)
> * all partial matches will have a score <= the levenshteinSimilarity
> * Partial matches are normalized by using the max result score (regardless if the result with the max Solr result score is a exact match or not).
> Note: This resembles a disambiguation based on the label of the Entity as well as possible Document Boosts in the Solr index. This is NOT intended to be a real Entity Disambiguation algorithm.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (STANBOL-624) The NamedEntityTagging engine should use confidence values between [0..1]

Posted by "Fabian Christ (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/STANBOL-624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Fabian Christ updated STANBOL-624:
----------------------------------

      Component/s:     (was: Enhancer)
                   Engine - EntityTagging
    Fix Version/s:     (was: enhancer-0.10.0-incubating)
    
> The NamedEntityTagging engine should use confidence values between [0..1]
> -------------------------------------------------------------------------
>
>                 Key: STANBOL-624
>                 URL: https://issues.apache.org/jira/browse/STANBOL-624
>             Project: Stanbol
>          Issue Type: Bug
>          Components: Engine - EntityTagging
>    Affects Versions: 0.9.0-incubating
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>
> Currently the Solr result scores are used as confidence. Only exact matches are sorted in front of partial matches. However Solr result scores are not within the range [0..1] what makes it hard for clients to process confidence values.
> The suggestion is to use the following algorithm to "normalize" confidence values of this engine
> * score ... the Solr result score of the current entity
> * maxScore ... the highest Solr result score
> * maxExactScore ... the highest Solr result score of an Entity the exactly matches the fise:selected-text
> * levenshteinSimilarity ... the LevenshteinDistance(selectedText,label)/Math.max(selectedText.length(),label.length())
> The normalized Score is calculated as follows:
>     if(levenshteinSimilarity == 1) //exact match
>         score = score/maxExactScore;
>     else
>         score = score*levenshteinSimilarity/maxScore
> This ensures that
> * If there is a exact match it will have the confidence 1.0
> * If there are multiple exact matches they will be rated based on the Solr result scores (normalized to 1 using the result score of the best exact match as base)
> * all partial matches will have a score <= the levenshteinSimilarity
> * Partial matches are normalized by using the max result score (regardless if the result with the max Solr result score is a exact match or not).
> Note: This resembles a disambiguation based on the label of the Entity as well as possible Document Boosts in the Solr index. This is NOT intended to be a real Entity Disambiguation algorithm.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (STANBOL-624) The NamedEntityTagging engine should use confidence values between [0..1]

Posted by "Rupert Westenthaler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/STANBOL-624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rupert Westenthaler resolved STANBOL-624.
-----------------------------------------

    Resolution: Fixed

implemented with 1341438
                
> The NamedEntityTagging engine should use confidence values between [0..1]
> -------------------------------------------------------------------------
>
>                 Key: STANBOL-624
>                 URL: https://issues.apache.org/jira/browse/STANBOL-624
>             Project: Stanbol
>          Issue Type: Bug
>          Components: Enhancer
>    Affects Versions: 0.9.0-incubating
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>             Fix For: 0.10.0-incubating
>
>
> Currently the Solr result scores are used as confidence. Only exact matches are sorted in front of partial matches. However Solr result scores are not within the range [0..1] what makes it hard for clients to process confidence values.
> The suggestion is to use the following algorithm to "normalize" confidence values of this engine
> * score ... the Solr result score of the current entity
> * maxScore ... the highest Solr result score
> * maxExactScore ... the highest Solr result score of an Entity the exactly matches the fise:selected-text
> * levenshteinSimilarity ... the LevenshteinDistance(selectedText,label)/Math.max(selectedText.length(),label.length())
> The normalized Score is calculated as follows:
>     if(levenshteinSimilarity == 1) //exact match
>         score = score/maxExactScore;
>     else
>         score = score*levenshteinSimilarity/maxScore
> This ensures that
> * If there is a exact match it will have the confidence 1.0
> * If there are multiple exact matches they will be rated based on the Solr result scores (normalized to 1 using the result score of the best exact match as base)
> * all partial matches will have a score <= the levenshteinSimilarity
> * Partial matches are normalized by using the max result score (regardless if the result with the max Solr result score is a exact match or not).
> Note: This resembles a disambiguation based on the label of the Entity as well as possible Document Boosts in the Solr index. This is NOT intended to be a real Entity Disambiguation algorithm.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira