You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by "Rupert Westenthaler (JIRA)" <ji...@apache.org> on 2013/04/12 09:53:15 UTC
[jira] [Created] (STANBOL-1027) SolrYard should compensate for Solr lengthNorm on multi valued fields

Rupert Westenthaler created STANBOL-1027:
--------------------------------------------

             Summary: SolrYard should compensate for Solr lengthNorm on multi valued fields
                 Key: STANBOL-1027
                 URL: https://issues.apache.org/jira/browse/STANBOL-1027
             Project: Stanbol
          Issue Type: Improvement
          Components: Entityhub
    Affects Versions: entityhub-0.11.0
            Reporter: Rupert Westenthaler
            Assignee: Rupert Westenthaler


Currently Entities with multiple labels are down ranked for Queries if omitNorms is not disabled by the Solr schema.xml. Typically the solution to this problem is to enable omitNorms in for such fields. However this also means that no index time boosts can be applied to such fields.

However for Entity lookups of the Stanbol Enhancer it is important to have both:

* index time boosts based on the ranking of the Entity within the knowledge base (e.g. the number of incoming links in dbpedia)
* compensation to the Solr lengthNorm in case of multiple labels

This is especially important as often very prominent Entities do have a lot of alternate Labels defined and therefore suffer most from this.


from http://lucene.472066.n3.nabble.com/QueryNorm-and-FieldNorm-td1992964.html

> 1. lengthNorm = measure of the importance of a term according to the total number of terms in the field
>   1. Implementation: 1/sqrt(numTerms)
>   2. Implication: a term matched in fields with less terms have a higher score
>   3. Rationale: a term in a field with less terms is more important than one with more
> 2. boost (index) = boost of the field at index-time
>   1. Index time boost specified. The fieldNorm value in the score would include the same.
>   3. boost (query) = boost of the field at query-time

As an example we will use the Entity "Barack Obama" on freebase.com has 20 alternate names for the English language. All 21 labels (1 label + 20 alternate) with a sum of 56 Tokens where 19 of them are unique. Further we will compare them with an other Entity also named "Barack Obama" that does not have any alternate labels.

When indexing all those labels to a single (multivalued) text field we will end up with a lengthNorm of 1/sqrt(56) = 0.133, while the 2nd Entity will have a lengthNorm of 1/sqrt(2) = 0,707. Meaning that the 2nd Entity is boosted relative to the first by a factor of ~5 (500%).

To compensate this the suggestion is to apply an index time boost of sqrt(numLabels). Doing so would change the situation as follows

Entity 1 would get a combined norm of sqrt(20)/sqrt(56)=0,598 while the 2nd Entity would stay at 0,707 resulting in an boost of the 2nd entity by about 20%.

NOTE: that based on my readings of the Lucene/Solr documentation the fact that the term "Obama" is contained 16 times in the label of the first Entity does not influence scoring of term queries.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira