You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@lucene.apache.org by #MAGGY ANASTASIA SURYANTO# <MA...@ntu.edu.sg> on 2007/07/03 01:38:16 UTC

clarification on booleanScorer

Hi all,
 
I would like to clarify my understanding of the way Lucene score boolean queries, in relation with +/  clause attributes (required and optional) as well as OR and AND operators. 
 
After looking at the BooleanScorer source core, the following is my understanding on the scoring:
1. OR is translated into " " (optional) and AND is translated into "+" (required) by queryParser
 so, is it true that 
(t1 t2 t3) AND (t4 t5 t6)  OR  (t7 t8 t9)  is parsed by queryParser into the following boolan query
+(t1 t2 t3) +(t4 t5 t6) (t7 t8 t9)
 
2. using default similarity, a score of a document score(q,d) is the summation of the tf, idf measure of the terms in q that appear in d. 
 
3. Score of a document w.r.t BooleanClause, BC (score(BC,d)) is the sum of score of the document w.r.t sall sub clauses of BC.
 
4. no difference in treating "+" clauses and " " clauses in scoring (i.e. their scorer.score() are summed up together to produce the total score of their parent' score), however, the addition of the scores of " " clauses are delayed until all "+" are matched by the documents. If not all "+" mare matched, the document is not retrieved.
         -----C1-----    ----C2-----    ----C3-----      ------C4-------
q = +{+(t1 t2 t3)   +(t4 t5 t6)   (t7 t8 t9)}     {t10 t11 t12}
 
assuming a document,d  match C1 and C2, the s(q,d) = sum(sum(s(C1,d) + s(C2,d) + s(C3, d)), s(C4,d))
 
Please let me know whether the above are true. In case there are something I miss to understand the scoring of booleanScorer, please let me know.
 
 
best regards
 
maggy
 
 
 

RE: clarification on booleanScorer

Posted by #MAGGY ANASTASIA SURYANTO# <MA...@ntu.edu.sg>.
Hi Paul,

 
thanks a lot for the reply. Is there any published writings (paper or journal) on Lucene Scoring. Did the Lucene developers/designers refer to any paper or tested methods or did they propose new scoring formula that is specific to Lucene only? 
 
I am asking this because I need to put this reference on the report I am currently writing. 
 
regards
 
maggy

________________________________

From: Paul Elschot [mailto:paul.elschot@xs4all.nl]
Sent: Tue 7/3/2007 2:56 PM
To: general@lucene.apache.org
Subject: Re: clarification on booleanScorer



Maggy,

On java-user@lucene.apache.org there is normally a higher chance of
getting a response.

You may have missed this:
http://lucene.apache.org/java/docs/scoring.html

Your analysis below is correct, only a few points need to be added:
- the coordination factor, which favours more matching clauses
  (for prefix queries normally no coordination is used),
- your examples are nested boolean queries, so all this applies on
  each level, and
- the idf computation is a bit more involved, see the reference above.

Regards,
Paul Elschot


On Tuesday 03 July 2007 01:38, #MAGGY ANASTASIA SURYANTO# wrote:
> Hi all,
> 
> I would like to clarify my understanding of the way Lucene score boolean
queries, in relation with +/  clause attributes (required and optional) as
well as OR and AND operators.
> 
> After looking at the BooleanScorer source core, the following is my
understanding on the scoring:
> 1. OR is translated into " " (optional) and AND is translated into
"+" (required) by queryParser
>  so, is it true that
> (t1 t2 t3) AND (t4 t5 t6)  OR  (t7 t8 t9)  is parsed by queryParser into the
following boolan query
> +(t1 t2 t3) +(t4 t5 t6) (t7 t8 t9)
> 
> 2. using default similarity, a score of a document score(q,d) is the
summation of the tf, idf measure of the terms in q that appear in d.
> 
> 3. Score of a document w.r.t BooleanClause, BC (score(BC,d)) is the sum of
score of the document w.r.t sall sub clauses of BC.
> 
> 4. no difference in treating "+" clauses and " " clauses in scoring (i.e.
their scorer.score() are summed up together to produce the total score of
their parent' score), however, the addition of the scores of " " clauses are
delayed until all "+" are matched by the documents. If not all "+" mare
matched, the document is not retrieved.
>          -----C1-----    ----C2-----    ----C3-----      ------C4-------
> q = +{+(t1 t2 t3)   +(t4 t5 t6)   (t7 t8 t9)}     {t10 t11 t12}
> 
> assuming a document,d  match C1 and C2, the s(q,d) = sum(sum(s(C1,d) +
s(C2,d) + s(C3, d)), s(C4,d))
> 
> Please let me know whether the above are true. In case there are something I
miss to understand the scoring of booleanScorer, please let me know.
> 
> 
> best regards
> 
> maggy
> 
> 
> 
>



Re: clarification on booleanScorer

Posted by Paul Elschot <pa...@xs4all.nl>.
Maggy,

On java-user@lucene.apache.org there is normally a higher chance of
getting a response.

You may have missed this:
http://lucene.apache.org/java/docs/scoring.html

Your analysis below is correct, only a few points need to be added:
- the coordination factor, which favours more matching clauses
  (for prefix queries normally no coordination is used),
- your examples are nested boolean queries, so all this applies on
  each level, and
- the idf computation is a bit more involved, see the reference above.

Regards,
Paul Elschot


On Tuesday 03 July 2007 01:38, #MAGGY ANASTASIA SURYANTO# wrote:
> Hi all,
>  
> I would like to clarify my understanding of the way Lucene score boolean 
queries, in relation with +/  clause attributes (required and optional) as 
well as OR and AND operators. 
>  
> After looking at the BooleanScorer source core, the following is my 
understanding on the scoring:
> 1. OR is translated into " " (optional) and AND is translated into 
"+" (required) by queryParser
>  so, is it true that 
> (t1 t2 t3) AND (t4 t5 t6)  OR  (t7 t8 t9)  is parsed by queryParser into the 
following boolan query
> +(t1 t2 t3) +(t4 t5 t6) (t7 t8 t9)
>  
> 2. using default similarity, a score of a document score(q,d) is the 
summation of the tf, idf measure of the terms in q that appear in d. 
>  
> 3. Score of a document w.r.t BooleanClause, BC (score(BC,d)) is the sum of 
score of the document w.r.t sall sub clauses of BC.
>  
> 4. no difference in treating "+" clauses and " " clauses in scoring (i.e. 
their scorer.score() are summed up together to produce the total score of 
their parent' score), however, the addition of the scores of " " clauses are 
delayed until all "+" are matched by the documents. If not all "+" mare 
matched, the document is not retrieved.
>          -----C1-----    ----C2-----    ----C3-----      ------C4-------
> q = +{+(t1 t2 t3)   +(t4 t5 t6)   (t7 t8 t9)}     {t10 t11 t12}
>  
> assuming a document,d  match C1 and C2, the s(q,d) = sum(sum(s(C1,d) + 
s(C2,d) + s(C3, d)), s(C4,d))
>  
> Please let me know whether the above are true. In case there are something I 
miss to understand the scoring of booleanScorer, please let me know.
>  
>  
> best regards
>  
> maggy
>  
>  
>  
>