You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Donna L Gresh <gr...@us.ibm.com> on 2007/03/29 17:03:27 UTC

normalized scores

Recent questions about whether/how scores are normalized got me wondering 
how
my application (happily) seems to be doing what I want. I have two 
indexes, one
which contains text fields which I want to use as queries into text fields 
in a second index.

I create a Boolean query based on all the terms in a document in my first
index, with all terms added with a SHOULD condition. I then apply this 
query to  my 
second index. I get the hits, and starting from the best hits  I look at 
the score and (arbitrarily) only
report those with a score greater than 0.3. Otherwise I move on to the 
next document
in my first index.

For a given query (for a single input document), the highest score is 
*not* always 1 (which is just how 
I want it). Is this because I am using a Boolean query? Here is my code 
snippet.


                           For the ith document in my input index......
                        TermFreqVector tfv = 
indexReaderOS.getTermFreqVector(i,"required skills");
                        String inputtid = 
indexReaderOS.document(i).getField("inputid").stringValue();
                        if (tfv !=null) {
                                BooleanQuery bq = new BooleanQuery();
                                String[] terms = tfv.getTerms();
                                for (int j=0; j<terms.length; j++) {
                                        String term = terms[j];
                                        Query query = parser.parse(term);
                                        bq.add(query, 
BooleanClause.Occur.SHOULD);
                                }
                                Hits hitsR = isearcherR.search(bq);
 
                                for (int ii=0; ii< hitsR.length(); ii++) {
                                        Document hitRDoc = hitsR.doc(ii);
                                        String hitid = 
hitRDoc.get("empid");
                                        float scoreR = hitsR.score(ii);
 
                                        if (scoreR<0.30) break;
 outfile.println(inputtid+","+empid+","+scoreR);
 
                                }

 

Donna Gresh

Re: normalized scores

Posted by Donna L Gresh <gr...@us.ibm.com>.

Thanks Erik, that works great--
Donna



>> It is unfortunate that some scores are being normalized and some 
>> may not
>> be. Is there a
>> way to obtain the unnormalized score?

>Any IndexSearcher.search method that does not return Hits keeps the 
>raw scores.  Try out the TopDocs returning ones or use a HitCollector.

Donna L. Gresh
Services Research, Mathematical Sciences Department
IBM T.J. Watson Research Center
(914) 945-2472
http://www.research.ibm.com/people/g/donnagresh
gresh@us.ibm.com

Re: normalized scores

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Mar 30, 2007, at 8:48 AM, Donna L Gresh wrote:
> It is unfortunate that some scores are being normalized and some  
> may not
> be. Is there a
> way to obtain the unnormalized score?

Any IndexSearcher.search method that does not return Hits keeps the  
raw scores.  Try out the TopDocs returning ones or use a HitCollector.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: normalized scores

Posted by Chris Hostetter <ho...@fucit.org>.

: I'm well aware that some queries will return no results due to my
: filtering by 0.3.
: That's the point. I expect that some of my input queries will not be a
: good match
: to *any* of the documents in my second index.

what i'm trying ot make sure you understand is that picking 0.3 as an
arbitrary number might make sense for soem queries, but not others ... the
scores are inherently not comparable between queries, if you can't
compare score(queryA) with score(queryB) then you also can't fairly
comparse score(queryA) with a constant N which you also compare to the
score(queryB).

with so many similar threads, i get confused as to what's already been
said sometimes, it doesn't look like i ever pointed out the FAQ on this
(assuming you haven't already seen it)...

http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-912c1f237bb00259185353182948e5935f0c2f03
http://article.gmane.org/gmane.comp.jakarta.lucene.user/10810


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: normalized scores

Posted by Donna L Gresh <gr...@us.ibm.com>.

I'm well aware that some queries will return no results due to my 
filtering by 0.3. 
That's the point. I expect that some of my input queries will not be a 
good match
to *any* of the documents in my second index. 

I'm really doing something much like
the "Books Like This" example in Chapter 5 of Lucene in Action (which I 
saw after I wrote this). 
It is unfortunate that some scores are being normalized and some may not 
be. Is there a
way to obtain the unnormalized score?


Donna Gresh





Chris Hostetter <ho...@fucit.org> 
03/29/2007 06:26 PM
Please respond to
java-user@lucene.apache.org


To
java-user@lucene.apache.org
cc

Subject
Re: normalized scores






: For a given query (for a single input document), the highest score is
: *not* always 1 (which is just how
: I want it). Is this because I am using a Boolean query? Here is my code
: snippet.

the Hits class only normalizes scores if the highest score is greater then
one, if it's less then 1 no normalization happens.

as to your more general question...

: Recent questions about whether/how scores are normalized got me 
wondering
: how my application (happily) seems to be doing what I want. I have two

it's all a question of what you want ... what you've got is throwing
things out with a score less then 0.3 ... but that's an arbitrary
decision -- there is no mathematical basis for assuming a
documentwhich scores "0.31" agaisnt query A is better match on A then a
doc which scores 0.29 against query B is for B ... they are apples and
oranges.

you can be as arbitrary as you want ... you could decide to ignore every
even numbered hit if you want -- it's entirely your choice, but it's not a
ratinal choice.


BTW: i hope you realize based on your comment about not all Hits having a
max score of 1, for some queries, the highest scoring doc might not even
have a score above 0.3, in which case you would be ignoring all matches.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: normalized scores

Posted by Chris Hostetter <ho...@fucit.org>.

: For a given query (for a single input document), the highest score is
: *not* always 1 (which is just how
: I want it). Is this because I am using a Boolean query? Here is my code
: snippet.

the Hits class only normalizes scores if the highest score is greater then
one, if it's less then 1 no normalization happens.

as to your more general question...

: Recent questions about whether/how scores are normalized got me wondering
: how my application (happily) seems to be doing what I want. I have two

it's all a question of what you want ... what you've got is throwing
things out with a score less then 0.3 ... but that's an arbitrary
decision -- there is no mathematical basis for assuming a
documentwhich scores "0.31" agaisnt query A is better match on A then a
doc which scores 0.29 against query B is for B ... they are apples and
oranges.

you can be as arbitrary as you want ... you could decide to ignore every
even numbered hit if you want -- it's entirely your choice, but it's not a
ratinal choice.


BTW: i hope you realize based on your comment about not all Hits having a
max score of 1, for some queries, the highest scoring doc might not even
have a score above 0.3, in which case you would be ignoring all matches.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org