You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Andrew Nagy <an...@villanova.edu> on 2007/01/23 22:05:47 UTC

relevance ranking and scoring

I have 2 questions about the SOLR relevancy system.

1. Why is it when I search for an exact phrase of a title of a record I 
have it generally does not come up as the 1st record in the results?

ex: title:(gone with the wind), the record comes up 3rd.  A record with 
the term "wind" as the first word in the title comes up 1st.
ex: title:"gone with the wind", the record comes up 1st.

Is this because the word "wind" is the only noun?

2. The "score" that is associated with each value is quite odd, what 
does it represent.  I generally get results with the top record being 
somewhere around 3.0 or 2.0 and most records are below 1.


Thanks!
Andrew

Re: relevance ranking and scoring

Posted by Chris Hostetter <ho...@fucit.org>.

: > title:(gone with the wind)^3.0 OR title2:(gone with the wind)
: That did it!  Thanks for the Help!
: What value do the numbers carry in the ranking?  I arbitrarily choose
: the number 5 cause it's an easy number :)

query boosts are in fact pretty arbitrary ... what you should pick really
depends on what boosts you put on other clauses, and what kinds of values
the tf, idf, and coord functions of your Similarity are going to return.

: I am a bit nervous about the dismax query system as I have quite a bit
: of other content that could skew the results.

i'm really not sure what you mean by that ... dismax will only look at the
fields you tell it to, and the factors that contribute to the score each
term/document pair in a dismax query will be the same as those from the
standard request handler -- the only differnece is how those individual
TermQuery scores are combined.

: Whats the difference between the dismax query handler and listing all of
: the fields in my search and separating them with an OR?

the best way to udnerstand this is too look at the debug output you get
from each query, and read the "Explanation" section ... some of the deep
detals may not make much sense, but the overall structure of score
calculation should be helpful

in a nutshell, when you ask the StandardRequestHandler for docs
matching...
     q = title:(foo bar) other:(foo bar)

if a document matches both title:foo, other:foo, and other:bar then the
score for that document is (esentially) the sum of the scores from
matching the individual terms

with dismax, if you ask for

     q = foo bar  & qf = title other

then the score for the same document is different: the matches on
the word "foo" are considered together regardless of field, and only the
field that resulted in the highest score is used (with a small portion of
hte matches on the otherfields being included to help break ties).  the
score contribution from matching on other;bar is basically the same as
before.

The driving motivation for the DisjunctionMaxQuery was so that if you
wanted to search for the words "Java" or "Lucene" in 3 differnet fields:
title, description, and body a document that matched Lucene once in the
body field, but matched Java dozens of times and at least once in each
field wouldn't overshadow a documetn that matched both Lucene and Java
just once in each field.


-Hoss