You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Andrew Nagy <an...@villanova.edu> on 2007/01/23 22:05:47 UTC

relevance ranking and scoring

I have 2 questions about the SOLR relevancy system.

1. Why is it when I search for an exact phrase of a title of a record I 
have it generally does not come up as the 1st record in the results?

ex: title:(gone with the wind), the record comes up 3rd.  A record with 
the term "wind" as the first word in the title comes up 1st.
ex: title:"gone with the wind", the record comes up 1st.

Is this because the word "wind" is the only noun?

2. The "score" that is associated with each value is quite odd, what 
does it represent.  I generally get results with the top record being 
somewhere around 3.0 or 2.0 and most records are below 1.


Thanks!
Andrew



Re: relevance ranking and scoring

Posted by Chris Hostetter <ho...@fucit.org>.
: > title:(gone with the wind)^3.0 OR title2:(gone with the wind)
: That did it!  Thanks for the Help!
: What value do the numbers carry in the ranking?  I arbitrarily choose
: the number 5 cause it's an easy number :)

query boosts are in fact pretty arbitrary ... what you should pick really
depends on what boosts you put on other clauses, and what kinds of values
the tf, idf, and coord functions of your Similarity are going to return.

: I am a bit nervous about the dismax query system as I have quite a bit
: of other content that could skew the results.

i'm really not sure what you mean by that ... dismax will only look at the
fields you tell it to, and the factors that contribute to the score each
term/document pair in a dismax query will be the same as those from the
standard request handler -- the only differnece is how those individual
TermQuery scores are combined.

: Whats the difference between the dismax query handler and listing all of
: the fields in my search and separating them with an OR?

the best way to udnerstand this is too look at the debug output you get
from each query, and read the "Explanation" section ... some of the deep
detals may not make much sense, but the overall structure of score
calculation should be helpful

in a nutshell, when you ask the StandardRequestHandler for docs
matching...
     q = title:(foo bar) other:(foo bar)

if a document matches both title:foo, other:foo, and other:bar then the
score for that document is (esentially) the sum of the scores from
matching the individual terms

with dismax, if you ask for

     q = foo bar  & qf = title other

then the score for the same document is different: the matches on
the word "foo" are considered together regardless of field, and only the
field that resulted in the highest score is used (with a small portion of
hte matches on the otherfields being included to help break ties).  the
score contribution from matching on other;bar is basically the same as
before.

The driving motivation for the DisjunctionMaxQuery was so that if you
wanted to search for the words "Java" or "Lucene" in 3 differnet fields:
title, description, and body a document that matched Lucene once in the
body field, but matched Java dozens of times and at least once in each
field wouldn't overshadow a documetn that matched both Lucene and Java
just once in each field.


-Hoss


Re: relevance ranking and scoring

Posted by Andrew Nagy <an...@villanova.edu>.
Yonik Seeley wrote:
>>
>> What about term ranking, could I rank the phrases searched in title
>> higher than title2?
>
> Absolutely... standard lucene syntax for boosting will give you that
> in the standard query handler.
>
> title:(gone with the wind)^3.0 OR title2:(gone with the wind)
That did it!  Thanks for the Help!
What value do the numbers carry in the ranking?  I arbitrarily choose 
the number 5 cause it's an easy number :)

I am a bit nervous about the dismax query system as I have quite a bit 
of other content that could skew the results.
Whats the difference between the dismax query handler and listing all of 
the fields in my search and separating them with an OR?

Thanks!
Andrew



Re: relevance ranking and scoring

Posted by Yonik Seeley <yo...@apache.org>.
On 1/24/07, Andrew Nagy <an...@villanova.edu> wrote:
> Yonik Seeley wrote:
> > Ok, here is your query:
> > <str name="rawquerystring">title:(gone with the wind) OR title2:(gone
> > with the wind)</str>
> > And here it is parsed:
> > <str name="parsedquery">(title:gone title:wind) (title2:gone
> > title2:wind)</str>
> >
> > First, notice how stopwords were removed, so "with" and "the" will not
> > count in the results.
> >
> > You are querying across two different fields.
> > Notice how the first two documents both have "wind" in both title and
> > title2,
> > while the third document "gone with the wind" has no title2 field (and
> > hence can't match on it).
> >
> > In the first documents, the scores for the matches on title and title2
> > both contribute to the score.  For the third document, it's penalized
> > by not matching in both the title and title2 field.
> >
> > You could look at the dismax handler... it helps constructs queries, a
> > component of which are DisjunctionMaxQueries (they don't add together
> > scores from different fields, but just take the highest score from any
> > matching field for a term).
> >
> > You could also see how changing or removing the stopword list affects
> > relevance.
> Wow, thanks for the verbose response.  This gives me a lot to go on!
>
> What about term ranking, could I rank the phrases searched in title
> higher than title2?

Absolutely... standard lucene syntax for boosting will give you that
in the standard query handler.

title:(gone with the wind)^3.0 OR title2:(gone with the wind)

For dismax, you give the query separate from the fields, and you can
express different weights on the fields via qf=title^3.0 title2

-Yonik

Re: relevance ranking and scoring

Posted by Andrew Nagy <an...@villanova.edu>.
Yonik Seeley wrote:
> Ok, here is your query:
> <str name="rawquerystring">title:(gone with the wind) OR title2:(gone
> with the wind)</str>
> And here it is parsed:
> <str name="parsedquery">(title:gone title:wind) (title2:gone 
> title2:wind)</str>
>
> First, notice how stopwords were removed, so "with" and "the" will not
> count in the results.
>
> You are querying across two different fields.
> Notice how the first two documents both have "wind" in both title and 
> title2,
> while the third document "gone with the wind" has no title2 field (and
> hence can't match on it).
>
> In the first documents, the scores for the matches on title and title2
> both contribute to the score.  For the third document, it's penalized
> by not matching in both the title and title2 field.
>
> You could look at the dismax handler... it helps constructs queries, a
> component of which are DisjunctionMaxQueries (they don't add together
> scores from different fields, but just take the highest score from any
> matching field for a term).
>
> You could also see how changing or removing the stopword list affects 
> relevance.
Wow, thanks for the verbose response.  This gives me a lot to go on!

What about term ranking, could I rank the phrases searched in title 
higher than title2?

Thanks!
Andrew

Re: relevance ranking and scoring

Posted by Yonik Seeley <yo...@apache.org>.
On 1/24/07, Andrew Nagy <an...@villanova.edu> wrote:
> > Let's start with the first... add a debugQuery=on
> > parameter to your request and post the full result here.
> > You can get the same effect through the
> > query form on the solr admin pages by checking the "Debug: explain"
> > checkbox.
> I attached the results to my last email, are you not able to see them?

Ahh, I missed it.

Ok, here is your query:
 <str name="rawquerystring">title:(gone with the wind) OR title2:(gone
with the wind)</str>
And here it is parsed:
 <str name="parsedquery">(title:gone title:wind) (title2:gone title2:wind)</str>

First, notice how stopwords were removed, so "with" and "the" will not
count in the results.

You are querying across two different fields.
Notice how the first two documents both have "wind" in both title and title2,
while the third document "gone with the wind" has no title2 field (and
hence can't match on it).

In the first documents, the scores for the matches on title and title2
both contribute to the score.  For the third document, it's penalized
by not matching in both the title and title2 field.

You could look at the dismax handler... it helps constructs queries, a
component of which are DisjunctionMaxQueries (they don't add together
scores from different fields, but just take the highest score from any
matching field for a term).

You could also see how changing or removing the stopword list affects relevance.

-Yonik

Re: relevance ranking and scoring

Posted by Andrew Nagy <an...@villanova.edu>.
Yonik Seeley wrote:
> On 1/23/07, Andrew Nagy <an...@villanova.edu> wrote:
>> Yonik Seeley wrote:
>> > Things you can try:
>> > - post the debugging output (including score explain) for the query
>> I have attached the output.
>> > - try disabling length normalization for the title field, then remove
>> > the entire index and re-idnex.
>> > - try the dismax handler, which can generate sloppy phrase queries to
>> > boost results containing all terms.
>> > - try a different similarity implementation
>> > (org.apache.lucene.misc.SweetSpotSimilarity from lucene)
>> Can you explain what these 3 options mean?  I would like to get a better
>> understanding of the guts of SOLR/Lucene but I am too busy working on my
>> application that uses it to spend time with the internals.
>
> Let's start with the first... add a debugQuery=on
> parameter to your request and post the full result here.
> You can get the same effect through the
> query form on the solr admin pages by checking the "Debug: explain" 
> checkbox.
I attached the results to my last email, are you not able to see them?

Andrew

Re: relevance ranking and scoring

Posted by Yonik Seeley <yo...@apache.org>.
On 1/23/07, Andrew Nagy <an...@villanova.edu> wrote:
> Yonik Seeley wrote:
> > Things you can try:
> > - post the debugging output (including score explain) for the query
> I have attached the output.
> > - try disabling length normalization for the title field, then remove
> > the entire index and re-idnex.
> > - try the dismax handler, which can generate sloppy phrase queries to
> > boost results containing all terms.
> > - try a different similarity implementation
> > (org.apache.lucene.misc.SweetSpotSimilarity from lucene)
> Can you explain what these 3 options mean?  I would like to get a better
> understanding of the guts of SOLR/Lucene but I am too busy working on my
> application that uses it to spend time with the internals.

Let's start with the first... add a debugQuery=on
parameter to your request and post the full result here.
You can get the same effect through the
query form on the solr admin pages by checking the "Debug: explain" checkbox.

-Yonik

Re: relevance ranking and scoring

Posted by Andrew Nagy <an...@villanova.edu>.
Yonik Seeley wrote:
> Things you can try:
> - post the debugging output (including score explain) for the query
I have attached the output.
> - try disabling length normalization for the title field, then remove
> the entire index and re-idnex.
> - try the dismax handler, which can generate sloppy phrase queries to
> boost results containing all terms.
> - try a different similarity implementation
> (org.apache.lucene.misc.SweetSpotSimilarity from lucene)
Can you explain what these 3 options mean?  I would like to get a better 
understanding of the guts of SOLR/Lucene but I am too busy working on my 
application that uses it to spend time with the internals.

Thanks
Andrew

Re: relevance ranking and scoring

Posted by Yonik Seeley <yo...@apache.org>.
On 1/23/07, Yonik Seeley <yo...@apache.org> wrote:
> - try disabling length normalization for the title field, then remove
> the entire index and re-idnex.

Forgot to tell you how to disable length normalization:
set omitNorms="true" on the field in schema.xml

-Yonik

Re: relevance ranking and scoring

Posted by Yonik Seeley <yo...@apache.org>.
On 1/23/07, Andrew Nagy <an...@villanova.edu> wrote:
> I have 2 questions about the SOLR relevancy system.

As far as scoring, it's pretty much stock lucene with some other stuff
added on (like function query).
http://lucene.apache.org/java/docs/scoring.html

> 1. Why is it when I search for an exact phrase of a title of a record I
> have it generally does not come up as the 1st record in the results?
>
> ex: title:(gone with the wind), the record comes up 3rd.  A record with
> the term "wind" as the first word in the title comes up 1st.
> ex: title:"gone with the wind", the record comes up 1st.

Well, you could do an exact or sloppy phrase match
title:"gone with the wind"
But I get your point... if you want to also match records with just "wind".

> Is this because the word "wind" is the only noun?

Yes, this probably came about because of lucene's length normalization
in the default similarity.  It's 1/sqrt(num_terms_in_field)

So a document with a title of "wind" has a "norm" of 1.0, while a
document with 4 terms has a "norm" of .7
Still, it seems like the coord factor (number of terms matching)
should have been more than enough to overcome the length
normalization.  What were the exact titles?  I assume you were not
using any type if index-time boosting?

Things you can try:
- post the debugging output (including score explain) for the query
- try disabling length normalization for the title field, then remove
the entire index and re-idnex.
- try the dismax handler, which can generate sloppy phrase queries to
boost results containing all terms.
- try a different similarity implementation
(org.apache.lucene.misc.SweetSpotSimilarity from lucene)


> 2. The "score" that is associated with each value is quite odd, what
> does it represent.  I generally get results with the top record being
> somewhere around 3.0 or 2.0 and most records are below 1.

Scores aren't too comparable across different queries... the scores
are only meant to rank documents with respect to a single query.

-Yonik