You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Chris Hostetter <ho...@fucit.org> on 2011/11/03 20:21:10 UTC

Re: score based on unique words matching

: > q=david bowie changes
: > 
: > Problem : If a record mentions david bowie a lot, it beats out something
: > more relevant (more unique matches) ...
: > 
: > A. (now appearing david bowie at the cineplex 7pm david bowie goes on stage,
: > then mr. bowie will sign autographs)
: > B. song :david bowie - changes
: > 
: > (A) ends up more relevant because of the frequency or number of words in
: > it.. not cool...
: > I want it so the number of words matching will trump density/weight....

debugQuery=true is your freind .. it will show you exactly how the scores 
are being computed.

the key factors in something like this are fieldNorm, tf, and the coord 
factor.

The fieldNorm includes as a factor the length of the field, so as long as 
you have omitNorm=false configured for this field, doc#A should be 
panalized relative doc#B for being longer -- but if you omitNorm's then 
that won't help you -- so start by checking that.

The coord factor will penalize documents that don't match all of the 
clauses of a boolean query (ie: doc #A only matches 2/3 clauses becuase it 
doesn't match the word "changes") so you could customize your Similarity 
implementation to make that coord penalty higher, but that requires some 
custom java code.

As an extreme option, you could use omitTf to completley eliminate the 
term frequency from being a factor in scoring so the number of times 
"bowie" appears won't affect the score, just that it appears at least 
once) but that probably isn't what you want: "david bowie changes 
some stuff" would get the same score as "david bowie changes david bowie"

in general the simplest way to deal with a lot of this type of thing is to 
think about how you are structuring your query.  something as simple as 
using the dismax parser with your field in both the "qf" and "pf" fields 
(and a little bit of slop in the "ps" param) may give you exactly what you 
want (since it will reward docs where the whole query string appears in 
the field...

https://wiki.apache.org/solr/DisMaxQParserPlugin


-Hoss