You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by ka...@nokia.com on 2011/01/26 18:17:20 UTC

Scoring woes?

I have an interesting scoring problem, which I can't seem to get around.

The problem is best stated as follows:

(1)    My schema has several independent fields, e.g. "value_0", "value_1", ... "value_6".

(2)    Every document has all of these fields set, with a-priori field norm values.  Where a record has no field value, the document is indexed with a placeholder value ("_empty_"), whose field norm is the numerical average of all the a-priori field norms for that field.

(3)    My query takes a set of terms and builds a list of combinations of these, and Ors these combinations together.  For example:

Q=Lexington Massachusetts

Query:
(+value_0:Lexington +value_0:Massachusetts)
(+value_0:Lexington +value_1:Massachusetts)
(+value_1:Lexington +value_0:Massachusetts)
...

The tricky part comes in when I try to explicitly add the "_empty_" matches.  I need to do this because I am trying to insure that when, say, two values are matched, I preferentially score the record which has only those two values the highest, compared to the all the records that have those two values and also a third one.  So, I tried this:

Query:
(+value_0:Lexington +value_0:Massachusetts +value_1:_empty_ +value_2:_empty_ + value_3:_empty_ + value_4:_empty_ etc.)
(+value_0:Lexington +value_1:Massachusetts +value_2:_empty_ etc.)
(+value_1:Lexington +value_0:Massachusetts +value_2:_empty_ etc.)
...

I also needed it to be possible to match all possible values instead of _empty_ for each of the places where that occurred.  Including no clause for these fields clearly messed up the queryNorm, so  I fixed that by including a MatchAllDocsQuery() for each missing field, this insuring that the number of query clauses was identical from clause to clause.

Nevertheless, I was still not seeing the shortest-match records being scored to the top.  So I tried to boost the _empty_ matches, like this:

(+value_0:Lexington +value_0:Massachusetts +value_1:_empty_^1000.0 +value_2:_empty_^1000.0  + value_3:_empty_^1000.0  + value_4:_empty_^1000.0  etc.)

That, surprisingly, did not change anything.  I suppose it must be because the boost is also figured into the query norm?  I'm trying another experiment now, reindexing with a pre-boosted field norm for _empty_ tokens.  But what I'd like to ask is, how exactly are you supposed to fix this problem in Lucene?  All I want to see is the minimal complete match be scored to the top.

Karl