You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by ka...@nokia.com on 2011/01/26 18:17:20 UTC

Scoring woes?

I have an interesting scoring problem, which I can't seem to get around.

The problem is best stated as follows:

(1) My schema has several independent fields, e.g. "value_0", "value_1", ... "value_6".

(2) Every document has all of these fields set, with a-priori field norm values. Where a record has no field value, the document is indexed with a placeholder value ("_empty_"), whose field norm is the numerical average of all the a-priori field norms for that field.

(3) My query takes a set of terms and builds a list of combinations of these, and Ors these combinations together. For example:

Q=Lexington Massachusetts

Query:
(+value_0:Lexington +value_0:Massachusetts)
(+value_0:Lexington +value_1:Massachusetts)
(+value_1:Lexington +value_0:Massachusetts)
...

The tricky part comes in when I try to explicitly add the "_empty_" matches. I need to do this because I am trying to insure that when, say, two values are matched, I preferentially score the record which has only those two values the highest, compared to the all the records that have those two values and also a third one. So, I tried this:

Query:
(+value_0:Lexington +value_0:Massachusetts +value_1:_empty_ +value_2:_empty_ + value_3:_empty_ + value_4:_empty_ etc.)
(+value_0:Lexington +value_1:Massachusetts +value_2:_empty_ etc.)
(+value_1:Lexington +value_0:Massachusetts +value_2:_empty_ etc.)
...

I also needed it to be possible to match all possible values instead of _empty_ for each of the places where that occurred. Including no clause for these fields clearly messed up the queryNorm, so I fixed that by including a MatchAllDocsQuery() for each missing field, this insuring that the number of query clauses was identical from clause to clause.

Nevertheless, I was still not seeing the shortest-match records being scored to the top. So I tried to boost the _empty_ matches, like this:

(+value_0:Lexington +value_0:Massachusetts +value_1:_empty_^1000.0 +value_2:_empty_^1000.0 + value_3:_empty_^1000.0 + value_4:_empty_^1000.0 etc.)

That, surprisingly, did not change anything. I suppose it must be because the boost is also figured into the query norm? I'm trying another experiment now, reindexing with a pre-boosted field norm for _empty_ tokens. But what I'd like to ask is, how exactly are you supposed to fix this problem in Lucene? All I want to see is the minimal complete match be scored to the top.

Karl