You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Ronan Cummins <ro...@cl.cam.ac.uk> on 2015/04/11 14:29:02 UTC

LMDirichletSimilarity Scoring Function

I am implementing a language modelling (type) similarity function, and am
using the LMDirichletSimilarity class (and its helper classes) as a
template. However, it seems the LMDirichletSimilarity.class implementation
is not the same as that presented in "A Study of Smoothing Methods for
Language Models Applied to Information Retrieval" by Zhai and Lafferty.

The score method in LMDirichletSimilarity.class for matching terms is
implemented as follows:

score = (float) (Math.log(1 + freq / (mu * ((LMStats)
stats).getCollectionProbability())) + Math.log(mu / (docLen + mu)))

In particular, the score method in that class only provides the
normalisation factor (i.e. the Math.log(mu / (docLen + mu)) bit ) for
matching terms. It should actually do this normalisation for all terms in
the query (regardless of whether they occur in the document). The
Math.log(mu / (docLen + mu)) should really be removed and the following
document-specific score should be added to the document score after the
term-scoring part (unless I am missing some background scoring that is
going on in Lucene):

+ queryLen * Math.log(mu / (docLen + mu))

Therefore, my question is as follows:

Where in lucene can I add a document-specific factor just prior to sorting
the final document scores? I want this to be calculated and tuneable at
query-time (not index time).

The boosting features of lucene seem to be inflexible (as they assume that
you wish to multiply the boosting factor).
I could run the initial query and then re-score the documents in the
TopDocs by adding the factor, but it seems like there has to be a more
efficient way to do this.

As this is one of the main formulas in information retrieval, it would be
nice if it was implemented correctly.
Any help appreciated...