You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Stephen Wu <st...@trapit.com> on 2015/05/01 20:16:08 UTC

Stats in CustomScoreProvider + (in)correctness of LMDirichletSimilarity

I am having trouble getting collection probabilities for a term to show up
in a CustomScoreQuery/CustomScoreProvider.  Basically, I am trying to add a
per-document weight that amounts to the sum (for each term in the query) of
Math.log(collectionProbability).  Can anyone help with this?

Or feel free to suggest a better way to do this.  Here's a description...

-----
LMDirichletSimilarity is not consistent with the original equations, as
many have noted.  Here's how it's different under two

1. *Swap in LMDirichletSimilarity* in place of some other similarity, but
modify the scoring function.  Ignoring the boost, it is currently
implemented as:
    term_score_current = Math.log(1 + freq /
        (mu * collectionProbability)) +
        Math.log(mu / (docLen + mu))

If you do this, there are two problems.  The first problem is that the
score is off by a factor of Math.log(collectionProbability).  Do the math
<http://en.wikipedia.org/wiki/List_of_logarithmic_identities>: if you add
that in, you will get something equal to form of the original formulation
(e.g., in Zhai and Lafferty 2001).  For reference, that looks like:
    term_score_official = Math.log( (freq+mu*collectionProbability) /
(docLen+mu) )

If you add that factor, though, the second problem arises.  That
Math.log(collectionProbability) factor does not get added for terms that
don't MATCH with a document because .score() doesn't get called if there's
no MATCH.  This is basically the problem that Ronan Cummins wrote about a
few weeks ago.

2. *Leave LMDirichletSimilarity as it is* but *add a factor* to every final
score that is returned*.*  (Note: you'd also need to remove the
non-negative score restriction in LMDirichletSimilarity.)  This would be
the sum of the log collection probabilities for each term:
    query_score = sum(term_score_current) +
sum(Math.log(collectionProbability))

As some have mentioned, this is basically an additive version of a
queryNorm.  It seems like the right way to do this is to wrap each query in
a modified CustomScoreQuery accessing a CustomScoreProvider, which would
then add that "constant" factor across all documents.  However, this
"constant" factor needs to be computed from statistics; how can this be
done?  Those statistics are available in LMDirichletSimilarity, but it is
less clear how to find those statistics directly from a Query object.

stephen

Re: Stats in CustomScoreProvider + (in)correctness of LMDirichletSimilarity

Posted by Stephen Wu <st...@trapit.com>.

Sorry, I was wrong on my solution for #2 -- linking some equations here
<http://mathb.in/34502?key=b2b24cfc50ee4983d8a2a0da09bab2686e8f2592> that
should explain a consistent approach.  Leaving LMDirichletSimilarity as-is
skews the "additive queryNorm" factor.  LMDirichletSimilarity should have
only the following in its .score() function:
    term_score_proposed = Math.log(1 + freq /
        (mu * collectionProbability))
At this point, the score is rank-equivalent with the correct score.
However, to get correct probabilities for other purposes (e.g., weighting
pseudo-relevance query expansion), the final score would need to add in:
    query_score = sum_matched(term_score_proposed) +
sum_all(Math.log(mu*collectionProbability/(docLen+mu)))
where there is a difference between matched terms and all terms.

Any help on how to implement this, especially getting the
collectionProbabilities into CustomScoreProvider, would be appreciated.

stephen

On Fri, May 1, 2015 at 11:16 AM, Stephen Wu <st...@trapit.com> wrote:

> I am having trouble getting collection probabilities for a term to show up
> in a CustomScoreQuery/CustomScoreProvider.  Basically, I am trying to add a
> per-document weight that amounts to the sum (for each term in the query) of
> Math.log(collectionProbability).  Can anyone help with this?
>
> Or feel free to suggest a better way to do this.  Here's a description...
>
> -----
> LMDirichletSimilarity is not consistent with the original equations, as
> many have noted.  Here's how it's different under two
>
> 1. *Swap in LMDirichletSimilarity* in place of some other similarity, but
> modify the scoring function.  Ignoring the boost, it is currently
> implemented as:
>     term_score_current = Math.log(1 + freq /
>         (mu * collectionProbability)) +
>         Math.log(mu / (docLen + mu))
>
> If you do this, there are two problems.  The first problem is that the
> score is off by a factor of Math.log(collectionProbability).  Do the math
> <http://en.wikipedia.org/wiki/List_of_logarithmic_identities>: if you add
> that in, you will get something equal to form of the original formulation
> (e.g., in Zhai and Lafferty 2001).  For reference, that looks like:
>     term_score_official = Math.log( (freq+mu*collectionProbability) /
> (docLen+mu) )
>
> If you add that factor, though, the second problem arises.  That
> Math.log(collectionProbability) factor does not get added for terms that
> don't MATCH with a document because .score() doesn't get called if there's
> no MATCH.  This is basically the problem that Ronan Cummins wrote about a
> few weeks ago.
>
> 2. *Leave LMDirichletSimilarity as it is* but *add a factor* to every
> final score that is returned*.*  (Note: you'd also need to remove the
> non-negative score restriction in LMDirichletSimilarity.)  This would be
> the sum of the log collection probabilities for each term:
>     query_score = sum(term_score_current) +
> sum(Math.log(collectionProbability))
>
> As some have mentioned, this is basically an additive version of a
> queryNorm.  It seems like the right way to do this is to wrap each query in
> a modified CustomScoreQuery accessing a CustomScoreProvider, which would
> then add that "constant" factor across all documents.  However, this
> "constant" factor needs to be computed from statistics; how can this be
> done?  Those statistics are available in LMDirichletSimilarity, but it is
> less clear how to find those statistics directly from a Query object.
>
> stephen
>
>
>
>
>