You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2016/10/07 14:44:20 UTC

[jira] [Commented] (LUCENE-7480) Wrong Formula in LMDirichletSimilarity

    [ https://issues.apache.org/jira/browse/LUCENE-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15555263#comment-15555263 ] 

Michael McCandless commented on LUCENE-7480:
--------------------------------------------

I am not familiar with {{LMDirichletSimilarity}} in particular, but there are two phases in general for a similarity.

Phase 1 is done up front by checking the term statistics across the entire index, in {{LMSimilarity.fillBasicStats}}.

Phase 2 is done per-segment, which is the code you are pointing to in {{BooleanWeight}}: when {{subScorer}} is {{null}} that means the requested term (or sub-query) never appears at all in the current segment.  But this is not supposed to alter how scoring works, since Phase 1 should have computed stats for all terms in the query.

Maybe you can make a test case showing that the score is incorrect in Lucene's implementation vs the original formula?

> Wrong Formula in LMDirichletSimilarity
> --------------------------------------
>
>                 Key: LUCENE-7480
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7480
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Shayan Tabrizi
>
> It seems that LMDirichletSimilarity only calculates "score" method if the term occurs in the document. Otherwise, in line 389 of BooleanWeight (Lucene 6.2.0) subScorer becomes null, and thus the clause is not added to the optional list in order to be scored.
> However, in the original formula of LM (http://www.stat.uchicago.edu/~lafferty/pdf/smooth-tois.pdf, formula 6), we have "n log a_d" (n is the number of query terms). Therefore, even for the query terms not present in the document a "log a_d" must be added to the final score.
> But the implementation of LMDirichletSimilarity adds "log a_d" to the score in the "score" method, and therefore it is only added to the final score for the query terms present in the document.
> This can worsen the retrieval results compared to the correct formula. I tried to correct this for myself but because of the plenty of "final" methods and classes, I was not successful. Please, check the problem and solve it if approved, and also please tell me how I can correct it before a new release is published.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org