You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Shayan Tabrizi (JIRA)" <ji...@apache.org> on 2016/10/06 13:42:20 UTC

[jira] [Created] (LUCENE-7478) Wrong Formula in LMDirichletSimilarity

Shayan Tabrizi created LUCENE-7478:
--------------------------------------

             Summary: Wrong Formula in LMDirichletSimilarity
                 Key: LUCENE-7478
                 URL: https://issues.apache.org/jira/browse/LUCENE-7478
             Project: Lucene - Core
          Issue Type: Bug
            Reporter: Shayan Tabrizi
            Priority: Critical


It seems that the formula in LMDirichletSimilarity is wrong or at least is not the formula in the mentioned C.X. Zhai paper. 

The main part of formula in LMDirichletSimilarity is:
Math.log(1 + freq /
        (mu * ((LMStats)stats).getCollectionProbability())) +
        Math.log(mu / (docLen + mu))

which is in fact:
(mu*p(w|C)+c(w,d))/(p(w)*(|d| + mu))

while the main formula is:
(mu*p(w|C)+c(w,d))/(|d| + mu)

So a p(w) is practically added to the formula.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org