You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Shayan Tabrizi (JIRA)" <ji...@apache.org> on 2016/10/06 13:45:21 UTC

[jira] [Updated] (LUCENE-7478) Wrong Formula in LMDirichletSimilarity

     [ https://issues.apache.org/jira/browse/LUCENE-7478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shayan Tabrizi updated LUCENE-7478:
-----------------------------------
    Description: 
It seems that the formula in LMDirichletSimilarity is wrong or at least is not the formula in the mentioned C.X. Zhai paper. 

The main part of formula in LMDirichletSimilarity is:
Math.log(1 + freq /
        (mu * ((LMStats)stats).getCollectionProbability())) +
        Math.log(mu / (docLen + mu))

which is in fact:
(mu*p(w|C)+c(w,d))/(p(w|C)*(|d| + mu))

while the main formula is:
(mu*p(w|C)+c(w,d))/(|d| + mu)

So a p(w|C) is practically added to the formula.

  was:
It seems that the formula in LMDirichletSimilarity is wrong or at least is not the formula in the mentioned C.X. Zhai paper. 

The main part of formula in LMDirichletSimilarity is:
Math.log(1 + freq /
        (mu * ((LMStats)stats).getCollectionProbability())) +
        Math.log(mu / (docLen + mu))

which is in fact:
(mu*p(w|C)+c(w,d))/(p(w)*(|d| + mu))

while the main formula is:
(mu*p(w|C)+c(w,d))/(|d| + mu)

So a p(w) is practically added to the formula.


> Wrong Formula in LMDirichletSimilarity
> --------------------------------------
>
>                 Key: LUCENE-7478
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7478
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Shayan Tabrizi
>            Priority: Critical
>
> It seems that the formula in LMDirichletSimilarity is wrong or at least is not the formula in the mentioned C.X. Zhai paper. 
> The main part of formula in LMDirichletSimilarity is:
> Math.log(1 + freq /
>         (mu * ((LMStats)stats).getCollectionProbability())) +
>         Math.log(mu / (docLen + mu))
> which is in fact:
> (mu*p(w|C)+c(w,d))/(p(w|C)*(|d| + mu))
> while the main formula is:
> (mu*p(w|C)+c(w,d))/(|d| + mu)
> So a p(w|C) is practically added to the formula.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org