You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Shayan Tabrizi (JIRA)" <ji...@apache.org> on 2016/10/06 13:45:21 UTC
[jira] [Updated] (LUCENE-7478) Wrong Formula in
LMDirichletSimilarity
[ https://issues.apache.org/jira/browse/LUCENE-7478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shayan Tabrizi updated LUCENE-7478:
-----------------------------------
Description:
It seems that the formula in LMDirichletSimilarity is wrong or at least is not the formula in the mentioned C.X. Zhai paper.
The main part of formula in LMDirichletSimilarity is:
Math.log(1 + freq /
(mu * ((LMStats)stats).getCollectionProbability())) +
Math.log(mu / (docLen + mu))
which is in fact:
(mu*p(w|C)+c(w,d))/(p(w|C)*(|d| + mu))
while the main formula is:
(mu*p(w|C)+c(w,d))/(|d| + mu)
So a p(w|C) is practically added to the formula.
was:
It seems that the formula in LMDirichletSimilarity is wrong or at least is not the formula in the mentioned C.X. Zhai paper.
The main part of formula in LMDirichletSimilarity is:
Math.log(1 + freq /
(mu * ((LMStats)stats).getCollectionProbability())) +
Math.log(mu / (docLen + mu))
which is in fact:
(mu*p(w|C)+c(w,d))/(p(w)*(|d| + mu))
while the main formula is:
(mu*p(w|C)+c(w,d))/(|d| + mu)
So a p(w) is practically added to the formula.
> Wrong Formula in LMDirichletSimilarity
> --------------------------------------
>
> Key: LUCENE-7478
> URL: https://issues.apache.org/jira/browse/LUCENE-7478
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Shayan Tabrizi
> Priority: Critical
>
> It seems that the formula in LMDirichletSimilarity is wrong or at least is not the formula in the mentioned C.X. Zhai paper.
> The main part of formula in LMDirichletSimilarity is:
> Math.log(1 + freq /
> (mu * ((LMStats)stats).getCollectionProbability())) +
> Math.log(mu / (docLen + mu))
> which is in fact:
> (mu*p(w|C)+c(w,d))/(p(w|C)*(|d| + mu))
> while the main formula is:
> (mu*p(w|C)+c(w,d))/(|d| + mu)
> So a p(w|C) is practically added to the formula.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org