You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Gangadhar Nittala <np...@gmail.com> on 2010/09/10 07:06:41 UTC

Beginner question about entropy calculation

All,
I am a first time user of Mahout. I checked out the code and was able
to get the build going. I was checking the Tasks list (I use Eclipse)
and saw one in the LogLikelihoodTest.java to check the epsilons.

While checking the code in the LogLikelohood.java
(org.apache.mahout.math.stats.Loglikelihood.java), I saw that the code
for the Shannon entropy calculation seemed different from the one as
it is defined on Wikipedia
[http://en.wikipedia.org/wiki/Entropy_(information_theory)].

I wrote a small python script (attached- llr.py) to compare the one
that is present in Mahout
(org.apache.mahout.math.stats.Loglikelihood.java) and the one that is
defined by Ted Dunning in
http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html.
Even though the entropy and LLR calculations are different, the final
output of the LLR is the same with both the methods.

I am trying to find out why both the methods are equivalent. Can you
please let me know why this is the case or if there is a reference I
can check I shall do that. If this is not the list for this question,
I am sorry, I shall try the mahout-users list.

Thank you
Gangadhar

Re: Beginner question about entropy calculation

Posted by Ted Dunning <te...@gmail.com>.
No problem.

It is always good to review these things to make sure that they really are
correct.  I will add a comment
to the Mahout code to clarify the issue.

On Fri, Sep 10, 2010 at 6:15 PM, Gangadhar Nittala
<np...@gmail.com>wrote:

> Thank you for taking the time to respond to my question.
>

Re: Beginner question about entropy calculation

Posted by Gangadhar Nittala <np...@gmail.com>.
Thank you for the reply Ted. I see that they are equivalent now.

>From wikipedia
H(K) = -sum_i ( (k_i/N) * log(k_i/N) ) where N = sum_i(k_i)

>From the code in org.apache.mahout.math.stats.Loglikelihood.java
H(K) = -( sum_i( k_i*log (k_i) ) - N*log(N) ) where N = sum_i(k_i)

>From your blog entry
H(K) = sum_i( (k_i/N)*log(k_i) ) - logN where N = sum_i(k_i)

During the LLR calculation though, in the Loglikelihood.java, the LLR
is calculated as
2 * ( H(matrix) - H(rowSum) - H(colSum))

whereas in your blog entry it is mentioned as
2 * N * ( H(matrix) - H(rowSums(k)) - H(colSums(k)) )

the extra N  in the second calculation is required as that is a
remnant from the way the entropy is calculated in that scenario. I am
still trying to understand the different ways in which the rowSum and
columnSum entropies are calculated in the LLR function. I think they
are also equivalent to each other. I just need to work out that.

Nevertheless, the LLR value resulting from both the methods is the
same, that is why I was trying to find out how they are equivalent.

Thank you for taking the time to respond to my question.
Gangadhar


On Fri, Sep 10, 2010 at 2:14 AM, Ted Dunning <te...@gmail.com> wrote:
> What difference are you seeing?
>
> The entropy calculation in LogLikelihood is what I would call un-normalized
> entropy
>
>   H(K) = - \sum_i k_i log (k_i / n)
>
> This makes the expression for the log-likelihood ratio slightly simpler.
>  The log (k_i / n) is also split into
> two parts to avoid doing lots of divisions.  This gives:
>
>    H(K) = -\sum_i k_i (log(k_i) - log(n)) = -\sum_i k_i log(k_i) + \sum_i
> k_i log(n)
>           =  n log(n) - \sum_i k_i log(k_i)
>
>
>
> On Thu, Sep 9, 2010 at 10:06 PM, Gangadhar Nittala
> <np...@gmail.com>wrote:
>
>> All,
>> I am a first time user of Mahout. I checked out the code and was able
>> to get the build going. I was checking the Tasks list (I use Eclipse)
>> and saw one in the LogLikelihoodTest.java to check the epsilons.
>>
>> While checking the code in the LogLikelohood.java
>> (org.apache.mahout.math.stats.Loglikelihood.java), I saw that the code
>> for the Shannon entropy calculation seemed different from the one as
>> it is defined on Wikipedia
>> [http://en.wikipedia.org/wiki/Entropy_(information_theory)].
>>
>> I wrote a small python script (attached- llr.py) to compare the one
>> that is present in Mahout
>> (org.apache.mahout.math.stats.Loglikelihood.java) and the one that is
>> defined by Ted Dunning in
>> http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html.
>> Even though the entropy and LLR calculations are different, the final
>> output of the LLR is the same with both the methods.
>>
>> I am trying to find out why both the methods are equivalent. Can you
>> please let me know why this is the case or if there is a reference I
>> can check I shall do that. If this is not the list for this question,
>> I am sorry, I shall try the mahout-users list.
>>
>> Thank you
>> Gangadhar
>>
>

Re: Beginner question about entropy calculation

Posted by Ted Dunning <te...@gmail.com>.
What difference are you seeing?

The entropy calculation in LogLikelihood is what I would call un-normalized
entropy

   H(K) = - \sum_i k_i log (k_i / n)

This makes the expression for the log-likelihood ratio slightly simpler.
 The log (k_i / n) is also split into
two parts to avoid doing lots of divisions.  This gives:

    H(K) = -\sum_i k_i (log(k_i) - log(n)) = -\sum_i k_i log(k_i) + \sum_i
k_i log(n)
           =  n log(n) - \sum_i k_i log(k_i)



On Thu, Sep 9, 2010 at 10:06 PM, Gangadhar Nittala
<np...@gmail.com>wrote:

> All,
> I am a first time user of Mahout. I checked out the code and was able
> to get the build going. I was checking the Tasks list (I use Eclipse)
> and saw one in the LogLikelihoodTest.java to check the epsilons.
>
> While checking the code in the LogLikelohood.java
> (org.apache.mahout.math.stats.Loglikelihood.java), I saw that the code
> for the Shannon entropy calculation seemed different from the one as
> it is defined on Wikipedia
> [http://en.wikipedia.org/wiki/Entropy_(information_theory)].
>
> I wrote a small python script (attached- llr.py) to compare the one
> that is present in Mahout
> (org.apache.mahout.math.stats.Loglikelihood.java) and the one that is
> defined by Ted Dunning in
> http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html.
> Even though the entropy and LLR calculations are different, the final
> output of the LLR is the same with both the methods.
>
> I am trying to find out why both the methods are equivalent. Can you
> please let me know why this is the case or if there is a reference I
> can check I shall do that. If this is not the list for this question,
> I am sorry, I shall try the mahout-users list.
>
> Thank you
> Gangadhar
>