You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Ted Dunning <te...@gmail.com> on 2014/05/23 19:25:45 UTC

Re: Question Regarding Entropy calculation in Mahout

I am sorry, but I don't understand your questions or needs sufficiently to
answer.




On Wed, Apr 23, 2014 at 12:21 PM, Darshan Sonagara <
darshan.sonagara@gmail.com> wrote:

> sir please reply me as soon as possible
> thanks in advance.
>
>
> On Tue, Apr 22, 2014 at 11:50 PM, Darshan Sonagara <
> darshan.sonagara@gmail.com> wrote:
>
> > waiting for the replay sir .
> >
> >
> > On Tue, Apr 22, 2014 at 7:13 PM, Darshan Sonagara <
> > darshan.sonagara@gmail.com> wrote:
> >
> >> Thnks for the Replay sir,
> >>
> >> actually i am doing clustering for gathering similar king of document in
> >> same cluster as much as possible.
> >> i can see from output file by cluster dump by observing top term.
> >> i also figure out that by varying Distance Measure Technique. it
> differs.
> >> but i want some mathematical prof that it is better then other
> technique.
> >> so for that i need to calculate Entropy and pureness of cluster.
> >> but i am not able to find any command in mahout which can give me
> entropy
> >> as a result.
> >> i found Entropy.java under mahout common math statistic package. but i
> >> don't what should i give it as input so that i can find entropy or other
> >> parameter. so i can find how much cluster is good or bed.
> >>
> >>
> >>
> >> On Tue, Apr 22, 2014 at 7:01 PM, Ted Dunning <ted.dunning@gmail.com
> >wrote:
> >>
> >>> On Tue, Apr 22, 2014 at 12:11 AM, Darshan Sonagara <
> >>> darshan.sonagara@gmail.com> wrote:
> >>>
> >>> > But the problem is that i want check that whether my clustering is
> >>> good or
> >>> > bad. so for that i need to calculate Entropy Value. I am not having
> any
> >>> > idea how to calculate entropy in mahout or by other technique.
> >>> > by finding entropy i can have good conclusion.
> >>> > so please can anyone help me with these.
> >>> >
> >>>
> >>> Actually, the way to tell whether your clustering is good is to see if
> it
> >>> works for its intended use.
> >>>
> >>> What do you want to use clustering for?
> >>>
> >>
> >>
> >>
> >> --
> >>
> >> *Regards From:*
> >>
> >> *Darshan  Sonagara*
> >> *Collaborative Platform lead,** SSN Team | Gujarat Section.*
> >>
> >> *Vice-Chairperson | **GCET IEEE SB.*
> >>
> >> (: +*91* 9408002452
> >>
> >>
> >>
> >>  : Darshan Sonagara<
> http://www.linkedin.com/pub/darshan-sonagara/64/11a/b54>
> >>   : Darshan Sonagara <http://www.facebook.com/darshansonagara>
> >>
> >>
> >
> >
> > --
> >
> > *Regards From:*
> >
> > *Darshan  Sonagara*
> > *Collaborative Platform lead,** SSN Team | Gujarat Section.*
> >
> > *Vice-Chairperson | **GCET IEEE SB.*
> >
> > (: +*91* 9408002452
> >
> >
> >
> >  : Darshan Sonagara<
> http://www.linkedin.com/pub/darshan-sonagara/64/11a/b54>
> >   : Darshan Sonagara <http://www.facebook.com/darshansonagara>
> >
> >
>
>
> --
>
> *Regards From:*
>
> *Darshan  Sonagara*
> *Collaborative Platform lead,** SSN Team | Gujarat Section.*
>
> *Vice-Chairperson | **GCET IEEE SB.*
>
> (: +*91* 9408002452
>
>
>
>  : Darshan Sonagara<
> http://www.linkedin.com/pub/darshan-sonagara/64/11a/b54>
>   : Darshan Sonagara <http://www.facebook.com/darshansonagara>
>

Re: Question Regarding Entropy calculation in Mahout

Posted by Yash Sharma <ya...@gmail.com>.
Well I was not aware of perplexity calculation. Your point makes perfect
sense.
Entropies calculated independently for each cluster would not serve any
purpose.

So the question moves back to the questioner and I'd move back to textbooks
:)

Peace,
Yash


On Sat, May 24, 2014 at 12:01 AM, Ted Dunning <te...@gmail.com> wrote:

> Yash,
>
> I am not sure how your suggestion will work.
>
> The problem is clustering algorithms tend to make hard assignments.  Thus,
> if you try to compute entropy relative to some reference probability
> distribution (aka perplexity [1]) then a reference clustering will provide
> 1 or 0 as the probability.  Any item that gets classified into a different
> cluster will cause the Entropy to include a term - 1 log 0 which is
> infinite.
>
> One way to deal with this is to assign probability 1-\epsilon to the
> cluster an item is in and \epsilon/(k-1) for all the other clusters.  You
> then have issues finding a good value of \epsilon which seem to me to be
> out of scope for the original question.
>
> Computing entropy relative to the fraction of documents in each cluster is
> easier to compute, but much harder to understand.  Computing mutual
> information (not entropy) on the confusion matrix between two clusterings
> can also be done, but that also seems beyond the original question.
>
> As such, I think that the burden is on the original questioner to describe
> the problem more accurately.
>
>
>
> On Fri, May 23, 2014 at 11:21 AM, Yash Sharma <ya...@gmail.com> wrote:
>
> > Hi Darshan,
> > What i understand from your problem is that:
> > - You have clustered few documents
> > - You want to verify the accuracy of ur clustering , and you want to use
> > entropy for that
> > - You are not sure what should be the input for entropy calculation.
> >
> > Possible solution:
> > The entropy would expect a String[] to calculate the information
> contained
> > in the data/sequence.
> > One simplest way is to keep all the documents labelled with categories.
> > - Cluster the docs as you usually do.
> > - For entropy calculation create a String[] for every cluster. Each array
> > containing all the labels of the docs in the cluster.
> > cluster1 = {"sports", "tech", "tech", "tech", "book", ..}
> > cluster2 = {"sports", "drama", "sports", "sports"...}
> > etc
> >
> > - Calculate the entropy of each cluster.
> > Entropy would measure the degree of randomness of a system. High entropy
> > means there is high degree of randomness in a system.
> > Lower Entropy are desirable for validation of accuracy of your clustering
> > technique.
> >
> > P.S. You can use Entropy.java class for your validation purpose but
> > its deprecated now.
> >
> > Having Said that - Kindly be patient while asking questions and provide
> > more info on what work you have done so far with your findings. It would
> > enable all of us to answer quickly & correctly :)
> >
> > Hope it was helpful. Other Approaches are welcome..!!
> >
> > Peace,
> > Yash
> >
> >
> > On Fri, May 23, 2014 at 10:55 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > I am sorry, but I don't understand your questions or needs sufficiently
> > to
> > > answer.
> > >
> > >
> > >
> > >
> > > On Wed, Apr 23, 2014 at 12:21 PM, Darshan Sonagara <
> > > darshan.sonagara@gmail.com> wrote:
> > >
> > > > sir please reply me as soon as possible
> > > > thanks in advance.
> > > >
> > > >
> > > > On Tue, Apr 22, 2014 at 11:50 PM, Darshan Sonagara <
> > > > darshan.sonagara@gmail.com> wrote:
> > > >
> > > > > waiting for the replay sir .
> > > > >
> > > > >
> > > > > On Tue, Apr 22, 2014 at 7:13 PM, Darshan Sonagara <
> > > > > darshan.sonagara@gmail.com> wrote:
> > > > >
> > > > >> Thnks for the Replay sir,
> > > > >>
> > > > >> actually i am doing clustering for gathering similar king of
> > document
> > > in
> > > > >> same cluster as much as possible.
> > > > >> i can see from output file by cluster dump by observing top term.
> > > > >> i also figure out that by varying Distance Measure Technique. it
> > > > differs.
> > > > >> but i want some mathematical prof that it is better then other
> > > > technique.
> > > > >> so for that i need to calculate Entropy and pureness of cluster.
> > > > >> but i am not able to find any command in mahout which can give me
> > > > entropy
> > > > >> as a result.
> > > > >> i found Entropy.java under mahout common math statistic package.
> > but i
> > > > >> don't what should i give it as input so that i can find entropy or
> > > other
> > > > >> parameter. so i can find how much cluster is good or bed.
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Tue, Apr 22, 2014 at 7:01 PM, Ted Dunning <
> ted.dunning@gmail.com
> > > > >wrote:
> > > > >>
> > > > >>> On Tue, Apr 22, 2014 at 12:11 AM, Darshan Sonagara <
> > > > >>> darshan.sonagara@gmail.com> wrote:
> > > > >>>
> > > > >>> > But the problem is that i want check that whether my clustering
> > is
> > > > >>> good or
> > > > >>> > bad. so for that i need to calculate Entropy Value. I am not
> > having
> > > > any
> > > > >>> > idea how to calculate entropy in mahout or by other technique.
> > > > >>> > by finding entropy i can have good conclusion.
> > > > >>> > so please can anyone help me with these.
> > > > >>> >
> > > > >>>
> > > > >>> Actually, the way to tell whether your clustering is good is to
> see
> > > if
> > > > it
> > > > >>> works for its intended use.
> > > > >>>
> > > > >>> What do you want to use clustering for?
> > > > >>>
> > > > >>
> > > > >>
> > > > >>
> > > > >> --
> > > > >>
> > > > >> *Regards From:*
> > > > >>
> > > > >> *Darshan  Sonagara*
> > > > >> *Collaborative Platform lead,** SSN Team | Gujarat Section.*
> > > > >>
> > > > >> *Vice-Chairperson | **GCET IEEE SB.*
> > > > >>
> > > > >> (: +*91* 9408002452
> > > > >>
> > > > >>
> > > > >>
> > > > >>  : Darshan Sonagara<
> > > > http://www.linkedin.com/pub/darshan-sonagara/64/11a/b54>
> > > > >>   : Darshan Sonagara <http://www.facebook.com/darshansonagara>
> > > > >>
> > > > >>
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > *Regards From:*
> > > > >
> > > > > *Darshan  Sonagara*
> > > > > *Collaborative Platform lead,** SSN Team | Gujarat Section.*
> > > > >
> > > > > *Vice-Chairperson | **GCET IEEE SB.*
> > > > >
> > > > > (: +*91* 9408002452
> > > > >
> > > > >
> > > > >
> > > > >  : Darshan Sonagara<
> > > > http://www.linkedin.com/pub/darshan-sonagara/64/11a/b54>
> > > > >   : Darshan Sonagara <http://www.facebook.com/darshansonagara>
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > >
> > > > *Regards From:*
> > > >
> > > > *Darshan  Sonagara*
> > > > *Collaborative Platform lead,** SSN Team | Gujarat Section.*
> > > >
> > > > *Vice-Chairperson | **GCET IEEE SB.*
> > > >
> > > > (: +*91* 9408002452
> > > >
> > > >
> > > >
> > > >  : Darshan Sonagara<
> > > > http://www.linkedin.com/pub/darshan-sonagara/64/11a/b54>
> > > >   : Darshan Sonagara <http://www.facebook.com/darshansonagara>
> > > >
> > >
> >
>

Re: Question Regarding Entropy calculation in Mahout

Posted by Ted Dunning <te...@gmail.com>.
Yash,

I am not sure how your suggestion will work.

The problem is clustering algorithms tend to make hard assignments.  Thus,
if you try to compute entropy relative to some reference probability
distribution (aka perplexity [1]) then a reference clustering will provide
1 or 0 as the probability.  Any item that gets classified into a different
cluster will cause the Entropy to include a term - 1 log 0 which is
infinite.

One way to deal with this is to assign probability 1-\epsilon to the
cluster an item is in and \epsilon/(k-1) for all the other clusters.  You
then have issues finding a good value of \epsilon which seem to me to be
out of scope for the original question.

Computing entropy relative to the fraction of documents in each cluster is
easier to compute, but much harder to understand.  Computing mutual
information (not entropy) on the confusion matrix between two clusterings
can also be done, but that also seems beyond the original question.

As such, I think that the burden is on the original questioner to describe
the problem more accurately.



On Fri, May 23, 2014 at 11:21 AM, Yash Sharma <ya...@gmail.com> wrote:

> Hi Darshan,
> What i understand from your problem is that:
> - You have clustered few documents
> - You want to verify the accuracy of ur clustering , and you want to use
> entropy for that
> - You are not sure what should be the input for entropy calculation.
>
> Possible solution:
> The entropy would expect a String[] to calculate the information contained
> in the data/sequence.
> One simplest way is to keep all the documents labelled with categories.
> - Cluster the docs as you usually do.
> - For entropy calculation create a String[] for every cluster. Each array
> containing all the labels of the docs in the cluster.
> cluster1 = {"sports", "tech", "tech", "tech", "book", ..}
> cluster2 = {"sports", "drama", "sports", "sports"...}
> etc
>
> - Calculate the entropy of each cluster.
> Entropy would measure the degree of randomness of a system. High entropy
> means there is high degree of randomness in a system.
> Lower Entropy are desirable for validation of accuracy of your clustering
> technique.
>
> P.S. You can use Entropy.java class for your validation purpose but
> its deprecated now.
>
> Having Said that - Kindly be patient while asking questions and provide
> more info on what work you have done so far with your findings. It would
> enable all of us to answer quickly & correctly :)
>
> Hope it was helpful. Other Approaches are welcome..!!
>
> Peace,
> Yash
>
>
> On Fri, May 23, 2014 at 10:55 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > I am sorry, but I don't understand your questions or needs sufficiently
> to
> > answer.
> >
> >
> >
> >
> > On Wed, Apr 23, 2014 at 12:21 PM, Darshan Sonagara <
> > darshan.sonagara@gmail.com> wrote:
> >
> > > sir please reply me as soon as possible
> > > thanks in advance.
> > >
> > >
> > > On Tue, Apr 22, 2014 at 11:50 PM, Darshan Sonagara <
> > > darshan.sonagara@gmail.com> wrote:
> > >
> > > > waiting for the replay sir .
> > > >
> > > >
> > > > On Tue, Apr 22, 2014 at 7:13 PM, Darshan Sonagara <
> > > > darshan.sonagara@gmail.com> wrote:
> > > >
> > > >> Thnks for the Replay sir,
> > > >>
> > > >> actually i am doing clustering for gathering similar king of
> document
> > in
> > > >> same cluster as much as possible.
> > > >> i can see from output file by cluster dump by observing top term.
> > > >> i also figure out that by varying Distance Measure Technique. it
> > > differs.
> > > >> but i want some mathematical prof that it is better then other
> > > technique.
> > > >> so for that i need to calculate Entropy and pureness of cluster.
> > > >> but i am not able to find any command in mahout which can give me
> > > entropy
> > > >> as a result.
> > > >> i found Entropy.java under mahout common math statistic package.
> but i
> > > >> don't what should i give it as input so that i can find entropy or
> > other
> > > >> parameter. so i can find how much cluster is good or bed.
> > > >>
> > > >>
> > > >>
> > > >> On Tue, Apr 22, 2014 at 7:01 PM, Ted Dunning <ted.dunning@gmail.com
> > > >wrote:
> > > >>
> > > >>> On Tue, Apr 22, 2014 at 12:11 AM, Darshan Sonagara <
> > > >>> darshan.sonagara@gmail.com> wrote:
> > > >>>
> > > >>> > But the problem is that i want check that whether my clustering
> is
> > > >>> good or
> > > >>> > bad. so for that i need to calculate Entropy Value. I am not
> having
> > > any
> > > >>> > idea how to calculate entropy in mahout or by other technique.
> > > >>> > by finding entropy i can have good conclusion.
> > > >>> > so please can anyone help me with these.
> > > >>> >
> > > >>>
> > > >>> Actually, the way to tell whether your clustering is good is to see
> > if
> > > it
> > > >>> works for its intended use.
> > > >>>
> > > >>> What do you want to use clustering for?
> > > >>>
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >>
> > > >> *Regards From:*
> > > >>
> > > >> *Darshan  Sonagara*
> > > >> *Collaborative Platform lead,** SSN Team | Gujarat Section.*
> > > >>
> > > >> *Vice-Chairperson | **GCET IEEE SB.*
> > > >>
> > > >> (: +*91* 9408002452
> > > >>
> > > >>
> > > >>
> > > >>  : Darshan Sonagara<
> > > http://www.linkedin.com/pub/darshan-sonagara/64/11a/b54>
> > > >>   : Darshan Sonagara <http://www.facebook.com/darshansonagara>
> > > >>
> > > >>
> > > >
> > > >
> > > > --
> > > >
> > > > *Regards From:*
> > > >
> > > > *Darshan  Sonagara*
> > > > *Collaborative Platform lead,** SSN Team | Gujarat Section.*
> > > >
> > > > *Vice-Chairperson | **GCET IEEE SB.*
> > > >
> > > > (: +*91* 9408002452
> > > >
> > > >
> > > >
> > > >  : Darshan Sonagara<
> > > http://www.linkedin.com/pub/darshan-sonagara/64/11a/b54>
> > > >   : Darshan Sonagara <http://www.facebook.com/darshansonagara>
> > > >
> > > >
> > >
> > >
> > > --
> > >
> > > *Regards From:*
> > >
> > > *Darshan  Sonagara*
> > > *Collaborative Platform lead,** SSN Team | Gujarat Section.*
> > >
> > > *Vice-Chairperson | **GCET IEEE SB.*
> > >
> > > (: +*91* 9408002452
> > >
> > >
> > >
> > >  : Darshan Sonagara<
> > > http://www.linkedin.com/pub/darshan-sonagara/64/11a/b54>
> > >   : Darshan Sonagara <http://www.facebook.com/darshansonagara>
> > >
> >
>

Re: Question Regarding Entropy calculation in Mahout

Posted by Yash Sharma <ya...@gmail.com>.
Hi Darshan,
What i understand from your problem is that:
- You have clustered few documents
- You want to verify the accuracy of ur clustering , and you want to use
entropy for that
- You are not sure what should be the input for entropy calculation.

Possible solution:
The entropy would expect a String[] to calculate the information contained
in the data/sequence.
One simplest way is to keep all the documents labelled with categories.
- Cluster the docs as you usually do.
- For entropy calculation create a String[] for every cluster. Each array
containing all the labels of the docs in the cluster.
cluster1 = {"sports", "tech", "tech", "tech", "book", ..}
cluster2 = {"sports", "drama", "sports", "sports"...}
etc

- Calculate the entropy of each cluster.
Entropy would measure the degree of randomness of a system. High entropy
means there is high degree of randomness in a system.
Lower Entropy are desirable for validation of accuracy of your clustering
technique.

P.S. You can use Entropy.java class for your validation purpose but
its deprecated now.

Having Said that - Kindly be patient while asking questions and provide
more info on what work you have done so far with your findings. It would
enable all of us to answer quickly & correctly :)

Hope it was helpful. Other Approaches are welcome..!!

Peace,
Yash


On Fri, May 23, 2014 at 10:55 PM, Ted Dunning <te...@gmail.com> wrote:

> I am sorry, but I don't understand your questions or needs sufficiently to
> answer.
>
>
>
>
> On Wed, Apr 23, 2014 at 12:21 PM, Darshan Sonagara <
> darshan.sonagara@gmail.com> wrote:
>
> > sir please reply me as soon as possible
> > thanks in advance.
> >
> >
> > On Tue, Apr 22, 2014 at 11:50 PM, Darshan Sonagara <
> > darshan.sonagara@gmail.com> wrote:
> >
> > > waiting for the replay sir .
> > >
> > >
> > > On Tue, Apr 22, 2014 at 7:13 PM, Darshan Sonagara <
> > > darshan.sonagara@gmail.com> wrote:
> > >
> > >> Thnks for the Replay sir,
> > >>
> > >> actually i am doing clustering for gathering similar king of document
> in
> > >> same cluster as much as possible.
> > >> i can see from output file by cluster dump by observing top term.
> > >> i also figure out that by varying Distance Measure Technique. it
> > differs.
> > >> but i want some mathematical prof that it is better then other
> > technique.
> > >> so for that i need to calculate Entropy and pureness of cluster.
> > >> but i am not able to find any command in mahout which can give me
> > entropy
> > >> as a result.
> > >> i found Entropy.java under mahout common math statistic package. but i
> > >> don't what should i give it as input so that i can find entropy or
> other
> > >> parameter. so i can find how much cluster is good or bed.
> > >>
> > >>
> > >>
> > >> On Tue, Apr 22, 2014 at 7:01 PM, Ted Dunning <ted.dunning@gmail.com
> > >wrote:
> > >>
> > >>> On Tue, Apr 22, 2014 at 12:11 AM, Darshan Sonagara <
> > >>> darshan.sonagara@gmail.com> wrote:
> > >>>
> > >>> > But the problem is that i want check that whether my clustering is
> > >>> good or
> > >>> > bad. so for that i need to calculate Entropy Value. I am not having
> > any
> > >>> > idea how to calculate entropy in mahout or by other technique.
> > >>> > by finding entropy i can have good conclusion.
> > >>> > so please can anyone help me with these.
> > >>> >
> > >>>
> > >>> Actually, the way to tell whether your clustering is good is to see
> if
> > it
> > >>> works for its intended use.
> > >>>
> > >>> What do you want to use clustering for?
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >>
> > >> *Regards From:*
> > >>
> > >> *Darshan  Sonagara*
> > >> *Collaborative Platform lead,** SSN Team | Gujarat Section.*
> > >>
> > >> *Vice-Chairperson | **GCET IEEE SB.*
> > >>
> > >> (: +*91* 9408002452
> > >>
> > >>
> > >>
> > >>  : Darshan Sonagara<
> > http://www.linkedin.com/pub/darshan-sonagara/64/11a/b54>
> > >>   : Darshan Sonagara <http://www.facebook.com/darshansonagara>
> > >>
> > >>
> > >
> > >
> > > --
> > >
> > > *Regards From:*
> > >
> > > *Darshan  Sonagara*
> > > *Collaborative Platform lead,** SSN Team | Gujarat Section.*
> > >
> > > *Vice-Chairperson | **GCET IEEE SB.*
> > >
> > > (: +*91* 9408002452
> > >
> > >
> > >
> > >  : Darshan Sonagara<
> > http://www.linkedin.com/pub/darshan-sonagara/64/11a/b54>
> > >   : Darshan Sonagara <http://www.facebook.com/darshansonagara>
> > >
> > >
> >
> >
> > --
> >
> > *Regards From:*
> >
> > *Darshan  Sonagara*
> > *Collaborative Platform lead,** SSN Team | Gujarat Section.*
> >
> > *Vice-Chairperson | **GCET IEEE SB.*
> >
> > (: +*91* 9408002452
> >
> >
> >
> >  : Darshan Sonagara<
> > http://www.linkedin.com/pub/darshan-sonagara/64/11a/b54>
> >   : Darshan Sonagara <http://www.facebook.com/darshansonagara>
> >
>