You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2010/01/09 18:18:43 UTC

Re: Validating clustering output

I opened https://issues.apache.org/jira/browse/MAHOUT-236 for this.

Some questions inline.

On Jun 17, 2009, at 2:51 AM, Ted Dunning wrote:

> A principled approach to cluster evaluation is to measure how well the
> cluster membership captures the structure of unseen data.  A natural measure
> for this is to measure how much of the entropy of the data is captured by
> cluster membership.  For k-means and its natural L_2 metric, the natural
> cluster quality metric is the squared distance from the nearest centroid
> adjusted by the log_2 of the number of clusters.

Makes sense.

>  This can be compared to
> the squared magnitude of the original data
> or the squared deviation from the
> centroid for all of the data.  

Not sure I am following these two ideas enough to code it up.  

For magnitude, do you mean compare the above against the magnitude of each vector or do you mean something else?

And for the second part, do you mean to calculate the centroid for all the data and then compare the sq. deviation of the vector under scrutiny to that centroid?

Code and/or math here would be helpful.

Also, what about other clustering algs besides k-Means and the L_2 metric?

> The idea is that you are changing the
> representation of the data by allocating some of the bits in your original
> representation to represent which cluster each point is in.  If those bits
> aren't made up by the residue being small then your clustering is making a
> bad trade-off.
> 
> In the past, I have used other more heuristic measures as well.  One of the
> key characteristics that I would like to see out of a clustering is a degree
> of stability.  Thus, I look at the fractions of points that are assigned to
> each cluster

So if we just spit out the percentage of points for each cluster, then someone could re-run with the original data plus the held out data, and those percentages should still be about the same, assuming the held-out data is randomly distributed in the space.

> or the distribution of distances from the cluster centroid.

Likewise, we'd expect the held out points to be randomly distributed distance wise as well, right?

> These values should be relatively stable when applied to held-out data.
> 
> For text, you can actually compute perplexity which measures how well
> cluster membership predicts what words are used.  This is nice because you
> don't have to worry about the entropy of real valued numbers.

Do you have a good ref. on perplexity and/or some R code (or other)?

Thanks,
Grant

Re: Validating clustering output

Posted by Grant Ingersoll <gs...@apache.org>.

On Jan 9, 2010, at 12:18 PM, Grant Ingersoll wrote:
>> 
>> For text, you can actually compute perplexity which measures how well
>> cluster membership predicts what words are used.  This is nice because you
>> don't have to worry about the entropy of real valued numbers.
> 
> Do you have a good ref. on perplexity and/or some R code (or other)?

In looking a little more at this (via http://en.wikipedia.org/wiki/Perplexity), it seems we may already have most of this, given o.a.m.math.stats.LogLikelihood has the entropy calculation and this
is just b^entropy()  right?  Or am I misreading?

-Grant

Re: Validating clustering output

Posted by Ted Dunning <te...@gmail.com>.

What I meant was that the distribution of distances for the members of a
cluster in the training data should be similar to the distribution of the
distances for the members of the cluster in the test data.  The distribution
of the number of members in each cluster should also be similar for training
and test.

This is not a panacea in terms of evaluating clustering, but it will
definitely highlight cases where the original clustering had too many
clusters and thus was over-fitting the data.  For instance, if a cluster has
a single member, the centroid will be right on top of that point and the
distance will be zero.  In the test data it is very unlikely that any point
will be in exactly that same place so you will have a different value.
Typically what you want is not so much a traditional reject the null kind of
test (like Kolmogorov Smirnov statistics, for example), but rather to be
able to test if the test and training data definitively show similar
distributions (i.e. grouping not-similar and can't-tell into the reject
pile).  One handy way to do this is to use the Kolmogorov Smirnov statistic,
but not normalize for sample size.

For testing the number of members in a cluster, you can use LLR, but I would
tend to prefer doing binomial mutual information on each cluster, comparing
number of members of the cluster versus number of non-members and training
versus test.  This will give you the same affirmative properties.  Mutual
information is the same as LLR divided by twice the number of samples.

Again, these tests will help you avoid over-fitting, but won't help you
determine whether one cluster or two is better (if you have lots of data,
the two case should be always better).  One heuristic is to use the largest
number of clusters such that none fail the over-fitting test.

On Sat, Jan 9, 2010 at 9:18 AM, Grant Ingersoll <gs...@apache.org> wrote:

> >  This can be compared to
> > the squared magnitude of the original data
> > or the squared deviation from the
> > centroid for all of the data.
>
> Not sure I am following these two ideas enough to code it up.
>
> For magnitude, do you mean compare the above against the magnitude of each
> vector or do you mean something else?
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Validating clustering output

Posted by Ted Dunning <te...@gmail.com>.

On Sat, Jan 9, 2010 at 9:18 AM, Grant Ingersoll <gs...@apache.org> wrote:

> > In the past, I have used other more heuristic measures as well.  One of
> the
> > key characteristics that I would like to see out of a clustering is a
> degree
> > of stability.  Thus, I look at the fractions of points that are assigned
> to
> > each cluster
>
> So if we just spit out the percentage of points for each cluster, then
> someone could re-run with the original data plus the held out data, and
> those percentages should still be about the same, assuming the held-out data
> is randomly distributed in the space.
>

Spit out counts, of course, rather than percentages so that you can
distinguish small count anomalies.


>
> > or the distribution of distances from the cluster centroid.
>
> Likewise, we'd expect the held out points to be randomly distributed
> distance wise as well, right?
>

Not so much randomly as distributed randomly *the*same*way* for both test
and training.


-- 
Ted Dunning, CTO
DeepDyve