You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Aysu Ezen <ae...@ncsu.edu> on 2013/02/05 17:57:53 UTC

Clustering accuracy

Hello,

To my understanding from the book, ClusterDumper tool can be used to get
the top features of each cluster and the centroid vector. However, I have a
dataset with manual labels on it. I would like to evaluate the clusters
based on the manual labels to calculate accuracy of clustering (set the
majority label of each cluster as its class and calculate TP/FP rates). Is
there a way to understand which instance belongs to which cluster so that I
can compute the accuracy?

Thanks

Re: Clustering accuracy

Posted by Ted Dunning <te...@gmail.com>.
Estimating accuracy this way will almost always give you very poor results.
 The reason is that unsupervised clustering will draw its own boundaries
which are very unlikely to match your own.

If you want to make this work you can do a few different things:

a) semi-supervised clustering.  Include your target variables during
training and don't include them in testing.  This helps force the cluster
boundaries to align with your definitions.

b) use lots of clusters as features to a classifier.  Consider using 3-20x
times more clusters than you have categories.  Then use proximity or
distance to those clusters as features for classifiers.


A quick diagnostic for any technique like this is to see if table
containing one row per cluster and one column per target category has
higher than expected mutual information.  If so, then the clusters are
encoding your categories in some way.

On Tue, Feb 5, 2013 at 9:57 AM, Aysu Ezen <ae...@ncsu.edu> wrote:

> Hello,
>
> To my understanding from the book, ClusterDumper tool can be used to get
> the top features of each cluster and the centroid vector. However, I have a
> dataset with manual labels on it. I would like to evaluate the clusters
> based on the manual labels to calculate accuracy of clustering (set the
> majority label of each cluster as its class and calculate TP/FP rates). Is
> there a way to understand which instance belongs to which cluster so that I
> can compute the accuracy?
>
> Thanks
>