You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Dan Filimon <da...@gmail.com> on 2013/04/05 13:47:30 UTC

Generalizing ConfusionMatrix and using the Adjusted Rand Index

As part of evaluating cluster quality, I'd like to implement a bunch of
quality measures, especially external ones.

The one that I think would be particularly useful is the Adjusted Rand
Index [1].
Using a contingency table with the partitions from 2 clusterings, this
returns a value from 0 to 1 (higher being better) corresponding to the
similarity of the partitions.

First of all, I'd like to know your thought on using ARI as a metric.

Second, there's an implementation of ConfusionMatrix that is NxN. I'd like
to extend this class to support unlabeled partitions of different sizes and
add a method that computes the ARI.

What are your thoughts?

[1] http://en.wikipedia.org/wiki/Rand_index

Re: Generalizing ConfusionMatrix and using the Adjusted Rand Index

Posted by Dan Filimon <da...@gmail.com>.

Does nobody have any thoughts on this?

Ted, please? :)


On Fri, Apr 5, 2013 at 2:47 PM, Dan Filimon <da...@gmail.com>wrote:

> As part of evaluating cluster quality, I'd like to implement a bunch of
> quality measures, especially external ones.
>
> The one that I think would be particularly useful is the Adjusted Rand
> Index [1].
> Using a contingency table with the partitions from 2 clusterings, this
> returns a value from 0 to 1 (higher being better) corresponding to the
> similarity of the partitions.
>
> First of all, I'd like to know your thought on using ARI as a metric.
>
> Second, there's an implementation of ConfusionMatrix that is NxN. I'd like
> to extend this class to support unlabeled partitions of different sizes and
> add a method that computes the ARI.
>
> What are your thoughts?
>
> [1] http://en.wikipedia.org/wiki/Rand_index
>

Re: Generalizing ConfusionMatrix and using the Adjusted Rand Index

Posted by Ted Dunning <te...@gmail.com>.

Yeah... Dan convinced me that the measure is invariant over cluster
permutations so it is fine.


On Fri, Apr 12, 2013 at 10:01 AM, Dan Filimon
<da...@gmail.com>wrote:

> I think I follow what you said about the entropy-based measure.
>
> But, why don't we know the assignment in groups?
> Here's how I implemented it [1].
>
> I first got the lists of centroids in both cases, then for each data point,
> computed the centroid closest to it in that group and incremented the
> corresponding element of the confusion matrix.
> So the assignment is given by the centroids and the distance measure.
>
> It looks valid to me. Isn't it?
>
> [1]
>
> https://github.com/dfilimon/mahout/blob/skm/core/src/main/java/org/apache/mahout/clustering/streaming/cluster/ClusteringUtils.java#L198
>
>
> On Thu, Apr 11, 2013 at 6:12 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > I think based on spending 2 minutes reading wikipedia that the Rand Index
> > assumes that the assignment between groups in the two clusters is known.
> >
> > In fact, I think that this is typically not known.
> >
> > As an example, I created a dataset with 10,000 data points in 10
> > dimensions.  I created 10 highly separated clusters by putting Gaussian
> > distributions at randomly selected corners of the unit hyper-cube.
> >
> >     for (i in 1:10) {c[i,] = (rnorm(10)>0)+0}
> >
> > I repeated the generation of the corners until I got 10 different
> > centroids.  I checked this by looking at the rank of the matrix of
> centers:
> >
> >    svd(c)$d
> >
> > The last value here will be non-zero if all the rows of c are
> independent.
> >
> > Generating the data is pretty easy:
> >
> >     for (i in 1:10) {
> >         for (j in 1:1000) {
> >             data[(i-1)*1000+j,] = matrix(rnorm(10,sd=0.1), ncol=10) +
> c[i,]
> >         }
> >     }
> >
> > Normal k-means clustering will very commonly produce a poor clustering of
> > these data even though they are very separated.  If you do 10 restarts,
> you
> > will likely get a good result.  I did several clusters and retained the
> > cluster assignments:
> >
> >     k1=kmeans(data, centers=10, nstart=10)
> >     k2=kmeans(data, centers=10, nstart=1)
> >     k3=kmeans(data, centers=10, nstart=1)
> >
> > We can now produce plots of these clusterings by projecting the data
> using
> > SVD onto 2 dimensions and coloring with the cluster assignment  (I have
> put
> > a sample plot at
> https://dl.dropboxusercontent.com/u/36863361/clusters.png
> > ).
> >
> >    ss=svd(data)
> >    plot(data %*% ss$v[,1:2], col=rainbow(10)[k1$cluster])
> >
> > We can abuse sparse matrices to get a contingency table of cluster
> > assignments
> >
> >     require('Matrix')
> >     sparseMatrix(i=k1$cluster, j=k2$cluster, x=1)
> >
> > The results look something like this
> >
> >     10 x 10 sparse Matrix of class "dgCMatrix"
> >      [1,]    .   . 1000   .    .    .   .    .    .   .
> >      [2,]    . 541    .   .    .    .   .    .    . 459
> >      [3,]    .   .    . 499    .    . 501    .    .   .
> >      [4,]    .   .    .   .    .    .   .    . 1000   .
> >      [5,]    .   . 1000   .    .    .   .    .    .   .
> >      [6,]    .   .    .   .    . 1000   .    .    .   .
> >      [7,]    .   .    .   . 1000    .   .    .    .   .
> >      [8,]    .   .    .   .    .    .   . 1000    .   .
> >      [9,] 1000   .    .   .    .    .   .    .    .   .
> >     [10,]    .   .    .   . 1000    .   .    .    .   .
> >
> > Each of the 1000 values here represent a cluster that is recognized by
> both
> > runs.  The smaller number represent a split cluster.
> >
> > So the problem is how to tell that a diagonal matrix, arbitrarily
> permuted
> > is perfect and that this matrix isn't so good.
> >
> > My first inclination would be to look at mutual information based
> measures
> > like the multinomial log-likelihood ratio (AKA LLR in Mahout land).
> >
> > This leads to these kinds of numbers:
> >
> > > mi
> > function(m) {entropy(m)-entropy(colSums(m))-entropy(rowSums(m))}
> > > entropy
> > function(m) {m = m/sum(m);sum(m * log (m+(m==0)))}
> > > mi(sparseMatrix(i=k6$cluster, j=k1$cluster, x=1))
> > [1] 2.025326
> > > mi(sparseMatrix(i=k2$cluster, j=k1$cluster, x=1))
> > [1] 2.163956
> > > mi(sparseMatrix(i=k1$cluster, j=k1$cluster, x=1))
> > [1] 2.302585
> > > mi(sparseMatrix(i=k1$cluster, j=k6$cluster, x=1))
> > [1] 2.025326
> > > mi(sparseMatrix(i=k6$cluster, j=k6$cluster, x=1))
> > [1] 2.163619
> > > mi(matrix(rmultinom(1, 10000, prob=rep(0.01, 100)), nrow=10))
> > [1] 0.005527828
> > >
> >
> > The characteristics here are:
> >
> > Better clustering produces higher entropy.  Equivalent clusterings that
> > have different labels produce identical scores compared to each other as
> to
> > themselves.  Nearly identical clustering that have small problems produce
> > slightly lower scores.  Unrelated assignments produce near zero values.
> >
> >
> >
> > On Fri, Apr 5, 2013 at 4:47 AM, Dan Filimon <dangeorge.filimon@gmail.com
> > >wrote:
> >
> > > As part of evaluating cluster quality, I'd like to implement a bunch of
> > > quality measures, especially external ones.
> > >
> > > The one that I think would be particularly useful is the Adjusted Rand
> > > Index [1].
> > > Using a contingency table with the partitions from 2 clusterings, this
> > > returns a value from 0 to 1 (higher being better) corresponding to the
> > > similarity of the partitions.
> > >
> > > First of all, I'd like to know your thought on using ARI as a metric.
> > >
> > > Second, there's an implementation of ConfusionMatrix that is NxN. I'd
> > like
> > > to extend this class to support unlabeled partitions of different sizes
> > and
> > > add a method that computes the ARI.
> > >
> > > What are your thoughts?
> > >
> > > [1] http://en.wikipedia.org/wiki/Rand_index
> > >
> >
>

Re: Generalizing ConfusionMatrix and using the Adjusted Rand Index

Posted by Dan Filimon <da...@gmail.com>.

I think I follow what you said about the entropy-based measure.

But, why don't we know the assignment in groups?
Here's how I implemented it [1].

I first got the lists of centroids in both cases, then for each data point,
computed the centroid closest to it in that group and incremented the
corresponding element of the confusion matrix.
So the assignment is given by the centroids and the distance measure.

It looks valid to me. Isn't it?

[1]
https://github.com/dfilimon/mahout/blob/skm/core/src/main/java/org/apache/mahout/clustering/streaming/cluster/ClusteringUtils.java#L198


On Thu, Apr 11, 2013 at 6:12 AM, Ted Dunning <te...@gmail.com> wrote:

> I think based on spending 2 minutes reading wikipedia that the Rand Index
> assumes that the assignment between groups in the two clusters is known.
>
> In fact, I think that this is typically not known.
>
> As an example, I created a dataset with 10,000 data points in 10
> dimensions.  I created 10 highly separated clusters by putting Gaussian
> distributions at randomly selected corners of the unit hyper-cube.
>
>     for (i in 1:10) {c[i,] = (rnorm(10)>0)+0}
>
> I repeated the generation of the corners until I got 10 different
> centroids.  I checked this by looking at the rank of the matrix of centers:
>
>    svd(c)$d
>
> The last value here will be non-zero if all the rows of c are independent.
>
> Generating the data is pretty easy:
>
>     for (i in 1:10) {
>         for (j in 1:1000) {
>             data[(i-1)*1000+j,] = matrix(rnorm(10,sd=0.1), ncol=10) + c[i,]
>         }
>     }
>
> Normal k-means clustering will very commonly produce a poor clustering of
> these data even though they are very separated.  If you do 10 restarts, you
> will likely get a good result.  I did several clusters and retained the
> cluster assignments:
>
>     k1=kmeans(data, centers=10, nstart=10)
>     k2=kmeans(data, centers=10, nstart=1)
>     k3=kmeans(data, centers=10, nstart=1)
>
> We can now produce plots of these clusterings by projecting the data using
> SVD onto 2 dimensions and coloring with the cluster assignment  (I have put
> a sample plot at https://dl.dropboxusercontent.com/u/36863361/clusters.png
> ).
>
>    ss=svd(data)
>    plot(data %*% ss$v[,1:2], col=rainbow(10)[k1$cluster])
>
> We can abuse sparse matrices to get a contingency table of cluster
> assignments
>
>     require('Matrix')
>     sparseMatrix(i=k1$cluster, j=k2$cluster, x=1)
>
> The results look something like this
>
>     10 x 10 sparse Matrix of class "dgCMatrix"
>      [1,]    .   . 1000   .    .    .   .    .    .   .
>      [2,]    . 541    .   .    .    .   .    .    . 459
>      [3,]    .   .    . 499    .    . 501    .    .   .
>      [4,]    .   .    .   .    .    .   .    . 1000   .
>      [5,]    .   . 1000   .    .    .   .    .    .   .
>      [6,]    .   .    .   .    . 1000   .    .    .   .
>      [7,]    .   .    .   . 1000    .   .    .    .   .
>      [8,]    .   .    .   .    .    .   . 1000    .   .
>      [9,] 1000   .    .   .    .    .   .    .    .   .
>     [10,]    .   .    .   . 1000    .   .    .    .   .
>
> Each of the 1000 values here represent a cluster that is recognized by both
> runs.  The smaller number represent a split cluster.
>
> So the problem is how to tell that a diagonal matrix, arbitrarily permuted
> is perfect and that this matrix isn't so good.
>
> My first inclination would be to look at mutual information based measures
> like the multinomial log-likelihood ratio (AKA LLR in Mahout land).
>
> This leads to these kinds of numbers:
>
> > mi
> function(m) {entropy(m)-entropy(colSums(m))-entropy(rowSums(m))}
> > entropy
> function(m) {m = m/sum(m);sum(m * log (m+(m==0)))}
> > mi(sparseMatrix(i=k6$cluster, j=k1$cluster, x=1))
> [1] 2.025326
> > mi(sparseMatrix(i=k2$cluster, j=k1$cluster, x=1))
> [1] 2.163956
> > mi(sparseMatrix(i=k1$cluster, j=k1$cluster, x=1))
> [1] 2.302585
> > mi(sparseMatrix(i=k1$cluster, j=k6$cluster, x=1))
> [1] 2.025326
> > mi(sparseMatrix(i=k6$cluster, j=k6$cluster, x=1))
> [1] 2.163619
> > mi(matrix(rmultinom(1, 10000, prob=rep(0.01, 100)), nrow=10))
> [1] 0.005527828
> >
>
> The characteristics here are:
>
> Better clustering produces higher entropy.  Equivalent clusterings that
> have different labels produce identical scores compared to each other as to
> themselves.  Nearly identical clustering that have small problems produce
> slightly lower scores.  Unrelated assignments produce near zero values.
>
>
>
> On Fri, Apr 5, 2013 at 4:47 AM, Dan Filimon <dangeorge.filimon@gmail.com
> >wrote:
>
> > As part of evaluating cluster quality, I'd like to implement a bunch of
> > quality measures, especially external ones.
> >
> > The one that I think would be particularly useful is the Adjusted Rand
> > Index [1].
> > Using a contingency table with the partitions from 2 clusterings, this
> > returns a value from 0 to 1 (higher being better) corresponding to the
> > similarity of the partitions.
> >
> > First of all, I'd like to know your thought on using ARI as a metric.
> >
> > Second, there's an implementation of ConfusionMatrix that is NxN. I'd
> like
> > to extend this class to support unlabeled partitions of different sizes
> and
> > add a method that computes the ARI.
> >
> > What are your thoughts?
> >
> > [1] http://en.wikipedia.org/wiki/Rand_index
> >
>

Re: Generalizing ConfusionMatrix and using the Adjusted Rand Index

Posted by Ted Dunning <te...@gmail.com>.

I think based on spending 2 minutes reading wikipedia that the Rand Index
assumes that the assignment between groups in the two clusters is known.

In fact, I think that this is typically not known.

As an example, I created a dataset with 10,000 data points in 10
dimensions.  I created 10 highly separated clusters by putting Gaussian
distributions at randomly selected corners of the unit hyper-cube.

    for (i in 1:10) {c[i,] = (rnorm(10)>0)+0}

I repeated the generation of the corners until I got 10 different
centroids.  I checked this by looking at the rank of the matrix of centers:

   svd(c)$d

The last value here will be non-zero if all the rows of c are independent.

Generating the data is pretty easy:

    for (i in 1:10) {
        for (j in 1:1000) {
            data[(i-1)*1000+j,] = matrix(rnorm(10,sd=0.1), ncol=10) + c[i,]
        }
    }

Normal k-means clustering will very commonly produce a poor clustering of
these data even though they are very separated.  If you do 10 restarts, you
will likely get a good result.  I did several clusters and retained the
cluster assignments:

    k1=kmeans(data, centers=10, nstart=10)
    k2=kmeans(data, centers=10, nstart=1)
    k3=kmeans(data, centers=10, nstart=1)

We can now produce plots of these clusterings by projecting the data using
SVD onto 2 dimensions and coloring with the cluster assignment  (I have put
a sample plot at https://dl.dropboxusercontent.com/u/36863361/clusters.png).

   ss=svd(data)
   plot(data %*% ss$v[,1:2], col=rainbow(10)[k1$cluster])

We can abuse sparse matrices to get a contingency table of cluster
assignments

    require('Matrix')
    sparseMatrix(i=k1$cluster, j=k2$cluster, x=1)

The results look something like this

    10 x 10 sparse Matrix of class "dgCMatrix"
     [1,]    .   . 1000   .    .    .   .    .    .   .
     [2,]    . 541    .   .    .    .   .    .    . 459
     [3,]    .   .    . 499    .    . 501    .    .   .
     [4,]    .   .    .   .    .    .   .    . 1000   .
     [5,]    .   . 1000   .    .    .   .    .    .   .
     [6,]    .   .    .   .    . 1000   .    .    .   .
     [7,]    .   .    .   . 1000    .   .    .    .   .
     [8,]    .   .    .   .    .    .   . 1000    .   .
     [9,] 1000   .    .   .    .    .   .    .    .   .
    [10,]    .   .    .   . 1000    .   .    .    .   .

Each of the 1000 values here represent a cluster that is recognized by both
runs.  The smaller number represent a split cluster.

So the problem is how to tell that a diagonal matrix, arbitrarily permuted
is perfect and that this matrix isn't so good.

My first inclination would be to look at mutual information based measures
like the multinomial log-likelihood ratio (AKA LLR in Mahout land).

This leads to these kinds of numbers:

> mi
function(m) {entropy(m)-entropy(colSums(m))-entropy(rowSums(m))}
> entropy
function(m) {m = m/sum(m);sum(m * log (m+(m==0)))}
> mi(sparseMatrix(i=k6$cluster, j=k1$cluster, x=1))
[1] 2.025326
> mi(sparseMatrix(i=k2$cluster, j=k1$cluster, x=1))
[1] 2.163956
> mi(sparseMatrix(i=k1$cluster, j=k1$cluster, x=1))
[1] 2.302585
> mi(sparseMatrix(i=k1$cluster, j=k6$cluster, x=1))
[1] 2.025326
> mi(sparseMatrix(i=k6$cluster, j=k6$cluster, x=1))
[1] 2.163619
> mi(matrix(rmultinom(1, 10000, prob=rep(0.01, 100)), nrow=10))
[1] 0.005527828
>

The characteristics here are:

Better clustering produces higher entropy.  Equivalent clusterings that
have different labels produce identical scores compared to each other as to
themselves.  Nearly identical clustering that have small problems produce
slightly lower scores.  Unrelated assignments produce near zero values.

On Fri, Apr 5, 2013 at 4:47 AM, Dan Filimon <da...@gmail.com>wrote:

> As part of evaluating cluster quality, I'd like to implement a bunch of
> quality measures, especially external ones.
>
> The one that I think would be particularly useful is the Adjusted Rand
> Index [1].
> Using a contingency table with the partitions from 2 clusterings, this
> returns a value from 0 to 1 (higher being better) corresponding to the
> similarity of the partitions.
>
> First of all, I'd like to know your thought on using ARI as a metric.
>
> Second, there's an implementation of ConfusionMatrix that is NxN. I'd like
> to extend this class to support unlabeled partitions of different sizes and
> add a method that computes the ARI.
>
> What are your thoughts?
>
> [1] http://en.wikipedia.org/wiki/Rand_index
>