You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Immo Micus <im...@googlemail.com> on 2011/07/26 15:05:56 UTC

Cluster-center and cluster-radius

Hello,

this is my first email to the mahout-user-list.
I am trying to do some clustering with mahout and i have a question concerning the cluster-center and cluster-radius.

For testing, i clustered 10 points using the KMeansClusterer:

points:
 [13.000, 4455.000] 
 [13.000, 5101.000] 
 [13.000,   333.000] 
 [13.000, 3412.000] 
 [13.000,   823.000] 
 [13.000,   238.000]
 [13.000    951.000] 
 [  9.000,   311.000] 
 [  9.000,   970.000] 
 [10.000, 2885.000]

This is the method i am using:

clusters = KMeansClusterer.clusterPoints(points, initial_clusters, measure, 10, 0.001);

initial_clusters are 2 random points of the points above, measure is EuclideanDistanceMeasure.


And this is the result of the converged clusters VL-0 and VL-1:

VL-0{n=6 c=[11.667, 604.333] r=[1.886, 315.059]}
VL-1{n=4 c=[12.250, 3963.250] r=[1.299, 866.428]}

If i understand this output right then n is the number of points that are assigned to the cluster. c is the cluster-center and r is the radius of the cluster.
So, every point belongs to either cluster 0 or cluster 1. Actually you can even guess what points belong to what cluster but i am confused by the calculated cluster-center and cluster-radius:
For example  [  9.000,   970.000] should belong to cluster 0, but   9.000 <  9.781 [11.667 -1.886] and 970.000 > 919.392  [604.333 + 315.059].  The point is not in range of the cluster, it obviously does not belong to cluster 1 but all 10 points are assigned to clusters. Can someone please tell me where the mistake is?


greetings, Immo








Re: Cluster-center and cluster-radius

Posted by Immo Micus <im...@googlemail.com>.
Hi Christoph,

thanks for your reply!

The cluster-assignment is pretty much what i want to do: 

I have some points that i want to be clustered. Thats what i use KMeansClusterer.clusterpoints(...) for.  Unfortunately this method does not provide me with an item-cluster-map. The only thing i get as a result is a List<List<Cluster>> -object (this should be all iterations of all clusters). And the Cluster-object only includes information about the n, c(enter), r(adius) as stated below.

That is why i want to calculate what points belong to what cluster by my own. Now i have the problem, that not all points are within the calculated range of the clusters. 

This are the calculated converged clusters from the List<List<Cluster>>-object:
>> VL-0{n=6 c=[11.667, 604.333] r=[1.886, 315.059]}
>> VL-1{n=4 c=[12.250, 3963.250] r=[1.299, 866.428]} .

The point
>> [  9.000,   970.000] 
should fit into one of the clusters, but it does not. I wonder how this can happen or if i understood something completely wrong. 

Could you please tell me what you mean by "use the clustering flag"?
  
This is the method in detail from the KMeansClusterer-class. I dont see how to set some flags.

public static List<List<Cluster>> clusterPoints(Iterable<org.apache.mahout.math.Vector> points, List<Cluster> clusters, DistanceMeasure measure, int maxIter, double distanceThreshold) {
        //compiled code
        throw new RuntimeException("Compiled Code");
    }

thanks for your help, 
Immo

On Jul 26, 2011, at 3:57 PM, Christoph Brücke wrote:

> Hi Immo,
> 
> did you have an extra cluster assignment at the end? Because the KMeans uses two phases: the first where all points are assigned to a cluster and the second where the cluster centroids are calculated based on the first assignment. So my idea is that you could use the clustering flag in order to have a final cluster assignment.
> I didn't do the math though, this is just an educated guess.
> 
> Hope this helps,
> Christoph
> 
> 
> Am 26.07.2011 um 15:05 schrieb Immo Micus:
> 
>> Hello,
>> 
>> this is my first email to the mahout-user-list.
>> I am trying to do some clustering with mahout and i have a question concerning the cluster-center and cluster-radius.
>> 
>> For testing, i clustered 10 points using the KMeansClusterer:
>> 
>> points:
>> [13.000, 4455.000] 
>> [13.000, 5101.000] 
>> [13.000,   333.000] 
>> [13.000, 3412.000] 
>> [13.000,   823.000] 
>> [13.000,   238.000]
>> [13.000    951.000] 
>> [  9.000,   311.000] 
>> [  9.000,   970.000] 
>> [10.000, 2885.000]
>> 
>> This is the method i am using:
>> 
>> clusters = KMeansClusterer.clusterPoints(points, initial_clusters, measure, 10, 0.001);
>> 
>> initial_clusters are 2 random points of the points above, measure is EuclideanDistanceMeasure.
>> 
>> 
>> And this is the result of the converged clusters VL-0 and VL-1:
>> 
>> VL-0{n=6 c=[11.667, 604.333] r=[1.886, 315.059]}
>> VL-1{n=4 c=[12.250, 3963.250] r=[1.299, 866.428]}
>> 
>> If i understand this output right then n is the number of points that are assigned to the cluster. c is the cluster-center and r is the radius of the cluster.
>> So, every point belongs to either cluster 0 or cluster 1. Actually you can even guess what points belong to what cluster but i am confused by the calculated cluster-center and cluster-radius:
>> For example  [  9.000,   970.000] should belong to cluster 0, but   9.000 <  9.781 [11.667 -1.886] and 970.000 > 919.392  [604.333 + 315.059].  The point is not in range of the cluster, it obviously does not belong to cluster 1 but all 10 points are assigned to clusters. Can someone please tell me where the mistake is?
>> 
>> 
>> greetings, Immo
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> 
> Christoph Brücke
> christoph.bruecke@campus.tu-berlin.de
> 
> 
> 


Re: Cluster-center and cluster-radius

Posted by Christoph Brücke <ch...@campus.tu-berlin.de>.
Hi Immo,

did you have an extra cluster assignment at the end? Because the KMeans uses two phases: the first where all points are assigned to a cluster and the second where the cluster centroids are calculated based on the first assignment. So my idea is that you could use the clustering flag in order to have a final cluster assignment.
I didn't do the math though, this is just an educated guess.

Hope this helps,
Christoph


Am 26.07.2011 um 15:05 schrieb Immo Micus:

> Hello,
> 
> this is my first email to the mahout-user-list.
> I am trying to do some clustering with mahout and i have a question concerning the cluster-center and cluster-radius.
> 
> For testing, i clustered 10 points using the KMeansClusterer:
> 
> points:
> [13.000, 4455.000] 
> [13.000, 5101.000] 
> [13.000,   333.000] 
> [13.000, 3412.000] 
> [13.000,   823.000] 
> [13.000,   238.000]
> [13.000    951.000] 
> [  9.000,   311.000] 
> [  9.000,   970.000] 
> [10.000, 2885.000]
> 
> This is the method i am using:
> 
> clusters = KMeansClusterer.clusterPoints(points, initial_clusters, measure, 10, 0.001);
> 
> initial_clusters are 2 random points of the points above, measure is EuclideanDistanceMeasure.
> 
> 
> And this is the result of the converged clusters VL-0 and VL-1:
> 
> VL-0{n=6 c=[11.667, 604.333] r=[1.886, 315.059]}
> VL-1{n=4 c=[12.250, 3963.250] r=[1.299, 866.428]}
> 
> If i understand this output right then n is the number of points that are assigned to the cluster. c is the cluster-center and r is the radius of the cluster.
> So, every point belongs to either cluster 0 or cluster 1. Actually you can even guess what points belong to what cluster but i am confused by the calculated cluster-center and cluster-radius:
> For example  [  9.000,   970.000] should belong to cluster 0, but   9.000 <  9.781 [11.667 -1.886] and 970.000 > 919.392  [604.333 + 315.059].  The point is not in range of the cluster, it obviously does not belong to cluster 1 but all 10 points are assigned to clusters. Can someone please tell me where the mistake is?
> 
> 
> greetings, Immo
> 
> 
> 
> 
> 
> 
> 
> 

Christoph Brücke
christoph.bruecke@campus.tu-berlin.de




Re: Cluster-center and cluster-radius

Posted by Ted Dunning <te...@gmail.com>.
The first problem is that the input doesn't have comparable variability.
 This means that distance is going to be pretty much just y-distance.

One way to improve this is to reduce each coordinate by dividing by the
standard deviation of that coordinate.

Depending on what your y coordinate is intended to mean, you might consider
a log transform of y.

I pulled these points into R and used k-means on the data.  The centroids
that I got were exactly the same as yours so the Mahout clustering appears
to be working well.

The cluster results are:

K-means clustering with 2 clusters of sizes 6, 4

Cluster means:
         x         y
1 11.66667  604.3333
2 12.25000 3963.2500

Clustering vector:
 [1] 2 2 1 2 1 1 1 1 1 2


If you use a log-transform of y, then these are the cluster results:

K-means clustering with 2 clusters of sizes 3, 7

Cluster means:
          x     logy
1  9.333333 6.861456
2 13.000000 7.132130

Clustering vector:
 [1] 2 2 2 2 2 2 2 1 1 1


Neither of these is very satisfying.  The untransformed version separates
only on y while the transformed version separates only on x.

On Tue, Jul 26, 2011 at 6:05 AM, Immo Micus <im...@googlemail.com>wrote:

> Hello,
>
> this is my first email to the mahout-user-list.
> I am trying to do some clustering with mahout and i have a question
> concerning the cluster-center and cluster-radius.
>
> For testing, i clustered 10 points using the KMeansClusterer:
>
> points:
>  [13.000, 4455.000]
>  [13.000, 5101.000]
>  [13.000,   333.000]
>  [13.000, 3412.000]
>  [13.000,   823.000]
>  [13.000,   238.000]
>  [13.000    951.000]
>  [  9.000,   311.000]
>  [  9.000,   970.000]
>  [10.000, 2885.000]
>
> This is the method i am using:
>
> clusters = KMeansClusterer.clusterPoints(points, initial_clusters, measure,
> 10, 0.001);
>
> initial_clusters are 2 random points of the points above, measure is
> EuclideanDistanceMeasure.
>
>
> And this is the result of the converged clusters VL-0 and VL-1:
>
> VL-0{n=6 c=[11.667, 604.333] r=[1.886, 315.059]}
> VL-1{n=4 c=[12.250, 3963.250] r=[1.299, 866.428]}
>
> If i understand this output right then n is the number of points that are
> assigned to the cluster. c is the cluster-center and r is the radius of the
> cluster.
> So, every point belongs to either cluster 0 or cluster 1. Actually you can
> even guess what points belong to what cluster but i am confused by the
> calculated cluster-center and cluster-radius:
> For example  [  9.000,   970.000] should belong to cluster 0, but   9.000 <
>  9.781 [11.667 -1.886] and 970.000 > 919.392  [604.333 + 315.059].  The
> point is not in range of the cluster, it obviously does not belong to
> cluster 1 but all 10 points are assigned to clusters. Can someone please
> tell me where the mistake is?
>
>
> greetings, Immo
>
>
>
>
>
>
>
>