You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Immo Micus <im...@googlemail.com> on 2011/07/26 15:05:56 UTC
Cluster-center and cluster-radius
Hello,
this is my first email to the mahout-user-list.
I am trying to do some clustering with mahout and i have a question concerning the cluster-center and cluster-radius.
For testing, i clustered 10 points using the KMeansClusterer:
points:
[13.000, 4455.000]
[13.000, 5101.000]
[13.000, 333.000]
[13.000, 3412.000]
[13.000, 823.000]
[13.000, 238.000]
[13.000 951.000]
[ 9.000, 311.000]
[ 9.000, 970.000]
[10.000, 2885.000]
This is the method i am using:
clusters = KMeansClusterer.clusterPoints(points, initial_clusters, measure, 10, 0.001);
initial_clusters are 2 random points of the points above, measure is EuclideanDistanceMeasure.
And this is the result of the converged clusters VL-0 and VL-1:
VL-0{n=6 c=[11.667, 604.333] r=[1.886, 315.059]}
VL-1{n=4 c=[12.250, 3963.250] r=[1.299, 866.428]}
If i understand this output right then n is the number of points that are assigned to the cluster. c is the cluster-center and r is the radius of the cluster.
So, every point belongs to either cluster 0 or cluster 1. Actually you can even guess what points belong to what cluster but i am confused by the calculated cluster-center and cluster-radius:
For example [ 9.000, 970.000] should belong to cluster 0, but 9.000 < 9.781 [11.667 -1.886] and 970.000 > 919.392 [604.333 + 315.059]. The point is not in range of the cluster, it obviously does not belong to cluster 1 but all 10 points are assigned to clusters. Can someone please tell me where the mistake is?
greetings, Immo
Re: Cluster-center and cluster-radius
Posted by Immo Micus <im...@googlemail.com>.
Hi Christoph,
thanks for your reply!
The cluster-assignment is pretty much what i want to do:
I have some points that i want to be clustered. Thats what i use KMeansClusterer.clusterpoints(...) for. Unfortunately this method does not provide me with an item-cluster-map. The only thing i get as a result is a List<List<Cluster>> -object (this should be all iterations of all clusters). And the Cluster-object only includes information about the n, c(enter), r(adius) as stated below.
That is why i want to calculate what points belong to what cluster by my own. Now i have the problem, that not all points are within the calculated range of the clusters.
This are the calculated converged clusters from the List<List<Cluster>>-object:
>> VL-0{n=6 c=[11.667, 604.333] r=[1.886, 315.059]}
>> VL-1{n=4 c=[12.250, 3963.250] r=[1.299, 866.428]} .
The point
>> [ 9.000, 970.000]
should fit into one of the clusters, but it does not. I wonder how this can happen or if i understood something completely wrong.
Could you please tell me what you mean by "use the clustering flag"?
This is the method in detail from the KMeansClusterer-class. I dont see how to set some flags.
public static List<List<Cluster>> clusterPoints(Iterable<org.apache.mahout.math.Vector> points, List<Cluster> clusters, DistanceMeasure measure, int maxIter, double distanceThreshold) {
//compiled code
throw new RuntimeException("Compiled Code");
}
thanks for your help,
Immo
On Jul 26, 2011, at 3:57 PM, Christoph Brücke wrote:
> Hi Immo,
>
> did you have an extra cluster assignment at the end? Because the KMeans uses two phases: the first where all points are assigned to a cluster and the second where the cluster centroids are calculated based on the first assignment. So my idea is that you could use the clustering flag in order to have a final cluster assignment.
> I didn't do the math though, this is just an educated guess.
>
> Hope this helps,
> Christoph
>
>
> Am 26.07.2011 um 15:05 schrieb Immo Micus:
>
>> Hello,
>>
>> this is my first email to the mahout-user-list.
>> I am trying to do some clustering with mahout and i have a question concerning the cluster-center and cluster-radius.
>>
>> For testing, i clustered 10 points using the KMeansClusterer:
>>
>> points:
>> [13.000, 4455.000]
>> [13.000, 5101.000]
>> [13.000, 333.000]
>> [13.000, 3412.000]
>> [13.000, 823.000]
>> [13.000, 238.000]
>> [13.000 951.000]
>> [ 9.000, 311.000]
>> [ 9.000, 970.000]
>> [10.000, 2885.000]
>>
>> This is the method i am using:
>>
>> clusters = KMeansClusterer.clusterPoints(points, initial_clusters, measure, 10, 0.001);
>>
>> initial_clusters are 2 random points of the points above, measure is EuclideanDistanceMeasure.
>>
>>
>> And this is the result of the converged clusters VL-0 and VL-1:
>>
>> VL-0{n=6 c=[11.667, 604.333] r=[1.886, 315.059]}
>> VL-1{n=4 c=[12.250, 3963.250] r=[1.299, 866.428]}
>>
>> If i understand this output right then n is the number of points that are assigned to the cluster. c is the cluster-center and r is the radius of the cluster.
>> So, every point belongs to either cluster 0 or cluster 1. Actually you can even guess what points belong to what cluster but i am confused by the calculated cluster-center and cluster-radius:
>> For example [ 9.000, 970.000] should belong to cluster 0, but 9.000 < 9.781 [11.667 -1.886] and 970.000 > 919.392 [604.333 + 315.059]. The point is not in range of the cluster, it obviously does not belong to cluster 1 but all 10 points are assigned to clusters. Can someone please tell me where the mistake is?
>>
>>
>> greetings, Immo
>>
>>
>>
>>
>>
>>
>>
>>
>
> Christoph Brücke
> christoph.bruecke@campus.tu-berlin.de
>
>
>
Re: Cluster-center and cluster-radius
Posted by Christoph Brücke <ch...@campus.tu-berlin.de>.
Hi Immo,
did you have an extra cluster assignment at the end? Because the KMeans uses two phases: the first where all points are assigned to a cluster and the second where the cluster centroids are calculated based on the first assignment. So my idea is that you could use the clustering flag in order to have a final cluster assignment.
I didn't do the math though, this is just an educated guess.
Hope this helps,
Christoph
Am 26.07.2011 um 15:05 schrieb Immo Micus:
> Hello,
>
> this is my first email to the mahout-user-list.
> I am trying to do some clustering with mahout and i have a question concerning the cluster-center and cluster-radius.
>
> For testing, i clustered 10 points using the KMeansClusterer:
>
> points:
> [13.000, 4455.000]
> [13.000, 5101.000]
> [13.000, 333.000]
> [13.000, 3412.000]
> [13.000, 823.000]
> [13.000, 238.000]
> [13.000 951.000]
> [ 9.000, 311.000]
> [ 9.000, 970.000]
> [10.000, 2885.000]
>
> This is the method i am using:
>
> clusters = KMeansClusterer.clusterPoints(points, initial_clusters, measure, 10, 0.001);
>
> initial_clusters are 2 random points of the points above, measure is EuclideanDistanceMeasure.
>
>
> And this is the result of the converged clusters VL-0 and VL-1:
>
> VL-0{n=6 c=[11.667, 604.333] r=[1.886, 315.059]}
> VL-1{n=4 c=[12.250, 3963.250] r=[1.299, 866.428]}
>
> If i understand this output right then n is the number of points that are assigned to the cluster. c is the cluster-center and r is the radius of the cluster.
> So, every point belongs to either cluster 0 or cluster 1. Actually you can even guess what points belong to what cluster but i am confused by the calculated cluster-center and cluster-radius:
> For example [ 9.000, 970.000] should belong to cluster 0, but 9.000 < 9.781 [11.667 -1.886] and 970.000 > 919.392 [604.333 + 315.059]. The point is not in range of the cluster, it obviously does not belong to cluster 1 but all 10 points are assigned to clusters. Can someone please tell me where the mistake is?
>
>
> greetings, Immo
>
>
>
>
>
>
>
>
Christoph Brücke
christoph.bruecke@campus.tu-berlin.de
Re: Cluster-center and cluster-radius
Posted by Ted Dunning <te...@gmail.com>.
The first problem is that the input doesn't have comparable variability.
This means that distance is going to be pretty much just y-distance.
One way to improve this is to reduce each coordinate by dividing by the
standard deviation of that coordinate.
Depending on what your y coordinate is intended to mean, you might consider
a log transform of y.
I pulled these points into R and used k-means on the data. The centroids
that I got were exactly the same as yours so the Mahout clustering appears
to be working well.
The cluster results are:
K-means clustering with 2 clusters of sizes 6, 4
Cluster means:
x y
1 11.66667 604.3333
2 12.25000 3963.2500
Clustering vector:
[1] 2 2 1 2 1 1 1 1 1 2
If you use a log-transform of y, then these are the cluster results:
K-means clustering with 2 clusters of sizes 3, 7
Cluster means:
x logy
1 9.333333 6.861456
2 13.000000 7.132130
Clustering vector:
[1] 2 2 2 2 2 2 2 1 1 1
Neither of these is very satisfying. The untransformed version separates
only on y while the transformed version separates only on x.
On Tue, Jul 26, 2011 at 6:05 AM, Immo Micus <im...@googlemail.com>wrote:
> Hello,
>
> this is my first email to the mahout-user-list.
> I am trying to do some clustering with mahout and i have a question
> concerning the cluster-center and cluster-radius.
>
> For testing, i clustered 10 points using the KMeansClusterer:
>
> points:
> [13.000, 4455.000]
> [13.000, 5101.000]
> [13.000, 333.000]
> [13.000, 3412.000]
> [13.000, 823.000]
> [13.000, 238.000]
> [13.000 951.000]
> [ 9.000, 311.000]
> [ 9.000, 970.000]
> [10.000, 2885.000]
>
> This is the method i am using:
>
> clusters = KMeansClusterer.clusterPoints(points, initial_clusters, measure,
> 10, 0.001);
>
> initial_clusters are 2 random points of the points above, measure is
> EuclideanDistanceMeasure.
>
>
> And this is the result of the converged clusters VL-0 and VL-1:
>
> VL-0{n=6 c=[11.667, 604.333] r=[1.886, 315.059]}
> VL-1{n=4 c=[12.250, 3963.250] r=[1.299, 866.428]}
>
> If i understand this output right then n is the number of points that are
> assigned to the cluster. c is the cluster-center and r is the radius of the
> cluster.
> So, every point belongs to either cluster 0 or cluster 1. Actually you can
> even guess what points belong to what cluster but i am confused by the
> calculated cluster-center and cluster-radius:
> For example [ 9.000, 970.000] should belong to cluster 0, but 9.000 <
> 9.781 [11.667 -1.886] and 970.000 > 919.392 [604.333 + 315.059]. The
> point is not in range of the cluster, it obviously does not belong to
> cluster 1 but all 10 points are assigned to clusters. Can someone please
> tell me where the mistake is?
>
>
> greetings, Immo
>
>
>
>
>
>
>
>