You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Colum Foley <co...@gmail.com> on 2013/03/05 19:33:18 UTC

KMeans Results: Finding Cluster Members

Hi,

I have a simple enough question: having run K-Means clustering
(generated the clustered points, and clusters-x, clusters-x-final
directories), how do you identify which items were clustered together?
Apologies if this is trivial but I could not see an obvious answer in
the documentation.

Clusterdump seems to be the tool to use, but when I have run it I only
see Cluster ids,centroid values, radius etc, but it is not obvious to
me how I resolve individual item names? I am looking for something of
the following form:

cluster_id = (keys)*

for example:

cluster_1 = {"user104x","user89dc","user22da".}
cluster_2 = {"user19c","user11c",....}


Thanks,
Colum

Re: KMeans Results: Finding Cluster Members

Posted by Matt Molek <mp...@gmail.com>.

I'm still a bit of a beginner withMahout, but as far as I know,
NamedVectors are the only way to preserve that sort of data though the
clustering algorithm. They're easy to construct if you need to do it
yourself. Just pass your current vector object and whatever name you want
into the constructor, NamedVector(Vector delegate, String name)

>From your original email, you're looking to get something along the lines
of: cluster_1 = {"user104x","user89dc","user22da"...}

Is that name/id data coming out of Elephant-Bird somewhere along with the
vectors? As the key of a seqfile, with the vector as the value maybe? I
haven't used Elephant-Bird before. If that's the case, it would be a simple
Map only Hadoop job to convert all your vectors to NamedVectors. The map
method would be something along the lines of:

public void map(Text key, VectorWritable value, Context context)
{
    value.set(new NamedVector(value.get(),key.toString());
    context.write(key, value)
}



On Wed, Mar 6, 2013 at 5:02 AM, Colum Foley <co...@gmail.com> wrote:

> Hi Matt,
>
> Thanks for the reply. I am using the Elephant-Bird package in order to
> generate my vectors, so I am not sure if I can specify to use
> NamedVectors. I have asked this in a separate thread.
>
> In the absence of NamedVectors, is there another way I can resolve the
> name of the items that you know of?
>
> Currently when I print out the contents of the clusteredPoints file I
> see the following output:
>
> 1.0: [3887:3.000, 9441:1.000] is in 1205002
> 1.0: [6773:1.000] is in 1205002
> 1.0: [8987:2.000] is in 1205002
> 1.0: [2956:1.000] is in 1205002
>
>
> Thanks again,
> Colum
>
> On Tue, Mar 5, 2013 at 8:57 PM, Matt Molek <mp...@gmail.com> wrote:
> > If you run kmeans with the "-cl" option (or set the runClustering option
> to
> > true if you're calling the driver from Java code), you'll get a sequence
> > file in the directory $KMEANS_OUT/clusteredPoints with an IntWritiable
> key
> > identifying the cluster, and a WeightedVectorWritable with a pdf weight
> > (always 1.0 in kmeans) and your original vector. If you want to recover
> the
> > name/id/whatever of the original input, you need to use NamedVector as
> > input to kmeans. The name will be preserved in the vector.
> >
> > Here's one abbreviated line of output. My vector with name "0" was
> > classified into cluster 4398:
> >     Key: 4398: Value: 1.0: 0 = [0.007, 0.002, -0.016, -0.003,...]
> >
> > Clusterdump might include this information as well. I can't remember.
> You'd
> > still need to run kmeans with the -cl option.
> >
> >
> > On Tue, Mar 5, 2013 at 1:33 PM, Colum Foley <co...@gmail.com>
> wrote:
> >
> >> Hi,
> >>
> >> I have a simple enough question: having run K-Means clustering
> >> (generated the clustered points, and clusters-x, clusters-x-final
> >> directories), how do you identify which items were clustered together?
> >> Apologies if this is trivial but I could not see an obvious answer in
> >> the documentation.
> >>
> >> Clusterdump seems to be the tool to use, but when I have run it I only
> >> see Cluster ids,centroid values, radius etc, but it is not obvious to
> >> me how I resolve individual item names? I am looking for something of
> >> the following form:
> >>
> >> cluster_id = (keys)*
> >>
> >> for example:
> >>
> >> cluster_1 = {"user104x","user89dc","user22da".}
> >> cluster_2 = {"user19c","user11c",....}
> >>
> >>
> >> Thanks,
> >> Colum
> >>
>

Re: KMeans Results: Finding Cluster Members

Posted by Colum Foley <co...@gmail.com>.

Hi Matt,

Thanks for the reply. I am using the Elephant-Bird package in order to
generate my vectors, so I am not sure if I can specify to use
NamedVectors. I have asked this in a separate thread.

In the absence of NamedVectors, is there another way I can resolve the
name of the items that you know of?

Currently when I print out the contents of the clusteredPoints file I
see the following output:

1.0: [3887:3.000, 9441:1.000] is in 1205002
1.0: [6773:1.000] is in 1205002
1.0: [8987:2.000] is in 1205002
1.0: [2956:1.000] is in 1205002


Thanks again,
Colum

On Tue, Mar 5, 2013 at 8:57 PM, Matt Molek <mp...@gmail.com> wrote:
> If you run kmeans with the "-cl" option (or set the runClustering option to
> true if you're calling the driver from Java code), you'll get a sequence
> file in the directory $KMEANS_OUT/clusteredPoints with an IntWritiable key
> identifying the cluster, and a WeightedVectorWritable with a pdf weight
> (always 1.0 in kmeans) and your original vector. If you want to recover the
> name/id/whatever of the original input, you need to use NamedVector as
> input to kmeans. The name will be preserved in the vector.
>
> Here's one abbreviated line of output. My vector with name "0" was
> classified into cluster 4398:
>     Key: 4398: Value: 1.0: 0 = [0.007, 0.002, -0.016, -0.003,...]
>
> Clusterdump might include this information as well. I can't remember. You'd
> still need to run kmeans with the -cl option.
>
>
> On Tue, Mar 5, 2013 at 1:33 PM, Colum Foley <co...@gmail.com> wrote:
>
>> Hi,
>>
>> I have a simple enough question: having run K-Means clustering
>> (generated the clustered points, and clusters-x, clusters-x-final
>> directories), how do you identify which items were clustered together?
>> Apologies if this is trivial but I could not see an obvious answer in
>> the documentation.
>>
>> Clusterdump seems to be the tool to use, but when I have run it I only
>> see Cluster ids,centroid values, radius etc, but it is not obvious to
>> me how I resolve individual item names? I am looking for something of
>> the following form:
>>
>> cluster_id = (keys)*
>>
>> for example:
>>
>> cluster_1 = {"user104x","user89dc","user22da".}
>> cluster_2 = {"user19c","user11c",....}
>>
>>
>> Thanks,
>> Colum
>>

Re: KMeans Results: Finding Cluster Members

Posted by Matt Molek <mp...@gmail.com>.

If you run kmeans with the "-cl" option (or set the runClustering option to
true if you're calling the driver from Java code), you'll get a sequence
file in the directory $KMEANS_OUT/clusteredPoints with an IntWritiable key
identifying the cluster, and a WeightedVectorWritable with a pdf weight
(always 1.0 in kmeans) and your original vector. If you want to recover the
name/id/whatever of the original input, you need to use NamedVector as
input to kmeans. The name will be preserved in the vector.

Here's one abbreviated line of output. My vector with name "0" was
classified into cluster 4398:
    Key: 4398: Value: 1.0: 0 = [0.007, 0.002, -0.016, -0.003,...]

Clusterdump might include this information as well. I can't remember. You'd
still need to run kmeans with the -cl option.

On Tue, Mar 5, 2013 at 1:33 PM, Colum Foley <co...@gmail.com> wrote:

> Hi,
>
> I have a simple enough question: having run K-Means clustering
> (generated the clustered points, and clusters-x, clusters-x-final
> directories), how do you identify which items were clustered together?
> Apologies if this is trivial but I could not see an obvious answer in
> the documentation.
>
> Clusterdump seems to be the tool to use, but when I have run it I only
> see Cluster ids,centroid values, radius etc, but it is not obvious to
> me how I resolve individual item names? I am looking for something of
> the following form:
>
> cluster_id = (keys)*
>
> for example:
>
> cluster_1 = {"user104x","user89dc","user22da".}
> cluster_2 = {"user19c","user11c",....}
>
>
> Thanks,
> Colum
>