You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by gabeweb <ga...@htc.com> on 2010/10/13 08:09:30 UTC

Format of K-means clusters from Hadoop

The format of the output clusters in K-means clustering using KMeansDriver
appears to have changed from 0.3 to 0.4.  In 0.3, once Hadoop is done
running, each call to SequenceFile.Reader.next will return a pair of
(userID, clusterID) mapping a user to the cluster to which that user
belongs.  But in 0.4, the format is different.  The name of the Hadoop
directory changes to "clusteredPoints" (I assume that running
KMeansDriver.run() with runClustering=true is the right thing to do here),
and within that directory, using SequenceFile.Reader.next, the "key" values
are almost certainly cluster IDs, but what are the values?  I would think
they are some sort of cluster representation, but they are of type
WeightedVectorWritable and contain sparse vectors with values like "1.142". 
Furthermore, there is more than one of these for each key.  So basically, I
don't understand this output.  So what is in fact the meaning of these
(IntWritable, WeightedVectorWritable) output pairs?

BTW, am I missing some documentation somewhere that explains this?  I don't
really mind sorting it out myself as it is educational, but if it exists
then I will use it.
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Format-of-K-means-clusters-from-Hadoop-tp1692414p1692414.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: Format of K-means clusters from Hadoop

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  Naturally, it happens to me all the time. Here's a link to the k-Means 
algorithm page in the wiki 
(https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clustering) 
where, down in the middle, before Examples is says:

After running the algorithm, the output directory will contain:

   1. clusters-N: directories containing SequenceFiles(Text, Cluster)
      produced by the algorithm for each iteration. The Text /key/ is a
      cluster identifier string.
   2. clusteredPoints: (if --clustering enabled) a directory containing
      SequenceFile(IntWritable, WeightedVectorWritable). The IntWritable
      /key/ is the clusterId. The WeightedVectorWritable /value/ is a
      bean containing a double /weight/ and a VectorWritable /vector/
      where the weight indicates the probability that the vector is a
      member of the cluster. For k-Means clustering, the weights are all
      1.0 since the algorithm selects only a single, most likely cluster
      for each point.

But these things have changed from 0.3 as you observed. We did this to 
improve usability and uniformity between the clustering algorithms.


On 10/12/10 11:28 PM, gabeweb wrote:
> As per the First Law of Email, as soon as I sent the previous post I figured
> it out -- I think.  The index of the pair is the index of the point (I was
> saying "user" below, but that's just my use case) being clustered, the key
> is the output cluster index, and the value is the original vector associated
> with that point (that should have been obvious).  Is that right?

Re: Format of K-means clusters from Hadoop

Posted by gabeweb <ga...@htc.com>.

As per the First Law of Email, as soon as I sent the previous post I figured
it out -- I think.  The index of the pair is the index of the point (I was
saying "user" below, but that's just my use case) being clustered, the key
is the output cluster index, and the value is the original vector associated
with that point (that should have been obvious).  Is that right?
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Format-of-K-means-clusters-from-Hadoop-tp1692414p1692479.html
Sent from the Mahout User List mailing list archive at Nabble.com.