You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Aniruddha Basak <t-...@expedia.com> on 2012/08/15 19:08:04 UTC

Understanding Mahout KMeans

Hi,
I am trying to understand the Kmeans implementation in Mahout.
Few questions appear in my mind:

 1.  In the ClusterIteration.IterateMR(), no combiner class has been declared. Looking at CIMapper and CIReducer, I could not find out where the new centroids are computed at the end of each iteration?
    *   I expected at some point the "SUM" (as in Cluster.S1) of the points assigned to a cluster will be divided by the number of points (Cluster.S0). The computeCentroid() method in AbstractCluster class does that but I could not find whether it was called or not.
 2.  While generating the cluster centroids as initial guess i.e RandomSeedGenerator.buildRandom(), why the observer() method was called for each cluster? I noticed this observe() method records the sum of points assigned to that cluster. Then, is not that point (which was chosen as clusterCenter) counted twice ?

Can someone please help me answering these questions.

Regards,
Aniruddha

Re: Understanding Mahout KMeans

Posted by Paritosh Ranjan <pr...@xebia.com>.

part-randomSeed" is in the "clusters" dir whereas other clusters are in "clusters-0".

KMeansUtil.configureWithClusterInfo(conf, clustersIn, clusters);

in KMenasDriver.buildClusters() reads the initial clusters.

On 16-08-2012 06:58, Aniruddha Basak wrote:
> Hi Jeff,
> Thanks a lot for your explanations. I completely understood the logic.
>
> In the meantime, I found one issue in KMenasDriver.buildClusters() :
> The RandomSeedGenerator generates a " part-randomSeed" file containing the initial clusters.
> In the above mentioned method sequence files for each cluster is created (and also policy) by
> prior.writeToSeqFiles(priorClustersPath);
> After that ClusterIterator is called. Now, " part-randomSeed" is still in the "clusters-0" directory with the
> individual clusters files. This causes the ClusterClassifier.readFromSeqFiles() read the clusters TWICE;
> once in the complete file and one more time in the part files.  I don’t know whether this is correct or not.
>
>   I think the " part-randomSeed" should be deleted before calling the ClusterIterator.
> Please let me know what do u think in this issue.
>
> Thanks,
> Aniruddha
>
>
>
> -----Original Message-----
> From: Jeff Eastman [mailto:jdog@windwardsolutions.com]
> Sent: Wednesday, August 15, 2012 6:15 PM
> To: user@mahout.apache.org
> Subject: Re: Understanding Mahout KMeans
>
> 1. True, the KMeansCombiner was removed and the new clustering implementations don't use combiners. Instead, all of the points assigned to a cluster by the mapper are observed() by that cluster and the clusters with their raw observation statistics are passed through to each reducer. The number of clusters has to fit in memory in each mapper anyway and counting the observations there is a lot less plumbing than with a combiner (which might or might not be run at all). All the clusters are output (k records) at the end of each mapper's cleanup() method, keyed by the clusterId.
>
> 1*. Each reducer then receives #mappers Clusters. It takes the first one, with its observation statistics, and then observes all of the remaining Clusters with that distinguished Cluster. That
> observe(Cluster) method does the summing of the observation statics. At the end of processing each key, a new ClusterClassifier is created on the one distinguished cluster and its close() method calls
> computeParameters() before it is output.
>
> 2. No, I don't think so. Observing a vector with an empty cluster will add its observation statistics and then computeParameters() will properly set its centroid before it is output.
>
>
> On 8/15/12 8:50 PM, Lance Norskog wrote:
>> It is possible to run the M/R jobs inside Eclipse or another IDE with
>> small datasets. I learned a lot from single-stepping through some of
>> the more complex code.
>>
>> On Wed, Aug 15, 2012 at 10:08 AM, Aniruddha Basak <t-...@expedia.com> wrote:
>>> Hi,
>>> I am trying to understand the Kmeans implementation in Mahout.
>>> Few questions appear in my mind:
>>>
>>>    1.  In the ClusterIteration.IterateMR(), no combiner class has been declared. Looking at CIMapper and CIReducer, I could not find out where the new centroids are computed at the end of each iteration?
>>>       *   I expected at some point the "SUM" (as in Cluster.S1) of the points assigned to a cluster will be divided by the number of points (Cluster.S0). The computeCentroid() method in AbstractCluster class does that but I could not find whether it was called or not.
>>>    2.  While generating the cluster centroids as initial guess i.e RandomSeedGenerator.buildRandom(), why the observer() method was called for each cluster? I noticed this observe() method records the sum of points assigned to that cluster. Then, is not that point (which was chosen as clusterCenter) counted twice ?
>>>
>>> Can someone please help me answering these questions.
>>>
>>> Regards,
>>> Aniruddha
>>

RE: Understanding Mahout KMeans

Posted by Aniruddha Basak <t-...@expedia.com>.

Hi Jeff,
Thanks a lot for your explanations. I completely understood the logic.

In the meantime, I found one issue in KMenasDriver.buildClusters() :
The RandomSeedGenerator generates a " part-randomSeed" file containing the initial clusters.
In the above mentioned method sequence files for each cluster is created (and also policy) by
prior.writeToSeqFiles(priorClustersPath);
After that ClusterIterator is called. Now, " part-randomSeed" is still in the "clusters-0" directory with the
individual clusters files. This causes the ClusterClassifier.readFromSeqFiles() read the clusters TWICE;
once in the complete file and one more time in the part files.  I don’t know whether this is correct or not.

 I think the " part-randomSeed" should be deleted before calling the ClusterIterator. 
Please let me know what do u think in this issue.

Thanks,
Aniruddha

-----Original Message-----
From: Jeff Eastman [mailto:jdog@windwardsolutions.com] 
Sent: Wednesday, August 15, 2012 6:15 PM
To: user@mahout.apache.org
Subject: Re: Understanding Mahout KMeans

1. True, the KMeansCombiner was removed and the new clustering implementations don't use combiners. Instead, all of the points assigned to a cluster by the mapper are observed() by that cluster and the clusters with their raw observation statistics are passed through to each reducer. The number of clusters has to fit in memory in each mapper anyway and counting the observations there is a lot less plumbing than with a combiner (which might or might not be run at all). All the clusters are output (k records) at the end of each mapper's cleanup() method, keyed by the clusterId.

1*. Each reducer then receives #mappers Clusters. It takes the first one, with its observation statistics, and then observes all of the remaining Clusters with that distinguished Cluster. That
observe(Cluster) method does the summing of the observation statics. At the end of processing each key, a new ClusterClassifier is created on the one distinguished cluster and its close() method calls
computeParameters() before it is output.

2. No, I don't think so. Observing a vector with an empty cluster will add its observation statistics and then computeParameters() will properly set its centroid before it is output.

On 8/15/12 8:50 PM, Lance Norskog wrote:
> It is possible to run the M/R jobs inside Eclipse or another IDE with 
> small datasets. I learned a lot from single-stepping through some of 
> the more complex code.
>
> On Wed, Aug 15, 2012 at 10:08 AM, Aniruddha Basak <t-...@expedia.com> wrote:
>> Hi,
>> I am trying to understand the Kmeans implementation in Mahout.
>> Few questions appear in my mind:
>>
>>   1.  In the ClusterIteration.IterateMR(), no combiner class has been declared. Looking at CIMapper and CIReducer, I could not find out where the new centroids are computed at the end of each iteration?
>>      *   I expected at some point the "SUM" (as in Cluster.S1) of the points assigned to a cluster will be divided by the number of points (Cluster.S0). The computeCentroid() method in AbstractCluster class does that but I could not find whether it was called or not.
>>   2.  While generating the cluster centroids as initial guess i.e RandomSeedGenerator.buildRandom(), why the observer() method was called for each cluster? I noticed this observe() method records the sum of points assigned to that cluster. Then, is not that point (which was chosen as clusterCenter) counted twice ?
>>
>> Can someone please help me answering these questions.
>>
>> Regards,
>> Aniruddha
>
>

Re: Understanding Mahout KMeans

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

1. True, the KMeansCombiner was removed and the new clustering 
implementations don't use combiners. Instead, all of the points assigned 
to a cluster by the mapper are observed() by that cluster and the 
clusters with their raw observation statistics are passed through to 
each reducer. The number of clusters has to fit in memory in each mapper 
anyway and counting the observations there is a lot less plumbing than 
with a combiner (which might or might not be run at all). All the 
clusters are output (k records) at the end of each mapper's cleanup() 
method, keyed by the clusterId.

1*. Each reducer then receives #mappers Clusters. It takes the first 
one, with its observation statistics, and then observes all of the 
remaining Clusters with that distinguished Cluster. That 
observe(Cluster) method does the summing of the observation statics. At 
the end of processing each key, a new ClusterClassifier is created on 
the one distinguished cluster and its close() method calls 
computeParameters() before it is output.

2. No, I don't think so. Observing a vector with an empty cluster will 
add its observation statistics and then computeParameters() will 
properly set its centroid before it is output.

On 8/15/12 8:50 PM, Lance Norskog wrote:
> It is possible to run the M/R jobs inside Eclipse or another IDE with
> small datasets. I learned a lot from single-stepping through some of
> the more complex code.
>
> On Wed, Aug 15, 2012 at 10:08 AM, Aniruddha Basak <t-...@expedia.com> wrote:
>> Hi,
>> I am trying to understand the Kmeans implementation in Mahout.
>> Few questions appear in my mind:
>>
>>   1.  In the ClusterIteration.IterateMR(), no combiner class has been declared. Looking at CIMapper and CIReducer, I could not find out where the new centroids are computed at the end of each iteration?
>>      *   I expected at some point the "SUM" (as in Cluster.S1) of the points assigned to a cluster will be divided by the number of points (Cluster.S0). The computeCentroid() method in AbstractCluster class does that but I could not find whether it was called or not.
>>   2.  While generating the cluster centroids as initial guess i.e RandomSeedGenerator.buildRandom(), why the observer() method was called for each cluster? I noticed this observe() method records the sum of points assigned to that cluster. Then, is not that point (which was chosen as clusterCenter) counted twice ?
>>
>> Can someone please help me answering these questions.
>>
>> Regards,
>> Aniruddha
>
>

Re: Understanding Mahout KMeans

Posted by Lance Norskog <go...@gmail.com>.

It is possible to run the M/R jobs inside Eclipse or another IDE with
small datasets. I learned a lot from single-stepping through some of
the more complex code.

On Wed, Aug 15, 2012 at 10:08 AM, Aniruddha Basak <t-...@expedia.com> wrote:
> Hi,
> I am trying to understand the Kmeans implementation in Mahout.
> Few questions appear in my mind:
>
>  1.  In the ClusterIteration.IterateMR(), no combiner class has been declared. Looking at CIMapper and CIReducer, I could not find out where the new centroids are computed at the end of each iteration?
>     *   I expected at some point the "SUM" (as in Cluster.S1) of the points assigned to a cluster will be divided by the number of points (Cluster.S0). The computeCentroid() method in AbstractCluster class does that but I could not find whether it was called or not.
>  2.  While generating the cluster centroids as initial guess i.e RandomSeedGenerator.buildRandom(), why the observer() method was called for each cluster? I noticed this observe() method records the sum of points assigned to that cluster. Then, is not that point (which was chosen as clusterCenter) counted twice ?
>
> Can someone please help me answering these questions.
>
> Regards,
> Aniruddha



-- 
Lance Norskog
goksron@gmail.com