You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "new user (JIRA)" <ji...@apache.org> on 2010/04/01 23:18:27 UTC

[jira] Created: (MAHOUT-357) Implement a clustering algorithm on mapreduce

Implement a clustering algorithm on mapreduce
---------------------------------------------

                 Key: MAHOUT-357
                 URL: https://issues.apache.org/jira/browse/MAHOUT-357
             Project: Mahout
          Issue Type: New Feature
            Reporter: new user


As I mentioned in my previous posts that I am interseted in implementing the clustering algorithm on mapreduce.So,now I am going to tell what I have thought to implement this. Thinking of the k-means algorithm for clustering,it appears that the whole set of data has to copied on each of the nodes of the hadoop framework to process the data in each iteration of the k-means clustering. But, this can be done without useless replication  of data on the clusters.First of all, we select a set of k elements as the initial clusters.This can be purely random or decided on the basis of some criteria.We maintain a file which stores the id of each cluster, the number of elements in each cluster, and the exact position of the cluster in terms of its co-ordinates.This file has to be shared by each of the nodes. During each iteration of the algorithm, the following steps are done:
1. As each node has a part of the initial data,during the map phase, it calculates the distance of each of the element from the k cluster centroids. For each row,the smallest distance is chosen and the id of the cluster and the position of that element is stored.
2.During the combine phase, for each of the cluster, the average of the co-ordinates for all the elements is calculated and the number of elements in that cluster. So, the combiner funnction outputs the cluster id and the average co-ordinates of the elements.
3. During the reduce phase, the cluster centroid is re-calculated using the weighted averages of the co-ordinates.
Thus , after these 3 steps, the new value of centorid for each cluster and the number of elemnts in each cluster is updated.

The above three steps can be performed iteratively as long as the condition set for the convergence is not satisfied, by applying the map-combine-reduce phases again.
I have proposed this as per my understanding of the probelem and my knowledge. If anybody have any doubts or want to add anything or suggest anything anything,then please respond as soon as possible. And, if you consider it a good idea, then please suggest how to proceed further in the Gsoc procedure.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-357) Implement a clustering algorithm on mapreduce

Posted by "new user (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852732#action_12852732 ] 

new user commented on MAHOUT-357:
---------------------------------

If it is so, then can you elaborate on exactly what do you want to achieve with a new implementation. I mean, do I need to optimize this algorithm or think about an entirely new way of implementation. Please be a little descriptive.

> Implement a clustering algorithm on mapreduce
> ---------------------------------------------
>
>                 Key: MAHOUT-357
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-357
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: new user
>
> As I mentioned in my previous posts that I am interseted in implementing the clustering algorithm on mapreduce.So,now I am going to tell what I have thought to implement this. Thinking of the k-means algorithm for clustering,it appears that the whole set of data has to copied on each of the nodes of the hadoop framework to process the data in each iteration of the k-means clustering. But, this can be done without useless replication  of data on the clusters.First of all, we select a set of k elements as the initial clusters.This can be purely random or decided on the basis of some criteria.We maintain a file which stores the id of each cluster, the number of elements in each cluster, and the exact position of the cluster in terms of its co-ordinates.This file has to be shared by each of the nodes. During each iteration of the algorithm, the following steps are done:
> 1. As each node has a part of the initial data,during the map phase, it calculates the distance of each of the element from the k cluster centroids. For each row,the smallest distance is chosen and the id of the cluster and the position of that element is stored.
> 2.During the combine phase, for each of the cluster, the average of the co-ordinates for all the elements is calculated and the number of elements in that cluster. So, the combiner funnction outputs the cluster id and the average co-ordinates of the elements.
> 3. During the reduce phase, the cluster centroid is re-calculated using the weighted averages of the co-ordinates.
> Thus , after these 3 steps, the new value of centorid for each cluster and the number of elemnts in each cluster is updated.
> The above three steps can be performed iteratively as long as the condition set for the convergence is not satisfied, by applying the map-combine-reduce phases again.
> I have proposed this as per my understanding of the probelem and my knowledge. If anybody have any doubts or want to add anything or suggest anything anything,then please respond as soon as possible. And, if you consider it a good idea, then please suggest how to proceed further in the Gsoc procedure.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-357) Implement a clustering algorithm on mapreduce

Posted by "new user (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852716#action_12852716 ] 

new user commented on MAHOUT-357:
---------------------------------

yeah...it is the k-means algorithm. But, the difference lies in the efficient implementation of the k-means on a very large data set on the mapreduce framework.By, efficient, I mean that you don't have to copy the whole data set on each of the nodes. In my implementation, the data is distributed among the nodes equally and each one of them perform the processing on the data on their data set only. But, the result would be the same as the whole data is processed on a single machine. So, the strength of the algorithm lies in this local processing without copying the whole data set.
I hope , I have answered your doubt. In case ,  you meant something else, please elaborate.

> Implement a clustering algorithm on mapreduce
> ---------------------------------------------
>
>                 Key: MAHOUT-357
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-357
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: new user
>
> As I mentioned in my previous posts that I am interseted in implementing the clustering algorithm on mapreduce.So,now I am going to tell what I have thought to implement this. Thinking of the k-means algorithm for clustering,it appears that the whole set of data has to copied on each of the nodes of the hadoop framework to process the data in each iteration of the k-means clustering. But, this can be done without useless replication  of data on the clusters.First of all, we select a set of k elements as the initial clusters.This can be purely random or decided on the basis of some criteria.We maintain a file which stores the id of each cluster, the number of elements in each cluster, and the exact position of the cluster in terms of its co-ordinates.This file has to be shared by each of the nodes. During each iteration of the algorithm, the following steps are done:
> 1. As each node has a part of the initial data,during the map phase, it calculates the distance of each of the element from the k cluster centroids. For each row,the smallest distance is chosen and the id of the cluster and the position of that element is stored.
> 2.During the combine phase, for each of the cluster, the average of the co-ordinates for all the elements is calculated and the number of elements in that cluster. So, the combiner funnction outputs the cluster id and the average co-ordinates of the elements.
> 3. During the reduce phase, the cluster centroid is re-calculated using the weighted averages of the co-ordinates.
> Thus , after these 3 steps, the new value of centorid for each cluster and the number of elemnts in each cluster is updated.
> The above three steps can be performed iteratively as long as the condition set for the convergence is not satisfied, by applying the map-combine-reduce phases again.
> I have proposed this as per my understanding of the probelem and my knowledge. If anybody have any doubts or want to add anything or suggest anything anything,then please respond as soon as possible. And, if you consider it a good idea, then please suggest how to proceed further in the Gsoc procedure.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-357) Implement a clustering algorithm on mapreduce

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852749#action_12852749 ] 

Ted Dunning commented on MAHOUT-357:
------------------------------------


The idea is for you to come up with something new that Mahout needs.  Doing a Mahout GSoC project is not about us telling you what to do.  It is about you coming up with ideas for projects, interacting with the community and then making a concrete proposal about what you want to do.

The mentors from Mahout will score your proposal relative to other proposals to help the Apache GSoC's prioritize which projects should be approved for GSoC approval.  In the past, we have scored highly those proposals that had very good and specific project plans, but which still described very useful new capabilities.  One key aspect of scoring is to record how the student proposing the project is able to work with the community and also to figure out how things work and what is useful to the project.  The goal is to support students who are likely to become useful and productive members of the community.



> Implement a clustering algorithm on mapreduce
> ---------------------------------------------
>
>                 Key: MAHOUT-357
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-357
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: new user
>
> As I mentioned in my previous posts that I am interseted in implementing the clustering algorithm on mapreduce.So,now I am going to tell what I have thought to implement this. Thinking of the k-means algorithm for clustering,it appears that the whole set of data has to copied on each of the nodes of the hadoop framework to process the data in each iteration of the k-means clustering. But, this can be done without useless replication  of data on the clusters.First of all, we select a set of k elements as the initial clusters.This can be purely random or decided on the basis of some criteria.We maintain a file which stores the id of each cluster, the number of elements in each cluster, and the exact position of the cluster in terms of its co-ordinates.This file has to be shared by each of the nodes. During each iteration of the algorithm, the following steps are done:
> 1. As each node has a part of the initial data,during the map phase, it calculates the distance of each of the element from the k cluster centroids. For each row,the smallest distance is chosen and the id of the cluster and the position of that element is stored.
> 2.During the combine phase, for each of the cluster, the average of the co-ordinates for all the elements is calculated and the number of elements in that cluster. So, the combiner funnction outputs the cluster id and the average co-ordinates of the elements.
> 3. During the reduce phase, the cluster centroid is re-calculated using the weighted averages of the co-ordinates.
> Thus , after these 3 steps, the new value of centorid for each cluster and the number of elemnts in each cluster is updated.
> The above three steps can be performed iteratively as long as the condition set for the convergence is not satisfied, by applying the map-combine-reduce phases again.
> I have proposed this as per my understanding of the probelem and my knowledge. If anybody have any doubts or want to add anything or suggest anything anything,then please respond as soon as possible. And, if you consider it a good idea, then please suggest how to proceed further in the Gsoc procedure.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-357) Implement a clustering algorithm on mapreduce

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852575#action_12852575 ] 

Ted Dunning commented on MAHOUT-357:
------------------------------------


The algorithm you describe is pretty much how k-means works now.



> Implement a clustering algorithm on mapreduce
> ---------------------------------------------
>
>                 Key: MAHOUT-357
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-357
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: new user
>
> As I mentioned in my previous posts that I am interseted in implementing the clustering algorithm on mapreduce.So,now I am going to tell what I have thought to implement this. Thinking of the k-means algorithm for clustering,it appears that the whole set of data has to copied on each of the nodes of the hadoop framework to process the data in each iteration of the k-means clustering. But, this can be done without useless replication  of data on the clusters.First of all, we select a set of k elements as the initial clusters.This can be purely random or decided on the basis of some criteria.We maintain a file which stores the id of each cluster, the number of elements in each cluster, and the exact position of the cluster in terms of its co-ordinates.This file has to be shared by each of the nodes. During each iteration of the algorithm, the following steps are done:
> 1. As each node has a part of the initial data,during the map phase, it calculates the distance of each of the element from the k cluster centroids. For each row,the smallest distance is chosen and the id of the cluster and the position of that element is stored.
> 2.During the combine phase, for each of the cluster, the average of the co-ordinates for all the elements is calculated and the number of elements in that cluster. So, the combiner funnction outputs the cluster id and the average co-ordinates of the elements.
> 3. During the reduce phase, the cluster centroid is re-calculated using the weighted averages of the co-ordinates.
> Thus , after these 3 steps, the new value of centorid for each cluster and the number of elemnts in each cluster is updated.
> The above three steps can be performed iteratively as long as the condition set for the convergence is not satisfied, by applying the map-combine-reduce phases again.
> I have proposed this as per my understanding of the probelem and my knowledge. If anybody have any doubts or want to add anything or suggest anything anything,then please respond as soon as possible. And, if you consider it a good idea, then please suggest how to proceed further in the Gsoc procedure.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-357) Implement a clustering algorithm on mapreduce

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852722#action_12852722 ] 

Ted Dunning commented on MAHOUT-357:
------------------------------------


I think that you misunderstood me.   Probably I was not clear enough in what I wrote.

What you describe is how the Mahout implementation of k-means works now.  You are not describing anything new.




> Implement a clustering algorithm on mapreduce
> ---------------------------------------------
>
>                 Key: MAHOUT-357
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-357
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: new user
>
> As I mentioned in my previous posts that I am interseted in implementing the clustering algorithm on mapreduce.So,now I am going to tell what I have thought to implement this. Thinking of the k-means algorithm for clustering,it appears that the whole set of data has to copied on each of the nodes of the hadoop framework to process the data in each iteration of the k-means clustering. But, this can be done without useless replication  of data on the clusters.First of all, we select a set of k elements as the initial clusters.This can be purely random or decided on the basis of some criteria.We maintain a file which stores the id of each cluster, the number of elements in each cluster, and the exact position of the cluster in terms of its co-ordinates.This file has to be shared by each of the nodes. During each iteration of the algorithm, the following steps are done:
> 1. As each node has a part of the initial data,during the map phase, it calculates the distance of each of the element from the k cluster centroids. For each row,the smallest distance is chosen and the id of the cluster and the position of that element is stored.
> 2.During the combine phase, for each of the cluster, the average of the co-ordinates for all the elements is calculated and the number of elements in that cluster. So, the combiner funnction outputs the cluster id and the average co-ordinates of the elements.
> 3. During the reduce phase, the cluster centroid is re-calculated using the weighted averages of the co-ordinates.
> Thus , after these 3 steps, the new value of centorid for each cluster and the number of elemnts in each cluster is updated.
> The above three steps can be performed iteratively as long as the condition set for the convergence is not satisfied, by applying the map-combine-reduce phases again.
> I have proposed this as per my understanding of the probelem and my knowledge. If anybody have any doubts or want to add anything or suggest anything anything,then please respond as soon as possible. And, if you consider it a good idea, then please suggest how to proceed further in the Gsoc procedure.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.