You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Jeff Eastman (Created) (JIRA)" <ji...@apache.org> on 2012/03/09 21:04:56 UTC

[jira] [Created] (MAHOUT-988) Convert K-means buildClusters to use new ClusterIterator

Convert K-means buildClusters to use new ClusterIterator
--------------------------------------------------------

                 Key: MAHOUT-988
                 URL: https://issues.apache.org/jira/browse/MAHOUT-988
             Project: Mahout
          Issue Type: Sub-task
          Components: Clustering
    Affects Versions: 0.6
            Reporter: Jeff Eastman
            Assignee: Jeff Eastman
             Fix For: 0.7


Refactor the current K-means implementation to use the ClusterIterator/Classifier implementation. This will replace the mapper, combiner, reducer, clusterer and many unit tests but will not modify the other driver APIs, thus retaining compatibility with existing CLI.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Re: [jira] [Commented] (MAHOUT-988) Convert K-means buildClusters to use new ClusterIterator

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
+1 Paritosh, this is exactly what I envisioned. And I also like your 
idea of first converting them all to use ClusterWritable. Go for it!

On 3/15/12 10:42 AM, Paritosh Ranjan wrote:
> I saw the code and my understanding of the new implementation is:
> a) K-Means, Fuzzy K-Means and Dirichlet will ClusterIterator and write 
> IntWritable, ClusterWritbale in buildClusters phase ( Instead of 
> Kluster, SoftCluster and DirichletCluster )
> b) Canopy and MeanShift will NOT use ClusterIterator but will emit 
> IntWritable, ClusterWritable ( Instead of Canopy and MeanShiftCanopy )
>
> There are tools ( ClusterDumper and ClusterEvaluator ) which expect 
> <Cluster> when they read from the output file after clustering ( ~ 
> buildCluster phase ).
>
> KMeans is expecting Canopy and KCluster, but will get ClusterWritable.
>
> So, everything needs to be in sync ( i.e. ClusterWritable )
>
> I propose to wrap everything in ClusterWritable first, as everything 
> is a Cluster ( eg. DirichletCluster, SoftCluster, Kluster, Canopy and 
> MeanShiftCanopy ). This will remove the inconsistency without much 
> chaos. Once ClusterWritable is uniformly used, then refactor all 
> algorithms.
>
> I am also not against making ClusterDumper unavailable for a week or 
> so since we have ClusterOutputPostProcessor now.
>
> Is my understanding correct? If not, please help me understand it.
> If yes, which way do you propose to refactor?
>
> On 15-03-2012 19:24, Jeff Eastman wrote:
>> Yes, that was my point. below It may, in fact, be impossible to 
>> implement and commit them independently since so much of Mahout 
>> clustering depends upon the Cluster sequenceFile. You may be able to 
>> get part way by moving the Canopy mods into the kmeans issue, but 
>> then the cluster dumper and evaluator will not work with kmeans.
>>
>> Ideas?
>>
>> On 3/14/12 10:15 PM, Paritosh Ranjan wrote:
>>> Thanks Jeff. One question, are "Use ClusterIterator" tasks dependent 
>>> on "Modify Canopy etc to use ClusterWritable" task ?
>>> I am assuming that all subtasks in MAHOUT-933 
>>> <https://issues.apache.org/jira/browse/MAHOUT-933> are independent 
>>> of each other and the order to pick them does not matter. Am I correct?
>>>
>>> On 15-03-2012 09:23, Jeff Eastman wrote:
>>>> Sure Paritosh, go ahead and take a crack at it. I am moving from CO 
>>>> to PA for the next few weeks and won't be able to do much coding 
>>>> during that period. I suspect you will also need to modify Canopy 
>>>> to emit ClusterWritable and also the RandomSeedGenerator.
>>>>
>>>> Smooth sailing,
>>>> Jeff
>>>>
>>>> On 3/14/12 8:28 PM, Paritosh Ranjan (Commented) (JIRA) wrote:
>>>>>      [ 
>>>>> https://issues.apache.org/jira/browse/MAHOUT-988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229840#comment-13229840 
>>>>> ]
>>>>>
>>>>> Paritosh Ranjan commented on MAHOUT-988:
>>>>> ----------------------------------------
>>>>>
>>>>> Jeff, I would like to work on this issue (or MAHOUT-989, or 
>>>>> MAHOUT-990). Can I? I might also need some help ( at least the 
>>>>> first patch review ).
>>>>>
>>>>>
>>>>>> Convert K-means buildClusters to use new ClusterIterator
>>>>>> --------------------------------------------------------
>>>>>>
>>>>>>                  Key: MAHOUT-988
>>>>>>                  URL: 
>>>>>> https://issues.apache.org/jira/browse/MAHOUT-988
>>>>>>              Project: Mahout
>>>>>>           Issue Type: Sub-task
>>>>>>           Components: Clustering
>>>>>>     Affects Versions: 0.6
>>>>>>             Reporter: Jeff Eastman
>>>>>>             Assignee: Jeff Eastman
>>>>>>              Fix For: 0.7
>>>>>>
>>>>>>
>>>>>> Refactor the current K-means implementation to use the 
>>>>>> ClusterIterator/Classifier implementation. This will replace the 
>>>>>> mapper, combiner, reducer, clusterer and many unit tests but will 
>>>>>> not modify the other driver APIs, thus retaining compatibility 
>>>>>> with existing CLI.
>>>>> -- 
>>>>> This message is automatically generated by JIRA.
>>>>> If you think it was sent incorrectly, please contact your JIRA 
>>>>> administrators: 
>>>>> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa 
>>>>>
>>>>> For more information on JIRA, see: 
>>>>> http://www.atlassian.com/software/jira
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
>
>


Re: [jira] [Commented] (MAHOUT-988) Convert K-means buildClusters to use new ClusterIterator

Posted by Paritosh Ranjan <pr...@xebia.com>.
I saw the code and my understanding of the new implementation is:
a) K-Means, Fuzzy K-Means and Dirichlet will ClusterIterator and write 
IntWritable, ClusterWritbale in buildClusters phase ( Instead of 
Kluster, SoftCluster and DirichletCluster )
b) Canopy and MeanShift will NOT use ClusterIterator but will emit 
IntWritable, ClusterWritable ( Instead of Canopy and MeanShiftCanopy )

There are tools ( ClusterDumper and ClusterEvaluator ) which expect 
<Cluster> when they read from the output file after clustering ( ~ 
buildCluster phase ).

KMeans is expecting Canopy and KCluster, but will get ClusterWritable.

So, everything needs to be in sync ( i.e. ClusterWritable )

I propose to wrap everything in ClusterWritable first, as everything is 
a Cluster ( eg. DirichletCluster, SoftCluster, Kluster, Canopy and 
MeanShiftCanopy ). This will remove the inconsistency without much 
chaos. Once ClusterWritable is uniformly used, then refactor all algorithms.

I am also not against making ClusterDumper unavailable for a week or so 
since we have ClusterOutputPostProcessor now.

Is my understanding correct? If not, please help me understand it.
If yes, which way do you propose to refactor?

On 15-03-2012 19:24, Jeff Eastman wrote:
> Yes, that was my point. below It may, in fact, be impossible to 
> implement and commit them independently since so much of Mahout 
> clustering depends upon the Cluster sequenceFile. You may be able to 
> get part way by moving the Canopy mods into the kmeans issue, but then 
> the cluster dumper and evaluator will not work with kmeans.
>
> Ideas?
>
> On 3/14/12 10:15 PM, Paritosh Ranjan wrote:
>> Thanks Jeff. One question, are "Use ClusterIterator" tasks dependent 
>> on "Modify Canopy etc to use ClusterWritable" task ?
>> I am assuming that all subtasks in MAHOUT-933 
>> <https://issues.apache.org/jira/browse/MAHOUT-933> are independent of 
>> each other and the order to pick them does not matter. Am I correct?
>>
>> On 15-03-2012 09:23, Jeff Eastman wrote:
>>> Sure Paritosh, go ahead and take a crack at it. I am moving from CO 
>>> to PA for the next few weeks and won't be able to do much coding 
>>> during that period. I suspect you will also need to modify Canopy to 
>>> emit ClusterWritable and also the RandomSeedGenerator.
>>>
>>> Smooth sailing,
>>> Jeff
>>>
>>> On 3/14/12 8:28 PM, Paritosh Ranjan (Commented) (JIRA) wrote:
>>>>      [ 
>>>> https://issues.apache.org/jira/browse/MAHOUT-988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229840#comment-13229840 
>>>> ]
>>>>
>>>> Paritosh Ranjan commented on MAHOUT-988:
>>>> ----------------------------------------
>>>>
>>>> Jeff, I would like to work on this issue (or MAHOUT-989, or 
>>>> MAHOUT-990). Can I? I might also need some help ( at least the 
>>>> first patch review ).
>>>>
>>>>
>>>>> Convert K-means buildClusters to use new ClusterIterator
>>>>> --------------------------------------------------------
>>>>>
>>>>>                  Key: MAHOUT-988
>>>>>                  URL: 
>>>>> https://issues.apache.org/jira/browse/MAHOUT-988
>>>>>              Project: Mahout
>>>>>           Issue Type: Sub-task
>>>>>           Components: Clustering
>>>>>     Affects Versions: 0.6
>>>>>             Reporter: Jeff Eastman
>>>>>             Assignee: Jeff Eastman
>>>>>              Fix For: 0.7
>>>>>
>>>>>
>>>>> Refactor the current K-means implementation to use the 
>>>>> ClusterIterator/Classifier implementation. This will replace the 
>>>>> mapper, combiner, reducer, clusterer and many unit tests but will 
>>>>> not modify the other driver APIs, thus retaining compatibility 
>>>>> with existing CLI.
>>>> -- 
>>>> This message is automatically generated by JIRA.
>>>> If you think it was sent incorrectly, please contact your JIRA 
>>>> administrators: 
>>>> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa 
>>>>
>>>> For more information on JIRA, see: 
>>>> http://www.atlassian.com/software/jira
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>


Re: [jira] [Commented] (MAHOUT-988) Convert K-means buildClusters to use new ClusterIterator

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
Yes, that was my point. below It may, in fact, be impossible to 
implement and commit them independently since so much of Mahout 
clustering depends upon the Cluster sequenceFile. You may be able to get 
part way by moving the Canopy mods into the kmeans issue, but then the 
cluster dumper and evaluator will not work with kmeans.

Ideas?

On 3/14/12 10:15 PM, Paritosh Ranjan wrote:
> Thanks Jeff. One question, are "Use ClusterIterator" tasks dependent 
> on "Modify Canopy etc to use ClusterWritable" task ?
> I am assuming that all subtasks in MAHOUT-933 
> <https://issues.apache.org/jira/browse/MAHOUT-933> are independent of 
> each other and the order to pick them does not matter. Am I correct?
>
> On 15-03-2012 09:23, Jeff Eastman wrote:
>> Sure Paritosh, go ahead and take a crack at it. I am moving from CO 
>> to PA for the next few weeks and won't be able to do much coding 
>> during that period. I suspect you will also need to modify Canopy to 
>> emit ClusterWritable and also the RandomSeedGenerator.
>>
>> Smooth sailing,
>> Jeff
>>
>> On 3/14/12 8:28 PM, Paritosh Ranjan (Commented) (JIRA) wrote:
>>>      [ 
>>> https://issues.apache.org/jira/browse/MAHOUT-988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229840#comment-13229840 
>>> ]
>>>
>>> Paritosh Ranjan commented on MAHOUT-988:
>>> ----------------------------------------
>>>
>>> Jeff, I would like to work on this issue (or MAHOUT-989, or 
>>> MAHOUT-990). Can I? I might also need some help ( at least the first 
>>> patch review ).
>>>
>>>
>>>> Convert K-means buildClusters to use new ClusterIterator
>>>> --------------------------------------------------------
>>>>
>>>>                  Key: MAHOUT-988
>>>>                  URL: https://issues.apache.org/jira/browse/MAHOUT-988
>>>>              Project: Mahout
>>>>           Issue Type: Sub-task
>>>>           Components: Clustering
>>>>     Affects Versions: 0.6
>>>>             Reporter: Jeff Eastman
>>>>             Assignee: Jeff Eastman
>>>>              Fix For: 0.7
>>>>
>>>>
>>>> Refactor the current K-means implementation to use the 
>>>> ClusterIterator/Classifier implementation. This will replace the 
>>>> mapper, combiner, reducer, clusterer and many unit tests but will 
>>>> not modify the other driver APIs, thus retaining compatibility with 
>>>> existing CLI.
>>> -- 
>>> This message is automatically generated by JIRA.
>>> If you think it was sent incorrectly, please contact your JIRA 
>>> administrators: 
>>> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa 
>>>
>>> For more information on JIRA, see: 
>>> http://www.atlassian.com/software/jira
>>>
>>>
>>>
>>>
>>
>
>


Re: [jira] [Commented] (MAHOUT-988) Convert K-means buildClusters to use new ClusterIterator

Posted by Paritosh Ranjan <pr...@xebia.com>.
Thanks Jeff. One question, are "Use ClusterIterator" tasks dependent on 
"Modify Canopy etc to use ClusterWritable" task ?
I am assuming that all subtasks in MAHOUT-933 
<https://issues.apache.org/jira/browse/MAHOUT-933> are independent of 
each other and the order to pick them does not matter. Am I correct?

On 15-03-2012 09:23, Jeff Eastman wrote:
> Sure Paritosh, go ahead and take a crack at it. I am moving from CO to 
> PA for the next few weeks and won't be able to do much coding during 
> that period. I suspect you will also need to modify Canopy to emit 
> ClusterWritable and also the RandomSeedGenerator.
>
> Smooth sailing,
> Jeff
>
> On 3/14/12 8:28 PM, Paritosh Ranjan (Commented) (JIRA) wrote:
>>      [ 
>> https://issues.apache.org/jira/browse/MAHOUT-988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229840#comment-13229840 
>> ]
>>
>> Paritosh Ranjan commented on MAHOUT-988:
>> ----------------------------------------
>>
>> Jeff, I would like to work on this issue (or MAHOUT-989, or 
>> MAHOUT-990). Can I? I might also need some help ( at least the first 
>> patch review ).
>>
>>
>>> Convert K-means buildClusters to use new ClusterIterator
>>> --------------------------------------------------------
>>>
>>>                  Key: MAHOUT-988
>>>                  URL: https://issues.apache.org/jira/browse/MAHOUT-988
>>>              Project: Mahout
>>>           Issue Type: Sub-task
>>>           Components: Clustering
>>>     Affects Versions: 0.6
>>>             Reporter: Jeff Eastman
>>>             Assignee: Jeff Eastman
>>>              Fix For: 0.7
>>>
>>>
>>> Refactor the current K-means implementation to use the 
>>> ClusterIterator/Classifier implementation. This will replace the 
>>> mapper, combiner, reducer, clusterer and many unit tests but will 
>>> not modify the other driver APIs, thus retaining compatibility with 
>>> existing CLI.
>> -- 
>> This message is automatically generated by JIRA.
>> If you think it was sent incorrectly, please contact your JIRA 
>> administrators: 
>> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>> For more information on JIRA, see: 
>> http://www.atlassian.com/software/jira
>>
>>
>>
>>
>


Re: [jira] [Commented] (MAHOUT-988) Convert K-means buildClusters to use new ClusterIterator

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
Sure Paritosh, go ahead and take a crack at it. I am moving from CO to 
PA for the next few weeks and won't be able to do much coding during 
that period. I suspect you will also need to modify Canopy to emit 
ClusterWritable and also the RandomSeedGenerator.

Smooth sailing,
Jeff

On 3/14/12 8:28 PM, Paritosh Ranjan (Commented) (JIRA) wrote:
>      [ https://issues.apache.org/jira/browse/MAHOUT-988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229840#comment-13229840 ]
>
> Paritosh Ranjan commented on MAHOUT-988:
> ----------------------------------------
>
> Jeff, I would like to work on this issue (or MAHOUT-989, or MAHOUT-990). Can I? I might also need some help ( at least the first patch review ).
>
>
>> Convert K-means buildClusters to use new ClusterIterator
>> --------------------------------------------------------
>>
>>                  Key: MAHOUT-988
>>                  URL: https://issues.apache.org/jira/browse/MAHOUT-988
>>              Project: Mahout
>>           Issue Type: Sub-task
>>           Components: Clustering
>>     Affects Versions: 0.6
>>             Reporter: Jeff Eastman
>>             Assignee: Jeff Eastman
>>              Fix For: 0.7
>>
>>
>> Refactor the current K-means implementation to use the ClusterIterator/Classifier implementation. This will replace the mapper, combiner, reducer, clusterer and many unit tests but will not modify the other driver APIs, thus retaining compatibility with existing CLI.
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>
>


[jira] [Commented] (MAHOUT-988) Convert K-means buildClusters to use new ClusterIterator

Posted by "Paritosh Ranjan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229840#comment-13229840 ] 

Paritosh Ranjan commented on MAHOUT-988:
----------------------------------------

Jeff, I would like to work on this issue (or MAHOUT-989, or MAHOUT-990). Can I? I might also need some help ( at least the first patch review ).

                
> Convert K-means buildClusters to use new ClusterIterator
> --------------------------------------------------------
>
>                 Key: MAHOUT-988
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-988
>             Project: Mahout
>          Issue Type: Sub-task
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>             Fix For: 0.7
>
>
> Refactor the current K-means implementation to use the ClusterIterator/Classifier implementation. This will replace the mapper, combiner, reducer, clusterer and many unit tests but will not modify the other driver APIs, thus retaining compatibility with existing CLI.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-988) Convert K-means buildClusters to use new ClusterIterator

Posted by "Jeff Eastman (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13243277#comment-13243277 ] 

Jeff Eastman commented on MAHOUT-988:
-------------------------------------

+1 Huge code reduction, eh? This is just what I was hoping for. Just moved into my new Erie PA home and my office is not yet set up. Have not installed and run the tests, but nice job. 
                
> Convert K-means buildClusters to use new ClusterIterator
> --------------------------------------------------------
>
>                 Key: MAHOUT-988
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-988
>             Project: Mahout
>          Issue Type: Sub-task
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>
> Refactor the current K-means implementation to use the ClusterIterator/Classifier implementation. This will replace the mapper, combiner, reducer, clusterer and many unit tests but will not modify the other driver APIs, thus retaining compatibility with existing CLI.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (MAHOUT-988) Convert K-means buildClusters to use new ClusterIterator

Posted by "Paritosh Ranjan (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paritosh Ranjan reassigned MAHOUT-988:
--------------------------------------

    Assignee: Paritosh Ranjan  (was: Jeff Eastman)
    
> Convert K-means buildClusters to use new ClusterIterator
> --------------------------------------------------------
>
>                 Key: MAHOUT-988
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-988
>             Project: Mahout
>          Issue Type: Sub-task
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>
> Refactor the current K-means implementation to use the ClusterIterator/Classifier implementation. This will replace the mapper, combiner, reducer, clusterer and many unit tests but will not modify the other driver APIs, thus retaining compatibility with existing CLI.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Work started] (MAHOUT-988) Convert K-means buildClusters to use new ClusterIterator

Posted by "Paritosh Ranjan (Work started) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on MAHOUT-988 started by Paritosh Ranjan.

> Convert K-means buildClusters to use new ClusterIterator
> --------------------------------------------------------
>
>                 Key: MAHOUT-988
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-988
>             Project: Mahout
>          Issue Type: Sub-task
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>
> Refactor the current K-means implementation to use the ClusterIterator/Classifier implementation. This will replace the mapper, combiner, reducer, clusterer and many unit tests but will not modify the other driver APIs, thus retaining compatibility with existing CLI.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-988) Convert K-means buildClusters to use new ClusterIterator

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13243683#comment-13243683 ] 

Hudson commented on MAHOUT-988:
-------------------------------

Integrated in Mahout-Quality #1420 (See [https://builds.apache.org/job/Mahout-Quality/1420/])
    MAHOUT-988, MAHOUT-989. Using ClusterIterator and ClusteringPolicy to buildClusters. Removed Mapper, Reducer, Clusterer and their Junit Tests. (Revision 1308019)

     Result = SUCCESS
pranjan : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1308019
Files : 
* /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/fuzzykmeans/FuzzyKMeansClusterer.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/fuzzykmeans/FuzzyKMeansCombiner.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/fuzzykmeans/FuzzyKMeansDriver.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/fuzzykmeans/FuzzyKMeansMapper.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/fuzzykmeans/FuzzyKMeansReducer.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/fuzzykmeans/FuzzyKMeansUtil.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/KMeansClusterer.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/KMeansCombiner.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/KMeansDriver.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/KMeansMapper.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/KMeansReducer.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/KMeansUtil.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/fuzzykmeans/TestFuzzyKmeansClustering.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/kmeans/TestKmeansClustering.java

                
> Convert K-means buildClusters to use new ClusterIterator
> --------------------------------------------------------
>
>                 Key: MAHOUT-988
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-988
>             Project: Mahout
>          Issue Type: Sub-task
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>
> Refactor the current K-means implementation to use the ClusterIterator/Classifier implementation. This will replace the mapper, combiner, reducer, clusterer and many unit tests but will not modify the other driver APIs, thus retaining compatibility with existing CLI.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (MAHOUT-988) Convert K-means buildClusters to use new ClusterIterator

Posted by "Paritosh Ranjan (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paritosh Ranjan resolved MAHOUT-988.
------------------------------------

    Resolution: Fixed

KMeans is now using CluterIterator to buildClusters.
All the code is committed.

Resolving the issue.
                
> Convert K-means buildClusters to use new ClusterIterator
> --------------------------------------------------------
>
>                 Key: MAHOUT-988
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-988
>             Project: Mahout
>          Issue Type: Sub-task
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>
> Refactor the current K-means implementation to use the ClusterIterator/Classifier implementation. This will replace the mapper, combiner, reducer, clusterer and many unit tests but will not modify the other driver APIs, thus retaining compatibility with existing CLI.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-988) Convert K-means buildClusters to use new ClusterIterator

Posted by "Jeff Eastman (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13243282#comment-13243282 ] 

Jeff Eastman commented on MAHOUT-988:
-------------------------------------

It will be very interesting to compare the performance of the new k-means to the old version. The ClusterIterator solution does not utilize a combiner like the old implementation did, but does all the aggregation that the combiner used to do in the mapper, outputting all the trained clusters once at the end of mapper execution. This means that each CIMapper will only write k records, one for each cluster in the prior, and thus the copy-merge step should be very quick. Since each reducer (if numReducers == k) will only see numMappers input records, the reduce step should be pretty quick too. At least that's the expectation...
                
> Convert K-means buildClusters to use new ClusterIterator
> --------------------------------------------------------
>
>                 Key: MAHOUT-988
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-988
>             Project: Mahout
>          Issue Type: Sub-task
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>
> Refactor the current K-means implementation to use the ClusterIterator/Classifier implementation. This will replace the mapper, combiner, reducer, clusterer and many unit tests but will not modify the other driver APIs, thus retaining compatibility with existing CLI.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-988) Convert K-means buildClusters to use new ClusterIterator

Posted by "Paritosh Ranjan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13242569#comment-13242569 ] 

Paritosh Ranjan commented on MAHOUT-988:
----------------------------------------

Jeff, since this is first case of using ClusterIterator for clustering using ClusteringPolicy, can you please review it?
                
> Convert K-means buildClusters to use new ClusterIterator
> --------------------------------------------------------
>
>                 Key: MAHOUT-988
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-988
>             Project: Mahout
>          Issue Type: Sub-task
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>
> Refactor the current K-means implementation to use the ClusterIterator/Classifier implementation. This will replace the mapper, combiner, reducer, clusterer and many unit tests but will not modify the other driver APIs, thus retaining compatibility with existing CLI.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-988) Convert K-means buildClusters to use new ClusterIterator

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13242567#comment-13242567 ] 

jiraposter@reviews.apache.org commented on MAHOUT-988:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4573/
-----------------------------------------------------------

Review request for mahout.


Summary
-------

Used ClusterIterator and ClusteringPolicy to buildClusters for KMeans. Removed KMeansClusterer, KMeansReducer, KMeansMapper and KMeansCombiner, along with their unit tests.


This addresses bug MAHOUT-988.
    https://issues.apache.org/jira/browse/MAHOUT-988


Diffs
-----

  trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/KMeansClusterer.java 1307457 
  trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/KMeansCombiner.java 1307457 
  trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/KMeansDriver.java 1307457 
  trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/KMeansMapper.java 1307457 
  trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/KMeansReducer.java 1307457 
  trunk/core/src/main/java/org/apache/mahout/clustering/kmeans/KMeansUtil.java 1307457 
  trunk/core/src/test/java/org/apache/mahout/clustering/kmeans/TestKmeansClustering.java 1307457 

Diff: https://reviews.apache.org/r/4573/diff


Testing
-------

All junit tests pass.


Thanks,

Paritosh


                
> Convert K-means buildClusters to use new ClusterIterator
> --------------------------------------------------------
>
>                 Key: MAHOUT-988
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-988
>             Project: Mahout
>          Issue Type: Sub-task
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>
> Refactor the current K-means implementation to use the ClusterIterator/Classifier implementation. This will replace the mapper, combiner, reducer, clusterer and many unit tests but will not modify the other driver APIs, thus retaining compatibility with existing CLI.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira