You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Jeff Eastman (Created) (JIRA)" <ji...@apache.org> on 2011/12/16 20:42:30 UTC

[jira] [Created] (MAHOUT-929) Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning
--------------------------------------------------------------------------------------------

                 Key: MAHOUT-929
                 URL: https://issues.apache.org/jira/browse/MAHOUT-929
             Project: Mahout
          Issue Type: Improvement
          Components: Classification, Clustering
    Affects Versions: 0.6
            Reporter: Jeff Eastman
             Fix For: 0.7


The current clustering drivers have a -cp option to produce clusteredPoints directory containing the input vectors classified by the final clusters produced by the algorithm. These options are redundantly implemented in those drivers.

- Factor out & implement an independent post processor to perform the classification step independently of the various clustering implementations.

- Implement a pluggable outlier removal capability for this classifier. 

- Consider building off of the ClusterClassifier & ClusterIterator ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-929) Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

Posted by "Paritosh Ranjan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217088#comment-13217088 ] 

Paritosh Ranjan commented on MAHOUT-929:
----------------------------------------

1) Do not worry about outlier removals for the first cut. Use emitMostlikely=true and clusterClassificationThreshold = 0.0.
2) I don't think there is any need to run a Hadoop job to test the mapper. Just test the logic inside mapper. You will need EasyMock or some other mocking framework to do it. Dev mailing list/other existing tests can help to tell other ways to write tests. There is no defined reducer for the job.
3) I don't think there is any need to take the code inside ClusterClassificationDriver. The point is to test the cluster classification logic inside mapper, not the driver.
4) It does not matter how many clusters you use. What matters is the clarity of the test cases. It really helps if the functionality to be tested is understandable from the test cases. 
The sequential and mapreduce should produce the same result. So, you can also use the assertions and data used in ClusterClassificationDriverTest, which is for the sequential cluster classification.
                
> Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning
> --------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-929
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-929
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: Mahout-929, Mahout-929, Mahout-929, Mahout-929
>
>
> The current clustering drivers have a -cp option to produce clusteredPoints directory containing the input vectors classified by the final clusters produced by the algorithm. These options are redundantly implemented in those drivers.
> - Factor out & implement an independent post processor to perform the classification step independently of the various clustering implementations.
> - Implement a pluggable outlier removal capability for this classifier. 
> - Consider building off of the ClusterClassifier & ClusterIterator ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-929) Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13216844#comment-13216844 ] 

Hudson commented on MAHOUT-929:
-------------------------------

Integrated in Mahout-Quality #1368 (See [https://builds.apache.org/job/Mahout-Quality/1368/])
    MAHOUT-931, MAHOUT-929. Added emitMostLikely and threshold based outlier removal capability in ClusterClassificationDriver. (Revision 1293874)

     Result = SUCCESS
pranjan : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1293874
Files : 
* /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationDriver.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/classify/ClusterClassificationDriverTest.java

                
> Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning
> --------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-929
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-929
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: Mahout-929, Mahout-929, Mahout-929, Mahout-929
>
>
> The current clustering drivers have a -cp option to produce clusteredPoints directory containing the input vectors classified by the final clusters produced by the algorithm. These options are redundantly implemented in those drivers.
> - Factor out & implement an independent post processor to perform the classification step independently of the various clustering implementations.
> - Implement a pluggable outlier removal capability for this classifier. 
> - Consider building off of the ClusterClassifier & ClusterIterator ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-929) Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

Posted by "Saikat Kanjilal (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217918#comment-13217918 ] 

Saikat Kanjilal commented on MAHOUT-929:
----------------------------------------

Ha Paritosh, you beat me to the punch, pardon my newbieness, I was just reading through the code in more detail, I just created the ClusterClassificationMapperTest and was starting to add code to this, should I move your test case for map-reduce into this class.   I will first try to add some more test cases.


                
> Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning
> --------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-929
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-929
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: Mahout-929, Mahout-929, Mahout-929, Mahout-929
>
>
> The current clustering drivers have a -cp option to produce clusteredPoints directory containing the input vectors classified by the final clusters produced by the algorithm. These options are redundantly implemented in those drivers.
> - Factor out & implement an independent post processor to perform the classification step independently of the various clustering implementations.
> - Implement a pluggable outlier removal capability for this classifier. 
> - Consider building off of the ClusterClassifier & ClusterIterator ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-929) Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

Posted by "Jeff Eastman (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208080#comment-13208080 ] 

Jeff Eastman commented on MAHOUT-929:
-------------------------------------

I committed your patch today. Keep it going!
                
> Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning
> --------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-929
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-929
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>             Fix For: 0.7
>
>         Attachments: Mahout-929, Mahout-929, Mahout-929
>
>
> The current clustering drivers have a -cp option to produce clusteredPoints directory containing the input vectors classified by the final clusters produced by the algorithm. These options are redundantly implemented in those drivers.
> - Factor out & implement an independent post processor to perform the classification step independently of the various clustering implementations.
> - Implement a pluggable outlier removal capability for this classifier. 
> - Consider building off of the ClusterClassifier & ClusterIterator ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Re: [jira] [Commented] (MAHOUT-929) Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
Just +1 <grin>

On 2/22/12 10:35 PM, Paritosh Ranjan (Commented) (JIRA) wrote:
>      [ https://issues.apache.org/jira/browse/MAHOUT-929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13214329#comment-13214329 ]
>
> Paritosh Ranjan commented on MAHOUT-929:
> ----------------------------------------
>
> Assigned to myself.
>
> I think cluster classification driver is developed now. Would wait for some time for the ClusterClassificationMapper's Test case ( patch ) as we asked on dev.
>
> Else I will write it and commit it. Might need help while committing for the first time.
>
> Considering, ClusterClassificationDriver development is done, we need to refactor the KMeans, FuzzyK, Dirichlet, Canopy Drivers.
> I will create separate child issues for refactoring these algos, so that different people can pick it in parallel, if they want. It will help in avoiding duplicate efforts.
>
> Jeff, any comments/suggestions?
>
>> Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning
>> --------------------------------------------------------------------------------------------
>>
>>                  Key: MAHOUT-929
>>                  URL: https://issues.apache.org/jira/browse/MAHOUT-929
>>              Project: Mahout
>>           Issue Type: Improvement
>>           Components: Classification, Clustering
>>     Affects Versions: 0.6
>>             Reporter: Jeff Eastman
>>             Assignee: Paritosh Ranjan
>>              Fix For: 0.7
>>
>>          Attachments: Mahout-929, Mahout-929, Mahout-929, Mahout-929
>>
>>
>> The current clustering drivers have a -cp option to produce clusteredPoints directory containing the input vectors classified by the final clusters produced by the algorithm. These options are redundantly implemented in those drivers.
>> - Factor out&  implement an independent post processor to perform the classification step independently of the various clustering implementations.
>> - Implement a pluggable outlier removal capability for this classifier.
>> - Consider building off of the ClusterClassifier&  ClusterIterator ideas.
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>
>


[jira] [Commented] (MAHOUT-929) Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

Posted by "Paritosh Ranjan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13214329#comment-13214329 ] 

Paritosh Ranjan commented on MAHOUT-929:
----------------------------------------

Assigned to myself.

I think cluster classification driver is developed now. Would wait for some time for the ClusterClassificationMapper's Test case ( patch ) as we asked on dev.

Else I will write it and commit it. Might need help while committing for the first time. 

Considering, ClusterClassificationDriver development is done, we need to refactor the KMeans, FuzzyK, Dirichlet, Canopy Drivers.
I will create separate child issues for refactoring these algos, so that different people can pick it in parallel, if they want. It will help in avoiding duplicate efforts.

Jeff, any comments/suggestions?
                
> Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning
> --------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-929
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-929
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: Mahout-929, Mahout-929, Mahout-929, Mahout-929
>
>
> The current clustering drivers have a -cp option to produce clusteredPoints directory containing the input vectors classified by the final clusters produced by the algorithm. These options are redundantly implemented in those drivers.
> - Factor out & implement an independent post processor to perform the classification step independently of the various clustering implementations.
> - Implement a pluggable outlier removal capability for this classifier. 
> - Consider building off of the ClusterClassifier & ClusterIterator ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-929) Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217941#comment-13217941 ] 

Hudson commented on MAHOUT-929:
-------------------------------

Integrated in Mahout-Quality #1371 (See [https://builds.apache.org/job/Mahout-Quality/1371/])
    MAHOUT-929, MAHOUT-931. Implemented mapreduce version of ClusterClassificationDriver with outlier removal capability.
Changed output of sequential to WeightedVectorWritable. Fixed and added test cases. (Revision 1294454)

     Result = SUCCESS
pranjan : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1294454
Files : 
* /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationConfigKeys.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationDriver.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationMapper.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/classify/ClusterClassificationDriverTest.java

                
> Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning
> --------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-929
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-929
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: Mahout-929, Mahout-929, Mahout-929, Mahout-929
>
>
> The current clustering drivers have a -cp option to produce clusteredPoints directory containing the input vectors classified by the final clusters produced by the algorithm. These options are redundantly implemented in those drivers.
> - Factor out & implement an independent post processor to perform the classification step independently of the various clustering implementations.
> - Implement a pluggable outlier removal capability for this classifier. 
> - Consider building off of the ClusterClassifier & ClusterIterator ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (MAHOUT-929) Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

Posted by "Jeff Eastman (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Eastman reassigned MAHOUT-929:
-----------------------------------

    Assignee: Jeff Eastman
    
> Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning
> --------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-929
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-929
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>             Fix For: 0.7
>
>         Attachments: Mahout-929, Mahout-929, Mahout-929
>
>
> The current clustering drivers have a -cp option to produce clusteredPoints directory containing the input vectors classified by the final clusters produced by the algorithm. These options are redundantly implemented in those drivers.
> - Factor out & implement an independent post processor to perform the classification step independently of the various clustering implementations.
> - Implement a pluggable outlier removal capability for this classifier. 
> - Consider building off of the ClusterClassifier & ClusterIterator ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-929) Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

Posted by "Paritosh Ranjan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13205226#comment-13205226 ] 

Paritosh Ranjan commented on MAHOUT-929:
----------------------------------------

I would prefer committing the code because then I can do local changes with more ease.

Future actions ( for me ) :

a) Implement (plug) classifications for Dirichlet and FuzzyK ( similar to classification threshold ).
b) Add test case for MR version( at least Mapper).

If anything else is needed, then please point out.

                
> Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning
> --------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-929
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-929
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>             Fix For: 0.7
>
>         Attachments: Mahout-929, Mahout-929, Mahout-929
>
>
> The current clustering drivers have a -cp option to produce clusteredPoints directory containing the input vectors classified by the final clusters produced by the algorithm. These options are redundantly implemented in those drivers.
> - Factor out & implement an independent post processor to perform the classification step independently of the various clustering implementations.
> - Implement a pluggable outlier removal capability for this classifier. 
> - Consider building off of the ClusterClassifier & ClusterIterator ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-929) Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Eastman updated MAHOUT-929:
--------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

Resolving as all subtasks have been completed
                
> Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning
> --------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-929
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-929
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: Mahout-929, Mahout-929, Mahout-929, Mahout-929
>
>
> The current clustering drivers have a -cp option to produce clusteredPoints directory containing the input vectors classified by the final clusters produced by the algorithm. These options are redundantly implemented in those drivers.
> - Factor out & implement an independent post processor to perform the classification step independently of the various clustering implementations.
> - Implement a pluggable outlier removal capability for this classifier. 
> - Consider building off of the ClusterClassifier & ClusterIterator ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-929) Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

Posted by "Paritosh Ranjan (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paritosh Ranjan updated MAHOUT-929:
-----------------------------------

    Attachment: Mahout-929

Added License Files.
                
> Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning
> --------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-929
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-929
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>             Fix For: 0.7
>
>         Attachments: Mahout-929, Mahout-929, Mahout-929
>
>
> The current clustering drivers have a -cp option to produce clusteredPoints directory containing the input vectors classified by the final clusters produced by the algorithm. These options are redundantly implemented in those drivers.
> - Factor out & implement an independent post processor to perform the classification step independently of the various clustering implementations.
> - Implement a pluggable outlier removal capability for this classifier. 
> - Consider building off of the ClusterClassifier & ClusterIterator ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-929) Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

Posted by "Saikat Kanjilal (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217952#comment-13217952 ] 

Saikat Kanjilal commented on MAHOUT-929:
----------------------------------------

Paritosh,
Some more questions after I read through your code inside ClusterClassificationDriverTest:
1) it seems that the map-reduce method you added called testVectorClassificationWithOutlierRemovalMR only differs from testVectorClassificationWithOutlierRemoval by the following line: HadoopUtil.delete(conf, classifiedOutputPath);

2) I was going to add the following test cases inside ClusterClassificationDriverTest (I chose not to add ClusterClassificationMapperTest):
- testVectorClassificationWithoutOutlierRemovalMR
- testVectorClassificationWithoutOutlierRemovalChangeThresholdMR
- testVectorClassificationWithoutOutlierRemovalChangeThreshold (pass in some custom threshold here and mock out expectations)

Finally I may add some edge error cases surrounding the above

Thoughts, let me know if you think of other cases to add.

I want to first spend some time learning this in more detail before diving into the kmeans driver rework.


                
> Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning
> --------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-929
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-929
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: Mahout-929, Mahout-929, Mahout-929, Mahout-929
>
>
> The current clustering drivers have a -cp option to produce clusteredPoints directory containing the input vectors classified by the final clusters produced by the algorithm. These options are redundantly implemented in those drivers.
> - Factor out & implement an independent post processor to perform the classification step independently of the various clustering implementations.
> - Implement a pluggable outlier removal capability for this classifier. 
> - Consider building off of the ClusterClassifier & ClusterIterator ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-929) Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

Posted by "Jeff Eastman (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171700#comment-13171700 ] 

Jeff Eastman commented on MAHOUT-929:
-------------------------------------

Sure, the first two at least are pretty significant stories. The last is more of a design constraint on the first story. Go ahead and subdivide if you wish.
                
> Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning
> --------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-929
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-929
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>             Fix For: 0.7
>
>
> The current clustering drivers have a -cp option to produce clusteredPoints directory containing the input vectors classified by the final clusters produced by the algorithm. These options are redundantly implemented in those drivers.
> - Factor out & implement an independent post processor to perform the classification step independently of the various clustering implementations.
> - Implement a pluggable outlier removal capability for this classifier. 
> - Consider building off of the ClusterClassifier & ClusterIterator ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (MAHOUT-929) Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

Posted by "Paritosh Ranjan (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210978#comment-13210978 ] 

Paritosh Ranjan edited comment on MAHOUT-929 at 2/18/12 3:17 PM:
-----------------------------------------------------------------

I have added emitMostLikely feature to vector classification. If set to true, then only the vector having max pdf is classified.

However, if clusterClassificationThreshold is present, then only vectors whose pdf's are greater than clusterClassificationThreshold would be classified. Its a bit different than the previous implementation, but makes more sense if you think in terms of outlier removal.

So, even Dirichlet and FuzzyKMeans can be classified now.

The patch only contains changes and test cases for the sequential version for now. I will make changes to mapreduce version with test cases and submit soon.
                
      was (Author: paritoshranjan):
    I have added emitMostLikely feature to vector classification. If clusterClassificationThreshold is present, then only vectors whose pdf's are greater than clusterClassificationThreshold would be classified. Its a bit different than the previous implementation, but makes more sense if you think in terms of outlier removal.

So, even Dirichlet and FuzzyKMeans can be classified now.

The patch only contains changes and test cases for the sequential version for now. I will make changes to mapreduce version with test cases and submit soon.
                  
> Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning
> --------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-929
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-929
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>             Fix For: 0.7
>
>         Attachments: Mahout-929, Mahout-929, Mahout-929, Mahout-929
>
>
> The current clustering drivers have a -cp option to produce clusteredPoints directory containing the input vectors classified by the final clusters produced by the algorithm. These options are redundantly implemented in those drivers.
> - Factor out & implement an independent post processor to perform the classification step independently of the various clustering implementations.
> - Implement a pluggable outlier removal capability for this classifier. 
> - Consider building off of the ClusterClassifier & ClusterIterator ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-929) Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

Posted by "Jeff Eastman (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13204936#comment-13204936 ] 

Jeff Eastman commented on MAHOUT-929:
-------------------------------------

Sequential version looks good but lacks tests of the MR implementation or at least of the mapper. 

What I get reading the code is that all points with a pdf > clusterClassificationThreshold will be clustered (else ignored as outliers) and that the most likely cluster will be chosen for each vector. To replace the current FuzzyK and Dirichlet capabilities, it will also need another classification threshold to support multiple classifications that the current implementations support.

As this code is not used yet, it could be committed as-is if you are comfortable but it would still be a WIP. How would you like to proceed?



                
> Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning
> --------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-929
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-929
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>             Fix For: 0.7
>
>         Attachments: Mahout-929, Mahout-929, Mahout-929
>
>
> The current clustering drivers have a -cp option to produce clusteredPoints directory containing the input vectors classified by the final clusters produced by the algorithm. These options are redundantly implemented in those drivers.
> - Factor out & implement an independent post processor to perform the classification step independently of the various clustering implementations.
> - Implement a pluggable outlier removal capability for this classifier. 
> - Consider building off of the ClusterClassifier & ClusterIterator ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-929) Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

Posted by "Paritosh Ranjan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217913#comment-13217913 ] 

Paritosh Ranjan commented on MAHOUT-929:
----------------------------------------

I have added the mapreduce version of the ClusterClassificationDriver with outlier removal capability.

ClusterClassificationDriver if implemented now ( only some refactoring and CLI development is left ). So, the clustering refactorings can start.

Saikat, if you want, you can look into ClusterClassificationDriverTest. I have added a MapReduce test case. You can try to add some more test scenarios there. This will help in getting a better understanding of ClusterClassification. Once you understand it, you can try to use it in KMeansDriver.

                
> Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning
> --------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-929
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-929
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: Mahout-929, Mahout-929, Mahout-929, Mahout-929
>
>
> The current clustering drivers have a -cp option to produce clusteredPoints directory containing the input vectors classified by the final clusters produced by the algorithm. These options are redundantly implemented in those drivers.
> - Factor out & implement an independent post processor to perform the classification step independently of the various clustering implementations.
> - Implement a pluggable outlier removal capability for this classifier. 
> - Consider building off of the ClusterClassifier & ClusterIterator ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (MAHOUT-929) Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

Posted by "Paritosh Ranjan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13267588#comment-13267588 ] 

Paritosh Ranjan edited comment on MAHOUT-929 at 5/3/12 5:08 PM:
----------------------------------------------------------------

All issues other than MAHOUT-990 are fixed. There is no other patch to review. 

I have not closed this issue since MAHOUT-990 is a subtask of MAHOUT-933 which is linked to this issue. Once we close MAHOUT-990. We will be done with all issues related to this refactoring.
                
      was (Author: paritoshranjan):
    All issues other than MAHOUT-990 are fixed. There is no other patch to review. 

I have not closed this issue since MAHOUT-930 is a subtask of MAHOUT-933 which is linked to this issue. Once we close MAHOUT-930. We will be done with all issues related to this refactoring.
                  
> Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning
> --------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-929
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-929
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: Mahout-929, Mahout-929, Mahout-929, Mahout-929
>
>
> The current clustering drivers have a -cp option to produce clusteredPoints directory containing the input vectors classified by the final clusters produced by the algorithm. These options are redundantly implemented in those drivers.
> - Factor out & implement an independent post processor to perform the classification step independently of the various clustering implementations.
> - Implement a pluggable outlier removal capability for this classifier. 
> - Consider building off of the ClusterClassifier & ClusterIterator ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-929) Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

Posted by "Jeff Eastman (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13214091#comment-13214091 ] 

Jeff Eastman commented on MAHOUT-929:
-------------------------------------

Hey Paritosh, why don't you take over this issue since you now have committer karma :)
                
> Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning
> --------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-929
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-929
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>             Fix For: 0.7
>
>         Attachments: Mahout-929, Mahout-929, Mahout-929, Mahout-929
>
>
> The current clustering drivers have a -cp option to produce clusteredPoints directory containing the input vectors classified by the final clusters produced by the algorithm. These options are redundantly implemented in those drivers.
> - Factor out & implement an independent post processor to perform the classification step independently of the various clustering implementations.
> - Implement a pluggable outlier removal capability for this classifier. 
> - Consider building off of the ClusterClassifier & ClusterIterator ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (MAHOUT-929) Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

Posted by "Paritosh Ranjan (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paritosh Ranjan reassigned MAHOUT-929:
--------------------------------------

    Assignee: Paritosh Ranjan  (was: Jeff Eastman)
    
> Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning
> --------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-929
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-929
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: Mahout-929, Mahout-929, Mahout-929, Mahout-929
>
>
> The current clustering drivers have a -cp option to produce clusteredPoints directory containing the input vectors classified by the final clusters produced by the algorithm. These options are redundantly implemented in those drivers.
> - Factor out & implement an independent post processor to perform the classification step independently of the various clustering implementations.
> - Implement a pluggable outlier removal capability for this classifier. 
> - Consider building off of the ClusterClassifier & ClusterIterator ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-929) Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13267473#comment-13267473 ] 

Jeff Eastman commented on MAHOUT-929:
-------------------------------------

Paritosh, can you take a look at this patch? If it needs work and you need help closing the issue let me know.
                
> Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning
> --------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-929
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-929
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: Mahout-929, Mahout-929, Mahout-929, Mahout-929
>
>
> The current clustering drivers have a -cp option to produce clusteredPoints directory containing the input vectors classified by the final clusters produced by the algorithm. These options are redundantly implemented in those drivers.
> - Factor out & implement an independent post processor to perform the classification step independently of the various clustering implementations.
> - Implement a pluggable outlier removal capability for this classifier. 
> - Consider building off of the ClusterClassifier & ClusterIterator ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-929) Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

Posted by "Paritosh Ranjan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223162#comment-13223162 ] 

Paritosh Ranjan commented on MAHOUT-929:
----------------------------------------

You can create a patch and attach to the jira issue. More about it is written on the How to Contribute Page https://cwiki.apache.org/MAHOUT/how-to-contribute.html.
                
> Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning
> --------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-929
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-929
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: Mahout-929, Mahout-929, Mahout-929, Mahout-929
>
>
> The current clustering drivers have a -cp option to produce clusteredPoints directory containing the input vectors classified by the final clusters produced by the algorithm. These options are redundantly implemented in those drivers.
> - Factor out & implement an independent post processor to perform the classification step independently of the various clustering implementations.
> - Implement a pluggable outlier removal capability for this classifier. 
> - Consider building off of the ClusterClassifier & ClusterIterator ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-929) Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

Posted by "Saikat Kanjilal (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217028#comment-13217028 ] 

Saikat Kanjilal commented on MAHOUT-929:
----------------------------------------

In reading through the ClusterClassificationMapper I have some questions:
1) Do we need to worry about outlier removals when providing unit tests for the map reduce
2) Is there a sample class I can look at to see how many mappers and reducers to specify or is this baked into the unit tests from the mahouttest already
3) I am going to start with the simple test as Paritosh specified , something that classifies whether or not the vectors were classified correctly, so to do this I plan to take most of the code inside ClusterClassificationDriver and make the changes to have this logic work for doing the operations in map-reduce, let me know if there are issues with this approach
4) In the ClusterClassificationDriverTest I noticed we were using 3 clusters, does it matter how many clusters we create, I was wondering what the relationship is (if any) with the number of clusters to the actual map-reduce operation of classification
                
> Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning
> --------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-929
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-929
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: Mahout-929, Mahout-929, Mahout-929, Mahout-929
>
>
> The current clustering drivers have a -cp option to produce clusteredPoints directory containing the input vectors classified by the final clusters produced by the algorithm. These options are redundantly implemented in those drivers.
> - Factor out & implement an independent post processor to perform the classification step independently of the various clustering implementations.
> - Implement a pluggable outlier removal capability for this classifier. 
> - Consider building off of the ClusterClassifier & ClusterIterator ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-929) Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

Posted by "Paritosh Ranjan (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paritosh Ranjan updated MAHOUT-929:
-----------------------------------

    Attachment: Mahout-929

Added mapreduce version of ClusterClassification Driver.

Also added outlier removal functionality.

Added test cases which demostrate outlier removal and cluster classification. 

Added JavaDocs.
                
> Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning
> --------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-929
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-929
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>             Fix For: 0.7
>
>         Attachments: Mahout-929, Mahout-929
>
>
> The current clustering drivers have a -cp option to produce clusteredPoints directory containing the input vectors classified by the final clusters produced by the algorithm. These options are redundantly implemented in those drivers.
> - Factor out & implement an independent post processor to perform the classification step independently of the various clustering implementations.
> - Implement a pluggable outlier removal capability for this classifier. 
> - Consider building off of the ClusterClassifier & ClusterIterator ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-929) Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

Posted by "Paritosh Ranjan (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paritosh Ranjan updated MAHOUT-929:
-----------------------------------

    Attachment: Mahout-929

Created sequential version of ClusterClassifier. Test case is also present.

In next patch I will also add the MapReduce Version. It will be more or less implemented in a similar fashion. 

Please review the patch to find any early problems. The code is in working state.

And sorry for taking time, I was very busy with my office work. Though I used some time to read recommendation, classification and also probability, I am sure I will be able to use it in future.
                
> Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning
> --------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-929
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-929
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>             Fix For: 0.7
>
>         Attachments: Mahout-929
>
>
> The current clustering drivers have a -cp option to produce clusteredPoints directory containing the input vectors classified by the final clusters produced by the algorithm. These options are redundantly implemented in those drivers.
> - Factor out & implement an independent post processor to perform the classification step independently of the various clustering implementations.
> - Implement a pluggable outlier removal capability for this classifier. 
> - Consider building off of the ClusterClassifier & ClusterIterator ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-929) Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

Posted by "Paritosh Ranjan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217953#comment-13217953 ] 

Paritosh Ranjan commented on MAHOUT-929:
----------------------------------------

The MR test differs where the runSequential argument is used. For MR, its false, for others its true.

runClustering(pointsPath, conf, false);
runClassificationWithOutlierRemoval(conf, false);
                
> Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning
> --------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-929
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-929
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: Mahout-929, Mahout-929, Mahout-929, Mahout-929
>
>
> The current clustering drivers have a -cp option to produce clusteredPoints directory containing the input vectors classified by the final clusters produced by the algorithm. These options are redundantly implemented in those drivers.
> - Factor out & implement an independent post processor to perform the classification step independently of the various clustering implementations.
> - Implement a pluggable outlier removal capability for this classifier. 
> - Consider building off of the ClusterClassifier & ClusterIterator ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-929) Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

Posted by "Paritosh Ranjan (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paritosh Ranjan updated MAHOUT-929:
-----------------------------------

    Status: Patch Available  (was: Open)
    
> Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning
> --------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-929
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-929
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>             Fix For: 0.7
>
>         Attachments: Mahout-929
>
>
> The current clustering drivers have a -cp option to produce clusteredPoints directory containing the input vectors classified by the final clusters produced by the algorithm. These options are redundantly implemented in those drivers.
> - Factor out & implement an independent post processor to perform the classification step independently of the various clustering implementations.
> - Implement a pluggable outlier removal capability for this classifier. 
> - Consider building off of the ClusterClassifier & ClusterIterator ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-929) Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

Posted by "Paritosh Ranjan (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paritosh Ranjan updated MAHOUT-929:
-----------------------------------

    Attachment: Mahout-929

I have added emitMostLikely feature to vector classification. If clusterClassificationThreshold is present, then only vectors whose pdf's are greater than clusterClassificationThreshold would be classified. Its a bit different than the previous implementation, but makes more sense if you think in terms of outlier removal.

So, even Dirichlet and FuzzyKMeans can be classified now.

The patch only contains changes and test cases for the sequential version for now. I will make changes to mapreduce version with test cases and submit soon.
                
> Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning
> --------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-929
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-929
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>             Fix For: 0.7
>
>         Attachments: Mahout-929, Mahout-929, Mahout-929, Mahout-929
>
>
> The current clustering drivers have a -cp option to produce clusteredPoints directory containing the input vectors classified by the final clusters produced by the algorithm. These options are redundantly implemented in those drivers.
> - Factor out & implement an independent post processor to perform the classification step independently of the various clustering implementations.
> - Implement a pluggable outlier removal capability for this classifier. 
> - Consider building off of the ClusterClassifier & ClusterIterator ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-929) Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

Posted by "Paritosh Ranjan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171593#comment-13171593 ] 

Paritosh Ranjan commented on MAHOUT-929:
----------------------------------------

I think that it would be difficult to manage discussions and patches for all the three issues ( points mentioned ) in this single Jira issue. 

In agile's context also, this user story is big and trying to do too many things.

Would it be good to create three sub issues for the three points mentioned, as they are related? I think there is also an order in developing them, so, it would also be good to make sub issues dependent on each other (in order). If you agree, then we can create them.
                
> Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning
> --------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-929
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-929
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>             Fix For: 0.7
>
>
> The current clustering drivers have a -cp option to produce clusteredPoints directory containing the input vectors classified by the final clusters produced by the algorithm. These options are redundantly implemented in those drivers.
> - Factor out & implement an independent post processor to perform the classification step independently of the various clustering implementations.
> - Implement a pluggable outlier removal capability for this classifier. 
> - Consider building off of the ClusterClassifier & ClusterIterator ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-929) Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

Posted by "Paritosh Ranjan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13267588#comment-13267588 ] 

Paritosh Ranjan commented on MAHOUT-929:
----------------------------------------

All issues other than MAHOUT-990 are fixed. There is no other patch to review. 

I have not closed this issue since MAHOUT-930 is a subtask of MAHOUT-933 which is linked to this issue. Once we close MAHOUT-930. We will be done with all issues related to this refactoring.
                
> Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning
> --------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-929
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-929
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: Mahout-929, Mahout-929, Mahout-929, Mahout-929
>
>
> The current clustering drivers have a -cp option to produce clusteredPoints directory containing the input vectors classified by the final clusters produced by the algorithm. These options are redundantly implemented in those drivers.
> - Factor out & implement an independent post processor to perform the classification step independently of the various clustering implementations.
> - Implement a pluggable outlier removal capability for this classifier. 
> - Consider building off of the ClusterClassifier & ClusterIterator ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-929) Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

Posted by "Paritosh Ranjan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217920#comment-13217920 ] 

Paritosh Ranjan commented on MAHOUT-929:
----------------------------------------

Adding few test cases in ClusterClassificationDriver will help you understand its funtionality, which will help in clustering refactorings. Adding/skipping mapper test is your wish. Just reiterating, once you understand ClusterClassificationDriver, you can try to use it in KMeansDriver. ClusterClassificationDriver will replace the clusterData phase of KMeansDriver. Feel free to ask questions on MAHOUT-981 regarding KMeansDriver refactoring.

                
> Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning
> --------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-929
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-929
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: Mahout-929, Mahout-929, Mahout-929, Mahout-929
>
>
> The current clustering drivers have a -cp option to produce clusteredPoints directory containing the input vectors classified by the final clusters produced by the algorithm. These options are redundantly implemented in those drivers.
> - Factor out & implement an independent post processor to perform the classification step independently of the various clustering implementations.
> - Implement a pluggable outlier removal capability for this classifier. 
> - Consider building off of the ClusterClassifier & ClusterIterator ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-929) Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

Posted by "Saikat Kanjilal (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13222987#comment-13222987 ] 

Saikat Kanjilal commented on MAHOUT-929:
----------------------------------------

Paritosh sorry about my dissappearance, was out for a few days, anyways I have added a few tests to the ClusterClassificationDriver, being that I am not a committer whats the process of submitting my change, can I submit a patch through the usual means if I'm not a committer?
                
> Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning
> --------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-929
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-929
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: Mahout-929, Mahout-929, Mahout-929, Mahout-929
>
>
> The current clustering drivers have a -cp option to produce clusteredPoints directory containing the input vectors classified by the final clusters produced by the algorithm. These options are redundantly implemented in those drivers.
> - Factor out & implement an independent post processor to perform the classification step independently of the various clustering implementations.
> - Implement a pluggable outlier removal capability for this classifier. 
> - Consider building off of the ClusterClassifier & ClusterIterator ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (MAHOUT-929) Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

Posted by "Paritosh Ranjan (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217953#comment-13217953 ] 

Paritosh Ranjan edited comment on MAHOUT-929 at 2/28/12 6:35 AM:
-----------------------------------------------------------------

The MR test differs where the runSequential argument is used. For MR, its false, and for sequential, its true.

runClustering(pointsPath, conf, false);
runClassificationWithOutlierRemoval(conf, false);
                
      was (Author: paritoshranjan):
    The MR test differs where the runSequential argument is used. For MR, its false, for others its true.

runClustering(pointsPath, conf, false);
runClassificationWithOutlierRemoval(conf, false);
                  
> Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning
> --------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-929
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-929
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: Mahout-929, Mahout-929, Mahout-929, Mahout-929
>
>
> The current clustering drivers have a -cp option to produce clusteredPoints directory containing the input vectors classified by the final clusters produced by the algorithm. These options are redundantly implemented in those drivers.
> - Factor out & implement an independent post processor to perform the classification step independently of the various clustering implementations.
> - Implement a pluggable outlier removal capability for this classifier. 
> - Consider building off of the ClusterClassifier & ClusterIterator ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (MAHOUT-929) Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning

Posted by "Paritosh Ranjan (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13214329#comment-13214329 ] 

Paritosh Ranjan edited comment on MAHOUT-929 at 2/23/12 6:33 AM:
-----------------------------------------------------------------

Assigned to myself.

I think cluster classification driver is developed now. Would wait for some time for the ClusterClassificationMapper's Test case ( patch ) as we asked on dev.

Else I will write it and commit it. Might need help while committing for the first time. 

Considering, ClusterClassificationDriver development is done, we need to refactor the KMeans, FuzzyK, Dirichlet, Canopy Drivers.
I will create separate child issues for refactoring these algos ( Respective driver classes ), so that different people can pick it in parallel, if they want. It will help in avoiding duplicate efforts.

Jeff, any comments/suggestions?
                
      was (Author: paritoshranjan):
    Assigned to myself.

I think cluster classification driver is developed now. Would wait for some time for the ClusterClassificationMapper's Test case ( patch ) as we asked on dev.

Else I will write it and commit it. Might need help while committing for the first time. 

Considering, ClusterClassificationDriver development is done, we need to refactor the KMeans, FuzzyK, Dirichlet, Canopy Drivers.
I will create separate child issues for refactoring these algos, so that different people can pick it in parallel, if they want. It will help in avoiding duplicate efforts.

Jeff, any comments/suggestions?
                  
> Refactor Clustering (Vector Classification) into a Separate Postprocess with Outlier Pruning
> --------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-929
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-929
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: Mahout-929, Mahout-929, Mahout-929, Mahout-929
>
>
> The current clustering drivers have a -cp option to produce clusteredPoints directory containing the input vectors classified by the final clusters produced by the algorithm. These options are redundantly implemented in those drivers.
> - Factor out & implement an independent post processor to perform the classification step independently of the various clustering implementations.
> - Implement a pluggable outlier removal capability for this classifier. 
> - Consider building off of the ClusterClassifier & ClusterIterator ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira