You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Paritosh Ranjan (Created) (JIRA)" <ji...@apache.org> on 2011/12/18 05:42:30 UTC

[jira] [Created] (MAHOUT-931) Implement a pluggable outlier removal capability for cluster classifiers

Implement a pluggable outlier removal capability for cluster classifiers
------------------------------------------------------------------------

                 Key: MAHOUT-931
                 URL: https://issues.apache.org/jira/browse/MAHOUT-931
             Project: Mahout
          Issue Type: Improvement
          Components: Classification, Clustering
    Affects Versions: 0.6
            Reporter: Paritosh Ranjan
             Fix For: 0.7


A pluggable outlier removal capability while classifying the clusters is needed. The classification and outlier removal implementations, both should be completely separate entities for better abstraction. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-931) Implement a pluggable outlier removal capability for cluster classifiers

Posted by "Paritosh Ranjan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172480#comment-13172480 ] 

Paritosh Ranjan commented on MAHOUT-931:
----------------------------------------

This story depends on implementation/design of Mahout-930. I think Mahout-930's design of Vector classification is chalked out pretty nicely. We can start working to implementing all the policies, and other improvements.

But before going on fully implementing the Cluster Classification, I think it would be good to at least finalize the interface for Outlier Removal. I also think that binding it only to an outlier removal is not going to help forever.

So, following the open closed principle. Lets close it for further modification by plugging a Collection<Strategy> into the Policy. The Strategy can be outlier removal or any other feature which can be developed by implementing Strategy interface. So, this will also keep it open for extension. "Strategy" is just a thought, it can be any other name.

I will try to submit a patch for some mock/Canopy Outlier Removal first, by implementing "Strategy". If the design works and look good, then the designing part would be over. 

Does it look like a good way to proceed? Any suggestions?
                
> Implement a pluggable outlier removal capability for cluster classifiers
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-931
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-931
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>             Fix For: 0.7
>
>
> A pluggable outlier removal capability while classifying the clusters is needed. The classification and outlier removal implementations, both should be completely separate entities for better abstraction. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-931) Implement a pluggable outlier removal capability for cluster classifiers

Posted by "Paritosh Ranjan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176227#comment-13176227 ] 

Paritosh Ranjan commented on MAHOUT-931:
----------------------------------------

I am a bit confused.

Are we planning to get rid of the way clustering is being done currently, which is algorithms specific? i.e. the code in CanopyClusterer.
Will the new clustering strategy be "only" what is implemented in ClusterClassifier? i.e. Calculating probabilities of vectors belonging to different models (clusters) and choose the model with highest probability?

If yes, then Implementing Clustering policy for different clustering algorithms is all that is needed. And for outlier removal, just a threshold probability will be needed. All vectors below that probability won't be clustered. Am I correct?

Till now, I have been thinking that the clustering code just needs to be refactored out ( without changing the implementation ). If this is the case, then, I think, I have been proceeding in the correct direction ( in terms of design ). 

However, I am doubting that we are not in sync regarding the way of implementation. I think you want to change the clustering implementation to a cluster classification implementation, with outlier removal ( and completely get rid of the algorithm specific implementation, which makes sense ). 

So, it would be really helpful if you can clarify my doubts.



                
> Implement a pluggable outlier removal capability for cluster classifiers
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-931
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-931
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: MAHOUT-931
>
>
> A pluggable outlier removal capability while classifying the clusters is needed. The classification and outlier removal implementations, both should be completely separate entities for better abstraction. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-931) Implement a pluggable outlier removal capability for cluster classifiers

Posted by "Paritosh Ranjan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176423#comment-13176423 ] 

Paritosh Ranjan commented on MAHOUT-931:
----------------------------------------

Ok, I will start working in the following order then. I have few more doubts which I have written inline.

 - 929 implement a new post processor that does only classification as required by the various clusterPoints steps.

The new post processor for clusterPoints() would use the Cluster Classifier to identify which vector belongs to which cluster. At least for K-means, Canopy, Dirichlet ( i.e. similar policies exist for them ). I need to create a mapreduce and sequential version of it. Am I correct?

The current ClusterIterator is for buildCluster phase, as it is also training sideways?

 - 930 modify the existing drivers to use this post processor rather than their current, custom implementations.

Currently, the buildClusters and clusteredPoints run in the same method call for each vector. The new implementation would let buildClusters run for all input vectors first. And only after buildClusters is completely finished, start a new call for clusterPoints ( for all input vectors, using the new post processor ). 

 - 931 modify the post processor to support pluggable outlier removal.

This would be a probability threshold based implementation?
                
> Implement a pluggable outlier removal capability for cluster classifiers
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-931
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-931
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: MAHOUT-931
>
>
> A pluggable outlier removal capability while classifying the clusters is needed. The classification and outlier removal implementations, both should be completely separate entities for better abstraction. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-931) Implement a pluggable outlier removal capability for cluster classifiers

Posted by "Paritosh Ranjan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175980#comment-13175980 ] 

Paritosh Ranjan commented on MAHOUT-931:
----------------------------------------

The ClusterConfig class in the patch, can be further used to group all the clustering parameters of different clustering algorithms in a class. This will help in getting rid of long parameter list in the run() methods of the Clustering Drivers.  
                
> Implement a pluggable outlier removal capability for cluster classifiers
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-931
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-931
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: MAHOUT-931
>
>
> A pluggable outlier removal capability while classifying the clusters is needed. The classification and outlier removal implementations, both should be completely separate entities for better abstraction. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (MAHOUT-931) Implement a pluggable outlier removal capability for cluster classifiers

Posted by "Paritosh Ranjan (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paritosh Ranjan reassigned MAHOUT-931:
--------------------------------------

    Assignee: Paritosh Ranjan
    
> Implement a pluggable outlier removal capability for cluster classifiers
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-931
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-931
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: MAHOUT-931
>
>
> A pluggable outlier removal capability while classifying the clusters is needed. The classification and outlier removal implementations, both should be completely separate entities for better abstraction. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-931) Implement a pluggable outlier removal capability for cluster classifiers

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13216845#comment-13216845 ] 

Hudson commented on MAHOUT-931:
-------------------------------

Integrated in Mahout-Quality #1368 (See [https://builds.apache.org/job/Mahout-Quality/1368/])
    MAHOUT-931, MAHOUT-929. Added emitMostLikely and threshold based outlier removal capability in ClusterClassificationDriver. (Revision 1293874)

     Result = SUCCESS
pranjan : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1293874
Files : 
* /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationDriver.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/classify/ClusterClassificationDriverTest.java

                
> Implement a pluggable outlier removal capability for cluster classifiers
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-931
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-931
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: MAHOUT-931, MAHOUT-931
>
>
> A pluggable outlier removal capability while classifying the clusters is needed. The classification and outlier removal implementations, both should be completely separate entities for better abstraction. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-931) Implement a pluggable outlier removal capability for cluster classifiers

Posted by "Paritosh Ranjan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176266#comment-13176266 ] 

Paritosh Ranjan commented on MAHOUT-931:
----------------------------------------

Ok. 

Should I proceed like this :

Step 1) Encapsulte Cluster specific CLI arguments (ClusterConfig and its cluster specific implementations)

Step 2) Implement all Clustering policies

Step 3) Implement outlier removal in policies. 
Step 3a) First cut : use a probability threshold based outlier removal ( as described in previous comment )
Step 3b) Final cut : Use cluster specific arguments for outlier removal. 

Step 4) Replace Clustering Algorithms with Classifier/Iterator ( for algorithms which can be done using this )

Regarding naming, I would say, that, readability should always be given importance. I consider naming as an important part of software development, either working alone or in a team. I prefer readable code than JavaDocs. The current code is not having ample JavaDocs, so at least naming should be appropriate. I am not pushing for name change, just expressing my thoughts.

If you agree upon implementing things in the order (Steps) I mentioned. Then I can start implementing them. If you have any suggestions to improve them, then please suggest. 

                
> Implement a pluggable outlier removal capability for cluster classifiers
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-931
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-931
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: MAHOUT-931
>
>
> A pluggable outlier removal capability while classifying the clusters is needed. The classification and outlier removal implementations, both should be completely separate entities for better abstraction. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-931) Implement a pluggable outlier removal capability for cluster classifiers

Posted by "Jeff Eastman (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176331#comment-13176331 ] 

Jeff Eastman commented on MAHOUT-931:
-------------------------------------

1. I don't see a reason to introduce ClusterConfigs yet. I believe the various CLI arguments can be carried in the appropriate ClusteringPolicy implementations.

2. Other than augmenting what exist already with some more CLI arguments, I think this is done

3. Outlier removal is not a part of the buildClusters step, rather the clusterPoints step. I thought you were going to work on those stories while I finish up the mapreduce implementation of buildClusters using ClusterIterator/Classifier/Policies (MAHOUT-933)? This story (MAHOUT-931) should follow after -929 & -930, IMHO, for example:
 - 929 implement a new post processor that does only classification as required by the various clusterPoints steps.
 - 930 modify the existing drivers to use this post processor rather than their current, custom implementations.
 - 931 modify the post processor to support pluggable outlier removal.

4. This can be done once -933 is complete.

In any case, this is all post-0.6 stuff. Let's leave trunk where it is with the renaming for now.
                
> Implement a pluggable outlier removal capability for cluster classifiers
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-931
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-931
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: MAHOUT-931
>
>
> A pluggable outlier removal capability while classifying the clusters is needed. The classification and outlier removal implementations, both should be completely separate entities for better abstraction. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-931) Implement a pluggable outlier removal capability for cluster classifiers

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217942#comment-13217942 ] 

Hudson commented on MAHOUT-931:
-------------------------------

Integrated in Mahout-Quality #1371 (See [https://builds.apache.org/job/Mahout-Quality/1371/])
    MAHOUT-929, MAHOUT-931. Implemented mapreduce version of ClusterClassificationDriver with outlier removal capability.
Changed output of sequential to WeightedVectorWritable. Fixed and added test cases. (Revision 1294454)

     Result = SUCCESS
pranjan : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1294454
Files : 
* /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationConfigKeys.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationDriver.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationMapper.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/clustering/classify/ClusterClassificationDriverTest.java

                
> Implement a pluggable outlier removal capability for cluster classifiers
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-931
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-931
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: MAHOUT-931, MAHOUT-931
>
>
> A pluggable outlier removal capability while classifying the clusters is needed. The classification and outlier removal implementations, both should be completely separate entities for better abstraction. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-931) Implement a pluggable outlier removal capability for cluster classifiers

Posted by "Paritosh Ranjan (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paritosh Ranjan updated MAHOUT-931:
-----------------------------------

    Attachment: MAHOUT-931

I was thinking about the implementation and interface designing. Thought it could be best described using some code.
I think that this interface design will be able to tackle almost all future implementation changes in classification of clusters. 
If you have suggestions to improve it, then I can work on that. Else I think, we can also commit it and build over it.
                
> Implement a pluggable outlier removal capability for cluster classifiers
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-931
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-931
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: MAHOUT-931
>
>
> A pluggable outlier removal capability while classifying the clusters is needed. The classification and outlier removal implementations, both should be completely separate entities for better abstraction. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (MAHOUT-931) Implement a pluggable outlier removal capability for cluster classifiers

Posted by "Paritosh Ranjan (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paritosh Ranjan resolved MAHOUT-931.
------------------------------------

    Resolution: Fixed

This issue got resolved with MAHOUT-929.
                
> Implement a pluggable outlier removal capability for cluster classifiers
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-931
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-931
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: MAHOUT-931, MAHOUT-931
>
>
> A pluggable outlier removal capability while classifying the clusters is needed. The classification and outlier removal implementations, both should be completely separate entities for better abstraction. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-931) Implement a pluggable outlier removal capability for cluster classifiers

Posted by "Paritosh Ranjan (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paritosh Ranjan updated MAHOUT-931:
-----------------------------------

    Comment: was deleted

(was: Added emitMostLikely and threshold based outlier removal capability to ClusterClassificationDriver. The patch is attached.

I want to commit it ( test first commit in apaache ), as its a non-risky patch.

However, I am not sure of the formatting. I have used eclipse helios and the formatter inside buildtools.

Can someone help in verifying whether the code is formatted properly or not?)
    
> Implement a pluggable outlier removal capability for cluster classifiers
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-931
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-931
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: MAHOUT-931, MAHOUT-931
>
>
> A pluggable outlier removal capability while classifying the clusters is needed. The classification and outlier removal implementations, both should be completely separate entities for better abstraction. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-931) Implement a pluggable outlier removal capability for cluster classifiers

Posted by "Paritosh Ranjan (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paritosh Ranjan updated MAHOUT-931:
-----------------------------------

    Comment: was deleted

(was: Added emitMostLikely and threshold based outlier removal capability to ClusterClassificationDriver.

I want to commit it ( test first commit in apaache ), as its a non-risky patch.

However, I am not sure of the formatting. I have used eclipse helios and the formatter inside buildtools.

Can someone help in verifying whether the code is formatted properly or not?)
    
> Implement a pluggable outlier removal capability for cluster classifiers
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-931
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-931
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: MAHOUT-931, MAHOUT-931
>
>
> A pluggable outlier removal capability while classifying the clusters is needed. The classification and outlier removal implementations, both should be completely separate entities for better abstraction. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Re: [jira] [Commented] (MAHOUT-931) Implement a pluggable outlier removal capability for cluster classifiers

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
That's the same formatter I use. I had to change the page width to 120 
which, I believe, is our standard but it is good enough. Other 
committers are using IJ formatting and I doubt they will ever be 100% 
compatible with Eclipse. Not that big a deal either IMHO. Go ahead, push 
the button.

On 2/26/12 9:00 AM, Paritosh Ranjan (Commented) (JIRA) wrote:
>      [ https://issues.apache.org/jira/browse/MAHOUT-931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13216772#comment-13216772 ]
>
> Paritosh Ranjan commented on MAHOUT-931:
> ----------------------------------------
>
> Added emitMostLikely and threshold based outlier removal capability to ClusterClassificationDriver. The patch is attached.
>
> I want to commit it ( test first commit in apaache ), as its a non-risky patch.
>
> However, I am not sure of the formatting. I have used eclipse helios and the formatter inside buildtools.
>
> Can someone help in verifying whether the code is formatted properly or not?
>
>> Implement a pluggable outlier removal capability for cluster classifiers
>> ------------------------------------------------------------------------
>>
>>                  Key: MAHOUT-931
>>                  URL: https://issues.apache.org/jira/browse/MAHOUT-931
>>              Project: Mahout
>>           Issue Type: Improvement
>>           Components: Classification, Clustering
>>     Affects Versions: 0.6
>>             Reporter: Paritosh Ranjan
>>             Assignee: Paritosh Ranjan
>>              Fix For: 0.7
>>
>>          Attachments: MAHOUT-931, MAHOUT-931
>>
>>
>> A pluggable outlier removal capability while classifying the clusters is needed. The classification and outlier removal implementations, both should be completely separate entities for better abstraction.
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>
>


[jira] [Commented] (MAHOUT-931) Implement a pluggable outlier removal capability for cluster classifiers

Posted by "Paritosh Ranjan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13216772#comment-13216772 ] 

Paritosh Ranjan commented on MAHOUT-931:
----------------------------------------

Added emitMostLikely and threshold based outlier removal capability to ClusterClassificationDriver. The patch is attached.

I want to commit it ( test first commit in apaache ), as its a non-risky patch.

However, I am not sure of the formatting. I have used eclipse helios and the formatter inside buildtools.

Can someone help in verifying whether the code is formatted properly or not?
                
> Implement a pluggable outlier removal capability for cluster classifiers
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-931
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-931
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: MAHOUT-931, MAHOUT-931
>
>
> A pluggable outlier removal capability while classifying the clusters is needed. The classification and outlier removal implementations, both should be completely separate entities for better abstraction. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-931) Implement a pluggable outlier removal capability for cluster classifiers

Posted by "Jeff Eastman (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176707#comment-13176707 ] 

Jeff Eastman commented on MAHOUT-931:
-------------------------------------

- 929: Yes, use the existing ClusterClassifier to write sequential and mapreduce versions of a post processor to do vector classification. You should not need the ClusterIterator as that is used for the buildCluster phase.
- 930: No, buildClusters runs to completion on all vectors before clusterPoints is called on them. Currently, it is not possible to run the clusterPoints without first running buildClusters. With the post processor, they will be completely independent jobs (the existing CLI drivers may still bundle them for compatibility).
- 931: Yes, a probability-based threshold would work with the current ClusterClassifier API. A distance-based threshold (like Canopy T1 pruning) would need a different mechanism.
                
> Implement a pluggable outlier removal capability for cluster classifiers
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-931
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-931
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: MAHOUT-931
>
>
> A pluggable outlier removal capability while classifying the clusters is needed. The classification and outlier removal implementations, both should be completely separate entities for better abstraction. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Re: Multiple outputs

Posted by Jake Mannix <ja...@gmail.com>.
Funny that you ask about this, as I was just writing code which required
hacking back to
the old API because it needed to use MultipleOutputs.

Short answer to say: as far as I can tell, the only way to get
MultipleOutputs (or map-side
join via CompositeInputFormat) is to go back to the old and ugly JobConf /
MapReduceBase
API.

Sad, but necessary, unless I'm mistaken.

  -jake

On Mon, Dec 19, 2011 at 7:52 PM, Raphael Cendrillon <
cendrillon1978@gmail.com> wrote:

> A question, is it possible use multiple outputs with the new Hadoop API?
>
> It seems that multiple outputs were only full ported in Hadoop 0.21, but I
> think Mahout uses 0.20.
>
> Does that mean I need to stick with the old API (JobConf etc.)?
>
> Thanks!
>
> Raphael.

Re: Multiple outputs

Posted by Raphael Cendrillon <ce...@gmail.com>.
Fantastic. Thanks!

On Dec 19, 2011, at 8:47 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> in particular, i think B' job uses new api for Job etc. but yet
> produces old api mutliple outputs (and i think it may  even do it in
> both mapper and reducer).
> 
> On Mon, Dec 19, 2011 at 8:45 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> i hacked it. i use multiple outputs from old api which i pull on the
>> new api context (see code). Surprisingly, it works (most likely, new
>> api just delegates to it in 0.21)
>> 
>> On Mon, Dec 19, 2011 at 7:52 PM, Raphael Cendrillon
>> <ce...@gmail.com> wrote:
>>> A question, is it possible use multiple outputs with the new Hadoop API?
>>> 
>>> It seems that multiple outputs were only full ported in Hadoop 0.21, but I think Mahout uses 0.20.
>>> 
>>> Does that mean I need to stick with the old API (JobConf etc.)?
>>> 
>>> Thanks!
>>> 
>>> Raphael.

Re: Multiple outputs

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
in particular, i think B' job uses new api for Job etc. but yet
produces old api mutliple outputs (and i think it may  even do it in
both mapper and reducer).

On Mon, Dec 19, 2011 at 8:45 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> i hacked it. i use multiple outputs from old api which i pull on the
> new api context (see code). Surprisingly, it works (most likely, new
> api just delegates to it in 0.21)
>
> On Mon, Dec 19, 2011 at 7:52 PM, Raphael Cendrillon
> <ce...@gmail.com> wrote:
>> A question, is it possible use multiple outputs with the new Hadoop API?
>>
>> It seems that multiple outputs were only full ported in Hadoop 0.21, but I think Mahout uses 0.20.
>>
>> Does that mean I need to stick with the old API (JobConf etc.)?
>>
>> Thanks!
>>
>> Raphael.

Re: Multiple outputs

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Haven't seen where it wouldn't work so far. After all it is all just
property interpretation helpers, so I suppose for as long as legacy classes
are still around, there s no compelling reason for it not to work.

apologies for brevity.

Sent from my android.
-Dmitriy
On Dec 19, 2011 9:37 PM, "Jake Mannix" <ja...@gmail.com> wrote:

> Ah, this is nice!  I had not realized this works.  Do you know in which
> hadoop
> versions it works for?
>
>  -jake
>
> On Mon, Dec 19, 2011 at 8:45 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > i hacked it. i use multiple outputs from old api which i pull on the
> > new api context (see code). Surprisingly, it works (most likely, new
> > api just delegates to it in 0.21)
> >
> > On Mon, Dec 19, 2011 at 7:52 PM, Raphael Cendrillon
> > <ce...@gmail.com> wrote:
> > > A question, is it possible use multiple outputs with the new Hadoop
> API?
> > >
> > > It seems that multiple outputs were only full ported in Hadoop 0.21,
> but
> > I think Mahout uses 0.20.
> > >
> > > Does that mean I need to stick with the old API (JobConf etc.)?
> > >
> > > Thanks!
> > >
> > > Raphael.
> >
>

Re: Multiple outputs

Posted by Jake Mannix <ja...@gmail.com>.
Ah, this is nice!  I had not realized this works.  Do you know in which
hadoop
versions it works for?

  -jake

On Mon, Dec 19, 2011 at 8:45 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> i hacked it. i use multiple outputs from old api which i pull on the
> new api context (see code). Surprisingly, it works (most likely, new
> api just delegates to it in 0.21)
>
> On Mon, Dec 19, 2011 at 7:52 PM, Raphael Cendrillon
> <ce...@gmail.com> wrote:
> > A question, is it possible use multiple outputs with the new Hadoop API?
> >
> > It seems that multiple outputs were only full ported in Hadoop 0.21, but
> I think Mahout uses 0.20.
> >
> > Does that mean I need to stick with the old API (JobConf etc.)?
> >
> > Thanks!
> >
> > Raphael.
>

Re: Multiple outputs

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
i hacked it. i use multiple outputs from old api which i pull on the
new api context (see code). Surprisingly, it works (most likely, new
api just delegates to it in 0.21)

On Mon, Dec 19, 2011 at 7:52 PM, Raphael Cendrillon
<ce...@gmail.com> wrote:
> A question, is it possible use multiple outputs with the new Hadoop API?
>
> It seems that multiple outputs were only full ported in Hadoop 0.21, but I think Mahout uses 0.20.
>
> Does that mean I need to stick with the old API (JobConf etc.)?
>
> Thanks!
>
> Raphael.

Multiple outputs

Posted by Raphael Cendrillon <ce...@gmail.com>.
A question, is it possible use multiple outputs with the new Hadoop API?

It seems that multiple outputs were only full ported in Hadoop 0.21, but I think Mahout uses 0.20. 

Does that mean I need to stick with the old API (JobConf etc.)?

Thanks!

Raphael. 

[jira] [Commented] (MAHOUT-931) Implement a pluggable outlier removal capability for cluster classifiers

Posted by "Jeff Eastman (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172873#comment-13172873 ] 

Jeff Eastman commented on MAHOUT-931:
-------------------------------------

I agree that defining the interfaces for cluster classification and outlier removal are a good place to start. Why don't you take a stab at it since you seem to have some ideas in mind?
                
> Implement a pluggable outlier removal capability for cluster classifiers
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-931
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-931
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>             Fix For: 0.7
>
>
> A pluggable outlier removal capability while classifying the clusters is needed. The classification and outlier removal implementations, both should be completely separate entities for better abstraction. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-931) Implement a pluggable outlier removal capability for cluster classifiers

Posted by "Jeff Eastman (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175983#comment-13175983 ] 

Jeff Eastman commented on MAHOUT-931:
-------------------------------------

This patch looks to be mostly a rename of existing classes. I'm not one to be hung up on names, but I don't understand why the first thing you are proposing is to rename everything?
                
> Implement a pluggable outlier removal capability for cluster classifiers
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-931
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-931
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: MAHOUT-931
>
>
> A pluggable outlier removal capability while classifying the clusters is needed. The classification and outlier removal implementations, both should be completely separate entities for better abstraction. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-931) Implement a pluggable outlier removal capability for cluster classifiers

Posted by "Jeff Eastman (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176239#comment-13176239 ] 

Jeff Eastman commented on MAHOUT-931:
-------------------------------------

Renaming existing entities may be appropriate, but that ought to be done as a separate, independent and agreed-upon change. Otherwise we do not have a consistent vocabulary to discuss the functionality issues. Can we hold off on renaming until we get a bit more of the semantics defined? 

I tend to agree that implementing a set of algorithm-specific clustering policy objects will enable many (not all) of the current implementations to be re-implemented with the ClusterClassifier/Iterator. I think we will need to preserve the existing driver classes which support CLI argument selection in their run() methods but that the buildClusters methods would be revamped to use the new implementation. It does seem like these policy objects need to encapsulate the relevant CLI arguments so we are in synch there.

The clusterPoints methods can also be re-implemented using the new clustering postprocessor in MAHOUT-929.


                
> Implement a pluggable outlier removal capability for cluster classifiers
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-931
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-931
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: MAHOUT-931
>
>
> A pluggable outlier removal capability while classifying the clusters is needed. The classification and outlier removal implementations, both should be completely separate entities for better abstraction. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-931) Implement a pluggable outlier removal capability for cluster classifiers

Posted by "Paritosh Ranjan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175989#comment-13175989 ] 

Paritosh Ranjan commented on MAHOUT-931:
----------------------------------------

I think the Clustering Policy is all that is needed for extensibility. The design changes I did are :

a) Passing the vector rather than the probability to the clustering policy. I think this might be needed for clustering/outlier removal. Might help in transforming vector/adding weight before classification ( thinking of some future functionalities )
b) Added ClusterConfig objects to the policies. Now, the clustering policy will know all about the clustering parameters used. So, they will be able to classify accordingly.
c) ClusterConfig objects will emerge as generic cluster configuration objects, which can be used anywhere in clustering algorithms. Right now, there are a bunch of clustering parameters scattered through method calls.

I am in a habit of renaming/cleaning things while coding. So, it just happened. 
                
> Implement a pluggable outlier removal capability for cluster classifiers
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-931
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-931
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: MAHOUT-931
>
>
> A pluggable outlier removal capability while classifying the clusters is needed. The classification and outlier removal implementations, both should be completely separate entities for better abstraction. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-931) Implement a pluggable outlier removal capability for cluster classifiers

Posted by "Paritosh Ranjan (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paritosh Ranjan updated MAHOUT-931:
-----------------------------------

    Attachment: MAHOUT-931

Added emitMostLikely and threshold based outlier removal capability to ClusterClassificationDriver.

I want to commit it ( test first commit in apaache ), as its a non-risky patch.

However, I am not sure of the formatting. I have used eclipse helios and the formatter inside buildtools.

Can someone help in verifying whether the code is formatted properly or not?
                
> Implement a pluggable outlier removal capability for cluster classifiers
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-931
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-931
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: MAHOUT-931, MAHOUT-931
>
>
> A pluggable outlier removal capability while classifying the clusters is needed. The classification and outlier removal implementations, both should be completely separate entities for better abstraction. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-931) Implement a pluggable outlier removal capability for cluster classifiers

Posted by "Paritosh Ranjan (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172897#comment-13172897 ] 

Paritosh Ranjan commented on MAHOUT-931:
----------------------------------------

Ok, I will try to submit a patch for it soon.
                
> Implement a pluggable outlier removal capability for cluster classifiers
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-931
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-931
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>             Fix For: 0.7
>
>
> A pluggable outlier removal capability while classifying the clusters is needed. The classification and outlier removal implementations, both should be completely separate entities for better abstraction. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira