You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Jeff Eastman (Commented) (JIRA)" <ji...@apache.org> on 2011/11/03 18:39:33 UTC
[jira] [Commented] (MAHOUT-843) Top Down Clustering

    [ https://issues.apache.org/jira/browse/MAHOUT-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13143362#comment-13143362 ] 

Jeff Eastman commented on MAHOUT-843:
-------------------------------------

This patch looks like a refinement of the earlier patch. Writing a Java driver to orchestrate top-down clustering given the Config and Postprocessor instances seems a useful experiment. What is needed to move this patch closer to trunk is: 1) some unit tests of the Java classes, 2) a command line interface. This last requirement is where I get back to my earlier question above: "how is this better than using the existing [CLI] jobs [in a shell script]?"

To use the Java classes for top clusterer A and bottom clusterer B one needs to provide all of the arguments for A and B. Given all the different flavors of A and B which could be chosen, it still seems really complicated to define a single CLI which can provide all the permutations. Do you have a strategy for this?

I do think the postprocessor to split the clusteredPointsA into directories so that multiple invocations of B can proceed is useful and I would suggest focusing on that as a stand-alone CLI method first. This would be a minimal first step and save the combinatoric explosion of A,B CLI arguments needed to encapsulate the whole process. With some unit tests and an example script or two, I could see that in trunk very soon.
                
> Top Down Clustering
> -------------------
>
>                 Key: MAHOUT-843
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-843
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>              Labels: clustering, patch
>             Fix For: 0.6
>
>         Attachments: MAHOUT-843-patch, Top-Down-Clustering-patch
>
>
> Top Down Clustering works in multiple steps. The first step is to find comparative bigger clusters. The second step is to cluster the bigger chunks into meaningful clusters. This can performance while clustering big amount of data. And, it also removes the dependency of providing input clusters/numbers to the clustering algorithm.
> The "big" is a relative term, as well as the smaller "meaningful" terms. So, the control of this "bigger" and "smaller/meaningful" clusters will be controlled by the user.
> Which clustering algorithm to be used in the top level and which to use in the bottom level can also be selected by the user. Initially, it can be done for only one/few clustering algorithms, and later, option can be provided to use all the algorithms ( which suits the case ). 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira