You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Paritosh Ranjan (Issue Comment Edited) (JIRA)" <ji...@apache.org> on 2011/11/03 18:09:32 UTC

[jira] [Issue Comment Edited] (MAHOUT-843) Top Down Clustering

    [ https://issues.apache.org/jira/browse/MAHOUT-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13143296#comment-13143296 ] 

Paritosh Ranjan edited comment on MAHOUT-843 at 11/3/11 5:08 PM:
-----------------------------------------------------------------

This patch implements TopDownClustering. The class to use it is @TopDownClusteringDriver.

Top Level Clustering can be done by implementations of @TopLevelClusterConfig and bottom level clustering can be done by all implementations of @BottomLevelClusterConfig which are marker interfaces.

The concept is, to use different implementations of @ClusterConfig to specify parameters of different clustering algorithms. These @ClusterConfig implementations are passed as parameters specifying top level clustering configuration and bottom level clustering configuration.

The top level clustering output is post processed using @TopLevelClusterOutputPostProcessor which groups the vectors of similar clusters together. All of these clusters are further processed by bottom level clustering.

There is a specific implementation of @ClusterExecutor associated with each implementation of @ClusterConfig which uses the cluster config parameters to execute the specific algorithm.

The output of top level clustering is kept in <output path>/topLevelCluster and the output of bottom level clustering is kept in <output path>/bottomLevelCluster.

The post processed output of top level cluster is kept in <output path>/topLevelCluster/topLevelClusterPostProcessed/clusterId. 

Both the top and bottom level cluster use the clusterId as the name of the clusters produced.

I have added javadocs whereever it felt necessary so it would also help you guide through the code. I have tested using @CanopyClusterConfig as top and bottom level cluster config and it works.The other configs should work out of box.
                
      was (Author: paritoshranjan):
    This patch implements TopDownClustering. The class to use it is @TopDownClusteringDriver.

Top Level Clustering can be done by implementations of @TopLevelClusterConfig and bottom level clustering can be done by all implementations of @BottomLevelClusterConfig which are marker interfaces.

The concept is, to use different implementations of @ClusterConfig to specify parameters of different clustering algorithms. These @ClusterConfig implementations are passed as parameters specifying top level clustering configuration and bottom level clustering configuration.

The top level clustering output is post processed using @TopLevelClusterOutputPostProcessor which groups the vectors of similar clusters together. All of these clusters are further processed by bottom level clustering.

There is a specific implementation of @ClusterExecutor associated with each implementation of @ClusterConfig which uses the cluster config parameters to execute the specific algorithm.

The output of top level clustering is kept in <output path>/topLevelCluster and the output of bottom level clustering is kept in <output path>/bottomLevelCluster.

The post processed output of top level cluster is kept in <output path>/topLevelCluster/topLevelClusterPostProcessed/clusterId. 

Both the top and bottom level cluster use the clusterId as the name of the clusters produced.

I have added javadocs whereever it felt necessary so it would also help you guide through the code. I have done clustering using Canopy as top and bottom level cluster config and the other configs should work out of box.
                  
> Top Down Clustering
> -------------------
>
>                 Key: MAHOUT-843
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-843
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>              Labels: clustering, patch
>             Fix For: 0.6
>
>         Attachments: MAHOUT-843-patch, Top-Down-Clustering-patch
>
>
> Top Down Clustering works in multiple steps. The first step is to find comparative bigger clusters. The second step is to cluster the bigger chunks into meaningful clusters. This can performance while clustering big amount of data. And, it also removes the dependency of providing input clusters/numbers to the clustering algorithm.
> The "big" is a relative term, as well as the smaller "meaningful" terms. So, the control of this "bigger" and "smaller/meaningful" clusters will be controlled by the user.
> Which clustering algorithm to be used in the top level and which to use in the bottom level can also be selected by the user. Initially, it can be done for only one/few clustering algorithms, and later, option can be provided to use all the algorithms ( which suits the case ). 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira