You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Paritosh Ranjan (Updated) (JIRA)" <ji...@apache.org> on 2011/12/18 13:20:30 UTC
[jira] [Updated] (MAHOUT-930) Refactor Vector Classifaction out of Clustering - Make Classification abstract

     [ https://issues.apache.org/jira/browse/MAHOUT-930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paritosh Ranjan updated MAHOUT-930:
-----------------------------------

    Description: 
Right now, each clustering algorithm has its own runClustering ( -cp ) implementation which produces clusteredPoints. The current design lacks :
1) Extensibility - No place to plugin new features like outlier removal while classification
2) Uniformity in design - as new algorithms don't have a pattern to follow.
3) Abstraction - the clusterData should only bother about classifying vectors i.e. assigning different vectors to clusters. Currently it lacks a bit of abstraction. It should not care about how to classify. That should be the work of a separate entity, which can have features like outlier removal.

The new implementation factor out & implement an independent entity to perform the classification step independently of the various clustering implementations. The new design would start with ClusterClassifier, ClusteringPolicy and ClusterIterator whose experimental versions are available and committed. The currently committed version seems to work for all the iterative clustering algorithms.

The ClusterClassifier provides probability of any vector belonging to the different clusters available. These probabilities are converted into weights by different ClusteringPolicy implementations, which are for respective clustering algorithms. This is the place where the outlier removal implementation can be plugged in. In future, different implementations of ClusteringPolicy can be provided (configured) for different type of classification.

The ClusterClassifier also gives the capability to train the existing classifiers (clusters), by the input. This is the place where clustering/classification will converge.

The execution is done by a ClusterIterator for now, which runs a clustering policy on the input and tries to classify the vectors to different clusters. It can simultaneously train the classifiers, as it can run for given number of iterations and each iteration would improve the quality of the classifiers.



  was:Factor out & implement an independent post processor to perform the classification step independently of the various clustering implementations.

    
> Refactor Vector Classifaction out of Clustering - Make Classification abstract
> ------------------------------------------------------------------------------
>
>                 Key: MAHOUT-930
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-930
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>             Fix For: 0.7
>
>
> Right now, each clustering algorithm has its own runClustering ( -cp ) implementation which produces clusteredPoints. The current design lacks :
> 1) Extensibility - No place to plugin new features like outlier removal while classification
> 2) Uniformity in design - as new algorithms don't have a pattern to follow.
> 3) Abstraction - the clusterData should only bother about classifying vectors i.e. assigning different vectors to clusters. Currently it lacks a bit of abstraction. It should not care about how to classify. That should be the work of a separate entity, which can have features like outlier removal.
> The new implementation factor out & implement an independent entity to perform the classification step independently of the various clustering implementations. The new design would start with ClusterClassifier, ClusteringPolicy and ClusterIterator whose experimental versions are available and committed. The currently committed version seems to work for all the iterative clustering algorithms.
> The ClusterClassifier provides probability of any vector belonging to the different clusters available. These probabilities are converted into weights by different ClusteringPolicy implementations, which are for respective clustering algorithms. This is the place where the outlier removal implementation can be plugged in. In future, different implementations of ClusteringPolicy can be provided (configured) for different type of classification.
> The ClusterClassifier also gives the capability to train the existing classifiers (clusters), by the input. This is the place where clustering/classification will converge.
> The execution is done by a ClusterIterator for now, which runs a clustering policy on the input and tries to classify the vectors to different clusters. It can simultaneously train the classifiers, as it can run for given number of iterations and each iteration would improve the quality of the classifiers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira