You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Gaurav Redkar (Commented) (JIRA)" <ji...@apache.org> on 2012/02/06 13:55:59 UTC

[jira] [Commented] (MAHOUT-966) Mismantch in the number of points given by the clusterDumper and ClusterOutputPostProcessor

    [ https://issues.apache.org/jira/browse/MAHOUT-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13201252#comment-13201252 ] 

Gaurav Redkar commented on MAHOUT-966:
--------------------------------------

Hello,

As Paritosh suggested, i tried specifying the -cl option while clustering. But I am still experiencing the same problem. The number of members printed by the clusterdumper code match the number of points generated by the ClusterOutputPostProcessor for each cluster. Sadly this number does not match the value 'n' for that cluster in the clusterdumper implementation. 

Also while running the algorithm on a different dataset,the clustering algorithm resulted in two clusters with the same cluster identifier..!! Also that cluster contained some of the points twice. Any idea as to why is this happening.?  

The command used for performing the clustering job is :

bin/mahout org.apache.mahout.clustering.syntheticcontrol.meanshift.Job  -x 15  -cd 5 -t1 100  -t2 30 -cl  -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -i testdata -ow -o output

i am attaching the dataset on which i tried the clustering. Kindly give your suggestions on it.

                
> Mismantch in the number of points given by the clusterDumper and ClusterOutputPostProcessor
> -------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-966
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-966
>             Project: Mahout
>          Issue Type: Bug
>          Components: Integration
>    Affects Versions: 0.6
>         Environment: hadoop 0.20.2 mahout 0.6 
>            Reporter: Gaurav Redkar
>            Priority: Minor
>         Attachments: points100dCCNorm.txt
>
>
>  After running the post processor the number of points that each cluster contains is not matching the number of points each cluster should contain as stated by clusterdumper.
>  
> MSV-287{ n=90 c=[0.05195, 0.05675, 0.07151, 0.05713, 0.06946,...}
> MSV-145{ n=90 c=[0.93685, 0.93071, 0.93641, 0.94629, 0.94409,..}
> the n mentioned in clusters-n-final against each cluster is different from the number of points actually contained in d directory for each cluster. Any idea why is this happening ...?  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira