You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Matt Molek (JIRA)" <ji...@apache.org> on 2012/10/22 16:20:11 UTC

[jira] [Updated] (MAHOUT-1103) clusterpp is not writing directories for all clusters

     [ https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Molek updated MAHOUT-1103:
-------------------------------

    Description: 
After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is.

I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2

Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory.

Here is my command sequence for the k=2 run:
{{bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl}}

{{bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt}}

{{bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom}}


The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively.

Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. 

  was:
After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is.

I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2

Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory.

Here is my command sequence for the k=2 run:
bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl

bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt

bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom


The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively.

Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. 

    
> clusterpp is not writing directories for all clusters
> -----------------------------------------------------
>
>                 Key: MAHOUT-1103
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Matt Molek
>              Labels: clusterpp
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory.
> Here is my command sequence for the k=2 run:
> {{bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl}}
> {{bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt}}
> {{bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom}}
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira