You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Matt Molek (JIRA)" <ji...@apache.org> on 2012/10/22 17:24:12 UTC
[jira] [Comment Edited] (MAHOUT-1103) clusterpp is not writing directories for all clusters

    [ https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481433#comment-13481433 ] 

Matt Molek edited comment on MAHOUT-1103 at 10/22/12 3:23 PM:
--------------------------------------------------------------

Yes, I am clustering on ssvd output. I will try again with the vectors directly from seq2sparse and update once I'm done.

I was just reading up on the way the HashPartitioner works though, and I do think it is part of the issue. HashPartitioner uses the following logic to determine what partition a key belongs to: int partition = (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;

That yields a partition of 0 for both VL-3742464 and VL-3742466. If however, they were named VL-0 and VL-1, they would be properly split up by the HashPartitioner. I think if clusters were always named VL-i where 0<=i<k, then there would not be an issue. Dealing with this weird naming scheme (which I don't know the origin of since I'm not familiar with the inner workings of kmeans) seems to be the issue.
                
      was (Author: mmolek):
    Yes, I am clustering on ssvd output. I will try again with the vectors directly from seq2sparse and update once I'm done.

I was just reading up on the way the HashPartitioner works though, and I do think it is part of the issue. HashPartitioner uses the following logic to determine what partition a key belongs to: int partition = (key.hashCode() & Integer.MAX_VALUE) % 2;

That yields a partition of 0 for both VL-3742464 and VL-3742466. If however, they were named VL-0 and VL-1, they would be properly split up by the HashPartitioner. I think if clusters were always named VL-i where 0<=i<k, then there would not be an issue. Dealing with this weird naming scheme (which I don't know the origin of since I'm not familiar with the inner workings of kmeans) seems to be the issue.
                  
> clusterpp is not writing directories for all clusters
> -----------------------------------------------------
>
>                 Key: MAHOUT-1103
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Matt Molek
>            Assignee: Paritosh Ranjan
>              Labels: clusterpp
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. The hashes of these two clusters aren't identical, but they are close. Putting both cluster names into a Text and caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira