You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Matt Molek (JIRA)" <ji...@apache.org> on 2012/10/22 16:18:11 UTC

[jira] [Created] (MAHOUT-1103) clusterpp is not writing directories for all clusters

Matt Molek created MAHOUT-1103:
----------------------------------

             Summary: clusterpp is not writing directories for all clusters
                 Key: MAHOUT-1103
                 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
             Project: Mahout
          Issue Type: Bug
          Components: Clustering
    Affects Versions: 0.8
            Reporter: Matt Molek


After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is.

I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2

Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory.

Here is my command sequence for the k=2 run:
bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl

bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt

bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom


The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively.

Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1103) clusterpp is not writing directories for all clusters

Posted by "Matt Molek (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Molek updated MAHOUT-1103:
-------------------------------

    Description: 
After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is.

I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2

Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory.

Here is my command sequence for the k=2 run:
{noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl

bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt

bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 

The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively.

Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. The hashes of these two clusters aren't identical, but they are close. Putting both cluster names into a Text and caling hashCode() gives:
VL-3742464 -> -685560454
VL-3742466 -> -685560452

Finally, when running with "-xm sequential", everything performs as expected.

  was:
After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is.

I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2

Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory.

Here is my command sequence for the k=2 run:
{noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl

bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt

bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 

The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively.

Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. The hashes of these two clusters aren't identical, but they are close. Putting both cluster names into a Text and caling hashCode() gives:
VL-3742464 -> -685560454
VL-3742466 -> -685560452

    
> clusterpp is not writing directories for all clusters
> -----------------------------------------------------
>
>                 Key: MAHOUT-1103
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Matt Molek
>              Labels: clusterpp
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. The hashes of these two clusters aren't identical, but they are close. Putting both cluster names into a Text and caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (MAHOUT-1103) clusterpp is not writing directories for all clusters

Posted by "Paritosh Ranjan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13489762#comment-13489762 ] 

Paritosh Ranjan edited comment on MAHOUT-1103 at 11/2/12 9:23 PM:
------------------------------------------------------------------

I tried to test cluster post processing with Junit but the problem is while running through Junits ( standalone mode ), the numpartitions is always 1. So, the partitioner does not behave in the way it should. 

I will try running it on a cluster to test the results. 

I think the current approach to post process is highly dependent on cluster ids + partitioner which is somehow not correct. Any change in clustering approach will have an impact on cluster post processing. I will also think of another ways to solve it.
                
      was (Author: paritoshranjan):
    I tried test cluster post processing with Junit but the problem is while running through Junits ( standalone mode ), the numpartitions is always 1. So, the partitioner does not behave in the way it should. 

I will try running it on a cluster to test the results. 

I think the current approach to post process is highly dependent on cluster ids + partitioner which is somehow not correct. Any change in clustering approach will have an impact on cluster post processing. I will also think of another ways to solve it.
                  
> clusterpp is not writing directories for all clusters
> -----------------------------------------------------
>
>                 Key: MAHOUT-1103
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Matt Molek
>            Assignee: Paritosh Ranjan
>              Labels: clusterpp
>         Attachments: MAHOUT-1103.patch
>
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. The hashes of these two clusters aren't identical, but they are close. Putting both cluster names into a Text and caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1103) clusterpp is not writing directories for all clusters

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481552#comment-13481552 ] 

Dmitriy Lyubimov commented on MAHOUT-1103:
------------------------------------------

btw Tanimoto distance suggests he actually wants LSA topics, not pure PCA (euclidean). Further on, usually output U is taken for this purpose. 

I suggest the author to prototype with R on a document subset first to make sure his pipeline yields result he expects.
                
> clusterpp is not writing directories for all clusters
> -----------------------------------------------------
>
>                 Key: MAHOUT-1103
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Matt Molek
>            Assignee: Paritosh Ranjan
>              Labels: clusterpp
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. The hashes of these two clusters aren't identical, but they are close. Putting both cluster names into a Text and caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1103) clusterpp is not writing directories for all clusters

Posted by "Paritosh Ranjan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481680#comment-13481680 ] 

Paritosh Ranjan commented on MAHOUT-1103:
-----------------------------------------

Dmitriy, I was talking about the issue where VectorIds were being lost. That was not a ssvd problem, but the way ssvd + kmeans was used. I faintly remember that, so I just wanted to make sure that the same problem is not there.
                
> clusterpp is not writing directories for all clusters
> -----------------------------------------------------
>
>                 Key: MAHOUT-1103
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Matt Molek
>            Assignee: Paritosh Ranjan
>              Labels: clusterpp
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. The hashes of these two clusters aren't identical, but they are close. Putting both cluster names into a Text and caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1103) clusterpp is not writing directories for all clusters

Posted by "Paritosh Ranjan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481544#comment-13481544 ] 

Paritosh Ranjan commented on MAHOUT-1103:
-----------------------------------------

Instead of changing the nomenclature of clusters, we should create a partitioner which solves the problem ( if the problem is in partitioner ). 

By changing the nomenclature of the clusters, we will be fixing the bug just for this particular case. In case, someone creates as many as 3742464 clusters ( assume ), or more, then again the same problem will rise. So, we should fix it at a generic level.

Looking forward for your tests without SVD. As I said earlier, also try to use a better Partitioner. The partitioner can be changed in postProcessMR method of ClusterOutputPostProcessorDriver class. 

You can read this page https://cwiki.apache.org/MAHOUT/how-to-contribute.html to see how to get code and generate patches.
                
> clusterpp is not writing directories for all clusters
> -----------------------------------------------------
>
>                 Key: MAHOUT-1103
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Matt Molek
>            Assignee: Paritosh Ranjan
>              Labels: clusterpp
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. The hashes of these two clusters aren't identical, but they are close. Putting both cluster names into a Text and caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (MAHOUT-1103) clusterpp is not writing directories for all clusters

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481546#comment-13481546 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1103 at 10/22/12 5:52 PM:
--------------------------------------------------------------------

bq. Since its not working for even two clusters, I don't see any problem due to the Partitioner. The input here looks like the output of SSVD. There has been problems reported earlier also, where SSVD output was creating problems in clustering.


What issues? Can you be more specific? 

The only discussion i am aware of was with Pat and he was having problem using embedded style and with the fact that there were no --USigma output available at that time. 

I am not aware of any issues in the HEAD. he should be able to use --pca true option and if it retains enough variance (i.e. fairly rapid spectrum decay in the first 100-ish values) he should be fine at least with euclidean coordinates clustering. 

if he looks at cosine similarities for topical clustering (aka LSA) he doesn't need --pca option.

Either way it is not a problem of SSVD but a problem of the approach.

                
      was (Author: dlyubimov):
    bq. Since its not working for even two clusters, I don't see any problem due to the Partitioner. The input here looks like the output of SSVD. There has been problems reported earlier also, where SSVD output was creating problems in clustering.


What issues? Can you be more specific? 

The only discussion i am aware of was with Pat and he was having problem using embedded style and with the fact that there were no --USigma output available at that time. 

I am not aware of any issues in the HEAD. he should be able to use --pca true option and if it retains enough variance (i.e. fairly rapid spectrum decay in the first 100-ish values) he should be fine at least with euclidean coordinates clustering. 

if he looks at cosine similarities for topical clustering (aka LSA) he doesn't need --pca option.

                  
> clusterpp is not writing directories for all clusters
> -----------------------------------------------------
>
>                 Key: MAHOUT-1103
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Matt Molek
>            Assignee: Paritosh Ranjan
>              Labels: clusterpp
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. The hashes of these two clusters aren't identical, but they are close. Putting both cluster names into a Text and caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (MAHOUT-1103) clusterpp is not writing directories for all clusters

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481552#comment-13481552 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1103 at 10/22/12 5:57 PM:
--------------------------------------------------------------------

btw Tanimoto distance suggests he actually wants LSA topics, not pure PCA (euclidean). Further on, usually output U is taken for this purpose. 

I would suggest the author to prototype with R on a document subset first to make sure his pipeline yields result he expects. Unfortunately we don't have an easy way to convert DRM to R data sets (yet), hopefully some work will be done at some point to bridge that gap.
                
      was (Author: dlyubimov):
    btw Tanimoto distance suggests he actually wants LSA topics, not pure PCA (euclidean). Further on, usually output U is taken for this purpose. 

I suggest the author to prototype with R on a document subset first to make sure his pipeline yields result he expects.
                  
> clusterpp is not writing directories for all clusters
> -----------------------------------------------------
>
>                 Key: MAHOUT-1103
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Matt Molek
>            Assignee: Paritosh Ranjan
>              Labels: clusterpp
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. The hashes of these two clusters aren't identical, but they are close. Putting both cluster names into a Text and caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1103) clusterpp is not writing directories for all clusters

Posted by "Matt Molek (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Molek updated MAHOUT-1103:
-------------------------------

    Description: 
After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is.

I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2

Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory.

Here is my command sequence for the k=2 run:
{{bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl}}

{{bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt}}

{{bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom}}


The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively.

Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. 

  was:
After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is.

I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2

Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory.

Here is my command sequence for the k=2 run:
bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl

bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt

bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom


The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively.

Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. 

    
> clusterpp is not writing directories for all clusters
> -----------------------------------------------------
>
>                 Key: MAHOUT-1103
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Matt Molek
>              Labels: clusterpp
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory.
> Here is my command sequence for the k=2 run:
> {{bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl}}
> {{bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt}}
> {{bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom}}
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1103) clusterpp is not writing directories for all clusters

Posted by "Matt Molek (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Molek updated MAHOUT-1103:
-------------------------------

    Description: 
After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is.

I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2

Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory.

Here is my command sequence for the k=2 run:
{noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl

bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt

bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 

The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively.

Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. The hashes of these two clusters aren't identical, but they are close. Putting both cluster names into a Text and caling hashCode() gives:
VL-3742464 -> -685560454
VL-3742466 -> -685560452

  was:
After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is.

I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2

Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory.

Here is my command sequence for the k=2 run:
{noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl

bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt

bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 

The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively.

Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. 

    
> clusterpp is not writing directories for all clusters
> -----------------------------------------------------
>
>                 Key: MAHOUT-1103
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Matt Molek
>              Labels: clusterpp
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. The hashes of these two clusters aren't identical, but they are close. Putting both cluster names into a Text and caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (MAHOUT-1103) clusterpp is not writing directories for all clusters

Posted by "Matt Molek (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481433#comment-13481433 ] 

Matt Molek edited comment on MAHOUT-1103 at 10/22/12 3:23 PM:
--------------------------------------------------------------

Yes, I am clustering on ssvd output. I will try again with the vectors directly from seq2sparse and update once I'm done.

I was just reading up on the way the HashPartitioner works though, and I do think it is part of the issue. HashPartitioner uses the following logic to determine what partition a key belongs to: int partition = (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;

That yields a partition of 0 for both VL-3742464 and VL-3742466. If however, they were named VL-0 and VL-1, they would be properly split up by the HashPartitioner. I think if clusters were always named VL-i where 0<=i<k, then there would not be an issue. Dealing with this weird naming scheme (which I don't know the origin of since I'm not familiar with the inner workings of kmeans) seems to be the issue.
                
      was (Author: mmolek):
    Yes, I am clustering on ssvd output. I will try again with the vectors directly from seq2sparse and update once I'm done.

I was just reading up on the way the HashPartitioner works though, and I do think it is part of the issue. HashPartitioner uses the following logic to determine what partition a key belongs to: int partition = (key.hashCode() & Integer.MAX_VALUE) % 2;

That yields a partition of 0 for both VL-3742464 and VL-3742466. If however, they were named VL-0 and VL-1, they would be properly split up by the HashPartitioner. I think if clusters were always named VL-i where 0<=i<k, then there would not be an issue. Dealing with this weird naming scheme (which I don't know the origin of since I'm not familiar with the inner workings of kmeans) seems to be the issue.
                  
> clusterpp is not writing directories for all clusters
> -----------------------------------------------------
>
>                 Key: MAHOUT-1103
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Matt Molek
>            Assignee: Paritosh Ranjan
>              Labels: clusterpp
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. The hashes of these two clusters aren't identical, but they are close. Putting both cluster names into a Text and caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1103) clusterpp is not writing directories for all clusters

Posted by "Paritosh Ranjan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paritosh Ranjan updated MAHOUT-1103:
------------------------------------

    Attachment: MAHOUT-1103.patch

Dmitriy - yes, I think it was the same error. 

Matt - I have created a partitioner and applied it at ClusterOutputPostProcessorDriver assuming the valid cluster Ids are the latest and sequential i.e. ids will be VL-8543 to VL 8563 if 20 unique clusters are there. The attached test case demonstrates that it will work for this scenario.

If you want, you can try this patch on trunk, and check whether it works or not. I am not sure about it, as I still need to figure out the nomenclature of relevant cluster Ids.  
                
> clusterpp is not writing directories for all clusters
> -----------------------------------------------------
>
>                 Key: MAHOUT-1103
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Matt Molek
>            Assignee: Paritosh Ranjan
>              Labels: clusterpp
>         Attachments: MAHOUT-1103.patch
>
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. The hashes of these two clusters aren't identical, but they are close. Putting both cluster names into a Text and caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1103) clusterpp is not writing directories for all clusters

Posted by "Matt Molek (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Molek updated MAHOUT-1103:
-------------------------------

    Description: 
After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is.

I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2

Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory.

Here is my command sequence for the k=2 run:
{noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl

bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt

bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 

The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively.

Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. 

  was:
After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is.

I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2

Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory.

Here is my command sequence for the k=2 run:
{{bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl}}

{{bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt}}

{{bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom}}


The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively.

Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. 

    
> clusterpp is not writing directories for all clusters
> -----------------------------------------------------
>
>                 Key: MAHOUT-1103
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Matt Molek
>              Labels: clusterpp
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1103) clusterpp is not writing directories for all clusters

Posted by "Matt Molek (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481433#comment-13481433 ] 

Matt Molek commented on MAHOUT-1103:
------------------------------------

Yes, I am clustering on ssvd output. I will try again with the vectors directly from seq2sparse and update once I'm done.

I was just reading up on the way the HashPartitioner works though, and I do think it is part of the issue. HashPartitioner uses the following logic to determine what partition a key belongs to: int partition = (key.hashCode() & Integer.MAX_VALUE) % 2;

That yields a partition of 0 for both VL-3742464 and VL-3742466. If however, they were named VL-0 and VL-1, they would be properly split up by the HashPartitioner. I think if clusters were always named VL-i where 0<=i<k, then there would not be an issue. Dealing with this weird naming scheme (which I don't know the origin of since I'm not familiar with the inner workings of kmeans) seems to be the issue.
                
> clusterpp is not writing directories for all clusters
> -----------------------------------------------------
>
>                 Key: MAHOUT-1103
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Matt Molek
>            Assignee: Paritosh Ranjan
>              Labels: clusterpp
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. The hashes of these two clusters aren't identical, but they are close. Putting both cluster names into a Text and caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1103) clusterpp is not writing directories for all clusters

Posted by "Paritosh Ranjan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13489762#comment-13489762 ] 

Paritosh Ranjan commented on MAHOUT-1103:
-----------------------------------------

I tried test cluster post processing with Junit but the problem is while running through Junits ( standalone mode ), the numpartitions is always 1. So, the partitioner does not behave in the way it should. 

I will try running it on a cluster to test the results. 

I think the current approach to post process is highly dependent on cluster ids + partitioner which is somehow not correct. Any change in clustering approach will have an impact on cluster post processing. I will also think of another ways to solve it.
                
> clusterpp is not writing directories for all clusters
> -----------------------------------------------------
>
>                 Key: MAHOUT-1103
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Matt Molek
>            Assignee: Paritosh Ranjan
>              Labels: clusterpp
>         Attachments: MAHOUT-1103.patch
>
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. The hashes of these two clusters aren't identical, but they are close. Putting both cluster names into a Text and caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (MAHOUT-1103) clusterpp is not writing directories for all clusters

Posted by "Paritosh Ranjan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paritosh Ranjan reassigned MAHOUT-1103:
---------------------------------------

    Assignee: Paritosh Ranjan
    
> clusterpp is not writing directories for all clusters
> -----------------------------------------------------
>
>                 Key: MAHOUT-1103
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Matt Molek
>            Assignee: Paritosh Ranjan
>              Labels: clusterpp
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. The hashes of these two clusters aren't identical, but they are close. Putting both cluster names into a Text and caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1103) clusterpp is not writing directories for all clusters

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481546#comment-13481546 ] 

Dmitriy Lyubimov commented on MAHOUT-1103:
------------------------------------------

bq. Since its not working for even two clusters, I don't see any problem due to the Partitioner. The input here looks like the output of SSVD. There has been problems reported earlier also, where SSVD output was creating problems in clustering.


What issues? Can you be more specific? 

The only discussion i am aware of was with Pat and he was having problem using embedded style and with the fact that there were no --USigma output available at that time. 

I am not aware of any issues in the HEAD. he should be able to use --pca true option and if it retains enough variance (i.e. fairly rapid spectrum decay in the first 100-ish values) he should be fine at least with euclidean coordinates clustering. 

if he looks at cosine similarities for topical clustering (aka LSA) he doesn't need --pca option.

                
> clusterpp is not writing directories for all clusters
> -----------------------------------------------------
>
>                 Key: MAHOUT-1103
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Matt Molek
>            Assignee: Paritosh Ranjan
>              Labels: clusterpp
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. The hashes of these two clusters aren't identical, but they are close. Putting both cluster names into a Text and caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1103) clusterpp is not writing directories for all clusters

Posted by "Matt Molek (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481550#comment-13481550 ] 

Matt Molek commented on MAHOUT-1103:
------------------------------------

{quote}he should be able to use --pca true option and if it retains enough variance (i.e. fairly rapid spectrum decay in the first 100-ish values) he should be fine at least with euclidean coordinates clustering.

if he looks at cosine similarities for topical clustering (aka LSA) he doesn't need --pca option.{quote}

Hi Dmitriy, sorry for going a little off topic here, but could you elaborate on this? I've been experimenting with using either cosine or tanimoto distance on the USigma output of ssvd with -pca true. Are those not appropriate distance measures for the -pca output?
                
> clusterpp is not writing directories for all clusters
> -----------------------------------------------------
>
>                 Key: MAHOUT-1103
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Matt Molek
>            Assignee: Paritosh Ranjan
>              Labels: clusterpp
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. The hashes of these two clusters aren't identical, but they are close. Putting both cluster names into a Text and caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1103) clusterpp is not writing directories for all clusters

Posted by "Paritosh Ranjan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481408#comment-13481408 ] 

Paritosh Ranjan commented on MAHOUT-1103:
-----------------------------------------

"I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2"

Since its not working for even two clusters, I don't see any problem due to the Partitioner. The input here looks like the output of SSVD. There has been problems reported earlier also, where SSVD output was creating problems in clustering.

Can your try kmeans + clusterpp without performing SSVD on the vectors? I suspect this to be the problem for now, but we will have to test it. 

The sequential and mapreduce versions are completely differernt implementations, so, its normal to have a bug in one version which is not present in the second version.

Please update once you test it.
                
> clusterpp is not writing directories for all clusters
> -----------------------------------------------------
>
>                 Key: MAHOUT-1103
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Matt Molek
>              Labels: clusterpp
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. The hashes of these two clusters aren't identical, but they are close. Putting both cluster names into a Text and caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1103) clusterpp is not writing directories for all clusters

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481557#comment-13481557 ] 

Dmitriy Lyubimov commented on MAHOUT-1103:
------------------------------------------

bq. Since its not working for even two clusters, I don't see any problem due to the Partitioner. The input here looks like the output of SSVD. There has been problems reported earlier also, where SSVD output was creating problems in clustering.

I'll reply on user list, that way somebody will surely try to correct me since i have been doing straightforward LSA only.
                
> clusterpp is not writing directories for all clusters
> -----------------------------------------------------
>
>                 Key: MAHOUT-1103
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Matt Molek
>            Assignee: Paritosh Ranjan
>              Labels: clusterpp
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. The hashes of these two clusters aren't identical, but they are close. Putting both cluster names into a Text and caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1103) clusterpp is not writing directories for all clusters

Posted by "Matt Molek (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481626#comment-13481626 ] 

Matt Molek commented on MAHOUT-1103:
------------------------------------

I've finished a test with k=25 on input directly from seq2sparse, and the problem has persisted. I'm pretty sure it's because of the way the HashPartitioner is splitting up the "VL-*" Text keys, and doesn't have anything to do with the input vectors (whether they came from seq2sparse or ssvd).

I can't start working on a patch right now, but I will check back when I have some free time and see what I can do if there hasn't been any progress.
                
> clusterpp is not writing directories for all clusters
> -----------------------------------------------------
>
>                 Key: MAHOUT-1103
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Matt Molek
>            Assignee: Paritosh Ranjan
>              Labels: clusterpp
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. The hashes of these two clusters aren't identical, but they are close. Putting both cluster names into a Text and caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1103) clusterpp is not writing directories for all clusters

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481702#comment-13481702 ] 

Dmitriy Lyubimov commented on MAHOUT-1103:
------------------------------------------

You mean, propagation of names in NamedVector to products of U? yes this has been addressed at the same time as --USigma, current trunk should be good.
                
> clusterpp is not writing directories for all clusters
> -----------------------------------------------------
>
>                 Key: MAHOUT-1103
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Matt Molek
>            Assignee: Paritosh Ranjan
>              Labels: clusterpp
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. The hashes of these two clusters aren't identical, but they are close. Putting both cluster names into a Text and caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (MAHOUT-1103) clusterpp is not writing directories for all clusters

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13481557#comment-13481557 ] 

Dmitriy Lyubimov edited comment on MAHOUT-1103 at 10/22/12 6:00 PM:
--------------------------------------------------------------------

bq. Hi Dmitriy, sorry for going a little off topic here, but could you elaborate on this? I've been experimenting with using either cosine or tanimoto distance on the USigma output of ssvd with -pca true. Are those not appropriate distance measures for the -pca output?

I'll reply on user list, that way somebody will surely try to correct me since i have been doing straightforward LSA only.
                
      was (Author: dlyubimov):
    bq. Since its not working for even two clusters, I don't see any problem due to the Partitioner. The input here looks like the output of SSVD. There has been problems reported earlier also, where SSVD output was creating problems in clustering.

I'll reply on user list, that way somebody will surely try to correct me since i have been doing straightforward LSA only.
                  
> clusterpp is not writing directories for all clusters
> -----------------------------------------------------
>
>                 Key: MAHOUT-1103
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1103
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.8
>            Reporter: Matt Molek
>            Assignee: Paritosh Ranjan
>              Labels: clusterpp
>
> After running kmeans clustering on a set of ~3M points, clusterpp fails to populate directories for some clusters, no matter what k is.
> I've tested this on my data with k = 300, 250, 150, 100, 50, 25, 10, 5, 2
> Even with k=2 only one cluster directory was created. For each reducer that fails to produce directories there is an empty part-r-* file in the output directory.
> Here is my command sequence for the k=2 run:
> {noformat}bin/mahout kmeans -i ssvd2/USigma -c 2clusters/init-clusters -o 2clusters/pca-clusters -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05 -k 2 -x 15 -cl
> bin/mahout clusterdump -i 2clusters/pca-clusters/clusters-*-final -o 2clusters.txt
> bin/mahout clusterpp -i 2clusters/pca-clusters -o 2clusters/bottom{noformat} 
> The output of clusterdump shows two clusters: VL-3742464 and VL-3742466 containing 2585843 and 1156624 points respectively.
> Discussion on the user mailing list suggested that this might be caused by the default hadoop hash partitioner. The hashes of these two clusters aren't identical, but they are close. Putting both cluster names into a Text and caling hashCode() gives:
> VL-3742464 -> -685560454
> VL-3742466 -> -685560452
> Finally, when running with "-xm sequential", everything performs as expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira