You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Harry Lang (JIRA)" <ji...@apache.org> on 2014/04/27 04:12:15 UTC
[jira] [Commented] (MAHOUT-1469) Streaming KMeans fails when executed in MapReduce mode and REDUCE_STREAMING_KMEANS is set to true

    [ https://issues.apache.org/jira/browse/MAHOUT-1469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13982162#comment-13982162 ] 

Harry Lang commented on MAHOUT-1469:
------------------------------------

Maxim Arap and I wrote documentation for StreamingKMeans earlier this week and Suneel Marthi pointed out this thread.

The high-probability approximation guarantee of the StreamingKMeans algorithm is proven only for Euclidean distance, but in general it will work for any metric with an isometry to Euclidean space (such as a scaling).

If the L1 metric is needed, an applicable algorithm would be Ke Chen (see "On Coresets for k-Median...", 2009).

> Streaming KMeans fails when executed in MapReduce mode and REDUCE_STREAMING_KMEANS is set to true
> -------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1469
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1469
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.9
>            Reporter: Suneel Marthi
>            Assignee: Suneel Marthi
>             Fix For: 1.0
>
>
> Centroids are not being generated when executed in MR mode with -rskm flag set. 
> {Code}
> 14/03/20 02:42:12 INFO mapreduce.StreamingKMeansThread: Estimated Points: 282
> 14/03/20 02:42:12 INFO mapred.JobClient:  map 100% reduce 0%
> 14/03/20 02:42:14 INFO mapreduce.StreamingKMeansReducer: Number of Centroids: 0
> 14/03/20 02:42:14 WARN mapred.LocalJobRunner: job_local1374896815_0001
> java.lang.IllegalArgumentException: Must have nonzero number of training and test vectors. Asked for %.1f %% of %d vectors for test [10.000000149011612, 0]
> 	at com.google.common.base.Preconditions.checkArgument(Preconditions.java:148)
> 	at org.apache.mahout.clustering.streaming.cluster.BallKMeans.splitTrainTest(BallKMeans.java:176)
> 	at org.apache.mahout.clustering.streaming.cluster.BallKMeans.cluster(BallKMeans.java:192)
> 	at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.getBestCentroids(StreamingKMeansReducer.java:107)
> 	at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:73)
> 	at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:37)
> 	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177)
> 	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
> 	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
> 14/03/20 02:42:14 INFO mapred.JobClient: Job complete: job_local1374896815_0001
> 14/03/20 02:42:14 INFO mapred.JobClient: Counters: 16
> 14/03/20 02:42:14 INFO mapred.JobClient:   File Input Format Counters 
> 14/03/20 02:42:14 INFO mapred.JobClient:     Bytes Read=17156391
> 14/03/20 02:42:14 INFO mapred.JobClient:   FileSystemCounters
> 14/03/20 02:42:14 INFO mapred.JobClient:     FILE_BYTES_READ=41925624
> 14/03/20 02:42:14 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=25974741
> 14/03/20 02:42:14 INFO mapred.JobClient:   Map-Reduce Framework
> 14/03/20 02:42:14 INFO mapred.JobClient:     Map output materialized bytes=956293
> 14/03/20 02:42:14 INFO mapred.JobClient:     Map input records=21578
> 14/03/20 02:42:14 INFO mapred.JobClient:     Reduce shuffle bytes=0
> 14/03/20 02:42:14 INFO mapred.JobClient:     Spilled Records=282
> 14/03/20 02:42:14 INFO mapred.JobClient:     Map output bytes=1788012
> 14/03/20 02:42:14 INFO mapred.JobClient:     Total committed heap usage (bytes)=217214976
> 14/03/20 02:42:14 INFO mapred.JobClient:     Combine input records=0
> 14/03/20 02:42:14 INFO mapred.JobClient:     SPLIT_RAW_BYTES=163
> 14/03/20 02:42:14 INFO mapred.JobClient:     Reduce input records=0
> 14/03/20 02:42:14 INFO mapred.JobClient:     Reduce input groups=0
> 14/03/20 02:42:14 INFO mapred.JobClient:     Combine output records=0
> 14/03/20 02:42:14 INFO mapred.JobClient:     Reduce output records=0
> 14/03/20 02:42:14 INFO mapred.JobClient:     Map output records=282
> 14/03/20 02:42:14 INFO driver.MahoutDriver: Program took 506269 ms (Minutes: 8.437816666666667)
> {Code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)