You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Harry Lang (JIRA)" <ji...@apache.org> on 2014/04/27 04:12:15 UTC
[jira] [Commented] (MAHOUT-1469) Streaming KMeans fails when
executed in MapReduce mode and REDUCE_STREAMING_KMEANS is set to true
[ https://issues.apache.org/jira/browse/MAHOUT-1469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13982162#comment-13982162 ]
Harry Lang commented on MAHOUT-1469:
------------------------------------
Maxim Arap and I wrote documentation for StreamingKMeans earlier this week and Suneel Marthi pointed out this thread.
The high-probability approximation guarantee of the StreamingKMeans algorithm is proven only for Euclidean distance, but in general it will work for any metric with an isometry to Euclidean space (such as a scaling).
If the L1 metric is needed, an applicable algorithm would be Ke Chen (see "On Coresets for k-Median...", 2009).
> Streaming KMeans fails when executed in MapReduce mode and REDUCE_STREAMING_KMEANS is set to true
> -------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-1469
> URL: https://issues.apache.org/jira/browse/MAHOUT-1469
> Project: Mahout
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 0.9
> Reporter: Suneel Marthi
> Assignee: Suneel Marthi
> Fix For: 1.0
>
>
> Centroids are not being generated when executed in MR mode with -rskm flag set.
> {Code}
> 14/03/20 02:42:12 INFO mapreduce.StreamingKMeansThread: Estimated Points: 282
> 14/03/20 02:42:12 INFO mapred.JobClient: map 100% reduce 0%
> 14/03/20 02:42:14 INFO mapreduce.StreamingKMeansReducer: Number of Centroids: 0
> 14/03/20 02:42:14 WARN mapred.LocalJobRunner: job_local1374896815_0001
> java.lang.IllegalArgumentException: Must have nonzero number of training and test vectors. Asked for %.1f %% of %d vectors for test [10.000000149011612, 0]
> at com.google.common.base.Preconditions.checkArgument(Preconditions.java:148)
> at org.apache.mahout.clustering.streaming.cluster.BallKMeans.splitTrainTest(BallKMeans.java:176)
> at org.apache.mahout.clustering.streaming.cluster.BallKMeans.cluster(BallKMeans.java:192)
> at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.getBestCentroids(StreamingKMeansReducer.java:107)
> at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:73)
> at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:37)
> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177)
> at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
> 14/03/20 02:42:14 INFO mapred.JobClient: Job complete: job_local1374896815_0001
> 14/03/20 02:42:14 INFO mapred.JobClient: Counters: 16
> 14/03/20 02:42:14 INFO mapred.JobClient: File Input Format Counters
> 14/03/20 02:42:14 INFO mapred.JobClient: Bytes Read=17156391
> 14/03/20 02:42:14 INFO mapred.JobClient: FileSystemCounters
> 14/03/20 02:42:14 INFO mapred.JobClient: FILE_BYTES_READ=41925624
> 14/03/20 02:42:14 INFO mapred.JobClient: FILE_BYTES_WRITTEN=25974741
> 14/03/20 02:42:14 INFO mapred.JobClient: Map-Reduce Framework
> 14/03/20 02:42:14 INFO mapred.JobClient: Map output materialized bytes=956293
> 14/03/20 02:42:14 INFO mapred.JobClient: Map input records=21578
> 14/03/20 02:42:14 INFO mapred.JobClient: Reduce shuffle bytes=0
> 14/03/20 02:42:14 INFO mapred.JobClient: Spilled Records=282
> 14/03/20 02:42:14 INFO mapred.JobClient: Map output bytes=1788012
> 14/03/20 02:42:14 INFO mapred.JobClient: Total committed heap usage (bytes)=217214976
> 14/03/20 02:42:14 INFO mapred.JobClient: Combine input records=0
> 14/03/20 02:42:14 INFO mapred.JobClient: SPLIT_RAW_BYTES=163
> 14/03/20 02:42:14 INFO mapred.JobClient: Reduce input records=0
> 14/03/20 02:42:14 INFO mapred.JobClient: Reduce input groups=0
> 14/03/20 02:42:14 INFO mapred.JobClient: Combine output records=0
> 14/03/20 02:42:14 INFO mapred.JobClient: Reduce output records=0
> 14/03/20 02:42:14 INFO mapred.JobClient: Map output records=282
> 14/03/20 02:42:14 INFO driver.MahoutDriver: Program took 506269 ms (Minutes: 8.437816666666667)
> {Code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)