You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Iaroslav Zeigerman (JIRA)" <ji...@apache.org> on 2015/07/21 17:21:05 UTC
[jira] [Commented] (SPARK-9220) Streaming K-means implementation exception while processing windowed stream

    [ https://issues.apache.org/jira/browse/SPARK-9220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635240#comment-14635240 ] 

Iaroslav Zeigerman commented on SPARK-9220:
-------------------------------------------

Looks like the issue reproduces only when training and test data streams are linked to the same directory. Can someone confirm if this cause the issue?

> Streaming K-means implementation exception while processing windowed stream
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-9220
>                 URL: https://issues.apache.org/jira/browse/SPARK-9220
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib, Streaming
>    Affects Versions: 1.4.1
>            Reporter: Iaroslav Zeigerman
>
> Spark throws an exception when the Streaming K-means algorithm trains on a windowed stream. The stream looks like following:
> {{val trainingSet = ssc.textFileStream(TrainingDataSet).window(Seconds(30))...}}
> The exception occurs when there is no new data in the stream. Here is an exception:
> 15/07/21 17:36:08 ERROR JobScheduler: Error running job streaming job 1437489368000 ms.0
> java.lang.ArrayIndexOutOfBoundsException: 13
> 	at org.apache.spark.mllib.clustering.StreamingKMeansModel$$anonfun$update$1.apply(StreamingKMeans.scala:105)
> 	at org.apache.spark.mllib.clustering.StreamingKMeansModel$$anonfun$update$1.apply(StreamingKMeans.scala:102)
> 	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> 	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> 	at org.apache.spark.mllib.clustering.StreamingKMeansModel.update(StreamingKMeans.scala:102)
> 	at org.apache.spark.mllib.clustering.StreamingKMeans$$anonfun$trainOn$1.apply(StreamingKMeans.scala:235)
> 	at org.apache.spark.mllib.clustering.StreamingKMeans$$anonfun$trainOn$1.apply(StreamingKMeans.scala:234)
> 	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42)
> 	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
> 	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
> 	at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
> 	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:40)
> 	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> 	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> 	at scala.util.Try$.apply(Try.scala:161)
> 	at org.apache.spark.streaming.scheduler.Job.run(Job.scala:34)
> 	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:193)
> 	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)
> 	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)
> 	at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
> 	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:192)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:745)
> When the new data arrives the algorithm works as expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org