You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Hao Ren <in...@gmail.com> on 2016/11/22 13:48:25 UTC
[Spark Streaming] map and window operation on DStream only process
one batch
Spark Streaming v 1.6.2
Kafka v0.10.1
I am reading msgs from Kafka.
What surprised me is the following DStream only process the first batch.
KafkaUtils.createDirectStream[
String,
String,
StringDecoder,
StringDecoder](streamingContext, kafkaParams, Set(topic))
.map(_._2)
.window(Seconds(windowLengthInSec))
Some logs as below are endlessly repeated:
16/11/22 14:20:40 INFO MappedDStream: Slicing from 1479820835000 ms to
1479820840000 ms (aligned to 1479820835000 ms and 1479820840000 ms)
16/11/22 14:20:40 INFO JobScheduler: Added jobs for time 1479820840000 ms
And the action on the DStream is just a rdd count
windowedStream foreachRDD { rdd => rdd.count }
From the webUI, only the first batch is in status: Processing, the others
are all Queued.
However, if I permute map and window operation, everything is ok.
KafkaUtils.createDirectStream[
String,
String,
StringDecoder,
StringDecoder](streamingContext, kafkaParams, Set(topic))
.window(Seconds(windowLengthInSec))
.map(_._2)
I think the two are equivalent. But they are not.
Furthermore, if I replace my KafkaDStream with a QueueStream, it works for
no matter which order of map and window operation.
I am not sure whether this is related with KafkaDStream or just DStream.
Any help is appreciated.
--
Hao Ren
Data Engineer @ leboncoin
Paris, France