You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Hao Ren <in...@gmail.com> on 2016/11/22 13:48:25 UTC

[Spark Streaming] map and window operation on DStream only process one batch

Spark Streaming v 1.6.2
Kafka v0.10.1

I am reading msgs from Kafka.
What surprised me is the following DStream only process the first batch.

KafkaUtils.createDirectStream[
  String,
  String,
  StringDecoder,
  StringDecoder](streamingContext, kafkaParams, Set(topic))
  .map(_._2)
  .window(Seconds(windowLengthInSec))

Some logs as below are endlessly repeated:

16/11/22 14:20:40 INFO MappedDStream: Slicing from 1479820835000 ms to
1479820840000 ms (aligned to 1479820835000 ms and 1479820840000 ms)
16/11/22 14:20:40 INFO JobScheduler: Added jobs for time 1479820840000 ms

And the action on the DStream is just a rdd count

windowedStream foreachRDD { rdd => rdd.count }

From the webUI, only the first batch is in status: Processing, the others
are all Queued.
However, if I permute map and window operation, everything is ok.

KafkaUtils.createDirectStream[
  String,
  String,
  StringDecoder,
  StringDecoder](streamingContext, kafkaParams, Set(topic))
  .window(Seconds(windowLengthInSec))
  .map(_._2)

I think the two are equivalent. But they are not.

Furthermore, if I replace my KafkaDStream with a QueueStream, it works for
no matter which order of map and window operation.

I am not sure whether this is related with KafkaDStream or just DStream.

Any help is appreciated.

-- 
Hao Ren

Data Engineer @ leboncoin

Paris, France