You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Michael Armbrust <mi...@databricks.com> on 2017/03/16 22:45:52 UTC

Re: Streaming 2.1.0 - window vs. batch duration

Have you considered trying event time aggregation in structured streaming
instead?

On Thu, Mar 16, 2017 at 12:34 PM, Dominik Safaric <do...@gmail.com>
wrote:

> Hi all,
>
> As I’ve implemented a streaming application pulling data from Kafka every
> 1 second (batch interval), I am observing some quite strange behaviour
> (didn’t use Spark extensively in the past, but continuous operator based
> engines instead of).
>
> Namely the dstream.window(Seconds(60)) windowed stream when written back
> to Kafka contains more messages then they were consumed (for debugging
> purposes using a small dataset of a million Kafka byte array deserialized
> messages). In particular, in total I’ve streamed exactly 1 million
> messages, whereas upon window expiry 60 million messages are written back
> to Kafka.
>
> I’ve read on the official docs that both the window and window slide
> duration must be multiples of the batch interval. Does this mean that when
> consuming messages between two windows every batch interval the RDDs of a
> given batch interval *t* the same batch is being ingested 59 more times
> into the windowed stream?
>
> If I would like to achieve this behaviour (batch every being equal to a
> second, window duration 60 seconds) - how might one achieve this?
>
> I would appreciate if anyone could correct me if I got the internals of
> Spark’s windowed operations wrong and elaborate a bit.
>
> Thanks,
> Dominik
>