You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Tathagata Das <ta...@gmail.com> on 2017/06/26 21:11:21 UTC

Re: Spark Streaming reduceByKeyAndWindow with inverse function seems to iterate over all the keys in the window even though they are not present in the current batch

Unfortunately the way reduceByKeyAndWindow is implemented, it does iterate
through all the counts. To have something more efficient, you may have to
implement your own windowing logic using mapWithState. Something like

eventDStream.flatmap { event =>
   // find the windows each even maps to, and return tuple (Window,
ValueToReduce)
}.mapWithState {
   // for each window, reduce new value to the partial reduce in state
}


Though, if you are going to implement this complex functions, I highly
recommend you to rather start using Structured Streaming, which is far more
optimized that DStreams, and gives AT LEAST 10x the throughput. It already
has the performance benefits that you are looking for.

On Mon, Jun 26, 2017 at 12:53 PM, SRK <sw...@gmail.com> wrote:

> Hi,
>
> We have reduceByKeyAndWindow with inverse function feature in our Streaming
> job to calculate rolling counts for the past hour and for the past 24
> hours.
> It seems that the functionality is iterating over all the keys in the
> window
> even though they are not present in the current batch causing the
> processing
> times to be high. My batch size is 1 minute. Is there a way that the
> reduceByKeyAndWindow would just iterate over the keys present in the
> current
> batch instead of reducing over all the keys in the Window? Because
> typically
> the updates would happen only for the keys present in the current batch.
>
> Thanks!
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Spark-Streaming-reduceByKeyAndWindow-with-
> inverse-function-seems-to-iterate-over-all-the-keys-in-theh-tp28792.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>