You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Marco Platania <ma...@yahoo.it.INVALID> on 2016/05/27 20:16:35 UTC

Spark Streaming - Is window() caching DStreams?

Dear all,
Can someone please explain me how Spark Streaming executes the window() operation? From the Spark 1.6.1 documentation, it seems that windowed batches are automatically cached in memory, but looking at the web UI it seems that operations already executed in previous batches are executed again. For your convenience, I attach a screenshot of my running application below.
By looking at the web UI, it seems that flatMapValues() RDDs are cached (green spot - this is the last operation executed before I call window() on the DStream), but, at the same time, it also seems that all the transformations that led to flatMapValues() in previous batches are executed again. If this is the case, the window() operation may induce huge performance penalties, especially if the window duration is 1 or 2 hours (as I expect for my application). Do you think that checkpointing the DStream at that time can be helpful? Consider that the expected slide window is about 5 minutes.
Hope someone can clarify this point.
Thanks,Marco