You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by aaronjosephs <aa...@placeiq.com> on 2014/07/18 21:09:38 UTC

Spark Streaming with long batch / window duration

Would it be a reasonable use case of spark streaming to have a very large
window size (lets say on the scale of weeks). In this particular case the
reduce function would be invertible so that would aid in efficiency. I
assume that having a larger batch size since the window is so large would
also lighten the workload for spark. The sliding duration is not too
important, I just want to know if this is reasonable for spark to handle
with any slide duration



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-with-long-batch-window-duration-tp10191.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Streaming with long batch / window duration

Posted by aaronjosephs <aa...@placeiq.com>.
So I think  I may end up using hourglass
(https://engineering.linkedin.com/datafu/datafus-hourglass-incremental-data-processing-hadoop)
a hadoop framework for incremental data processing, it would be very cool if
spark (not streaming ) could support something like this



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-with-long-batch-window-duration-tp10191p10311.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Streaming with long batch / window duration

Posted by aaronjosephs <aa...@placeiq.com>.
Unfortunately for reasons I won't go into my options for what I can use are
limited, it was more of a curiosity to see if spark could handle a use case
like this since the functionality I wanted fit perfectly into the
reduceByKeyAndWindow frame of thinking. Anyway thanks for answering.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-with-long-batch-window-duration-tp10191p10219.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Streaming with long batch / window duration

Posted by emceemouli <ch...@calgary.ca>.
Thanks. If i not use Window and choose to use Streaming the data on to HDFS,
could you suggest how to only store 1 week worth of data. Should i create a
cron job to delete HDFS files older than a week. PLease let me know if you
have any other suggestions



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-with-long-batch-window-duration-tp10191p29005.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Streaming with long batch / window duration

Posted by Tathagata Das <ta...@gmail.com>.
If you want to process data that spans across weeks, then it best to use a
dedicated data store (file system, sql / nosql database, etc.) that is
designed for long term data storage and retrieval. Spark Streaming is not
designed as a long term data store. Also it does not seem like you need low
latency. So it might be better to use a combination of Spark Streaming and
Spark programs - Spark Streaming to receive data and store it some long
term data store, and Spark to periodically (every hour, day?) pull the data
from the store and process them. You can implement the invertible function
yourself in Spark by storing the previous "reduced values" in the same data
store every time the spark program is run, and then using that data the
next time.

The great thing is that both these program can share all the map, and
reduce functions.

TD


On Fri, Jul 18, 2014 at 12:09 PM, aaronjosephs <aa...@placeiq.com> wrote:

> Would it be a reasonable use case of spark streaming to have a very large
> window size (lets say on the scale of weeks). In this particular case the
> reduce function would be invertible so that would aid in efficiency. I
> assume that having a larger batch size since the window is so large would
> also lighten the workload for spark. The sliding duration is not too
> important, I just want to know if this is reasonable for spark to handle
> with any slide duration
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-with-long-batch-window-duration-tp10191.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>