You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by qihong <qc...@pivotal.io> on 2014/09/05 21:09:32 UTC

Re: how to choose right DStream batch interval

repost since original msg was marked with "This post has NOT been accepted by
the mailing list yet."

I have some questions regarding DStream batch interval: 

1. if it only take 0.5 second to process the batch 99% of time, but 1% of
batches need 5 seconds to process (due to some random factor or failures),
then what's the right batch interval? 5 seconds (the worst case)? 

2. What will happen to DStream processing if 1 batch took longer than batch
interval? Can Spark recover from that? 

Thanks,
Qihong



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-choose-right-DStream-batch-interval-tp13578p13579.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: how to choose right DStream batch interval

Posted by Tim Smith <se...@gmail.com>.
http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617

Slide 39 covers it.

On Tue, Sep 9, 2014 at 9:23 PM, qihong <qc...@pivotal.io> wrote:

> Hi Mayur,
>
> Thanks for your response. I did write a simple test that set up a DStream
> with
> 5 batches; The batch duration is 1 second, and the 3rd batch will take
> extra
> 2 seconds, the output of the test shows that the 3rd batch causes backlog,
> and spark streaming does catch up on 4th and 5th batch (DStream.print
> was modified to output system time)
>
> -------------------------------------------
> Time: 1409959708000 ms, system time: 1409959708269
> -------------------------------------------
> 1155
> -------------------------------------------
> Time: 1409959709000 ms, system time: 1409959709033
> -------------------------------------------
> 2255
> delay 2000 ms
> -------------------------------------------
> Time: 1409959710000 ms, system time: 1409959712036
> -------------------------------------------
> 3355
> -------------------------------------------
> Time: 1409959711000 ms, system time: 1409959712059
> -------------------------------------------
> 4455
> -------------------------------------------
> Time: 1409959712000 ms, system time: 1409959712083
> -------------------------------------------
> 5555
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/how-to-choose-right-DStream-batch-interval-tp13578p13855.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: how to choose right DStream batch interval

Posted by qihong <qc...@pivotal.io>.
Hi Mayur,

Thanks for your response. I did write a simple test that set up a DStream
with 
5 batches; The batch duration is 1 second, and the 3rd batch will take extra
2 seconds, the output of the test shows that the 3rd batch causes backlog,
and spark streaming does catch up on 4th and 5th batch (DStream.print 
was modified to output system time)

-------------------------------------------
Time: 1409959708000 ms, system time: 1409959708269
-------------------------------------------
1155
-------------------------------------------
Time: 1409959709000 ms, system time: 1409959709033
-------------------------------------------
2255
delay 2000 ms
-------------------------------------------
Time: 1409959710000 ms, system time: 1409959712036
-------------------------------------------
3355
-------------------------------------------
Time: 1409959711000 ms, system time: 1409959712059
-------------------------------------------
4455
-------------------------------------------
Time: 1409959712000 ms, system time: 1409959712083
-------------------------------------------
5555

Thanks!



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-choose-right-DStream-batch-interval-tp13578p13855.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: how to choose right DStream batch interval

Posted by Mayur Rustagi <ma...@gmail.com>.
Spark will simply have a backlog of tasks, it'll manage to process them
nonetheless, though if it keeps falling behind, you may run out of memory
or have unreasonable latency. For momentary spikes, Spark streaming will
manage.
Mostly if you are looking to do 100% processing, you'll have to go with 5
sec processing, alternative is to process data in two pipelines (.5 & 5 )
in two spark streaming jobs & overwrite results of one with the other.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>


On Sat, Sep 6, 2014 at 12:39 AM, qihong <qc...@pivotal.io> wrote:

> repost since original msg was marked with "This post has NOT been accepted
> by
> the mailing list yet."
>
> I have some questions regarding DStream batch interval:
>
> 1. if it only take 0.5 second to process the batch 99% of time, but 1% of
> batches need 5 seconds to process (due to some random factor or failures),
> then what's the right batch interval? 5 seconds (the worst case)?
>
> 2. What will happen to DStream processing if 1 batch took longer than batch
> interval? Can Spark recover from that?
>
> Thanks,
> Qihong
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/how-to-choose-right-DStream-batch-interval-tp13578p13579.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>