You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by sherif98 <sh...@gmail.com> on 2018/08/26 09:22:17 UTC

Spark Structured Streaming using S3 as data source

I have data that is continuously pushed to multiple S3 buckets. I want to set
up a structured streaming application that uses the S3 buckets as the data
source and do stream-stream joins.

My question is if the application is down for some reason, will restarting
the application would continue processing data from the S3 where it left
off?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Spark Structured Streaming using S3 as data source

Posted by Sherif Hamdy <sh...@gmail.com>.

Thanks for the reply, to make sure I got this right.

if I have 5 JSON files with 100 records in each file.
 And for example, spark failed while processing the tenth record in the 3rd
file.
When the query runs again it will begin processing from the tenth record in
the 3rd file, did I get that right?

Assuming that checkpointing is used.

On Mon, Aug 27, 2018 at 12:11 AM Burak Yavuz <br...@gmail.com> wrote:

> Yes, the checkpoint makes sure that you start off from where you left off.
>
> On Sun, Aug 26, 2018 at 2:22 AM sherif98 <sh...@gmail.com>
> wrote:
>
>> I have data that is continuously pushed to multiple S3 buckets. I want to
>> set
>> up a structured streaming application that uses the S3 buckets as the data
>> source and do stream-stream joins.
>>
>> My question is if the application is down for some reason, will restarting
>> the application would continue processing data from the S3 where it left
>> off?
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>

Re: Spark Structured Streaming using S3 as data source

Posted by Burak Yavuz <br...@gmail.com>.

Yes, the checkpoint makes sure that you start off from where you left off.

On Sun, Aug 26, 2018 at 2:22 AM sherif98 <sh...@gmail.com>
wrote:

> I have data that is continuously pushed to multiple S3 buckets. I want to
> set
> up a structured streaming application that uses the S3 buckets as the data
> source and do stream-stream joins.
>
> My question is if the application is down for some reason, will restarting
> the application would continue processing data from the S3 where it left
> off?
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>