You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Scott W <de...@gmail.com> on 2016/10/26 16:20:48 UTC

Resiliency with SparkStreaming - fileStream

Hello,

I'm planning to use fileStream Spark streaming API to stream data from
HDFS. My Spark job would essentially process these files and post the
results to an external endpoint.

*How does fileStream API handle checkpointing of the file it processed ? *In
other words, if my Spark job failed while posting the results to an
external endpoint, I want that same original file to be picked up again and
get reprocessed.

Thanks much!

Re: Resiliency with SparkStreaming - fileStream

Posted by Michael Armbrust <mi...@databricks.com>.
I'll answer in the context of structured streaming (the new streaming API
build on DataFrames). When reading from files, the FileSource, records
which files are included in each batch inside of the given
checkpointLocation.  If you fail in the middle of a batch, the streaming
engine will retry that batch next time the query is restarted.

If you are concerned about exactly-once semantics, you can get that too.
The FileSink (i.e. using writeStream) writing out to something like parquet
does this automatically.  If you are writing to something like a
transactional database yourself, you can also implement similar
functionality.  Specifically, you can record the partition and version that
are provided by the open method
<https://spark.apache.org/docs/2.0.1/api/java/org/apache/spark/sql/ForeachWriter.html#open(long,%20long)>
into
the database in the same transaction that is writing the data.  This way,
when you recover you can avoid writing the same updates more than once.

On Wed, Oct 26, 2016 at 9:20 AM, Scott W <de...@gmail.com> wrote:

> Hello,
>
> I'm planning to use fileStream Spark streaming API to stream data from
> HDFS. My Spark job would essentially process these files and post the
> results to an external endpoint.
>
> *How does fileStream API handle checkpointing of the file it processed ? *In
> other words, if my Spark job failed while posting the results to an
> external endpoint, I want that same original file to be picked up again and
> get reprocessed.
>
> Thanks much!
>