You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jungtaek Lim (JIRA)" <ji...@apache.org> on 2019/08/09 02:33:00 UTC

[jira] [Commented] (SPARK-28650) Fix the guarantee of ForeachWriter

    [ https://issues.apache.org/jira/browse/SPARK-28650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16903510#comment-16903510 ] 

Jungtaek Lim commented on SPARK-28650:
--------------------------------------

It sounds like either correctness or data-loss, and most of cases end users should change their implementation of open method to always return true for safety.

Are you planning to work on this? If you don't plan to address this sooner, I'd like to take this up.

Given we've changed guarantee, do we want to keep the signature of "open" as it is? By leaving it as it is, we still give a chance to skip writing but according to the guarantee it only makes sense when skipping whole batch.

> Fix the guarantee of ForeachWriter
> ----------------------------------
>
>                 Key: SPARK-28650
>                 URL: https://issues.apache.org/jira/browse/SPARK-28650
>             Project: Spark
>          Issue Type: Documentation
>          Components: Structured Streaming
>    Affects Versions: 2.4.3
>            Reporter: Shixiong Zhu
>            Priority: Major
>
> Right now ForeachWriter has the following guarantee:
> {code}
> If the streaming query is being executed in the micro-batch mode, then every partition
> represented by a unique tuple (partitionId, epochId) is guaranteed to have the same data.
> Hence, (partitionId, epochId) can be used to deduplicate and/or transactionally commit data
> and achieve exactly-once guarantees.
> {code}
>  
> But we can break this easily actually when restarting a query but a batch is re-run (e.g., upgrade Spark)
>  * Source returns a different DataFrame that has a different partition number (e.g., we start to not create empty partitions in Kafka Source V2).
>  * A new added optimization rule may change the number of partitions in the new run.
>  * Change the file split size in the new run.
> Since we cannot guarantee that the same (partitionId, epochId) has the same data. We should update the document for "ForeachWriter".



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org