You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/08/29 23:13:50 UTC

[GitHub] [spark] HeartSaVioR edited a comment on issue #25618: [SPARK-28908][SS]Implement Kafka EOS sink for Structured Streaming

HeartSaVioR edited a comment on issue #25618: [SPARK-28908][SS]Implement Kafka EOS sink for Structured Streaming
URL: https://github.com/apache/spark/pull/25618#issuecomment-526394910

Just skimmed the design doc (need to take a look deeply on fault tolerance) and it's basically known approach what Flink is doing (2PC). Please mention what you've inspired of, for same reason, credit.

I was planning to propose similar before (haven't proposed the design itself), more clearly I've asked to support 2PC in DSv2 API level as Spark doesn't support 2PC natively, but feedback wasn't positive as it should be very invasive change on Spark codebase. There has been more cases asking for exactly-once write, and I guess the common answer was leveraging intermediate output. While some storage can leverage it (e.g. RDBMS - writers write to temp table, driver copies rows that writers reported to output table), it doesn't make sense for Kafka, at least performance reason, as there's no way to let Kafka copies its records from topic A to topic B (right?), so I gave up.

If the code change implements 2PC correctly, in general I guess it would work in many cases, though as it's explained that transaction timeout leads data loss. I've indicated the issue on transaction timeout when I designed it and that was also one of major concerns as well. When the producer writes something it must be committed within timeout in any kinds of failures, otherwise data loss happen. Even we decide to invalidate that batch and rerun the batch, we're now then "at-least-once". (I'm wondering Flink's Kafka producer with 2PC also has similar issue or they have some safeguard.)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org