You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by amran dean <ad...@gmail.com> on 2019/10/23 22:33:32 UTC

Flink Kafka->S3 exactly once guarantees

Hello,

Suppose I am using a *nondeterministic* time based partitioning scheme (e.g
Flink processing time) to bucket S3 objects via the *BucketAssigner*,
designated using *BulkFormatBuilder* for StreamingFileSink.

Suppose that after an S3 MPU has completed, but *before* Flink internally
commits (whether via ZK, or committing to Kafka directly) the newest
offsets, the job crashes, losing the committed offsets.

If using processing time to bucket S3 objects, will this result in
duplicate objects being written?
For example:
S3 object with timestamp 10-25 written, but before offsets committed, job
crashes.
When job resumes, system time is now 10-26, so instead of overwriting the
existing S3 object, a new one is created, duplicating data.

How does Flink prevent this?