You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Yash Sharma <ya...@gmail.com> on 2018/01/11 23:35:23 UTC

Structured Streaming with S3 file source duplicates data because of eventual consistency

Hi Team,
I have been using Structured Streaming with the S3 data source but I am
seeing it duplicate the data intermittently. New run seem to fix it, but
the duplication happens ~10% of time. The ratio increases with more number
of files in the source. Investigating more, I see this is clearly an issue
with S3's eventual consistency, and spark re-processes the task twice,
because its not able to verify if the task successfully wrote the output of
completed task.

I have added all the details of investigation in the ticket below with code
and error logs.Is there a way we can address this issue and is there
anything I can help out with.

https://issues.apache.org/jira/browse/SPARK-23050

Cheers