You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/04 21:03:29 UTC

[GitHub] [beam] damccorm opened a new issue, #21082: FileIO can produce duplicates in output files

damccorm opened a new issue, #21082:
URL: https://github.com/apache/beam/issues/21082

   FileIO can produce duplicates in output files - depending on a runner.
   
   Concrete example for Spark when executing as batch:
   
   When using FileIO with specific number of shards, it will use default sharding function which is a round robin shard assignment with random seed. In multistage pipeline, data between stages are hold by shuffle service until downstream stage request it for further computations. If shuffle results computed with this seeded shard function are lost - e.g. shuffle service fails because of HW error - then Spark will attempt to recover data by computing them again from source data. As a result of a random seed sharding, this will assign different shard - and therefore key to the element.
   
   More details are discussed in this thread:
   https://lists.apache.org/thread.html/r5e91d1996479defbf5e896dca3cf237ee2d9b59396cb3c4edf619df1%40%3Cdev.beam.apache.org%3E
   
   Imported from Jira [BEAM-12654](https://issues.apache.org/jira/browse/BEAM-12654). Original Jira may contain additional context.
   Reported by: jvilcek.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org