You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@beam.apache.org by "Beam JIRA Bot (Jira)" <ji...@apache.org> on 2021/10/05 17:25:01 UTC

[jira] [Commented] (BEAM-12654) FileIO can produce duplicates in output files

    [ https://issues.apache.org/jira/browse/BEAM-12654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424605#comment-17424605 ] 

Beam JIRA Bot commented on BEAM-12654:
--------------------------------------

This issue was marked "stale-P2" and has not received a public comment in 14 days. It is now automatically moved to P3. If you are still affected by it, you can comment and move it back to P2.

> FileIO can produce duplicates in output files
> ---------------------------------------------
>
>                 Key: BEAM-12654
>                 URL: https://issues.apache.org/jira/browse/BEAM-12654
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-java-core
>            Reporter: Jozef Vilcek
>            Priority: P3
>
> FileIO can produce duplicates in output files - depending on a runner.
> Concrete example for Spark when executing as batch:
> When using FileIO with specific number of shards, it will use default sharding function which is a round robin shard assignment with random seed. In multistage pipeline, data between stages are hold by shuffle service until downstream stage request it for further computations. If shuffle results computed with this seeded shard function are lost - e.g. shuffle service fails because of HW error - then Spark will attempt to recover data by computing them again from source data. As a result of a random seed sharding, this will assign different shard - and therefore key to the element.
> More details are discussed in this thread:
> https://lists.apache.org/thread.html/r5e91d1996479defbf5e896dca3cf237ee2d9b59396cb3c4edf619df1%40%3Cdev.beam.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)