You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/04 22:46:54 UTC

[GitHub] [beam] damccorm opened a new issue, #21346: Users cannot provide their own sharding function when using FileIO

damccorm opened a new issue, #21346:
URL: https://github.com/apache/beam/issues/21346

   Beam uses RandomShardingFunction ([https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java#L834](https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java#L834)) by default for sharding when using FileIO.write().
   
    
   
   RandomShardingFuncction doesn’t work well with Flink Runner. Its assigning same key (hashDestination(destination, destinationCoder))  along with the shard number. ([https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java#L863](https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java#L863))
   
   This is causing different ShardedKeys going to same task slot. As a result, there is no equal distribution of files written from different task slots.
   
   E.g. 2 files are written by task slot 1, 3 files written by task slot 2, 0 files written by other task slots.
   
    
   
   As a result, we are seeing Out of Memory from the pods writing more files.
   
    
   
   At [https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java#L695-L698](https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java#L695-L698)
   
   There is an option to give a different sharding function. But there is no way for the user to mention different sharding function when using FileIO class.
   
   Imported from Jira [BEAM-13667](https://issues.apache.org/jira/browse/BEAM-13667). Original Jira may contain additional context.
   Reported by: kathulasandeep.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org