You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Jozef Vilcek (Jira)" <ji...@apache.org> on 2021/07/23 06:51:00 UTC

[jira] [Commented] (BEAM-12493) FileIO should allow to opt-in for custom sharding function

    [ https://issues.apache.org/jira/browse/BEAM-12493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17386003#comment-17386003 ] 

Jozef Vilcek commented on BEAM-12493:
-------------------------------------

Per discussion here: 
[https://lists.apache.org/thread.html/r5e91d1996479defbf5e896dca3cf237ee2d9b59396cb3c4edf619df1%40%3Cdev.beam.apache.org%3E]

closing this request. Custom sharding needs to be handled manually bu Beam users around FileIO via customizing dynamic destinations.

Problem of FileIO generating duplicates in output files is tracked by separate Jira (linked here)

> FileIO should allow to opt-in for custom sharding function
> ----------------------------------------------------------
>
>                 Key: BEAM-12493
>                 URL: https://issues.apache.org/jira/browse/BEAM-12493
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-java-core
>    Affects Versions: 2.29.0
>            Reporter: Jozef Vilcek
>            Assignee: Jozef Vilcek
>            Priority: P2
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> When number of shards is explicitly specified, then default sharding function is `RandomShardingFunction`. `WriteFiles` does have an option to pass in custom sharding function but that is not surfaced on user facing API at `FileIO`.
> This is limiting in these 2 use-cases:
>  # I need to generate shards which are compatible with Hive bucketing and therefore need to decide shard assignment based on data fields of element being sharded
>  # When run e.g. on Spark and job encounters failure which cause loss of some data from previous stages, Spark does issue recompute of necessary task in necessary stages. Because shard assignment is random, some data will end up in different shards and cause duplicates in final dataset
> I propose to surface `.withShardingFunction()` at FileIO level so user can choose custom sharding strategy when desired.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)