You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@beam.apache.org by "Kyle Winkelman (Jira)" <ji...@apache.org> on 2022/03/01 14:27:00 UTC

[jira] [Assigned] (BEAM-6735) WriteFiles with runner-determined sharding is forced to handle spilling

     [ https://issues.apache.org/jira/browse/BEAM-6735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kyle Winkelman reassigned BEAM-6735:
------------------------------------

    Fix Version/s: 2.12.0
         Assignee: Kyle Winkelman

> WriteFiles with runner-determined sharding is forced to handle spilling
> -----------------------------------------------------------------------
>
>                 Key: BEAM-6735
>                 URL: https://issues.apache.org/jira/browse/BEAM-6735
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-java-core
>            Reporter: Kyle Winkelman
>            Assignee: Kyle Winkelman
>            Priority: P3
>              Labels: Clarified
>             Fix For: 2.12.0
>
>          Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> As a result of BEAM-2302, files in excess of WriteFiles maxNumWritersPerBundle are shuffled to be written later. The downside to this is that even if you can guarantee that maxNumWritersPerBundle is high enough to handle all writes you still have to pay the overhead of this write now being a MultiOutput ParDo.
> e.g. In the Spark Runner when a ParDo has multiple outputs the returned data is cached and if using the disableCache pipeline option it would cause recalculation and all the temp files would be written again.
> I'm sure that the Spark Runner is not the only runner that would benefit from an optional setting for WriteFiles that would skip this spilling and simplify the pipeline.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)