You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Jingsong Lee (Jira)" <ji...@apache.org> on 2022/06/21 05:48:00 UTC

[jira] [Updated] (FLINK-27696) Add bin-pack strategy to split the whole bucket data files into several small splits

     [ https://issues.apache.org/jira/browse/FLINK-27696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jingsong Lee updated FLINK-27696:
---------------------------------
    Summary: Add bin-pack strategy to split the whole bucket data files into several small splits  (was: Add bin-pack strategy to split the whole bucket data files into several small splits for append-only table.)

> Add bin-pack strategy to split the whole bucket data files into several small splits
> ------------------------------------------------------------------------------------
>
>                 Key: FLINK-27696
>                 URL: https://issues.apache.org/jira/browse/FLINK-27696
>             Project: Flink
>          Issue Type: Sub-task
>            Reporter: Zheng Hu
>            Assignee: Jingsong Lee
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: table-store-0.2.0
>
>
> We don't have to assign each task with a whole bucket data files. Instead, we can use some algorithm ( such as bin-packing) to split the whole bucket data files into multiple fragments to improve the job parallelism.
> For merge tree table:
> Suppose now there are files: [1, 2] [3, 4] [5, 180] [5, 190] [200, 600] [210, 700]
> Files without intersection are not related, we do not need to put all files into one split, we can slice into multiple splits, multiple parallelism execution is faster. Nor can we slice too fine, we should make each split as large as possible with 128 MB, so use BinPack to slice, the final result will be:
>  * split1: [1, 2] [3, 4]
>  * split2: [5, 180] [5, 190]
>  * split3: [200, 600] [210, 700]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)