You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Jingsong Lee (Jira)" <ji...@apache.org> on 2022/06/20 03:20:00 UTC

[jira] [Updated] (FLINK-27696) Add bin-pack strategy to split the whole bucket data files into several small splits for append-only table.

     [ https://issues.apache.org/jira/browse/FLINK-27696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jingsong Lee updated FLINK-27696:
---------------------------------
    Description: 
We don't have to assign each task with a whole bucket data files. Instead, we can use some algorithm ( such as bin-packing) to split the whole bucket data files into multiple fragments to improve the job parallelism.

For merge tree table:
Suppose now there are files: [1, 2] [3, 4] [5, 180] [5, 190] [200, 600] [210, 700]
Files without intersection are not related, we do not need to put all files into one split, we can slice into multiple splits, multiple parallelism execution is faster. Nor can we slice too fine, we should make each split as large as possible with 128 MB, so use BinPack to slice, the final result will be:
 * split1: [1, 2] [3, 4]
 * split2: [5, 180] [5, 190]
 * split3: [200, 600] [210, 700]

  was:For append-only table,  we don't have to assign each task with a whole bucket data files. Instead,  we can use some algorithm ( such as bin-packing) to split the whole bucket data files into multiple fragments  to improve the job parallelism.


> Add bin-pack strategy to split the whole bucket data files into several small splits for append-only table.
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-27696
>                 URL: https://issues.apache.org/jira/browse/FLINK-27696
>             Project: Flink
>          Issue Type: Sub-task
>            Reporter: Zheng Hu
>            Assignee: Jingsong Lee
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: table-store-0.2.0
>
>
> We don't have to assign each task with a whole bucket data files. Instead, we can use some algorithm ( such as bin-packing) to split the whole bucket data files into multiple fragments to improve the job parallelism.
> For merge tree table:
> Suppose now there are files: [1, 2] [3, 4] [5, 180] [5, 190] [200, 600] [210, 700]
> Files without intersection are not related, we do not need to put all files into one split, we can slice into multiple splits, multiple parallelism execution is faster. Nor can we slice too fine, we should make each split as large as possible with 128 MB, so use BinPack to slice, the final result will be:
>  * split1: [1, 2] [3, 4]
>  * split2: [5, 180] [5, 190]
>  * split3: [200, 600] [210, 700]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)