You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "luoyuxia (Jira)" <ji...@apache.org> on 2022/04/21 07:50:00 UTC
[jira] [Updated] (FLINK-27338) Improve spliting file for Hive soure

     [ https://issues.apache.org/jira/browse/FLINK-27338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

luoyuxia updated FLINK-27338:
-----------------------------
    Description: 
Currently, for hive source, it'll use the hdfs block size configured with key dfs.block.size in hdfs-site.xml as the max split size to split the files. The default value is usually 128M/256M depending on configuration.

The strategy to split file is not reasonable for the number of splits tend to be less so that can't make good use of the parallel computing.

What's more, when enable parallelism inference for hive source, it'll set the parallelism of Hive source to the num of splits when it's not bigger than max parallelism. So, it'll limit the source parallelism and could degrade the perfermance.

To solve this problem, the idea is to calcuate a reasonable split size based on files's total size, block size,  default parallelism or parallelism configured by user. 

  was:
Currently, for hive source, it'll use the hdfs block size configured as dfs.block.size as the max split size to split the files.  The default value is usually 128M/256M depending on configuration.

The strategy to split file is not reasonable for each split will tend to be larger so that can't make good use of the parallel computing .

What's more, when enable parallelism inference for hive source, it'll set the parallelism of Hive source to the num of splits when it's not bigger than max parallelism. So, it'll limit the source parallelism and could degrade the perfermance.

To solve this problem, the idea is to calcuate a reasonable split size based on files's total size, block size,  default parallelism or parallelism configured by user. 


> Improve spliting file for Hive soure
> ------------------------------------
>
>                 Key: FLINK-27338
>                 URL: https://issues.apache.org/jira/browse/FLINK-27338
>             Project: Flink
>          Issue Type: Improvement
>          Components: Connectors / Hive
>            Reporter: luoyuxia
>            Priority: Major
>
> Currently, for hive source, it'll use the hdfs block size configured with key dfs.block.size in hdfs-site.xml as the max split size to split the files. The default value is usually 128M/256M depending on configuration.
> The strategy to split file is not reasonable for the number of splits tend to be less so that can't make good use of the parallel computing.
> What's more, when enable parallelism inference for hive source, it'll set the parallelism of Hive source to the num of splits when it's not bigger than max parallelism. So, it'll limit the source parallelism and could degrade the perfermance.
> To solve this problem, the idea is to calcuate a reasonable split size based on files's total size, block size,  default parallelism or parallelism configured by user. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)