You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues-all@impala.apache.org by "Tim Armstrong (Jira)" <ji...@apache.org> on 2020/06/02 16:47:00 UTC

[jira] [Comment Edited] (IMPALA-8081) Avoid over-parallelizing queries when there are small input splits

    [ https://issues.apache.org/jira/browse/IMPALA-8081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17124041#comment-17124041 ] 

Tim Armstrong edited comment on IMPALA-8081 at 6/2/20, 4:46 PM:
----------------------------------------------------------------

Some context on how this is solved in hive-on-tez - the default minimum split size is hardcoded to 50mb in the hive codebase - https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/grouper/TezSplitGrouper.java#L77, but is overridden to 16MB by Ambari - https://github.com/apache/ambari/blame/5460e8952729854f1c032a781c9a8de608ba4475/ambari-web/app/assets/data/configurations/config_versions.json#L1611, which is probably more commonly used in prod for Hive LLAP.

See also https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works


was (Author: tarmstrong):
Some context on how this is solved in hive-on-tez - the default minimum split size is hardcoded to 50mb in the hive codebase - https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/grouper/TezSplitGrouper.java#L77, but is overridden to 16MB by Ambari - https://github.com/apache/ambari/blame/5460e8952729854f1c032a781c9a8de608ba4475/ambari-web/app/assets/data/configurations/config_versions.json#L1611, which is probably more commonly used in prod for Hive LLAP.

> Avoid over-parallelizing queries when there are small input splits
> ------------------------------------------------------------------
>
>                 Key: IMPALA-8081
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8081
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>            Reporter: Janaki Lahorani
>            Priority: Major
>              Labels: multithreading
>
> Currently we maximise parallelism given the number of input splits available. This is often a good decision, unless there are very many small input splits, particularly small files. We could avoid this pathological behaviour by having a minimum threshold of input bytes per instance (this is still pretty crude, since file input bytes only correlates loosely with the amount of work required).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org