You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2020/11/13 01:43:00 UTC

[jira] [Updated] (HIVE-24262) Optimise NullScanTaskDispatcher for cloud storage

     [ https://issues.apache.org/jira/browse/HIVE-24262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ASF GitHub Bot updated HIVE-24262:
----------------------------------
    Labels: pull-request-available  (was: )

> Optimise NullScanTaskDispatcher for cloud storage
> -------------------------------------------------
>
>                 Key: HIVE-24262
>                 URL: https://issues.apache.org/jira/browse/HIVE-24262
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Assignee: Mustafa İman
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> {noformat}
> select count(DISTINCT ss_sold_date_sk) from store_sales;
> ----------------------------------------------------------------------------------------------
>         VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
> ----------------------------------------------------------------------------------------------
> Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0
> Reducer 2 ...... container     SUCCEEDED      1          1        0        0       0       0
> ----------------------------------------------------------------------------------------------
> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 5.55 s
> ----------------------------------------------------------------------------------------------
> INFO  : Status: DAG finished successfully in 5.44 seconds
> INFO  :
> INFO  : Query Execution Summary
> INFO  : ----------------------------------------------------------------------------------------------
> INFO  : OPERATION                            DURATION
> INFO  : ----------------------------------------------------------------------------------------------
> INFO  : Compile Query                         102.02s
> INFO  : Prepare Plan                            0.51s
> INFO  : Get Query Coordinator (AM)              0.01s
> INFO  : Submit Plan                             0.33s
> INFO  : Start DAG                               0.56s
> INFO  : Run DAG                                 5.44s
> INFO  : ----------------------------------------------------------------------------------------------
> {noformat}
> Reason for "102 seconds" compilation time is that, it ends up doing "isEmptyPath" check for every partition path and takes lot of time in compilation phase.
> If the parent directory of all paths belong to the same path, we could just do a recursive listing just once (instead of listing each directory one at a time sequentially) in cloud storage systems.
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/NullScanTaskDispatcher.java#L158
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/NullScanTaskDispatcher.java#L121
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/NullScanTaskDispatcher.java#L101
> With a temp hacky fix, it comes down to 2 seconds from 100+ seconds.
> {noformat}
> INFO  : Dag name: select count(DISTINCT ss_sold_...store_sales (Stage-1)
> INFO  : Status: Running (Executing on YARN cluster with App id application_1602500203747_0003)
> ----------------------------------------------------------------------------------------------
>         VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
> ----------------------------------------------------------------------------------------------
> Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0
> Reducer 2 ...... container     SUCCEEDED      1          1        0        0       0       0
> ----------------------------------------------------------------------------------------------
> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 1.23 s
> ----------------------------------------------------------------------------------------------
> INFO  : Status: DAG finished successfully in 1.20 seconds
> INFO  :
> INFO  : Query Execution Summary
> INFO  : ----------------------------------------------------------------------------------------------
> INFO  : OPERATION                            DURATION
> INFO  : ----------------------------------------------------------------------------------------------
> INFO  : Compile Query                           0.85s
> INFO  : Prepare Plan                            0.17s
> INFO  : Get Query Coordinator (AM)              0.00s
> INFO  : Submit Plan                             0.03s
> INFO  : Start DAG                               0.03s
> INFO  : Run DAG                                 1.20s
> INFO  : ----------------------------------------------------------------------------------------------
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)