You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Rajesh Balamohan (Jira)" <ji...@apache.org> on 2022/05/02 13:47:00 UTC
[jira] [Reopened] (HIVE-26185) Need support for metadataonly operations with iceberg (e.g select distinct on partition column)

     [ https://issues.apache.org/jira/browse/HIVE-26185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rajesh Balamohan reopened HIVE-26185:
-------------------------------------

> Need support for metadataonly operations with iceberg (e.g select distinct on partition column)
> -----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-26185
>                 URL: https://issues.apache.org/jira/browse/HIVE-26185
>             Project: Hive
>          Issue Type: Bug
>          Components: HiveServer2
>            Reporter: Rajesh Balamohan
>            Priority: Major
>              Labels: performance
>
> {noformat}
> select distinct ss_sold_date_sk from store_sales
> {noformat}
> This query scans 1800+ rows in hive acid. But takes ages to process with NullScanOptimiser during compilation phase (https://issues.apache.org/jira/browse/HIVE-24262)
> {noformat}
> Hive ACID
> INFO  : Executing command(queryId=hive_20220427233926_282bc9d8-220c-4a09-928d-411601c2ef14): select distinct ss_sold_date_sk from store_sales
> INFO  : Compute 'ndembla-test2' is active.
> INFO  : Query ID = hive_20220427233926_282bc9d8-220c-4a09-928d-411601c2ef14
> INFO  : Total jobs = 1
> INFO  : Launching Job 1 out of 1
> INFO  : Starting task [Stage-1:MAPRED] in serial mode
> INFO  : Subscribed to counters: [] for queryId: hive_20220427233926_282bc9d8-220c-4a09-928d-411601c2ef14
> INFO  : Tez session hasn't been created yet. Opening session
> INFO  : Dag name: select distinct ss_sold_date_s...store_sales (Stage-1)
> INFO  : Status: Running (Executing on YARN cluster with App id application_1651102345385_0000)
> INFO  : Status: DAG finished successfully in 1.81 seconds
> INFO  : DAG ID: dag_1651102345385_0000_5
> INFO  :
> INFO  : Query Execution Summary
> INFO  : ----------------------------------------------------------------------------------------------
> INFO  : OPERATION                            DURATION
> INFO  : ----------------------------------------------------------------------------------------------
> INFO  : Compile Query                          55.47s
> INFO  : Prepare Plan                            2.32s
> INFO  : Get Query Coordinator (AM)              0.13s
> INFO  : Submit Plan                             0.03s
> INFO  : Start DAG                               0.09s
> INFO  : Run DAG                                 1.80s
> INFO  : ----------------------------------------------------------------------------------------------
> INFO  :
> INFO  : Task Execution Summary
> INFO  : ----------------------------------------------------------------------------------------------
> INFO  :   VERTICES      DURATION(ms)   CPU_TIME(ms)    GC_TIME(ms)   INPUT_RECORDS   OUTPUT_RECORDS
> INFO  : ----------------------------------------------------------------------------------------------
> INFO  :      Map 1           1009.00              0              0           1,824            1,824
> INFO  :  Reducer 2              0.00              0              0           1,824                0
> INFO  : ----------------------------------------------------------------------------------------------
> INFO  :
> {noformat}
> However, same query scans *2.8 Billion records.* in iceberg format. This can be fixed.
> {noformat}
> INFO  : Executing command(queryId=hive_20220427233519_cddc6dd1-95a3-4f0e-afa5-e11e9dc5fa72): select distinct ss_sold_date_sk from store_sales
> INFO  : Compute 'ndembla-test2' is active.
> INFO  : Query ID = hive_20220427233519_cddc6dd1-95a3-4f0e-afa5-e11e9dc5fa72
> INFO  : Total jobs = 1
> INFO  : Launching Job 1 out of 1
> INFO  : Starting task [Stage-1:MAPRED] in serial mode
> INFO  : Subscribed to counters: [] for queryId: hive_20220427233519_cddc6dd1-95a3-4f0e-afa5-e11e9dc5fa72
> INFO  : Tez session hasn't been created yet. Opening session
> INFO  : Dag name: select distinct ss_sold_date_s...store_sales (Stage-1)
> INFO  : Status: Running (Executing on YARN cluster with App id application_1651102345385_0000)
> ----------------------------------------------------------------------------------------------
>         VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
> ----------------------------------------------------------------------------------------------
> Map 1 ..........      llap     SUCCEEDED   7141       7141        0        0       0       0
> Reducer 2 ......      llap     SUCCEEDED      2          2        0        0       0       0
> ----------------------------------------------------------------------------------------------
> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 18.48 s
> ----------------------------------------------------------------------------------------------
> INFO  : Status: DAG finished successfully in 17.97 seconds
> INFO  : DAG ID: dag_1651102345385_0000_4
> INFO  :
> INFO  : Query Execution Summary
> INFO  : ----------------------------------------------------------------------------------------------
> INFO  : OPERATION                            DURATION
> INFO  : ----------------------------------------------------------------------------------------------
> INFO  : Compile Query                           1.81s
> INFO  : Prepare Plan                            0.04s
> INFO  : Get Query Coordinator (AM)              0.14s
> INFO  : Submit Plan                             0.02s
> INFO  : Start DAG                               0.03s
> INFO  : Run DAG                                17.97s
> INFO  : ----------------------------------------------------------------------------------------------
> INFO  :
> INFO  : Task Execution Summary
> INFO  : ----------------------------------------------------------------------------------------------
> INFO  :   VERTICES      DURATION(ms)   CPU_TIME(ms)    GC_TIME(ms)   INPUT_RECORDS   OUTPUT_RECORDS
> INFO  : ----------------------------------------------------------------------------------------------
> INFO  :      Map 1          15142.00              0              0  28,800,426,268            8,611
> INFO  :  Reducer 2            956.00              0              0           8,611                0
> INFO  : ----------------------------------------------------------------------------------------------
> INFO  :
> {noformat} 
> It will be beneficial to make use of MetadataOnlyOptimizer for iceberg tables as well. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)