You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues-all@impala.apache.org by "Li Penglin (Jira)" <ji...@apache.org> on 2023/03/08 09:11:00 UTC

[jira] [Created] (IMPALA-11986) Optimize MIN(part_col)/ MAX(part_col)/ COUNT(DISTINCT part_col)/ queries for Iceberg tables

Li Penglin created IMPALA-11986:
-----------------------------------

             Summary: Optimize MIN(part_col)/ MAX(part_col)/ COUNT(DISTINCT part_col)/ queries for Iceberg tables
                 Key: IMPALA-11986
                 URL: https://issues.apache.org/jira/browse/IMPALA-11986
             Project: IMPALA
          Issue Type: Improvement
            Reporter: Li Penglin


For Iceberg V1 and V2 tables without deletes:
https://impala.apache.org/docs/build/html/topics/impala_optimize_partition_key_scans.html OPTIMIZE_PARTITION_KEY_SCANS optimizes the MIN(key_column), MAX(key_column), and COUNT(DISTINCT key_column) by 'TBLS' table and 'PARTITION_KEY_VALS' partition key column in the HMS metadata. For the Iceberg tables, its partitioning stats is not stored in the HMS, but can be obtained through the Iceberg API. We can optimize query performance for MIN(key_column), MAX(key_column), or COUNT(DISTINCT key_column) by similar idea, but we should make sure that 'Partition Transforms' is 'identity'.
For non-partitioned columns, if min-max information is stored in Iceberg meta, the MIN(column) and MAX(column) queries can also be optimized based on this idea?
But impala does not guarantee that the statistics for these non-partitioned columns are complete, it's confusing things.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org