You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Zoltán Borók-Nagy (Jira)" <ji...@apache.org> on 2022/12/15 15:58:00 UTC
[jira] [Created] (IMPALA-11802) Optimize count(*) queries for Iceberg V2 tables
Zoltán Borók-Nagy created IMPALA-11802:
------------------------------------------
Summary: Optimize count(*) queries for Iceberg V2 tables
Key: IMPALA-11802
URL: https://issues.apache.org/jira/browse/IMPALA-11802
Project: IMPALA
Issue Type: Bug
Components: Frontend
Reporter: Zoltán Borók-Nagy
Simple {{SELECT count( * ) FROM ice_v2_tbl;}} could be optimized.
At first we need to investigate if the following is true:
If a V2 table only has position delete files, then the cardinality is
{noformat}
Cardinality(data files) - Cardinality(delete files)
{noformat}
If this is true, we answer count( * ) queries via a query rewrite similarly to what we do for V1 tables: IMPALA-11279
If the above is not true, we can still optimize count( * ) queries by:
{noformat}
SUM
|
UNION
/ \
/ \
/ \
COUNT(*) COUNT(*)
/ \
SCAN ANTI JOIN
data files / \
without / \
deletes SCAN SCAN
data files delete files
with deletes
{noformat}
The SCAN operator with "data files without deletes" could benefit from count( * ) optimization (they would only need to read file metadata). In the common case (when there are few deletes) this SCAN is in charge of scanning the vast majority of data files.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)