You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Zoltán Borók-Nagy (Jira)" <ji...@apache.org> on 2023/01/03 17:52:00 UTC
[jira] [Commented] (IMPALA-11802) Optimize count(*) queries for Iceberg V2 tables

    [ https://issues.apache.org/jira/browse/IMPALA-11802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17654121#comment-17654121 ] 

Zoltán Borók-Nagy commented on IMPALA-11802:
--------------------------------------------

I think we should assume that Cardinality(table) is not equal to Cardinality(data files) - Cardinality(delete files).

Because
 * Concurent deletes might create delete files that reference the same rows:
 ** [https://github.com/apache/iceberg/blob/cecb10bb8ab0458fb3f6a650692a8e432f08cbd2/api/src/main/java/org/apache/iceberg/RowDelta.java#L131-L133]
 * Partial compactions, e.g.:
 ## Table has data files: A, B, X and delete file: D
 ## D references A and B
 ## Now we rewrite the small files which are A and X
 ## So now the table has data files AX', B, and delete file D
 ## In this case it's clear that numRows(table) is not equal to numRows(dataFiles) - numRows(deleteFiles)
 ## (Though the above could be fixed by rewriting the delete file to D' to only reference rows in B. AFAICT Iceberg does not do that)

> Optimize count(*) queries for Iceberg V2 tables
> -----------------------------------------------
>
>                 Key: IMPALA-11802
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11802
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>            Reporter: Zoltán Borók-Nagy
>            Priority: Major
>              Labels: impala-iceberg
>
> Simple {{SELECT count( * ) FROM ice_v2_tbl;}} could be optimized.
> At first we need to investigate if the following is true:
> If a V2 table only has position delete files, then the cardinality is
> {noformat}
> Cardinality(data files) - Cardinality(delete files)
> {noformat}
> If this is true, then we can answer count( * ) queries via a query rewrite similarly to what we do for V1 tables: IMPALA-11279
> If the above is not true, we can still optimize count( * ) queries by:
> {noformat}
>         SUM
>          |
>      UNION ALL
>       /     \
>      /       \
>     /         \
> COUNT(*)     COUNT(*)
>   /                \
> SCAN             ANTI JOIN
> data files         /      \
> without           /        \
> deletes       SCAN         SCAN
>               data files   delete files
>               with deletes
> {noformat}
> The SCAN operator with "data files without deletes" could benefit from count( * ) optimization (they would only need to read file metadata). In the common case (when there are few deletes) this SCAN is in charge of scanning the vast majority of data files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org