You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Gabor Kaszab (Jira)" <ji...@apache.org> on 2024/03/12 14:17:00 UTC
[jira] [Updated] (IMPALA-12894) Optimized count(*) for Iceberg gives wrong results after a Spark rewrite_data_files
[ https://issues.apache.org/jira/browse/IMPALA-12894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gabor Kaszab updated IMPALA-12894:
----------------------------------
Attachment: count_star_correctness_repro.tar.gz
> Optimized count(*) for Iceberg gives wrong results after a Spark rewrite_data_files
> -----------------------------------------------------------------------------------
>
> Key: IMPALA-12894
> URL: https://issues.apache.org/jira/browse/IMPALA-12894
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Affects Versions: Impala 4.3.0
> Reporter: Gabor Kaszab
> Priority: Critical
> Labels: correctness, impala-iceberg
> Attachments: count_star_correctness_repro.tar.gz
>
>
> Issue was introduced by https://issues.apache.org/jira/browse/IMPALA-11802 that implemented an optimized way to get results for count(*). However, if the table was compacted by Spark this optimization can give incorrect results.
> The reason is that Spark can[ skip dropping delete files|https://iceberg.apache.org/docs/latest/spark-procedures/#rewrite_position_delete_files] that are pointing to compacted data files, as a result there might be delete files after compaction that are no longer applied to any data files.
> Repro:
> With Impala
> {code:java}
> create table default.iceberg_testing (id int, j bigint) STORED AS ICEBERG
> TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
> 'iceberg.catalog_location'='/tmp/spark_iceberg_catalog/',
> 'iceberg.table_identifier'='iceberg_testing',
> 'format-version'='2');
> insert into iceberg_testing values
> (1, 1), (2, 4), (3, 9), (4, 16), (5, 25);
> update iceberg_testing set j = -100 where id = 4;
> delete from iceberg_testing where id = 4;{code}
> Count * returns 4 at this point.
> Run compaction in Spark:
> {code:java}
> spark.sql(s"CALL local.system.rewrite_data_files(table => 'default.iceberg_testing', options => map('min-input-files','2') )").show() {code}
> Now count * in Impala returns 8 (might require an IM if in HadoopCatalog). Hive returns correct results. Also a SELECT * returns correct results.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org