You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "LiPenglin (Jira)" <ji...@apache.org> on 2022/10/15 07:32:00 UTC

[jira] [Comment Edited] (IMPALA-11608) Impala SHOW TABLE STATS shows wrong number of files for Iceberg tables

    [ https://issues.apache.org/jira/browse/IMPALA-11608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17617875#comment-17617875 ] 

LiPenglin edited comment on IMPALA-11608 at 10/15/22 7:31 AM:
--------------------------------------------------------------

Hey [~boroknagyz] could you pls assign me this jira?

While fixing this problem, I also wanted to make impala better pre-load the data_location of Iceberg tables  instead of the table_location.

1.The data_location of the Iceberg table is obtained by ([https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/Table.java#L309] [https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/LocationProviders.java#L89]) before the code line [https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java#L357]
2. Through hdfsTable_.load(...) pass the data_location to [https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L1358]
3. Make 'partDir_' in the 'FileMetadataLoader.load()' method be the data_location of Iceberg table instead of the table_location

 

UPDATE:

The iceberg LocationProvider cannot guarantee that the data is in the data_location, so this bug cannot be fixed by the above way. I will create another Jira to track pre-load the data_location of Iceberg tables.


was (Author: lipenglin):
Hey [~boroknagyz] could you pls assign me this jira?

While fixing this problem, I also wanted to make impala better pre-load the data_location of Iceberg tables  instead of the table_location.

1.The data_location of the Iceberg table is obtained by (https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/Table.java#L309 https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/LocationProviders.java#L89) before the code line https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/IcebergTable.java#L357
2. Through hdfsTable_.load(...) pass the data_location to https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L1358
3. Make 'partDir_' in the 'FileMetadataLoader.load()' method be the data_location of Iceberg table instead of the table_location

> Impala SHOW TABLE STATS shows wrong number of files for Iceberg tables
> ----------------------------------------------------------------------
>
>                 Key: IMPALA-11608
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11608
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>            Reporter: Zoltán Borók-Nagy
>            Priority: Major
>              Labels: impala-iceberg, ramp-up
>
> Impala SHOW TABLE stats outputs wrong value for number of files for Iceberg tables. It should only calculate the number of data files, but it calculates all files under the table directory, including metadata files, orphaned files, and old data files not belonging to the current snapshot.
> It should only output the number of data files in the current snapshot, making the output consistent with SHOW FILES IN tbl;
> {noformat}
> create table test (i int) stored as iceberg;
> compute stats test;
> show table stats test;
> +-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------+
> | #Rows | #Files | Size   | Bytes Cached | Cache Replication | Format  | Incremental stats | Location                                   |
> +-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------+
> | -1    | 2      | 2.70KB | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://localhost:20500/test-warehouse/test |
> +-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------+
> {noformat}
> SHOW TABLE STATS is handled here: https://github.com/apache/impala/blob/66484a4c081f3242750a3a0e04159dd4580b37a4/fe/src/main/java/org/apache/impala/service/Frontend.java#L1429-L1457



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org