You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2022/04/05 07:44:00 UTC
[jira] [Commented] (IMPALA-10737) Optimize Iceberg metadata handling

    [ https://issues.apache.org/jira/browse/IMPALA-10737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517283#comment-17517283 ] 

ASF subversion and git services commented on IMPALA-10737:
----------------------------------------------------------

Commit efba58f5f05da5dc1f1f5bb3c6fd812bf7f679b9 in impala's branch refs/heads/master from Tamas Mate
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=efba58f5f ]

IMPALA-10737: Optimize the number of Iceberg API Metadata requests

Iceberg stores the table metadata next to the data files, when this is
accessed through the Iceberg API a filesystem call is executed (HDFS,
S3, ADLS). These calls were used in various places during query
processing and this patch unifies the Iceberg metadata request in the
CatalogD and ImpalaD:
 - CatalogD loads and caches the org.apache.iceberg.Table object.
 - When ImpalaDs request the Table metadata, the current catalog
   snapshot id is sent over and the ImpalaD loads and caches the
   org.apache.iceberg.Table object throught Iceberg API as well.

This approach (loading the Iceberg table twice) was choosen because
the org.apache.iceberg.Table could not be meaningfully serialized and
deserialized. The result of a serialized Table is a lightweight
SerializableTable object which is in the Iceberg core package.

As a result REFRESH/INVALIDATE METADATA is required to reload any
Iceberg metadata changes and the metadata load time is improved.
This improvement is more significant for smaller queries, where the
metadata request has larger impact on the query execution time.

Additionally, the dependency on the Iceberg core package has been
reduced and the TableMetadata/BaseTable class uses has been replaced
with the Table class from the Iceberg api package in most places.

Testing:
 - Passed Iceberg E2E tests.

Change-Id: I5492e0cdb31602f0276029c2645d14ff5cb2f672
Reviewed-on: http://gerrit.cloudera.org:8080/18353
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Optimize Iceberg metadata handling
> ----------------------------------
>
>                 Key: IMPALA-10737
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10737
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>            Reporter: Zoltán Borók-Nagy
>            Assignee: Tamas Mate
>            Priority: Major
>              Labels: impala-iceberg
>             Fix For: Impala 4.1.0
>
>
> Currently we re-read Iceberg table metadata in several cases.
> We should rather keep it in memory and use it when possible.
> Also, when refreshing a table we should use Iceberg's refresh() API to avoid unnecessary re-reads of manifest files:
> [https://github.com/apache/iceberg/blob/282b6f9f1cae8d4fd5ff7c73de513ca91f01fddc/core/src/main/java/org/apache/iceberg/TableOperations.java#L45]
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org