You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Zoltán Borók-Nagy (Jira)" <ji...@apache.org> on 2021/01/18 15:34:00 UTC

[jira] [Updated] (IMPALA-10254) Load data files via Iceberg for Iceberg Tables

     [ https://issues.apache.org/jira/browse/IMPALA-10254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zoltán Borók-Nagy updated IMPALA-10254:
---------------------------------------
    Description: 
Currently we still load the file descriptors of an Iceberg table via recursive file listing.

This lists too many files, e.g. metadata files, files that are being written (can later throw checksum errors), files from aborted INSERTs, removed files, etc.

We should use the Iceberg API to load the file descriptors corresponding to the table snapshot. Iceberg DataFiles might also already contain the split offsets.

  was:
Currently we still load the file descriptors of an Iceberg table via recursive file listing.

This lists too many files, e.g. metadata files, files that are being written (can later throw checksum errors), files from aborted INSERTs, removed files, etc.

We should use the Iceberg API to load the file descriptors corresponding to the table snapshot.

Note that we already load data files through the Iceberg APIs to fill the 'path_hash_to_file_descriptor' map ([https://github.com/apache/impala/blob/master/common/thrift/CatalogObjects.thrift#L551).]


> Load data files via Iceberg for Iceberg Tables
> ----------------------------------------------
>
>                 Key: IMPALA-10254
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10254
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Catalog
>            Reporter: Zoltán Borók-Nagy
>            Priority: Major
>              Labels: impala-iceberg
>
> Currently we still load the file descriptors of an Iceberg table via recursive file listing.
> This lists too many files, e.g. metadata files, files that are being written (can later throw checksum errors), files from aborted INSERTs, removed files, etc.
> We should use the Iceberg API to load the file descriptors corresponding to the table snapshot. Iceberg DataFiles might also already contain the split offsets.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org