You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Gergely Fürnstáhl (Jira)" <ji...@apache.org> on 2022/11/28 11:02:00 UTC

[jira] [Updated] (IMPALA-11577) Optimize getting stored file types for Iceberg tables

     [ https://issues.apache.org/jira/browse/IMPALA-11577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gergely Fürnstáhl updated IMPALA-11577:
---------------------------------------
    Description: 
Spawned from IMPALA-10610
Impala supports mixed file formats for Iceberg tables, which means every file can have different file format and it uses the set of existing file formats for planning purposes. Currently Impala goes through all file's metadata to aggregate this information, which can be slow if there are lots of data files.

We could optimized this by storing this aggregated information somewhere (e.g. in Iceberg - yet to be implemented - [https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/SnapshotSummary.java])

Update:

IcebergContentFileStore supports aggregating file formats, similarly as the proposed change in iceberg, but might not be the correct approach. It represents the state of the current snapshot, but IcebergScanNode receives a possibly time travelled/pruned version of the file descriptor list. 

Further optimization ideas:
 * Use the aggregated info from the iceberg SnapshotSummary/IcebergContentFileStore if possible (e.g. current snapshot, no pruning)
 * Exit the loop if we found all the available file formats (might be to costly if we check the condition it in every iteration)
 * Parallelize the aggregation

  was:
Spawned from IMPALA-10610
Impala supports mixed file formats for Iceberg tables, which means every file can have different file format and it uses the set of existing file formats for planning purposes. Currently Impala goes through all file's metadata to aggregate this information, which can be slow if there are lots of data files.

We could optimized this by storing this aggregated information somewhere (e.g. in Iceberg - yet to be implemented - [https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/SnapshotSummary.java])


> Optimize getting stored file types for Iceberg tables
> -----------------------------------------------------
>
>                 Key: IMPALA-11577
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11577
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>            Reporter: Gergely Fürnstáhl
>            Priority: Major
>              Labels: impala-iceberg
>
> Spawned from IMPALA-10610
> Impala supports mixed file formats for Iceberg tables, which means every file can have different file format and it uses the set of existing file formats for planning purposes. Currently Impala goes through all file's metadata to aggregate this information, which can be slow if there are lots of data files.
> We could optimized this by storing this aggregated information somewhere (e.g. in Iceberg - yet to be implemented - [https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/SnapshotSummary.java])
> Update:
> IcebergContentFileStore supports aggregating file formats, similarly as the proposed change in iceberg, but might not be the correct approach. It represents the state of the current snapshot, but IcebergScanNode receives a possibly time travelled/pruned version of the file descriptor list. 
> Further optimization ideas:
>  * Use the aggregated info from the iceberg SnapshotSummary/IcebergContentFileStore if possible (e.g. current snapshot, no pruning)
>  * Exit the loop if we found all the available file formats (might be to costly if we check the condition it in every iteration)
>  * Parallelize the aggregation



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org