You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Steve Loughran (Jira)" <ji...@apache.org> on 2022/11/02 21:11:00 UTC

[jira] [Commented] (IMPALA-11662) Improve "refresh iceberg_tbl_on_oss;" performance

    [ https://issues.apache.org/jira/browse/IMPALA-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627987#comment-17627987 ] 

Steve Loughran commented on IMPALA-11662:
-----------------------------------------

if you use the iterator api to list files, listStatusIterator()/listFiles, listLocatedStatus, ..., you get a iterator back which on both abfs and s3a will do background fetches of pages of data, rather than block until all the pages of data are back. hides a lot of the latency.

note, listing is a lot slower on versioned buckets where older versions of files have been overwritten/deleted, even deleted dir markers cause problems. Are you testing on versioned buckets? if so, turning off directory marker deletion makes a big difference

> Improve "refresh iceberg_tbl_on_oss;" performance
> -------------------------------------------------
>
>                 Key: IMPALA-11662
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11662
>             Project: IMPALA
>          Issue Type: Improvement
>            Reporter: LiPenglin
>            Priority: Major
>              Labels: impala-iceberg
>
> Since Iceberg provides rich metadata, the cost of directory listing on OSS service e.g. S3A is higher than the cost on HDFS, we could create the file descriptors from Iceberg metadata instead of using org.apache.hadoop.fs.FileSystem#listFiles. https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java#L189.
> The only thing missing there is the last_modification_time of the files. But since Iceberg files are immutable, maybe we could just come up with a special timestamp for these files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org