You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2023/01/18 16:58:00 UTC

[jira] [Commented] (IMPALA-11591) Avoid calling planFiles() on Iceberg tables when there are no predicates

    [ https://issues.apache.org/jira/browse/IMPALA-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17678327#comment-17678327 ] 

ASF subversion and git services commented on IMPALA-11591:
----------------------------------------------------------

Commit 1e1b8f25b686471b088148dd296a2eb731160302 in impala's branch refs/heads/master from Zoltan Borok-Nagy
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=1e1b8f25b ]

IMPALA-11826: Avoid calling planFiles() on Iceberg V2 tables when there are no predicates

Similar to IMPALA-11591 but this Jira extends it to V2 tables. With
this patch we group data files into two categories in
IcebergContentFileStore:
 * data files without deletes
 * data files with deletes

With this information we can avoid calling planFiles() when planning
the scans of Iceberg tables. We can just set the lists of the file
descriptors based on IcebergContentFileStore then invoke the regular
planning methods.

iceberg-v2-tables.test had to be updated a bit because now we are
calculating the lengths of the file paths based on Impala's file
descriptor objects + table location, and not based on data file
information in Iceberg metadata (which has the file system prefix
stripped)

Testing:
  * executed existing tests
  * Updated plan tests

Change-Id: Ia46bd2dce248a9e096fc1c0bd914fc3fa4686fb0
Reviewed-on: http://gerrit.cloudera.org:8080/19419
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Avoid calling planFiles() on Iceberg tables when there are no predicates
> ------------------------------------------------------------------------
>
>                 Key: IMPALA-11591
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11591
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Catalog, Frontend
>            Reporter: Zoltán Borók-Nagy
>            Assignee: Zoltán Borók-Nagy
>            Priority: Major
>              Labels: impala-iceberg
>             Fix For: Impala 4.2.0
>
>
> Currently we always invoke Iceberg's planFiles() API for creating Iceberg scans.
> When there are no predicates (and no time travel) on the table we could avoid that because we already cache everything we need (schema, partition information, file descriptors).
> We can also consider only pushing down predicates if at least one of the predicates refer to a partition column. Otherwise it's possible that the overhead of reading, decoding, evaluating all the manifest files is too large.
> I think the change should be fairly simple, we just need to take care:
>  * -store delete files separately, so we can still do the V2 scans from cache- (will be implemented by IMPALA-11826)
>  * During time-travel we also cache old file descriptors, so we need to separate them from the actual snapshot's file descriptors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org