You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2022/11/03 01:20:00 UTC

[jira] [Commented] (IMPALA-11591) Avoid calling planFiles() on Iceberg tables when there are no predicates

    [ https://issues.apache.org/jira/browse/IMPALA-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17628046#comment-17628046 ] 

ASF subversion and git services commented on IMPALA-11591:
----------------------------------------------------------

Commit 301c3cebad814c73ff7f7ccfd528f7ab7832f4ab in impala's branch refs/heads/master from Zoltan Borok-Nagy
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=301c3ceba ]

IMPALA-11591: Avoid calling planFiles() on Iceberg tables

Iceberg's planFiles() API is very expensive as it needs to read all
the relevant manifest files. It's especially expensive on object
stores like S3.

When there are no predicates on the table and we are not doing
time travel it's possible to avoid calling planFiles() and do the
scan planning from cached metadata. When none of the predicates are
on partition columns there's little benefit of pushing down predicates
to Iceberg. So with this patch we only push down predicates (and
hence invoke planFiles()) when at least one of the predicates are
on partition columns.

This patch introduces a new class to store content files:
IcebergContentFileStore. It separates data, delete, and "old" content
files. "Old" content files are the ones that are not part of the current
snapshot. We add such data files during time travel. Storing "old"
content files in a separate concurrent hash map also fixes a concurrency
bug in the current code.

Testing:
 * executed current e2e tests
 * updated predicate push down tests

Change-Id: Iadb883a28602bb68cf4f61e57cdd691605045ac5
Reviewed-on: http://gerrit.cloudera.org:8080/19043
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Avoid calling planFiles() on Iceberg tables when there are no predicates
> ------------------------------------------------------------------------
>
>                 Key: IMPALA-11591
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11591
>             Project: IMPALA
>          Issue Type: Improvement
>            Reporter: Zoltán Borók-Nagy
>            Assignee: Zoltán Borók-Nagy
>            Priority: Major
>              Labels: impala-iceberg
>
> Currently we always invoke Iceberg's planFiles() API for creating Iceberg scans.
> When there are no predicates (and no time travel) on the table we could avoid that because we already cache everything we need (schema, partition information, file descriptors).
> We can also consider only pushing down predicates if at least one of the predicates refer to a partition column. Otherwise it's possible that the overhead of reading, decoding, evaluating all the manifest files is too large.
> I think the change should be fairly simple, we just need to take care:
>  * store delete files separately, so we can still do the V2 scans from cache
>  * During time-travel we also cache old file descriptors, so we need to separate them from the actual snapshot's file descriptors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org