You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2023/04/21 23:06:00 UTC

[jira] [Commented] (IMPALA-11701) Slow query problem about querying iceberg table by impala

    [ https://issues.apache.org/jira/browse/IMPALA-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17715190#comment-17715190 ] 

ASF subversion and git services commented on IMPALA-11701:
----------------------------------------------------------

Commit 7e0feb4a8e436ddbf618bd0e090dcc24c3563fc4 in impala's branch refs/heads/master from Gabor Kaszab
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=7e0feb4a8 ]

IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

We push down predicates to Iceberg that uses them to filter out files
when getting the results of planFiles(). Using the
FileScanTask.residual() function we can find out if we have to use
the predicates to further filter the rows of the given files or if
Iceberg has already performed all the filtering.
Basically if we only filter on IDENTITY-partition columns then Iceberg
can filter the files and using these filters in Impala wouldn't filter
any more rows from the output (assuming that no partition evolution was
performed on the table).

An additional benefit of not pushing down no-op predicates to the
scanner is that we can potentially materialize less slots.
For example:

SELECT count(1) from iceberg_tbl where part_col = 10;

Another additional benefit comes with count(*) queries. If all the
predicates are skipped from being pushed to Impala's scanner for a
count(*) query then the Parquet scanner can go to an optimized path
where it uses stats instead of reading actual data to answer the query.

In the above query Iceberg filters the files using the predicate on
a partition column and then there won't be any need to materialize
'part_col' in Impala, nor to push down the 'part_col = 10' predicate.

Note, this is an all or nothing approach, meaning that assuming N
number of predicates we either push down all predicates to the scanner
or none of them. There is a room for improvement to identify a subset
of the predicates that we still have to push down to the scanner.
However, for this we'd need a mapping between Impala predicates and the
predicates returned by Iceberg's FileScanTask.residual() function that
would significantly increase the complexity of the relevant code.

Testing:
  - Some existing tests needed some extra care as they were checking
    for predicates being pushed down to the scanner, but with this
    patch not all of them are pushed down. For these tests I added some
    extra predicates to achieve that all of the predicates are pushed
    down to the scanner.
  - Added a new planner test suite for checking how predicate push down
    works with Iceberg tables.

Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Reviewed-on: http://gerrit.cloudera.org:8080/19534
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Slow query problem about querying iceberg table by impala
> ---------------------------------------------------------
>
>                 Key: IMPALA-11701
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11701
>             Project: IMPALA
>          Issue Type: Bug
>            Reporter: Qizhu Chan
>            Assignee: Gabor Kaszab
>            Priority: Critical
>              Labels: impala-iceberg
>         Attachments: image-2022-11-03-17-37-14-712.png, profile_cf446a1ab3a5e852_1b1005de00000000.txt
>
>
> I use impala to query iceberg table, but the query efficiency is not ideal, compared with querying the hive format table of the same data, the time-consuming increase is dozens of times.
> The sql statement used is a very simple statistical query, be like :
> select count(*)  from `db_name`.tbl_name where datekey='20221001' and event='xxx'
> ('datekey' and 'event' are the partition fields)
> My personal feeling is that impala might fetch iceberg's metadata stats and return results very quickly, but it doesn't.
> The catalog of iceberg table is of the hadoop type, and Impala can access it by creating an external table in hive. By the way,  iceberg table will perform snapshot expiration and data compaction on a daily basis, so there should be no small file problems.
> I found this warning using the explain statement:
> {code:java}
> | WARNING: The following tables are missing relevant table and/or column statistics. |
> | iceberg.gamebox_event_iceberg
> {code}
> Query: SHOW TABLE STATS `iceberg`.gamebox_event_iceberg
> +-------+--------+--------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------+
> | #Rows | #Files | Size   | Bytes Cached | Cache Replication | Format  | Incremental stats | Location                                                        |
> +-------+--------+--------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------+
> | 0     | 590509 | 1.91TB | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs:///hive/warehouse/iceberg/gamebox_event_iceberg |
> +-------+--------+--------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------+
> It seems like Impala is not syncing iceberg's table and column statistics. I'm not sure if this has anything to do with slow queries.
> As shown in the screenshot, i think the query time is mainly on planning and execution backends , but I don't know what is the reason for these two time consuming.
> Attachment is the complete profile for this query.
> How do I speed up the query? Can someone help with my question?plz.....
>  !image-2022-11-03-17-37-14-712.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org