You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Fang-Yu Rao (Jira)" <ji...@apache.org> on 2023/05/19 00:46:00 UTC

[jira] [Created] (IMPALA-12151) Formula used to estimate the cost of join could be improved

Fang-Yu Rao created IMPALA-12151:
------------------------------------

             Summary: Formula used to estimate the cost of join could be improved
                 Key: IMPALA-12151
                 URL: https://issues.apache.org/jira/browse/IMPALA-12151
             Project: IMPALA
          Issue Type: Bug
          Components: Frontend
    Affects Versions: Impala 4.1.2
            Reporter: Fang-Yu Rao


We found that the formula used in [Planner#|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/planner/Planner.java#L719-L724] to estimate the cost of a join (per node) sometimes could lead to a bad join order.

The issue could shown using by the following steps.
{code:java}
create database test_db;
create table test_db.larger_tbl (string_col string, bigint_col bigint, int_col_0 int, int_col_1 int) partitioned by (date_string_col string) stored as parquet;
create table test_db.smaller_tbl (bigint_col bigint) partitioned by (date_string_col string) stored as parquet;

insert into test_db.smaller_tbl partition (date_string_col='2023-05-05') values (1000);
insert into test_db.smaller_tbl partition (date_string_col='2023-05-05') values (1000);
insert into test_db.smaller_tbl partition (date_string_col='2023-05-05') values (1000);

insert into test_db.larger_tbl partition (date_string_col='2023-05-05') values ('wa', 1000, 6, 1);

alter table test_db.smaller_tbl partition (date_string_col='2023-05-05')
 set tblproperties('numrows'='17000', 'stats_generated_via_stats_task'='true');
alter table test_db.larger_tbl partition (date_string_col='2023-05-05')
 set tblproperties('numrows'='28900000', 'stats_generated_via_stats_task'='true');

explain select
  distinct t0.`string_col`
from
  `test_db`.`larger_tbl` t0
  left outer join `test_db`.`smaller_tbl` t1 on (
    t0.`date_string_col` = t1.`date_string_col`
    and t0.`bigint_col` = t1.`bigint_col`
  )
where
t0.`date_string_col` in ('2023-05-05') and t0.`int_col_1` in (1)
order by 1 asc
limit 1000;
{code}
 

The query plan shows that Impala will be using the larger table ('larger_tbl') as the build side table in the hash join node. When there is data skew in the larger table, it's possible that there will be only one single executor working on building the hash table based on the only hash partition that contains data, which in turn could cause the executor node to run into memory issue.
{code:java}
+------------------------------------------------------------------------------------------------------+
| Explain String                                                                                       |
+------------------------------------------------------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=110.03MB Threads=7                                         |
| Per-Host Resource Estimates: Memory=414MB                                                            |
| WARNING: The following tables are missing relevant table and/or column statistics.                   |
| test_db.larger_tbl, test_db.smaller_tbl                                                              |
|                                                                                                      |
| PLAN-ROOT SINK                                                                                       |
| |                                                                                                    |
| 09:MERGING-EXCHANGE [UNPARTITIONED]                                                                  |
| |  order by: t0.string_col ASC                                                                       |
| |  limit: 1001                                                                                       |
| |                                                                                                    |
| 04:TOP-N [LIMIT=1001]                                                                                |
| |  order by: t0.string_col ASC                                                                       |
| |  row-size=12B cardinality=1.00K                                                                    |
| |                                                                                                    |
| 08:AGGREGATE [FINALIZE]                                                                              |
| |  group by: t0.string_col                                                                           |
| |  row-size=12B cardinality=2.89M                                                                    |
| |                                                                                                    |
| 07:EXCHANGE [HASH(t0.string_col)]                                                                    |
| |                                                                                                    |
| 03:AGGREGATE [STREAMING]                                                                             |
| |  group by: t0.string_col                                                                           |
| |  row-size=12B cardinality=2.89M                                                                    |
| |                                                                                                    |
| 02:HASH JOIN [RIGHT OUTER JOIN, PARTITIONED]                                                         |
| |  hash predicates: t1.bigint_col = t0.bigint_col, t1.date_string_col = t0.date_string_col           |
| |  runtime filters: RF000 <- t0.bigint_col, RF001 <- t0.date_string_col, RF003 <- t0.date_string_col |
| |  row-size=56B cardinality=2.89M                                                                    |
| |                                                                                                    |
| |--06:EXCHANGE [HASH(t0.bigint_col,t0.date_string_col)]                                              |
| |  |                                                                                                 |
| |  00:SCAN HDFS [test_db.larger_tbl t0]                                                              |
| |     partition predicates: t0.date_string_col IN ('2023-05-05')                                     |
| |     HDFS partitions=1/1 files=1 size=1.14KB                                                        |
| |     predicates: t0.int_col_1 IN (1)                                                                |
| |     row-size=36B cardinality=2.89M                                                                 |
| |                                                                                                    |
| 05:EXCHANGE [HASH(t1.bigint_col,t1.date_string_col)]                                                 |
| |                                                                                                    |
| 01:SCAN HDFS [test_db.smaller_tbl t1]                                                                |
|    partition predicates: t1.date_string_col IN ('2023-05-05')                                        |
|    HDFS partitions=1/1 files=3 size=1.18KB                                                           |
|    runtime filters: RF003 -> t1.date_string_col, RF000 -> t1.bigint_col, RF001 -> t1.date_string_col |
|    row-size=20B cardinality=17.00K                                                                   |
+------------------------------------------------------------------------------------------------------+
{code}
The following is the corresponding trace-level log produced by the Planner class.
{code:java}
I0518 16:36:03.611824  8891 Planner.java:726] 97457635012ca55d:e8f7d0f300000000] isInvertedJoinCheaper() (0 1) 
I0518 16:36:03.611925  8891 Planner.java:727] 97457635012ca55d:e8f7d0f300000000] lhsCard 2890000 lhsBytes 1.0404E8 lhsNumNodes 1
I0518 16:36:03.611968  8891 Planner.java:729] 97457635012ca55d:e8f7d0f300000000] rhsCard 17000 rhsBytes 340000.0 rhsNumNodes 3
I0518 16:36:03.612016  8891 Planner.java:731] 97457635012ca55d:e8f7d0f300000000] cost 1.1028566059551687E9 invCost 9.043482988224154E8
I0518 16:36:03.612049  8891 Planner.java:732] 97457635012ca55d:e8f7d0f300000000] INVERT? true
I0518 16:36:03.625046  8891 Planner.java:191] 97457635012ca55d:e8f7d0f300000000] desctbl: tuples:
{code}
The following are collected metrics above used by Impala to estimate the join cost.
 # lhsCard: 2890000 v.s. rhsCard: 17000.
 # lhsBytes: 1.1028566059551687E9 v.s. rhsBytes: 1.0404E8.
 # lhsNumNodes: 1 v.s. rhsNumNodes: 3.

It would be great if there is a configurable way for an Impala user to tweak how Impala estimate the join cost.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org