You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Aman Sinha (Jira)" <ji...@apache.org> on 2023/03/18 19:58:00 UTC

[jira] [Commented] (IMPALA-12006) Outer/inner join cardinality highly overestimated

    [ https://issues.apache.org/jira/browse/IMPALA-12006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17702173#comment-17702173 ] 

Aman Sinha commented on IMPALA-12006:
-------------------------------------

The overestimation occurs for both inner and outer joins.  Changed the Jira title to reflect that.  I created a simplified example on TPC-H dataset:

{noformat}
create view v4 as
 select a.o_clerk from orders a inner join
  (select o_clerk, max(cast(o_orderdate as DATE)) max_date from orders where o_orderdate < DATE '1998-01-01' group by o_clerk) t1
   on a.o_clerk = t1.o_clerk AND a.o_orderdate = max_date;

explain select * from
orders a left outer join v4
on a.o_clerk = v4.o_clerk;

+--------------------------------------------------------------------------------------------+
| Explain String                                                                             |
+--------------------------------------------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=69.88MB Threads=9                                |
| Per-Host Resource Estimates: Memory=778MB                                                  |
|                                                                                            |
| PLAN-ROOT SINK                                                                             |
| |                                                                                          |
| 11:EXCHANGE [UNPARTITIONED]                                                                |
| |                                                                                          |
| 05:HASH JOIN [LEFT OUTER JOIN, PARTITIONED]                                                |
| |  hash predicates: a.o_clerk = a.o_clerk                                                  |
| |  row-size=251B cardinality=2.24G                                                         |
| |                                                                                          |
| |--10:EXCHANGE [HASH(a.o_clerk)]                                                           |
| |  |                                                                                       |
| |  04:HASH JOIN [INNER JOIN, BROADCAST]                                                    |
| |  |  hash predicates: a.o_orderdate = max(CAST(o_orderdate AS DATE)), a.o_clerk = o_clerk |
| |  |  runtime filters: RF000 <- max(CAST(o_orderdate AS DATE)), RF001 <- o_clerk           |
| |  |  row-size=80B cardinality=1.50M                                                       |
| |  |                                                                                       |
| |  |--08:EXCHANGE [BROADCAST]                                                              |
| |  |  |                                                                                    |
| |  |  07:AGGREGATE [FINALIZE]                                                              |
| |  |  |  output: max:merge(CAST(o_orderdate AS DATE))                                      |
| |  |  |  group by: o_clerk                                                                 |
| |  |  |  row-size=31B cardinality=1.01K                                                    |
| |  |  |                                                                                    |
| |  |  06:EXCHANGE [HASH(o_clerk)]                                                          |
| |  |  |                                                                                    |
| |  |  03:AGGREGATE [STREAMING]                                                             |
| |  |  |  output: max(CAST(o_orderdate AS DATE))                                            |
| |  |  |  group by: o_clerk                                                                 |
| |  |  |  row-size=31B cardinality=1.01K                                                    |
| |  |  |                                                                                    |
| |  |  02:SCAN HDFS [tpch.orders]                                                           |
| |  |     HDFS partitions=1/1 files=1 size=162.56MB                                         |
| |  |     predicates: o_orderdate < DATE '1998-01-01'                                       |
| |  |     row-size=49B cardinality=150.00K                                                  |
| |  |                                                                                       |
| |  01:SCAN HDFS [tpch.orders a]                                                            |
| |     HDFS partitions=1/1 files=1 size=162.56MB                                            |
| |     runtime filters: RF000 -> a.o_orderdate, RF001 -> a.o_clerk                          |
| |     row-size=49B cardinality=1.50M                                                       |
| |                                                                                          |
| 09:EXCHANGE [HASH(a.o_clerk)]                                                              |
| |                                                                                          |
| 00:SCAN HDFS [tpch.orders a]                                                               |
|    HDFS partitions=1/1 files=1 size=162.56MB                                               |
|    row-size=171B cardinality=1.50M                                                         |
+--------------------------------------------------------------------------------------------+

tpch> select count(*) from
                      > orders a left outer join v4
                      > on a.o_clerk = v4.o_clerk;
+----------+                                                                                             
| count(*) |
+----------+
| 2009241  |
+----------+

{noformat}

So the estimated cardinality of the top Left Outer Join is 2.24B.  Actual cardinality is 2M, so an overestimation by 3 orders of magnitude.  I deliberately chose o_clerk as the join key since that has lots of duplicates.  Here's the NDV stats for orders.o_clerk:
{noformat}
+-----------------+---------------+------------------+--------+----------+---------------+--------+---------+
| Column          | Type          | #Distinct Values | #Nulls | Max Size | Avg Size      | #Trues | #Falses |
+-----------------+---------------+------------------+--------+----------+---------------+--------+---------+
| o_clerk         | STRING        | 1006             | 0      | 15       | 15.0          | -1     | -1      |
{noformat}



> Outer/inner join cardinality highly overestimated
> -------------------------------------------------
>
>                 Key: IMPALA-12006
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12006
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 4.2.0
>            Reporter: Aman Sinha
>            Assignee: Aman Sinha
>            Priority: Major
>
> In one of the use cases, we have seen the cardinality estimate for left outer join highly overestimated.  The plan is complex and only a partial output is shown below (with the column names anonymized): 
> {noformat}
>   57:HASH JOIN [LEFT OUTER JOIN, BROADCAST]
>   |  hash-table-id=121
>   |  hash predicates: a.id = a.id
>   |  fk/pk conjuncts: none
>   |  mem-estimate=0B mem-reservation=0B spill-buffer=2.00MB thread-reservation=0
>   |  tuple-ids=4N,3,6N,8N,9N,11N,14N,16N,19N,21N,24N,26N,29N,31N row-size=2.63KB cardinality=3.90T
>   |  in pipelines: 06(GETNEXT), 26(OPEN)
>   |
>   |--F1253:PLAN FRAGMENT hosts=13 instances=13
>   |  |  Per-Instance Resources: mem-estimate=1.10GB mem-reservation=204.00MB thread-reservation=1
>   |  JOIN BUILD
>   |  |  join-table-id=121 plan-id=122 cohort-id=25
>   |  |  build expressions: a.id
>   |  |  mem-estimate=1.08GB mem-reservation=204.00MB spill-buffer=2.00MB thread-reservation=0
>   |  |
>   |  1758:EXCHANGE [BROADCAST]
>   |  |  mem-estimate=20.87MB mem-reservation=0B thread-reservation=0
>   |  |  tuple-ids=29,31 row-size=85B cardinality=9.56M
>   |  |  in pipelines: 26(GETNEXT)
>   |  |
>   ...
>   ...
>   ...
>   56:HASH JOIN [LEFT OUTER JOIN, BROADCAST]
>   |  hash predicates: ifnull(a.id, a.id) = a.id
>   |  fk/pk conjuncts: assumed fk/pk
>   |  mem-estimate=0B mem-reservation=0B spill-buffer=2.00MB thread-reservation=0
>   |  tuple-ids=4N,3,6N,8N,9N,11N,14N,16N,19N,21N,24N,26N row-size=2.55KB cardinality=14.97G
>   |  in pipelines: 06(GETNEXT), 22(OPEN)
> {noformat}
> Note that the left input of the join is estimated as 14.97G rows, right input as 9.56M rows but the LOJ estimate is 3.9T rows.  We need to investigate why that is so and fix it.  The NDV of the based column involved in the join is 36661  but in the lower join there are functions involved in the join condition. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org