You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2020/09/01 16:12:00 UTC

[jira] [Commented] (IMPALA-7310) Compute Stats not computing NULLs as a distinct value causing wrong estimates

    [ https://issues.apache.org/jira/browse/IMPALA-7310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17188602#comment-17188602 ] 

ASF subversion and git services commented on IMPALA-7310:
---------------------------------------------------------

Commit f4273a40fe2e2ab0806f0324ecf33a25b3df6279 in impala's branch refs/heads/master from Shant Hovsepian
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=f4273a4 ]

IMPALA-7310: Partial fix for NDV cardinality with NULLs.

This fix just handles the case where a column's cardinality is zero
however it's nullable and we have null stats to indicate there are null
values, therefore we adjust the cardinality from 0 to 1.

The cardinality of zero was especially problematic when calculating
cardinalities for multiple predicates with multiplication. The 0 would
propagate up the plan tree and result in poor plan choices such as
always using broadcast joins where shuffle would've been more optimal.

Testing:
  * 26 Node TPC-DS 30TB run had better plans for Q4 and Q11
    - Q4 172s -> 80s
    - Q11 103s -> 77s
  * CardinalityTest
  * TpcdsPlannerTest

Change-Id: Iec967053b4991f8c67cde62adf003cbd3f429032
Reviewed-on: http://gerrit.cloudera.org:8080/16349
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Compute Stats not computing NULLs as a distinct value causing wrong estimates
> -----------------------------------------------------------------------------
>
>                 Key: IMPALA-7310
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7310
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, Impala 2.11.0, Impala 3.0, Impala 2.12.0
>            Reporter: Zsombor Fedor
>            Priority: Major
>
> As seen in other DBMSs
> {code:java}
> NDV(col){code}
> not counting NULL as a distinct value. The same also applies to
> {code:java}
> COUNT(DISTINCT col){code}
> This is working as intended, but when computing column statistics it can cause some anomalies (i.g. bad join order) as compute stats uses NDV() to determine columns NDVs.
>  
> For example when aggregating more columns, the estimated cardinality is [counted as the product of the columns' number of distinct values.|https://github.com/cloudera/Impala/blob/64cd0bb0c3529efa0ab5452c4e9e2a04fd815b4f/fe/src/main/java/org/apache/impala/analysis/Expr.java#L669] If there is a column full of NULLs the whole product will be 0.
>  
> There are two possible fix for this.
> Either we should count NULLs as a distinct value when Computing Stats in the query:
> {code:java}
> SELECT NDV(a) + COUNT(DISTINCT CASE WHEN a IS NULL THEN 1 END) AS a, CAST(-1 as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
> instead of
> {code:java}
> SELECT NDV(a) AS a, CAST(-1 as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
>  
>  
> Or we should change the planner [function|https://github.com/cloudera/Impala/blob/2d2579cb31edda24457d33ff5176d79b7c0432c5/fe/src/main/java/org/apache/impala/planner/AggregationNode.java#L169] to take care of this bug.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org