You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Zsombor Fedor (JIRA)" <ji...@apache.org> on 2018/07/17 13:35:00 UTC

[jira] [Created] (IMPALA-7310) Compute Stats not computing NULLs as a distinct value causing wrong estimates

Zsombor Fedor created IMPALA-7310:
-------------------------------------

             Summary: Compute Stats not computing NULLs as a distinct value causing wrong estimates
                 Key: IMPALA-7310
                 URL: https://issues.apache.org/jira/browse/IMPALA-7310
             Project: IMPALA
          Issue Type: Bug
          Components: Frontend
    Affects Versions: Impala 2.12.0, Impala 3.0, Impala 2.11.0, Impala 2.10.0, Impala 2.9.0, Impala 2.7.1, Impala 2.8.0
            Reporter: Zsombor Fedor


As seen in other DBMSs
{code:java}
NDV(col){code}
not counting NULL as a distinct value. The same also applies to
{code:java}
COUNT(DISTINCT col){code}
This is working as intended, but when computing column statistics it can cause some anomalies (i.g. bad join order) as compute stats uses NDV() to determine columns NDVs.

 

For example when aggregating more columns, the estimated cardinality is [counted as the product of the columns' number of distinct values.|[https://github.com/cloudera/Impala/blob/64cd0bb0c3529efa0ab5452c4e9e2a04fd815b4f/fe/src/main/java/org/apache/impala/analysis/Expr.java#L669].] If there is a column full of NULLs the whole product will be 0.

 

There are two possible fix for this.

Either we should count NULLs as a distinct value when Computing Stats in the query:
{code:java}
SELECT NDV(a) + COUNT(DISTINCT CASE WHEN a IS NULL THEN 1 END) AS a, CAST(-1 as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
instead of
{code:java}
SELECT NDV(a) AS a, CAST(-1 as BIGINT), 4, CAST(4 as DOUBLE) FROM test;{code}
 

 

Or we should change the planner [function|https://github.com/cloudera/Impala/blob/2d2579cb31edda24457d33ff5176d79b7c0432c5/fe/src/main/java/org/apache/impala/planner/AggregationNode.java#L169] to take care of this bug.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)