You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues-all@impala.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2019/01/05 01:06:00 UTC

[jira] [Created] (IMPALA-8050) IS [NOT] NULL gives wrong selectivity when null count is missing

Paul Rogers created IMPALA-8050:
-----------------------------------

             Summary: IS [NOT] NULL gives wrong selectivity when null count is missing
                 Key: IMPALA-8050
                 URL: https://issues.apache.org/jira/browse/IMPALA-8050
             Project: IMPALA
          Issue Type: Bug
          Components: Frontend
    Affects Versions: Impala 3.1.0
            Reporter: Paul Rogers
            Assignee: Paul Rogers


Suppose we have the following query:

{noformat}
select *
from tpch.customer c
where c.c_mktsegment is null
{noformat}

If we have a null count, we can estimate selectivity based on that number. In the case of the TPC-H test data, after a recent fix to add null count back, null count is zero so the cardinality of the predicate {{c.c_mktsegment is null}} is 0 and no rows should be returned. Yet, the query plan shows:

{noformat}
PLAN-ROOT SINK
|
00:SCAN HDFS [tpch.customer c]
   partitions=1/1 files=1 size=23.08MB row-size=218B cardinality=15.00K
   predicates: c.c_comment IS NULL
{noformat}

So, the first bug is that the existing code which is supposed to consider null count (found in {{IsNullPredicate.analyzeImpl()}} does not work. Reason: the code in {{ColumnStats}} to check if we have nulls is wrong:

{code:java}
  public boolean hasNulls() { return numNulls_ > 0; }
{code}

Zero is a perfectly valid null count: it means a NOT NULL column. The marker for a missing null count is -1 as shown in another method:

{code:java}
  public boolean hasStats() { return numNulls_ != -1 || numDistinctValues_ != -1; }
{code}

This is probably an ambiguity in the name: does "has nulls" mean:

* Do we have valid null count stats?
* Do we have null count stats and we have at least some nulls?

Fortunately, the only other use of this method is in (disabled) tests.

Second, if the null count is not available (for older stats), the next-best approximation is 1/NDV. The code currently guesses 0.1. The 0.1 estimate is fine if NDV is not available either.

Note that to properly test some of these cases requires new tables in the test suite with no or partial stats.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org