You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2018/09/20 21:07:00 UTC

[jira] [Created] (IMPALA-7602) Definition of NDV differs between planner and stats mechanism

Paul Rogers created IMPALA-7602:
-----------------------------------

             Summary: Definition of NDV differs between planner and stats mechanism
                 Key: IMPALA-7602
                 URL: https://issues.apache.org/jira/browse/IMPALA-7602
             Project: IMPALA
          Issue Type: Improvement
          Components: Frontend
            Reporter: Paul Rogers


See IMPALA-7310 which says that the Impala NDV function is implemented as "number of non-null distinct values." IMPALA-7310 also says that the stats gathering mechanism uses the same definition.

Down in the comments, we point to [{{ExprNdvTest}}|https://github.com/apache/impala/blob/master/fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java] which shows that, in the planner itself, when working with constant expressions, NULL is considered a distinct value.

In the case described in IMPALA-7310, this means that a column of only nulls has an NDV=0 if stats are used, NDV=1 if constants are used.

This is a minor point, but would be good to use a single definition everywhere. That way, if we use the "number of non-null distinct values" rule, the "adjusted NDV" is always one more than the "raw" NDV. As it is now, we can't be sure when to add the null adjustment because we don't know if it is already included.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org