You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Paul Rogers (Code Review)" <ge...@cloudera.org> on 2018/10/02 19:50:17 UTC

[Impala-ASF-CR] Use NDV=1 for a Column with all nulls

Hello Philip Zeyliger, Impala Public Jenkins, Vuk Ercegovac, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/11528

to look at the new patch set (#10).

Change subject: Use NDV=1 for a Column with all nulls
......................................................................

Use NDV=1 for a Column with all nulls

Modified the planner to handle low-value NDVs by adjusting them
upward by one to account for null values. Thus, even an all-null
column, which has an NDV of 0 in stats, will have an NDV of 1 in
the planner. (The planner already expects NDV to include nulls.)

Modified the front end to allow capturing the full plan for use in
a unit test. Added unit tests that verify estimated cardinality
for a plan as a way to verify that the fix solved the scenario
in IMPALA-7310.

Testing required a new table, similar to the existing nulltable,
but which has multiple rows and has stats calculated.

The change was limited to a very narrow range of cases:

* Base table column (not an internal column such as COUNT(*))
* Type is not BOOLEAN (turns out metadata does the needed
  NDV correction for BOOLEAN only.)
* Column has stats
* Column is nullable
* Column does not provide a null count, or null count > 0
* Reported NDV <= 1

Testing showed that, at least for the functional test tables,
we do have cases in which stats are computed, but the null
count is -1 (undefined), which is why null count had to
be considered. If we know the null count, and the null count
is zero, then no adjustment is needed, But, if we don't know
the null count, or it is positive, then adjustment may be
needed.

Research for this patch revealed that Impala treats NDVs in
two distinct ways:

* Stats (which use the NDV function) computes NDV as the number
  of distinct non-null values. (That is, the NDV of (0, null) is
  1.)
* The planner itself when working with constants, uses a definition
  of NDV that includes nulls. That is, the NDV of (0, null) is 2.

This fix attempts to bridge the two definitions, Since we know
that the NDV in stats excludes nulls (except for the BOOLEAN
type), and we know that the column contains nulls, we can bump
up the NDV to convert from the stats definition to the planner
definition. But, to avoid regressions, we do so in a very narrow
range of NDV values: only 0 and 1.

Technically, the adjustment should apply to all NDV values. However,
it turns out that if we do so, we end up with many failures in
PlannerTest in those tests that work with TPC-H tables.
The TPC-H tables have multiple columns marked as nullable but which
actually have no nulls. Some of these columns also have a low NDV.

By limiting the NDV adjustment to the narrow range, the TPC-H tests
need not be updated.  Since end users could have a similar situation,
the narrow range reduces the chance that this fix might impact such
workloads.

Although the change minimized impact on PlannerTest, some
memory numbers needed adjusting for a test in which one
column hit the criteria listed above and had its NDV adjusted.

Bug: IMPALA-7310: All-null columns give wrong estimates in planner
Change-Id: Ife657a43c9cafc451bd12ddf857dcb7169e97459
---
M fe/src/main/java/org/apache/impala/analysis/SlotDescriptor.java
M fe/src/main/java/org/apache/impala/analysis/SlotRef.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java
A fe/src/test/java/org/apache/impala/planner/CardinalityTest.java
A testdata/NullRows/data.csv
M testdata/bin/compute-table-stats.sh
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds-all.test
10 files changed, 452 insertions(+), 23 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/28/11528/10
-- 
To view, visit http://gerrit.cloudera.org:8080/11528
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ife657a43c9cafc451bd12ddf857dcb7169e97459
Gerrit-Change-Number: 11528
Gerrit-PatchSet: 10
Gerrit-Owner: Paul Rogers <pa...@yahoo.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pa...@yahoo.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <ve...@cloudera.com>