You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Qifan Chen (Code Review)" <ge...@cloudera.org> on 2020/06/09 02:23:23 UTC
[Impala-ASF-CR] IMPALA-2658: Extend the NDV function to accept a precision

Qifan Chen has uploaded a new patch set (#14). ( http://gerrit.cloudera.org:8080/15997 )

Change subject: IMPALA-2658: Extend the NDV function to accept a precision
......................................................................

IMPALA-2658: Extend the NDV function to accept a precision

This work addresses the current limitation in NDV function by
extending the function to take the 2nd integer-typed argument,
which must be an abstract value in the range of 1 to 10. This
abstract value specifies a real precision value used in the HLL
algorithm for the function.

Front end work:
1. Add a new template ndv function in builtin db that takes two
   arguments.
2. Verify that the 2nd argument of a NDV() is an integer literal in
   [1,10];
3. A new method to implement the mapping of the abstract value to the
   hll precision (the real work is TBD);
4. The length of the intermediate data type is computed based on the
   actual hll precision. When the 2nd argument is missing, the length
   is 1024 as in the current implementation;
5. The 2nd argument, if present, will be carried over all the way to
   the BE.

Back end work:
1. Remove the hardcoded precision (10) from these functions:
     AggregateFunctions::HllInit(),
     AggregateFunctions::HllUpdate(),
     AggregateFunctions::HllMerge(),
     AggregateFunctions::HllFinalEstimate(),
     AggregateFunctions::HllFinalize(),
     HllEstimateBias();
2. Instead, the actual precision is computed from the
   length of the intermediate data type;
3. Verify that the length of the intermediate data type is
   correct according to the 2nd argument (if present).

Testing:
1. Add a regression test (test_ndv)) in TestAggregationQueries
   section to computes ndv() for every supported Impala data type.
2.  Run unit tests against other tables such as tpcds.store_sales
    and tpch.customer in both serial and parallel plan settings.
3.  Ran core tests

select ndv(c_name, 1) "one", ndv(c_name, 2) two, ndv(c_name, 3) three,
ndv(c_name, 4) as four, ndv(c_name, 5) as five, ndv(c_name, 6) as six,
ndv(c_name, 7) as seven, ndv(c_name, 8) as eight, ndv(c_name, 9) as nine,
ndv(c_name, 10)  as ten
from tpch.customer;

select ndv(ss_sold_time_sk, 1) "one", ndv(ss_sold_time_sk, 2) two,
ndv(ss_sold_time_sk, 3) three, ndv(ss_sold_time_sk, 4) as four,
ndv(ss_sold_time_sk, 5) as five, ndv(ss_sold_time_sk, 6) as six,
ndv(ss_sold_time_sk, 7) as seven, ndv(ss_sold_time_sk, 8) as eight,
ndv(ss_sold_time_sk, 9) as nine, ndv(ss_sold_time_sk, 10) as ten
from tpcds.store_sales;

Performance: ran two experiments on EC2 VM with a debug build:
1. Estimation error tests against a total of 22 different data sets:
10 with 10million unique values (5 strings, and 5 integers),
10 with around 100million unique values (5 strings, and 5 integers),
and 2 with around 500million unique values (1 string and 1 integer set).
The error is computed as abs(<true_UEC> - <estimated_UEC>) / <true_UEC>.

Overall, the precision of 18 (or the abstact value of 10) gives
the best result with worst estimation error at 0.42% (for a set of
10 mliion random 32 bit Integers), and average error no more than 0.17%,
at the cost of 256Kb of memory for the internal data structures per
evaluation of the Hll algorithm.  Other precisions (such as 16 and 17)
are also very reasonable but with slightly larger estimation errors.

2. Execution time tests against a total of 6 different data sets: it is
found the total execution time is relatively stable across different
anstraction values.

- 10 million unique strings: ~3.5s
- 10 million unique integers: ~3.34s
- 100 million unique strings: ~5.0s
- ~100 million unique integers: 5.0s
- ~500million unique strings: ~22.0s
- ~500million unique integers: ~19.0s

Change-Id: I48a4517bd0959f7021143073d37505a46c551a58
---
M be/src/common/logging.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M tests/query_test/test_aggregation.py
9 files changed, 368 insertions(+), 75 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/97/15997/14
-- 
To view, visit http://gerrit.cloudera.org:8080/15997
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I48a4517bd0959f7021143073d37505a46c551a58
Gerrit-Change-Number: 15997
Gerrit-PatchSet: 14
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>