You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2021/04/22 05:35:00 UTC
[jira] [Commented] (IMPALA-2658) Extend the NDV function to accept a precision

    [ https://issues.apache.org/jira/browse/IMPALA-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327114#comment-17327114 ] 

ASF subversion and git services commented on IMPALA-2658:
---------------------------------------------------------

Commit 1fb7dbac0d43f3ccbbbbaaf9c41db10d3320fc48 in impala's branch refs/heads/master from fifteencai
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=1fb7dba ]

IMPALA-10445: Adjust NDV's scale with query option

This is a new way to control NDV's scale.

Since IMPALA-2658, we can trade memory for more accurate
estimation by setting larger `scale` in SQL function
NDV(<expr>, <scale>). However the use of larger NDV scale requires
the modification of SQL queries which may not be practical in certain
applications:

- Firstly, SQL writers are reluctant to lower that scale. They prone
to fill up the scale, which will make the cluster unstable. Especially
when there are `group by`s with high cardinalities. So it is wiser to
let cluster admin other than sql writer choose appropriate scale.

- Secondly, In some application scenarios, queries are stored in DBs.
In a BI system, for example, rewriting thousands of SQLs is risky.

In this commit, we introduced a new Query Option `DEFAULT_NDV_SCALE`
with the following semantics:

1. The allowed value is in the range [1..10];
2. Previously, the scale used in NDV(<expr>) functions was fixed at 2.
Now the scale is provided by the newly added query options.
3. It does not influence the NDV scale for SQL function
NDV(<expr>, <scale>) in which the NDV scale is provided by the 2nd
argument <scale>.

We also refactored method `Analyze` to make sure APPX_COUNT_DISTINCT
can work with this query option. After this, cluster admins can
substitute `count(distinct <expr>)` with `ndv(<expr>, scale)`.

Implementation details:

- The default value of DEFAULT_NDV_SCALE is 2, so we won't change
the default ndv behavior.
- We port `CountDistinctToNdv` transform logic from
`SelectStmt.analyze()` to `ExprRewriter`, making it compatible with
further rewrite rules.
- The newly added rewrite rule `DefaultNdvScaleRule` is applied
after `CountDistinctToNdvRule`.

Usage:

To set a default ndv scale:
```
SET DEFAULT_NDV_SCALE = 10;
```

To unset:
```
SET DEFAULT_NDV_SCALE = 2;
```

Here are test results of a typical workload (cardinality=40,090,650):
+====================================================================+
|   Metric    | Count Distinct |    NDV2    |    NDV5    |    NDV10  |
+--------------------------------------------------------------------+
|  Memory(GB) |       3.83     |    1.84    |    1.85    |     1.89  |
| Duration(s) |      182.89    |   30.22    |    29.72   |     29.24 |
|  ErrorRate  |        0%      |    1.8%    |    1.17%   |     0.06% |
+====================================================================+

Testing:
1) Added 3 unit test cases in `ExprRewriteRulesTest`.
2) Added 5 unit test cases in `ExprRewriterTest`.
3) Ran all front-end unit test, passed.
4) Added a new query-option test.

Change-Id: I1669858a6e8252e167b464586e8d0b6cb0d0bd50
Reviewed-on: http://gerrit.cloudera.org:8080/17306
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Extend the NDV function to accept a precision
> ---------------------------------------------
>
>                 Key: IMPALA-2658
>                 URL: https://issues.apache.org/jira/browse/IMPALA-2658
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>    Affects Versions: Impala 2.2.4
>            Reporter: Peter Ebert
>            Assignee: Qifan Chen
>            Priority: Minor
>              Labels: ramp-up
>             Fix For: Impala 4.0
>
>         Attachments: Comparison of HLL Memory usage, Query Duration and Accuracy.jpg
>
>
> Hyperloglog algorithm used by NDV defaults to a precision of 10.  Being able to set this precision would have two benefits:
> # Lower precision sizes can speed up the performance, as a precision of 9 has 1/2 the number of registers as 10 (exponential) and may be just as accurate depending on expected cardinality.
> # Higher precision can help with very large cardinalities (100 million to billion range) and will typically provide more accurate data.  Those who are presenting estimates to end users will likely be willing to trade some performance cost for more accuracy, while still out performing the naive approach by a large margin.
> Propose adding the overloaded function NDV(expression, int precision)
> with accepted range between 18 and 4 inclusive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org