You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Tim Armstrong (JIRA)" <ji...@apache.org> on 2019/01/23 23:53:00 UTC

[jira] [Commented] (IMPALA-7603) Incorrect NDV expression for col1 mathop col2

    [ https://issues.apache.org/jira/browse/IMPALA-7603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16750555#comment-16750555 ] 

Tim Armstrong commented on IMPALA-7603:
---------------------------------------

In the specific example of alltypes, we'd expect the NDV to be ultimately capped at 7300, right? Because that's the total number of rows in the table. Generally agree with the idea of this JIRA otherwise - NDV(a) * NDV(b) is probably a more conservative estimate than max(NDV(a), NDV(b)) although it's impossible to know without knowing the correlation.

> Incorrect NDV expression for col1 mathop col2
> ---------------------------------------------
>
>                 Key: IMPALA-7603
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7603
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>            Reporter: Paul Rogers
>            Priority: Minor
>
> Consider theĀ [{{ExprNdvTest}}|https://github.com/apache/impala/blob/master/fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java] test case. The code contains tests for the CASE expression. Add tests for simple arithmetic expressions:
> {noformat}
>     verifyNdv("id + 2", 7300);
>     verifyNdv("id * 2", 7300);
> {noformat}
> The above suggests that the NDV of a column op const is
> {noformat}
> max(NDV(column), NDV(const)) =
> max(NDV(column), 1) = NDV(column)
> {noformat}
> This is good and as expected.
> Now try two columns:
> {noformat}
>     verifyNdv("id + int_col", 7300);
>     verifyNdv("id * int_col", 7300);
> {noformat}
> This is *not* expected. Though the two columns are from the same table, they are not correlated: there is no reason to believe that the value of "id" determines the value of "int_col" in the general case. (Perhaps the table is the Cartesian product of the two fields.)
> In this case, the calculation should be:
> {noformat}
> NDV(a op b) = NDV(a) * NDV(b)
> {noformat}
> There might be some back-off to account for overlapping results. Could not readily find a reference for these calcs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org