You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Joe McDonnell (Code Review)" <ge...@cloudera.org> on 2017/01/23 19:44:59 UTC

[Impala-ASF-CR] IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

Joe McDonnell has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/5768

Change subject: IMPALA-4792: Fix number of distinct values for a CASE with constant outputs
......................................................................

IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

If a Case expression has only constants as return values, then the number of
distinct values can be restricted to the number of outputs.

There is a limitation to this fix:
Expr.java calculates the number of distinct values by finding any SlotRefs
in its tree and taking the maximum of the distinct values from those SlotRefs.
Due to this, this fix will only take effect if the CaseExpr is at the top level
(i.e. no function is applied on top of it). For example:

-- explaining this statement shows cardinality = 2
explain select distinct case when id = 1 then 'yes' else 'no' end
from functional.alltypes;

-- explaining this statement shows cardinality = 7300
explain select distinct char_length(case when id = 1 then 'yes' else 'no' end)
from functional.alltypes;

Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
---
M fe/src/main/java/org/apache/impala/analysis/CaseExpr.java
1 file changed, 30 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/5768/1
-- 
To view, visit http://gerrit.cloudera.org:8080/5768
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>

[Impala-ASF-CR] IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change.

Change subject: IMPALA-4792: Fix number of distinct values for a CASE with constant outputs
......................................................................


Patch Set 3:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/5768/3/fe/src/main/java/org/apache/impala/analysis/CaseExpr.java
File fe/src/main/java/org/apache/impala/analysis/CaseExpr.java:

Line 393:     long numOutputConstants = 0;
> int is fine here, if you overflow that we have other problems.
Done


Line 402:       if ((i - loopStart) % 2 == 1 || (i == children_.size() - 1 && hasElseExpr_)) {
> check the negation and continue (to save one indentation level).
Done


Line 409:             if (constLiteralSet.add(outputLiteral)) ++numOutputConstants;
> nice optimization. is there a test for that?
I hand tested it. I was looking through our tests, and it would be easy for me to add a SQL to QueryTest/explain-level2.test to test this patch's behavior.


-- 
To view, visit http://gerrit.cloudera.org:8080/5768
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Hello Marcel Kornacker,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5768

to look at the new patch set (#8).

Change subject: IMPALA-4792: Fix number of distinct values for a CASE with constant outputs
......................................................................

IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

If all the return values of a Case expression have a known number of
distinct values (i.e. they are constant or statistics exist), then
the number of distinct values for the Case can be computed using this
information.

In order for the value from Case to be used at higher levels in the tree,
the implementation of computeNumDistinctValues for Expr needed to change.
Previously, Expr calculated the number of distinct values by finding any
SlotRefs in its tree and taking the maximum of the distinct values from
those SlotRefs. This would ignore the value from CaseExpr. To fix this,
Expr now takes the maximum number of distinct values across all of its
children.

-- explaining this statement shows cardinality = 2
explain select distinct case when id = 1 then 'yes' else 'no' end
from functional.alltypes;

-- explaining this statement shows cardinality = 2
explain select distinct char_length(case when id = 1 then 'yes' else 'no' end)
from functional.alltypes;

-- explaining this statement shows cardinality = 7300
explain select distinct case when id = 1 then 0 else id end
from functional.alltypes;

-- explaining this statement shows cardinality = 737 (date_string_col has lower
-- cardinality than id)
explain select distinct case when id = 1 then 'yes' else date_string_col end
from functional.alltypes;

For cases when the number of distinct values is not known for all the outputs,
this will return -1, indicating that the number of distinct values is not
known. The inputs (whens) are not used for calculating the number of distinct
values.

Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
---
M fe/src/main/java/org/apache/impala/analysis/CaseExpr.java
M fe/src/main/java/org/apache/impala/analysis/Expr.java
A fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java
3 files changed, 185 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/5768/8
-- 
To view, visit http://gerrit.cloudera.org:8080/5768
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>

[Impala-ASF-CR] IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

Posted by "Marcel Kornacker (Code Review)" <ge...@cloudera.org>.
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-4792: Fix number of distinct values for a CASE with constant outputs
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5768/2/fe/src/main/java/org/apache/impala/analysis/CaseExpr.java
File fe/src/main/java/org/apache/impala/analysis/CaseExpr.java:

Line 384:     // Otherwise, take a max over all the outputs as well as all the slotrefs that
> Yes. The alternative is to take the max over the 'then/else exprs' and igno
hm, not sure what to do here. we could try out both variants, and i can make arguments for both (don't throw information away; counterexample: case when a < 10 then 1 else a end will get it really wrong if ndv(a) == -1 but in reality it has 10^9 values).

or maybe start with the -1 propagation, and then see how many benchmark queries we regress.


-- 
To view, visit http://gerrit.cloudera.org:8080/5768
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has uploaded a new patch set (#7).

Change subject: IMPALA-4792: Fix number of distinct values for a CASE with constant outputs
......................................................................

IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

If all the return values of a Case expression have a known number of
distinct values (i.e. they are constant or statistics exist), then
the number of distinct values for the Case can be computed using this
information.

In order for the value from Case to be used at higher levels in the tree,
the implementation of computeNumDistinctValues for Expr needed to change.
Previously, Expr calculated the number of distinct values by finding any
SlotRefs in its tree and taking the maximum of the distinct values from
those SlotRefs. This would ignore the value from CaseExpr. To fix this,
Expr now takes the maximum number of distinct values across all of its
children.

-- explaining this statement shows cardinality = 2
explain select distinct case when id = 1 then 'yes' else 'no' end
from functional.alltypes;

-- explaining this statement shows cardinality = 2
explain select distinct char_length(case when id = 1 then 'yes' else 'no' end)
from functional.alltypes;

-- explaining this statement shows cardinality = 7300
explain select distinct case when id = 1 then 0 else id end
from functional.alltypes;

-- explaining this statement shows cardinality = 737 (date_string_col has lower
-- cardinality than id)
explain select distinct case when id = 1 then 'yes' else date_string_col end
from functional.alltypes;

For cases when the number of distinct values is not known for all the outputs,
this will return -1, indicating that the number of distinct values is not
known. The inputs (whens) are not used for calculating the number of distinct
values.

Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
---
M fe/src/main/java/org/apache/impala/analysis/CaseExpr.java
M fe/src/main/java/org/apache/impala/analysis/Expr.java
A fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java
3 files changed, 183 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/5768/7
-- 
To view, visit http://gerrit.cloudera.org:8080/5768
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>

[Impala-ASF-CR] IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-4792: Fix number of distinct values for a CASE with constant outputs
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5768/2/fe/src/main/java/org/apache/impala/analysis/CaseExpr.java
File fe/src/main/java/org/apache/impala/analysis/CaseExpr.java:

Line 384:     // Otherwise, take a max over all the outputs as well as all the slotrefs that
> hm, not sure what to do here. we could try out both variants, and i can mak
Agree. Either approach is fine to me. I don't think we use this Expr NDV information for any critical planning decisions, so I don't think the choice here will have any impact on plans. That's why I vote to fix these NDV issues aggressively while we are not yet relying on them too much :)


-- 
To view, visit http://gerrit.cloudera.org:8080/5768
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

Posted by "Marcel Kornacker (Code Review)" <ge...@cloudera.org>.
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-4792: Fix number of distinct values for a CASE with constant outputs
......................................................................


Patch Set 7: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5768/7/fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java
File fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java:

Line 37:   // This test queries two tables to allow testing missing statistics.
use java-style function comments

/**
 * This ...
 */


-- 
To view, visit http://gerrit.cloudera.org:8080/5768
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-4792: Fix number of distinct values for a CASE with constant outputs
......................................................................


Patch Set 6:

(5 comments)

I'm pretty happy with this change minus the remaining nits.

http://gerrit.cloudera.org:8080/#/c/5768/6/fe/src/main/java/org/apache/impala/analysis/CaseExpr.java
File fe/src/main/java/org/apache/impala/analysis/CaseExpr.java:

Line 385:     // sum the number of constants with the maximum distinct value for the
... sum the number of distinct constants with the maximum NDV for the non-constants.


http://gerrit.cloudera.org:8080/#/c/5768/6/fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java
File fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java:

Line 27:  * Tests computeNumDistinctValues estimates for Exprs
computeNumDistinctValues()


Line 44:   public void VerifyNdv(String stmtStr, long expectedNdv)
can we make this VerifyNdvStmt() or something so that we can use VerifyNdv() for the common case of writing tests?


Line 93:     // functional.tinytable (tiny) does not
move this into the function comment of VerifyNdvTwoTable()


Line 96:     VerifyNdvTwoTable("case when a.id = 1 then 'yes' " + 
remove whitespace


-- 
To view, visit http://gerrit.cloudera.org:8080/5768
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged.

Change subject: IMPALA-4792: Fix number of distinct values for a CASE with constant outputs
......................................................................


IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

If all the return values of a Case expression have a known number of
distinct values (i.e. they are constant or statistics exist), then
the number of distinct values for the Case can be computed using this
information.

In order for the value from Case to be used at higher levels in the tree,
the implementation of computeNumDistinctValues for Expr needed to change.
Previously, Expr calculated the number of distinct values by finding any
SlotRefs in its tree and taking the maximum of the distinct values from
those SlotRefs. This would ignore the value from CaseExpr. To fix this,
Expr now takes the maximum number of distinct values across all of its
children.

-- explaining this statement shows cardinality = 2
explain select distinct case when id = 1 then 'yes' else 'no' end
from functional.alltypes;

-- explaining this statement shows cardinality = 2
explain select distinct char_length(case when id = 1 then 'yes' else 'no' end)
from functional.alltypes;

-- explaining this statement shows cardinality = 7300
explain select distinct case when id = 1 then 0 else id end
from functional.alltypes;

-- explaining this statement shows cardinality = 737 (date_string_col has lower
-- cardinality than id)
explain select distinct case when id = 1 then 'yes' else date_string_col end
from functional.alltypes;

For cases when the number of distinct values is not known for all the outputs,
this will return -1, indicating that the number of distinct values is not
known. The inputs (whens) are not used for calculating the number of distinct
values.

Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
Reviewed-on: http://gerrit.cloudera.org:8080/5768
Reviewed-by: Alex Behm <al...@cloudera.com>
Tested-by: Impala Public Jenkins
---
M fe/src/main/java/org/apache/impala/analysis/CaseExpr.java
M fe/src/main/java/org/apache/impala/analysis/Expr.java
A fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java
3 files changed, 185 insertions(+), 8 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Alex Behm: Looks good to me, approved



-- 
To view, visit http://gerrit.cloudera.org:8080/5768
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>

[Impala-ASF-CR] IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-4792: Fix number of distinct values for a CASE with constant outputs
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5768/2/fe/src/main/java/org/apache/impala/analysis/CaseExpr.java
File fe/src/main/java/org/apache/impala/analysis/CaseExpr.java:

Line 384:     // Otherwise, take a max over all the outputs as well as all the slotrefs that
> would this make us throw away information where we're not doing that today?
Yes. The alternative is to take the max over the 'then/else exprs' and ignore those with an NDV of -1. This is more similar (but still different) than the old behavior.


-- 
To view, visit http://gerrit.cloudera.org:8080/5768
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change.

Change subject: IMPALA-4792: Fix number of distinct values for a CASE with constant outputs
......................................................................


Patch Set 6:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/5768/6/fe/src/main/java/org/apache/impala/analysis/CaseExpr.java
File fe/src/main/java/org/apache/impala/analysis/CaseExpr.java:

Line 385:     // sum the number of constants with the maximum distinct value for the
> ... sum the number of distinct constants with the maximum NDV for the non-c
Done


http://gerrit.cloudera.org:8080/#/c/5768/6/fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java
File fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java:

Line 27:  * Tests computeNumDistinctValues estimates for Exprs
> computeNumDistinctValues()
Done


Line 31:   public void VerifyNdvBasic(String expr, long expectedNdv)
> yeah, i think i messed that up.
Done


Line 44:   public void VerifyNdv(String stmtStr, long expectedNdv)
> can we make this VerifyNdvStmt() or something so that we can use VerifyNdv(
Done


Line 93:     // functional.tinytable (tiny) does not
> move this into the function comment of VerifyNdvTwoTable()
Done


Line 96:     VerifyNdvTwoTable("case when a.id = 1 then 'yes' " + 
> remove whitespace
Done, also fixed emacs to highlight these


-- 
To view, visit http://gerrit.cloudera.org:8080/5768
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-4792: Fix number of distinct values for a CASE with constant outputs
......................................................................


Patch Set 8:

Build started: http://jenkins.impala.io:8080/job/gerrit-verify-dryrun/229/

-- 
To view, visit http://gerrit.cloudera.org:8080/5768
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has uploaded a new patch set (#3).

Change subject: IMPALA-4792: Fix number of distinct values for a CASE with constant outputs
......................................................................

IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

If all the return values of a Case expression have a known number of
distinct values (i.e. they are constant or statistics exist), then
the number of distinct values for the Case can be computed using this
information.

In order for the value from Case to be used at higher levels in the tree,
the implementation of computeNumDistinctValues for Expr needed to change.
Previously, Expr calculated the number of distinct values by finding any
SlotRefs in its tree and taking the maximum of the distinct values from
those SlotRefs. This would ignore the value from CaseExpr. To fix this,
Expr now takes the maximum number of distinct values across all of its
children.

-- explaining this statement shows cardinality = 2
explain select distinct case when id = 1 then 'yes' else 'no' end
from functional.alltypes;

-- explaining this statement shows cardinality = 2
explain select distinct char_length(case when id = 1 then 'yes' else 'no' end)
from functional.alltypes;

-- explaining this statement shows cardinality = 7300
explain select distinct case when id = 1 then 0 else id end
from functional.alltypes;

-- explaining this statement shows cardinality = 737 (date_string_col has lower
-- cardinality than id)
explain select distinct case when id = 1 then 'yes' else date_string_col end
from functional.alltypes;

For cases when the number of distinct values is not known for all the outputs,
this will return -1, indicating that the number of distinct values is not
known.

Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
---
M fe/src/main/java/org/apache/impala/analysis/CaseExpr.java
M fe/src/main/java/org/apache/impala/analysis/Expr.java
2 files changed, 69 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/5768/3
-- 
To view, visit http://gerrit.cloudera.org:8080/5768
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>

[Impala-ASF-CR] IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change.

Change subject: IMPALA-4792: Fix number of distinct values for a CASE with constant outputs
......................................................................


Patch Set 5:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/5768/5/fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java
File fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java:

Line 32: 	String stmtStr = "select " + expr + " from functional.alltypes";
> remove tabs and trailing spaces
Done


Line 42:   public Expr CalcsNdvOk(String stmtStr, long expectedNdv)
> odd/not immediately intuitive name
Changed to VerifyNdv


Line 48:     // Go into the select list and get the expression
> superfluous comment (it simply states what's apparent from the next line), 
Done


Line 51: 
> use blank lines judiciously, to separate logically separate pieces of code
Done


Line 54:     return analyzedExpr;
> why return anything?
No reason. I have changed this to void.


Line 71:     CalcsNdvOkBasic("case when id = 1 then 'yes' when id = 2 then 'maybe' else 'no' end", 3);
> long line (and elsewhere)
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/5768
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

Posted by "Marcel Kornacker (Code Review)" <ge...@cloudera.org>.
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-4792: Fix number of distinct values for a CASE with constant outputs
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5768/2/fe/src/main/java/org/apache/impala/analysis/CaseExpr.java
File fe/src/main/java/org/apache/impala/analysis/CaseExpr.java:

Line 384:     // Otherwise, take a max over all the outputs as well as all the slotrefs that
> I agree. The behavior with incomplete stats is debatable.
would this make us throw away information where we're not doing that today?


-- 
To view, visit http://gerrit.cloudera.org:8080/5768
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

Posted by "Marcel Kornacker (Code Review)" <ge...@cloudera.org>.
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-4792: Fix number of distinct values for a CASE with constant outputs
......................................................................


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5768/6/fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java
File fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java:

Line 31:   public void VerifyNdvBasic(String expr, long expectedNdv)
follow java naming convention


-- 
To view, visit http://gerrit.cloudera.org:8080/5768
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-4792: Fix number of distinct values for a CASE with constant outputs
......................................................................


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5768/6/fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java
File fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java:

Line 31:   public void VerifyNdvBasic(String expr, long expectedNdv)
> follow java naming convention
Just curious: do you mean change it to verifyNdvBasic()? Our test validation functions usually start uppercase like AnalyzesOk(), ParsesOk(), RewritesOk() etc.


-- 
To view, visit http://gerrit.cloudera.org:8080/5768
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has uploaded a new patch set (#2).

Change subject: IMPALA-4792: Fix number of distinct values for a CASE with constant outputs
......................................................................

IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

If all the return values of a Case expression have a known number of
distinct values (i.e. they are constant or statistics exist), then
the number of distinct values for the Case can be computed using this
information.

In order for the value from Case to be used at higher levels in the tree,
the implementation of computeNumDistinctValues for Expr needed to change.
Previously, Expr calculated the number of distinct values by finding any
SlotRefs in its tree and taking the maximum of the distinct values from
those SlotRefs. This would ignore the value from CaseExpr. To fix this,
Expr now takes the maximum number of distinct values across all of its
children.

-- explaining this statement shows cardinality = 2
explain select distinct case when id = 1 then 'yes' else 'no' end
from functional.alltypes;

-- explaining this statement shows cardinality = 2
explain select distinct char_length(case when id = 1 then 'yes' else 'no' end)
from functional.alltypes;

-- explaining this statement shows cardinality = 7300
explain select distinct case when id = 1 then 0 else id end
from functional.alltypes;

-- explaining this statement shows cardinality = 737 (date_string_col has lower
-- cardinality than id)
explain select distinct case when id = 1 then 'yes' else date_string_col end
from functional.alltypes;

For cases when the number of distinct values is not known for all the outputs,
this will take a max over all the SlotRefs that are children of the inputs
as well as all the known outputs.

Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
---
M fe/src/main/java/org/apache/impala/analysis/CaseExpr.java
M fe/src/main/java/org/apache/impala/analysis/Expr.java
2 files changed, 101 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/5768/2
-- 
To view, visit http://gerrit.cloudera.org:8080/5768
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>

[Impala-ASF-CR] IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

Posted by "Marcel Kornacker (Code Review)" <ge...@cloudera.org>.
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-4792: Fix number of distinct values for a CASE with constant outputs
......................................................................


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5768/6/fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java
File fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java:

Line 31:   public void VerifyNdvBasic(String expr, long expectedNdv)
> Just curious: do you mean change it to verifyNdvBasic()? Our test validatio
yeah, i think i messed that up.


-- 
To view, visit http://gerrit.cloudera.org:8080/5768
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-4792: Fix number of distinct values for a CASE with constant outputs
......................................................................


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/5768/2/fe/src/main/java/org/apache/impala/analysis/CaseExpr.java
File fe/src/main/java/org/apache/impala/analysis/CaseExpr.java:

Line 384:     // Otherwise, take a max over all the outputs as well as all the slotrefs that
> The existing algorithm looks at all SlotRef descendants of the Expr. For th
I agree. The behavior with incomplete stats is debatable.

The old behavior doesn't really make sense in several cases (like case expr). I don't think we should try to preserve "compatibility" with it too much, especially since this Expr NDV information is not used for making critical plan choices (for now, there is almost no downside to trying to fix expr NDVs aggressively). Using the NDV of the 'when' exprs in any way seems wrong for case expr.

I'm thinking we should only consider the 'then exprs' and if the NDV for one of them is unknown, then we should make the NDV of the case expr unknown as well. Might want to ask Marcel what he thinks.


Line 388:     boolean allOutputsKnown = true;
> I will look into this. The way it works today is that Expr's analyze calls 
Agree.


-- 
To view, visit http://gerrit.cloudera.org:8080/5768
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-4792: Fix number of distinct values for a CASE with constant outputs
......................................................................


Patch Set 8: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/5768
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has uploaded a new patch set (#5).

Change subject: IMPALA-4792: Fix number of distinct values for a CASE with constant outputs
......................................................................

IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

If all the return values of a Case expression have a known number of
distinct values (i.e. they are constant or statistics exist), then
the number of distinct values for the Case can be computed using this
information.

In order for the value from Case to be used at higher levels in the tree,
the implementation of computeNumDistinctValues for Expr needed to change.
Previously, Expr calculated the number of distinct values by finding any
SlotRefs in its tree and taking the maximum of the distinct values from
those SlotRefs. This would ignore the value from CaseExpr. To fix this,
Expr now takes the maximum number of distinct values across all of its
children.

-- explaining this statement shows cardinality = 2
explain select distinct case when id = 1 then 'yes' else 'no' end
from functional.alltypes;

-- explaining this statement shows cardinality = 2
explain select distinct char_length(case when id = 1 then 'yes' else 'no' end)
from functional.alltypes;

-- explaining this statement shows cardinality = 7300
explain select distinct case when id = 1 then 0 else id end
from functional.alltypes;

-- explaining this statement shows cardinality = 737 (date_string_col has lower
-- cardinality than id)
explain select distinct case when id = 1 then 'yes' else date_string_col end
from functional.alltypes;

For cases when the number of distinct values is not known for all the outputs,
this will return -1, indicating that the number of distinct values is not
known. The inputs (whens) are not used for calculating the number of distinct
values.

Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
---
M fe/src/main/java/org/apache/impala/analysis/CaseExpr.java
M fe/src/main/java/org/apache/impala/analysis/Expr.java
A fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java
3 files changed, 183 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/5768/5
-- 
To view, visit http://gerrit.cloudera.org:8080/5768
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>

[Impala-ASF-CR] IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

Posted by "Marcel Kornacker (Code Review)" <ge...@cloudera.org>.
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-4792: Fix number of distinct values for a CASE with constant outputs
......................................................................


Patch Set 3:

(3 comments)

regarding testing, let's discuss that.

http://gerrit.cloudera.org:8080/#/c/5768/3/fe/src/main/java/org/apache/impala/analysis/CaseExpr.java
File fe/src/main/java/org/apache/impala/analysis/CaseExpr.java:

Line 393:     long numOutputConstants = 0;
int is fine here, if you overflow that we have other problems.


Line 402:       if ((i - loopStart) % 2 == 1 || (i == children_.size() - 1 && hasElseExpr_)) {
check the negation and continue (to save one indentation level).


Line 409:             if (constLiteralSet.add(outputLiteral)) ++numOutputConstants;
nice optimization. is there a test for that?


-- 
To view, visit http://gerrit.cloudera.org:8080/5768
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change.

Change subject: IMPALA-4792: Fix number of distinct values for a CASE with constant outputs
......................................................................


Patch Set 2:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/5768/2/fe/src/main/java/org/apache/impala/analysis/CaseExpr.java
File fe/src/main/java/org/apache/impala/analysis/CaseExpr.java:

Line 384:     // Otherwise, take a max over all the outputs as well as all the slotrefs that
> Why would we consider the slotrefs of the input? They don't affect the NDV 
The existing algorithm looks at all SlotRef descendants of the Expr. For the CaseExpr, that includes the when expressions. My thought for the code change was to improve cases where we know we can do better and fall back to the existing behavior otherwise.

If we know the number of distinct values for all the outputs, then we ignore the inputs completely.

Otherwise, we have at least one output with no statistics. I think it is debatable what to do in this case. Here is an example SQL:

select distinct case when a.id > 10000 then b.number else a.number end
from a, b;

If we have statistics on table a but not table b, what do we do?

Option 1: Return that we don't know the number of distinct values.
Option 2: Return the max number of distinct values over the then+else clauses. In this case, that would mean returning the number of distinct values for a.number.
Option 3: Return the max over all SlotRefs (input and output). This would be the max distinct values over a.number and a.id.

Suppose a.number has 500 distinct values but a.id has 7300 distinct values. 7300 distinct values implies that there could be 7300 rows coming in (but even this is inexact, as where clauses can completely change this). Are we better off returning 500 or 7300? Neither is particularly good. There might be 7300 distinct values for a.id and 1,500,000 for b.number. There might be 7300 distinct values for a.id and 300 for b.number. 

It is more a question of what to do with incomplete information. Option 3 would err on the side of guessing too many distinct values, but it can still be completely wrong.


Line 388:     boolean allOutputsKnown = true;
> Why not put this new code into an override of computeNumDistinctValues()?
I will look into this. The way it works today is that Expr's analyze calls computeNumDistinctValues and subclasses override that by setting numDistinctValues_. This is done in LiteralExpr, Predicate, and SlotRef.

I think this might be due to the order of execution. SlotRef calls super.analyze first, but then it still needs to do more setup before initializing numDistinctValues_. CaseExpr casts expressions after super.analyze. I don't think that would impact the correctness of computeNumDistinctValues. LiteralExpr and Predicate should be fine overriding computeNumDistinctValues.


Line 390:     long    constOutputVals = 0;
> formatting: long constOutputVals (just one space after type)
Done


Line 392:     long    maxOutputVals = -1;
> maxNonConstNdv? add comment describing what this is
Done


Line 415:           if (!constLiteralSet.contains(thenLiteral)) {
> you can add() and check the return value to see whether the item was alread
Done


Line 424: 
> remove bank line
Done


Line 425:         if (thenNumDistinct == -1) {
> single line
Done


Line 432:     if (hasElseExpr_) {
> good bit of common code between the handing of this else expr and the then 
Combined the then/else part by changing the iteration


Line 460:         // All must be constant, because if we hit any slot, this would be set
> slot -> SlotRef
Done


http://gerrit.cloudera.org:8080/#/c/5768/2/fe/src/main/java/org/apache/impala/analysis/Expr.java
File fe/src/main/java/org/apache/impala/analysis/Expr.java:

Line 304:       // get the max number of distinct values over all the children of this node
> sufficient to write:
Done


Line 308:         // only the SlotRefs, except that it allows a Expr to override the values
> nit: an Expr
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/5768
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-4792: Fix number of distinct values for a CASE with constant outputs
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5768/1//COMMIT_MSG
Commit Message:

Line 12: There is a limitation to this fix:
To fix this we might consider changing Expr.computeNumDistinctValues() to compute the max NDV over all children, instead of the max NDV over the contained SlotRefs.

I think that should yield the same NDV for those Exprs where the original SlotRef-based NDV made sense.

What do you think?


-- 
To view, visit http://gerrit.cloudera.org:8080/5768
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

Posted by "Marcel Kornacker (Code Review)" <ge...@cloudera.org>.
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-4792: Fix number of distinct values for a CASE with constant outputs
......................................................................


Patch Set 4:

looks good, but let's add tests

-- 
To view, visit http://gerrit.cloudera.org:8080/5768
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has uploaded a new patch set (#4).

Change subject: IMPALA-4792: Fix number of distinct values for a CASE with constant outputs
......................................................................

IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

If all the return values of a Case expression have a known number of
distinct values (i.e. they are constant or statistics exist), then
the number of distinct values for the Case can be computed using this
information.

In order for the value from Case to be used at higher levels in the tree,
the implementation of computeNumDistinctValues for Expr needed to change.
Previously, Expr calculated the number of distinct values by finding any
SlotRefs in its tree and taking the maximum of the distinct values from
those SlotRefs. This would ignore the value from CaseExpr. To fix this,
Expr now takes the maximum number of distinct values across all of its
children.

-- explaining this statement shows cardinality = 2
explain select distinct case when id = 1 then 'yes' else 'no' end
from functional.alltypes;

-- explaining this statement shows cardinality = 2
explain select distinct char_length(case when id = 1 then 'yes' else 'no' end)
from functional.alltypes;

-- explaining this statement shows cardinality = 7300
explain select distinct case when id = 1 then 0 else id end
from functional.alltypes;

-- explaining this statement shows cardinality = 737 (date_string_col has lower
-- cardinality than id)
explain select distinct case when id = 1 then 'yes' else date_string_col end
from functional.alltypes;

For cases when the number of distinct values is not known for all the outputs,
this will return -1, indicating that the number of distinct values is not
known. The inputs (whens) are not used for calculating the number of distinct
values.

Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
---
M fe/src/main/java/org/apache/impala/analysis/CaseExpr.java
M fe/src/main/java/org/apache/impala/analysis/Expr.java
2 files changed, 73 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/5768/4
-- 
To view, visit http://gerrit.cloudera.org:8080/5768
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>

[Impala-ASF-CR] IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-4792: Fix number of distinct values for a CASE with constant outputs
......................................................................


Patch Set 8: Verified+1

-- 
To view, visit http://gerrit.cloudera.org:8080/5768
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-4792: Fix number of distinct values for a CASE with constant outputs
......................................................................


Patch Set 2:

(13 comments)

http://gerrit.cloudera.org:8080/#/c/5768/2/fe/src/main/java/org/apache/impala/analysis/CaseExpr.java
File fe/src/main/java/org/apache/impala/analysis/CaseExpr.java:

Line 384:     // Otherwise, take a max over all the outputs as well as all the slotrefs that
Why would we consider the slotrefs of the input? They don't affect the NDV of the output.


Line 388:     boolean allOutputsKnown = true;
Why not put this new code into an override of computeNumDistinctValues()?


Line 390:     long    constOutputVals = 0;
formatting: long constOutputVals (just one space after type)


Line 392:     long    maxOutputVals = -1;
maxNonConstNdv? add comment describing what this is


Line 400:       // look at the when is to get an idea of the number of input rows, so
I'm a little confused by this. The input of the when clauses won't tell us anything about the number of input rows. Do you mwan input NDV? Why would we even consider the when exprs for computing the NDV? Are you trying to determine how likely a when expr is going to be true?


Line 415:           if (!constLiteralSet.contains(thenLiteral)) {
you can add() and check the return value to see whether the item was already contained


Line 424: 
remove bank line


Line 425:         if (thenNumDistinct == -1) {
single line


Line 432:     if (hasElseExpr_) {
good bit of common code between the handing of this else expr and the then exprs above, please factor out common code (e.g. with a helper function, or by first assembling a list of all relevant exprs)


Line 460:         // All must be constant, because if we hit any slot, this would be set
slot -> SlotRef


Line 466:       numDistinctValues_ = Math.max(maxInputVals, maxOutputVals);
Still not following this part. Why are the maxInputVals relevant here?


http://gerrit.cloudera.org:8080/#/c/5768/2/fe/src/main/java/org/apache/impala/analysis/Expr.java
File fe/src/main/java/org/apache/impala/analysis/Expr.java:

Line 304:       // get the max number of distinct values over all the children of this node
sufficient to write:
"... over all children"


Line 308:         // only the SlotRefs, except that it allows a Expr to override the values
nit: an Expr


-- 
To view, visit http://gerrit.cloudera.org:8080/5768
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change.

Change subject: IMPALA-4792: Fix number of distinct values for a CASE with constant outputs
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5768/1//COMMIT_MSG
Commit Message:

Line 12: There is a limitation to this fix:
> To fix this we might consider changing Expr.computeNumDistinctValues() to c
I agree that this is what we should be doing.

My first thought was that this would be substantially more complicated and might require overriding this function in several classes. Now that I'm looking at it again, I think it wouldn't break anything to do this in Expr and then only override CaseExpr.java. I will modify this transaction to try this out.


-- 
To view, visit http://gerrit.cloudera.org:8080/5768
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

Posted by "Marcel Kornacker (Code Review)" <ge...@cloudera.org>.
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-4792: Fix number of distinct values for a CASE with constant outputs
......................................................................


Patch Set 5:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/5768/5/fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java
File fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java:

Line 32: 	String stmtStr = "select " + expr + " from functional.alltypes";
remove tabs and trailing spaces


Line 42:   public Expr CalcsNdvOk(String stmtStr, long expectedNdv)
odd/not immediately intuitive name


Line 48:     // Go into the select list and get the expression
superfluous comment (it simply states what's apparent from the next line), remove


Line 51: 
use blank lines judiciously, to separate logically separate pieces of code


Line 54:     return analyzedExpr;
why return anything?


Line 71:     CalcsNdvOkBasic("case when id = 1 then 'yes' when id = 2 then 'maybe' else 'no' end", 3);
long line (and elsewhere)


-- 
To view, visit http://gerrit.cloudera.org:8080/5768
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has uploaded a new patch set (#6).

Change subject: IMPALA-4792: Fix number of distinct values for a CASE with constant outputs
......................................................................

IMPALA-4792: Fix number of distinct values for a CASE with constant outputs

If all the return values of a Case expression have a known number of
distinct values (i.e. they are constant or statistics exist), then
the number of distinct values for the Case can be computed using this
information.

In order for the value from Case to be used at higher levels in the tree,
the implementation of computeNumDistinctValues for Expr needed to change.
Previously, Expr calculated the number of distinct values by finding any
SlotRefs in its tree and taking the maximum of the distinct values from
those SlotRefs. This would ignore the value from CaseExpr. To fix this,
Expr now takes the maximum number of distinct values across all of its
children.

-- explaining this statement shows cardinality = 2
explain select distinct case when id = 1 then 'yes' else 'no' end
from functional.alltypes;

-- explaining this statement shows cardinality = 2
explain select distinct char_length(case when id = 1 then 'yes' else 'no' end)
from functional.alltypes;

-- explaining this statement shows cardinality = 7300
explain select distinct case when id = 1 then 0 else id end
from functional.alltypes;

-- explaining this statement shows cardinality = 737 (date_string_col has lower
-- cardinality than id)
explain select distinct case when id = 1 then 'yes' else date_string_col end
from functional.alltypes;

For cases when the number of distinct values is not known for all the outputs,
this will return -1, indicating that the number of distinct values is not
known. The inputs (whens) are not used for calculating the number of distinct
values.

Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
---
M fe/src/main/java/org/apache/impala/analysis/CaseExpr.java
M fe/src/main/java/org/apache/impala/analysis/Expr.java
A fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java
3 files changed, 184 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/5768/6
-- 
To view, visit http://gerrit.cloudera.org:8080/5768
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I21dbdaad8452b7e58c477612b47847dccd9d98d2
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>