You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Paul Rogers (Code Review)" <ge...@cloudera.org> on 2019/03/04 19:55:04 UTC

[Impala-ASF-CR] IMPALA-8014: Incorrect FK/PK cardinality estimation

Paul Rogers has uploaded this change for review. ( http://gerrit.cloudera.org:8080/12535


Change subject: IMPALA-8014: Incorrect FK/PK cardinality estimation
......................................................................

IMPALA-8014: Incorrect FK/PK cardinality estimation

Impala uses two techniques to estimate join cardinality: one for FK/PK,
another for the "generic" case. (IMPALA-8018 suggests we combine the
two. But, for the purposes of this fix, we leave them separate.)

The code for the FK/PK estimation is mostly right for a key that
consists of a singe column. But, if a key is compound (contains more
than one column), the calculations can produce incorrect results.

This patch modifies the code to use the standard join calculation:
|join| = |left| * |right| / max(|left key|, |right key|).

The math for compound keys had a number of errors. Since no test tables
represent a proper compound key, all tests use unrealistic keys such as
a unique Id along with a second, superfluous column. To make the math
come out OK for these, the code included adjustments that misfired on
real-world keys. This patch fixes those errors. A separate patch
introduces new tables with the required schema.

Tests: Many planner tests were modified to reflect the new join
cardinality estimate. Some plans changed, including some TPC-H plans.
Inspected each change to verify that the numbers are correct using the
formula cited above.

Turns out that the change caused a spill test to fail. The new plan is
more efficient and didn't require as much memory. Shrank the memory
given to the query to force the newer, more efficient plan to also
spill.

Testing was also done using a production query that misfired without
this fix, but that produced correct results with this fix.

Change-Id: Id0db82564596b0a9271b89b8e3f7a3cf92d306da
---
M fe/src/main/java/org/apache/impala/planner/JoinNode.java
M fe/src/main/java/org/apache/impala/planner/PlanNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/workloads/functional-planner/queries/PlannerTest/card-inner-join.test
M testdata/workloads/functional-planner/queries/PlannerTest/card-multi-join.test
M testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection.test
M testdata/workloads/functional-planner/queries/PlannerTest/inline-view.test
M testdata/workloads/functional-planner/queries/PlannerTest/join-order.test
M testdata/workloads/functional-planner/queries/PlannerTest/joins.test
M testdata/workloads/functional-planner/queries/PlannerTest/nested-collections.test
M testdata/workloads/functional-planner/queries/PlannerTest/order.test
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds-all.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpch-all.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpch-kudu.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpch-views.test
M testdata/workloads/functional-query/queries/QueryTest/spilling.test
17 files changed, 1,909 insertions(+), 1,829 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/35/12535/4
-- 
To view, visit http://gerrit.cloudera.org:8080/12535
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Id0db82564596b0a9271b89b8e3f7a3cf92d306da
Gerrit-Change-Number: 12535
Gerrit-PatchSet: 4
Gerrit-Owner: Paul Rogers <pr...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pr...@cloudera.com>

[Impala-ASF-CR] IMPALA-8014: Incorrect FK/PK cardinality estimation

Posted by "Paul Rogers (Code Review)" <ge...@cloudera.org>.
Paul Rogers has posted comments on this change. ( http://gerrit.cloudera.org:8080/12535 )

Change subject: IMPALA-8014: Incorrect FK/PK cardinality estimation
......................................................................


Patch Set 4:

Passed pre-review tests: https://jenkins.impala.io/job/pre-review-test/328/


-- 
To view, visit http://gerrit.cloudera.org:8080/12535
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id0db82564596b0a9271b89b8e3f7a3cf92d306da
Gerrit-Change-Number: 12535
Gerrit-PatchSet: 4
Gerrit-Owner: Paul Rogers <pr...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pr...@cloudera.com>
Gerrit-Comment-Date: Mon, 04 Mar 2019 19:55:31 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8014: Revise FK/PK cardinality estimation

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/12535 )

Change subject: IMPALA-8014: Revise FK/PK cardinality estimation
......................................................................


Patch Set 5:

(13 comments)

http://gerrit.cloudera.org:8080/#/c/12535/5//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12535/5//COMMIT_MSG@28
PS5, Line 28: including some TPC-H plans
did you have any chance to compare the before-and-after plans from a performance perspective? Same with DS, where it seems a few plans changed?

(not that we should slavishly adhere to those benchmarks if we think the new heuristic is better, but if we've regressed it would be good to be explicit about how much)


http://gerrit.cloudera.org:8080/#/c/12535/5/fe/src/main/java/org/apache/impala/planner/JoinNode.java
File fe/src/main/java/org/apache/impala/planner/JoinNode.java:

http://gerrit.cloudera.org:8080/#/c/12535/5/fe/src/main/java/org/apache/impala/planner/JoinNode.java@327
PS5, Line 327:    * By definition, the detail (fk) table is on the left, the master (pk) table
I haven't seen the terminology "detail" and "master" being used anywhere else in the planner. Are these standard DB terminology? I think we should avoid introducing new terms not used elsewhere in the codebase.

To me, using the terms "fact" and "dimension" might make sense, or just stick to LHS and RHS, since it's conventional throughput Impala that RHS is the smaller "build" table whereas LHS is the larger "probe" table.


http://gerrit.cloudera.org:8080/#/c/12535/5/fe/src/main/java/org/apache/impala/planner/JoinNode.java@331
PS5, Line 331:    * can assume a cap on the NDV of a joint FK: it must be no larger than the NDV
when you say "NDV of a joint FK", should that be "of a composite FK"? ie the case you're differentiating is when the equijoin conjunct is on multiple columns? By some quick googling, the term "composite key" seems to be the most common name for this case


http://gerrit.cloudera.org:8080/#/c/12535/5/fe/src/main/java/org/apache/impala/planner/JoinNode.java@332
PS5, Line 332:    * of the FK, which is (by definition) the same as the master table cardinality.
isn't it by definition _bound_ by the master table cardinality, not the same as? ie there may be items in the PK table that are not referenced, so its cardinality may be larger.


http://gerrit.cloudera.org:8080/#/c/12535/5/fe/src/main/java/org/apache/impala/planner/JoinNode.java@339
PS5, Line 339:    * 3. Containment: that if the FK side has fewer keys than the PK side, all
excuse my ignorance, but is "containment" the common term for this assumption? To me, this is "referential integrity" -- ie that all FKs reference an existing row in the PK table.


http://gerrit.cloudera.org:8080/#/c/12535/5/fe/src/main/java/org/apache/impala/planner/JoinNode.java@351
PS5, Line 351:     // that the RHS is a fact (master, FK) table.
again confused on terminology here (and seems there is at least one typo). Above, it says that the table with the FK is on the left. As such, isn't the LHS the fact table (table with the FK)? And the right side is the dimension table (table with the PK)?


http://gerrit.cloudera.org:8080/#/c/12535/5/fe/src/main/java/org/apache/impala/planner/JoinNode.java@353
PS5, Line 353: |RHS key| 
I think this and some of the text below would be easier to follow if we don't overload |...| to mean both count and NDV. i.e. here what you mean is NDV(RHS key) = |RHS table|

This also matches the notation used in some of the block comments above


http://gerrit.cloudera.org:8080/#/c/12535/5/fe/src/main/java/org/apache/impala/planner/JoinNode.java@355
PS5, Line 355: In the typical case
Isn't this one of the assumptions above? (the "containment"/referential integrity one)


http://gerrit.cloudera.org:8080/#/c/12535/5/fe/src/main/java/org/apache/impala/planner/JoinNode.java@355
PS5, Line 355: very
nit: every


http://gerrit.cloudera.org:8080/#/c/12535/5/fe/src/main/java/org/apache/impala/planner/JoinNode.java@360
PS5, Line 360: |N|
what is |N| here?

In a PK relationship isn't the probability of a detail (fact) row finding a match in an unfiltered dimension table always 1?


http://gerrit.cloudera.org:8080/#/c/12535/5/fe/src/main/java/org/apache/impala/planner/JoinNode.java@370
PS5, Line 370:  |D"
typo at end of line here (that should say |D|, right?)


One question: would it not be valid to just say this is a degenerate case of the compound key calculation, where the product of one number is just the number itself? ie:

  |D.fk'| = min(|D.fk|, |M|) * |D'|/|D|

That's slightly different than what you have here, but in a "true" PK relationship, we are guaranteed that |D.fk| <= |M|, and if our assumption was off by a bit, we're at still guaranteed that |D.fk| is within some bound of |M| due to the fact that we're in this code path anyway. That would allow us to simplify the code below a bit, right?


http://gerrit.cloudera.org:8080/#/c/12535/5/fe/src/main/java/org/apache/impala/planner/JoinNode.java@373
PS5, Line 373: s
nit: typo


http://gerrit.cloudera.org:8080/#/c/12535/5/fe/src/main/java/org/apache/impala/planner/JoinNode.java@376
PS5, Line 376:     // |D' >< M'| = -------------------
nit: the '><' notation isn't used elsewhere in impala. Maybe easier to just read |D' JOIN M'|



-- 
To view, visit http://gerrit.cloudera.org:8080/12535
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id0db82564596b0a9271b89b8e3f7a3cf92d306da
Gerrit-Change-Number: 12535
Gerrit-PatchSet: 5
Gerrit-Owner: Paul Rogers <pr...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pr...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 21 Mar 2019 04:14:28 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8014: Incorrect FK/PK cardinality estimation

Posted by "Bharath Vissapragada (Code Review)" <ge...@cloudera.org>.
Bharath Vissapragada has posted comments on this change. ( http://gerrit.cloudera.org:8080/12535 )

Change subject: IMPALA-8014: Incorrect FK/PK cardinality estimation
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12535/4/fe/src/main/java/org/apache/impala/planner/JoinNode.java
File fe/src/main/java/org/apache/impala/planner/JoinNode.java:

http://gerrit.cloudera.org:8080/#/c/12535/4/fe/src/main/java/org/apache/impala/planner/JoinNode.java@416
PS4, Line 416: fkCard = Math.max(fkCard, largestKeyCard);
> Pardon my ignorance, isn't fkCard always >= largestKeyCard here? You are ju
Sorry, I meant you are multiplying all lhsNdv() for each slot and comparing it to the max among them. Generally I think the else {} block here could use an example.



-- 
To view, visit http://gerrit.cloudera.org:8080/12535
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id0db82564596b0a9271b89b8e3f7a3cf92d306da
Gerrit-Change-Number: 12535
Gerrit-PatchSet: 4
Gerrit-Owner: Paul Rogers <pr...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pr...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Fri, 08 Mar 2019 18:27:37 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8014: Incorrect FK/PK cardinality estimation

Posted by "Paul Rogers (Code Review)" <ge...@cloudera.org>.
Paul Rogers has posted comments on this change. ( http://gerrit.cloudera.org:8080/12535 )

Change subject: IMPALA-8014: Incorrect FK/PK cardinality estimation
......................................................................


Patch Set 4:

Here are two references to get you started:

https://pdfs.semanticscholar.org/2735/672262940a23cfce8d4cc1e25c0191de7da7.pdf
See Section 2, equation 1.

http://www.vldb.org/pvldb/vol9/p204-leis.pdf
Section 2.3

The basic formula is pretty standard, as is the math to calculate NDV for compound keys. The ad-hoc bits are the adjustments that limit values.

The current version is a bit odd: it uses NDVs to infer PK/FK, then uses PK/FK to adjust NDVs. Didn't want to fix too much in a single patch, so let this odd bit of reasoning.

We can go over the details in person.


-- 
To view, visit http://gerrit.cloudera.org:8080/12535
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id0db82564596b0a9271b89b8e3f7a3cf92d306da
Gerrit-Change-Number: 12535
Gerrit-PatchSet: 4
Gerrit-Owner: Paul Rogers <pr...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pr...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Tue, 05 Mar 2019 00:32:27 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8014: Revise FK/PK cardinality estimation

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12535 )

Change subject: IMPALA-8014: Revise FK/PK cardinality estimation
......................................................................


Patch Set 5:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/2398/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/12535
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id0db82564596b0a9271b89b8e3f7a3cf92d306da
Gerrit-Change-Number: 12535
Gerrit-PatchSet: 5
Gerrit-Owner: Paul Rogers <pr...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pr...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Fri, 08 Mar 2019 21:16:42 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8014: Revise FK/PK cardinality estimation

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12535 )

Change subject: IMPALA-8014: Revise FK/PK cardinality estimation
......................................................................


Patch Set 5:

I've been meaning to make some time to look at this. I think my high-level concern is about potential regressions (i.e. where we were wrong for the right reasons). I know you've argued against having a flag for everything because of the configuration space, testing and code complexity problems, but I wonder if there's a way we can guard against the worst issues.

E.g. can we factor out the cardinality code into a strategy class with a simple interface to encapsulate the different versions of the logic? Can we limit the configuration space by using something coarser-grained than planner versioning, e.g. have a planner "epoch" that is incremented for a group of significant planner changes.

Definitely don't want to be stuck being overly conservative with preserving semi-broken logic, but it would be good to mitigate the risk if it can be done with reasonable effort.


-- 
To view, visit http://gerrit.cloudera.org:8080/12535
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id0db82564596b0a9271b89b8e3f7a3cf92d306da
Gerrit-Change-Number: 12535
Gerrit-PatchSet: 5
Gerrit-Owner: Paul Rogers <pr...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pr...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 03 Apr 2019 00:01:21 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8014: Incorrect FK/PK cardinality estimation

Posted by "Paul Rogers (Code Review)" <ge...@cloudera.org>.
Paul Rogers has posted comments on this change. ( http://gerrit.cloudera.org:8080/12535 )

Change subject: IMPALA-8014: Incorrect FK/PK cardinality estimation
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12535/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12535/4//COMMIT_MSG@32
PS4, Line 32: Turns out that the change caused a spill test to fail. The new plan is
> This is my bad, I should have hinted the plan to force a join order and str
I agree that your solution is more stable. Fudging with magic numbers is asking for trouble.

Can you recommend the proper fix? How should the test in question be modified to achieve the goal you had in mind?



-- 
To view, visit http://gerrit.cloudera.org:8080/12535
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id0db82564596b0a9271b89b8e3f7a3cf92d306da
Gerrit-Change-Number: 12535
Gerrit-PatchSet: 4
Gerrit-Owner: Paul Rogers <pr...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pr...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Mon, 04 Mar 2019 20:23:33 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8014: Revise FK/PK cardinality estimation

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/12535 )

Change subject: IMPALA-8014: Revise FK/PK cardinality estimation
......................................................................


Patch Set 5:

Hey Paul. Did you plan on continuing to look at this or should someone else take this over?


-- 
To view, visit http://gerrit.cloudera.org:8080/12535
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id0db82564596b0a9271b89b8e3f7a3cf92d306da
Gerrit-Change-Number: 12535
Gerrit-PatchSet: 5
Gerrit-Owner: Paul Rogers <pr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pr...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 03 May 2019 05:51:56 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8014: Incorrect FK/PK cardinality estimation

Posted by "Bharath Vissapragada (Code Review)" <ge...@cloudera.org>.
Bharath Vissapragada has posted comments on this change. ( http://gerrit.cloudera.org:8080/12535 )

Change subject: IMPALA-8014: Incorrect FK/PK cardinality estimation
......................................................................


Patch Set 4:

(2 comments)

My understanding of the math here is limited but I think the patch generally makes sense. I have a couple of nits.

http://gerrit.cloudera.org:8080/#/c/12535/4/fe/src/main/java/org/apache/impala/planner/JoinNode.java
File fe/src/main/java/org/apache/impala/planner/JoinNode.java:

http://gerrit.cloudera.org:8080/#/c/12535/4/fe/src/main/java/org/apache/impala/planner/JoinNode.java@350
PS4, Line 350:  // The code has decided that the RHS is a dimension (detail, FK) table, and
             :     // that the LHS is a fact (master, FK) table
isn't it the opposite?


http://gerrit.cloudera.org:8080/#/c/12535/4/fe/src/main/java/org/apache/impala/planner/JoinNode.java@416
PS4, Line 416: fkCard = Math.max(fkCard, largestKeyCard);
Pardon my ignorance, isn't fkCard always >= largestKeyCard here? You are just multiplying. Can you please give an example where it isn't?



-- 
To view, visit http://gerrit.cloudera.org:8080/12535
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id0db82564596b0a9271b89b8e3f7a3cf92d306da
Gerrit-Change-Number: 12535
Gerrit-PatchSet: 4
Gerrit-Owner: Paul Rogers <pr...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pr...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Fri, 08 Mar 2019 18:25:22 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8014: Incorrect FK/PK cardinality estimation

Posted by "Bharath Vissapragada (Code Review)" <ge...@cloudera.org>.
Bharath Vissapragada has posted comments on this change. ( http://gerrit.cloudera.org:8080/12535 )

Change subject: IMPALA-8014: Incorrect FK/PK cardinality estimation
......................................................................


Patch Set 4:

I have some questions around how the Fk-Pk cardinality estimate formula was derived but it is probably easier to ask them in person sometime.


-- 
To view, visit http://gerrit.cloudera.org:8080/12535
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id0db82564596b0a9271b89b8e3f7a3cf92d306da
Gerrit-Change-Number: 12535
Gerrit-PatchSet: 4
Gerrit-Owner: Paul Rogers <pr...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pr...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Tue, 05 Mar 2019 00:21:35 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8014: Incorrect FK/PK cardinality estimation

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12535 )

Change subject: IMPALA-8014: Incorrect FK/PK cardinality estimation
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12535/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12535/4//COMMIT_MSG@32
PS4, Line 32: Turns out that the change caused a spill test to fail. The new plan is
> I agree that your solution is more stable. Fudging with magic numbers is as
I would look at the plan that was generated before this change (join order and strategy), then modify the query by adding straight_join and BROADCAST/PARTITIONED hints to force that join order.

Actually I guess if the join order didn't change, then it could be that the memory estimates were lower and it allocated smaller buffer sizes. In that case we could force the buffer sizes to be larger, but that seems overkill.



-- 
To view, visit http://gerrit.cloudera.org:8080/12535
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id0db82564596b0a9271b89b8e3f7a3cf92d306da
Gerrit-Change-Number: 12535
Gerrit-PatchSet: 4
Gerrit-Owner: Paul Rogers <pr...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pr...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Mon, 04 Mar 2019 20:30:40 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8014: Incorrect FK/PK cardinality estimation

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12535 )

Change subject: IMPALA-8014: Incorrect FK/PK cardinality estimation
......................................................................


Patch Set 4:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/2341/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/12535
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id0db82564596b0a9271b89b8e3f7a3cf92d306da
Gerrit-Change-Number: 12535
Gerrit-PatchSet: 4
Gerrit-Owner: Paul Rogers <pr...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pr...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Mon, 04 Mar 2019 20:32:50 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8014: Incorrect FK/PK cardinality estimation

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12535 )

Change subject: IMPALA-8014: Incorrect FK/PK cardinality estimation
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12535/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12535/4//COMMIT_MSG@32
PS4, Line 32: Turns out that the change caused a spill test to fail. The new plan is
This is my bad, I should have hinted the plan to force a join order and strategy. Your current solution seems fine, but ideally for these query execution tests we should be forcing a plan.



-- 
To view, visit http://gerrit.cloudera.org:8080/12535
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id0db82564596b0a9271b89b8e3f7a3cf92d306da
Gerrit-Change-Number: 12535
Gerrit-PatchSet: 4
Gerrit-Owner: Paul Rogers <pr...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pr...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Mon, 04 Mar 2019 20:00:44 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8014: Revise FK/PK cardinality estimation

Posted by "Paul Rogers (Code Review)" <ge...@cloudera.org>.
Paul Rogers has posted comments on this change. ( http://gerrit.cloudera.org:8080/12535 )

Change subject: IMPALA-8014: Revise FK/PK cardinality estimation
......................................................................


Patch Set 4:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/12535/4/fe/src/main/java/org/apache/impala/planner/JoinNode.java
File fe/src/main/java/org/apache/impala/planner/JoinNode.java:

http://gerrit.cloudera.org:8080/#/c/12535/4/fe/src/main/java/org/apache/impala/planner/JoinNode.java@350
PS4, Line 350:  // The code has decided that the RHS is a dimension (detail, FK) table, and
             :     // that the LHS is a fact (master, FK) table
> isn't it the opposite?
Done


http://gerrit.cloudera.org:8080/#/c/12535/4/fe/src/main/java/org/apache/impala/planner/JoinNode.java@416
PS4, Line 416: fkCard = Math.max(fkCard, largestKeyCard);
> Sorry, I meant you are multiplying all lhsNdv() for each slot and comparing
Right. Very confusing. This code is only for the compound key case. We are dealing with the lack of an NDV for the key as a whole. (This is where we can very much use Anurag's FK/PK enhancement.)

Note the step above in which we constrain the fk cardinality to be no greater than the master table cardinality. This can occur, as in our tests, when the fk consists of (id, int_col), so NDV(compound key) = NDV(id) * NDV(int_col) > master tale cardinality. So, we constrain the foreign key to be no greater than the master table (based on our M:1 assumption.)

But, then we have to deal with the case that our guess is wrong: the FK cardinality really is greater. Even if the two columns are perfectly correlated, their joint NDV has to be at least as large as the largest individual NDV.

To handle the reality of messy schemas and stats, we:

* Compute the "raw" joint NDV
* Constrain it by the detail table size (can't have more keys than rows)
* Constrain it by the master table size (can't have more FKs than PKs, ideally)
* Constrain it by largest key NDV size (can't have fewer compound keys than column NDV, even if that is greater than master table size)

I *think* this is right. The above messy adjustments are the result of inspecting tests that changed. But please do think it through to be certain.



-- 
To view, visit http://gerrit.cloudera.org:8080/12535
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id0db82564596b0a9271b89b8e3f7a3cf92d306da
Gerrit-Change-Number: 12535
Gerrit-PatchSet: 4
Gerrit-Owner: Paul Rogers <pr...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pr...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Fri, 08 Mar 2019 20:31:26 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8014: Revise FK/PK cardinality estimation

Posted by "Paul Rogers (Code Review)" <ge...@cloudera.org>.
Hello Bharath Vissapragada, Tim Armstrong, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12535

to look at the new patch set (#5).

Change subject: IMPALA-8014: Revise FK/PK cardinality estimation
......................................................................

IMPALA-8014: Revise FK/PK cardinality estimation

Impala uses two techniques to estimate join cardinality: one for FK/PK,
another for the "generic" case. (IMPALA-8018 suggests we combine the
two. But, for the purposes of this fix, we leave them separate.)

The code for the FK/PK estimation is mostly right for a key that
consists of a singe column. But, if a key is compound (contains more
than one column), the calculations can produce incorrect results.

This patch modifies the code to use the standard join calculation:
|join| = |left| * |right| / max(|left key|, |right key|).

The math for compound keys had a number of errors. Since no test tables
represent a proper compound key, all tests use unrealistic keys such as
a unique Id along with a second, superfluous column. To make the math
come out OK for these, the code included adjustments that misfired on
real-world keys. This patch fixes those errors. A separate patch
introduces new tables with the required schema.

Tests: Many planner tests were modified to reflect the new join
cardinality estimate. Some plans changed, including some TPC-H plans.
Inspected each change to verify that the numbers are correct using the
formula cited above.

Turns out that the change caused a spill test to fail. The new plan is
more efficient and didn't require as much memory. Shrank the memory
given to the query to force the newer, more efficient plan to also
spill.

Testing was also done using a production query that misfired without
this fix, but that produced correct results with this fix.

Change-Id: Id0db82564596b0a9271b89b8e3f7a3cf92d306da
---
M fe/src/main/java/org/apache/impala/planner/JoinNode.java
M fe/src/main/java/org/apache/impala/planner/PlanNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/workloads/functional-planner/queries/PlannerTest/card-inner-join.test
M testdata/workloads/functional-planner/queries/PlannerTest/card-multi-join.test
M testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection.test
M testdata/workloads/functional-planner/queries/PlannerTest/inline-view.test
M testdata/workloads/functional-planner/queries/PlannerTest/join-order.test
M testdata/workloads/functional-planner/queries/PlannerTest/joins.test
M testdata/workloads/functional-planner/queries/PlannerTest/nested-collections.test
M testdata/workloads/functional-planner/queries/PlannerTest/order.test
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds-all.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpch-all.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpch-kudu.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpch-views.test
M testdata/workloads/functional-query/queries/QueryTest/spilling.test
17 files changed, 1,920 insertions(+), 1,829 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/35/12535/5
-- 
To view, visit http://gerrit.cloudera.org:8080/12535
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id0db82564596b0a9271b89b8e3f7a3cf92d306da
Gerrit-Change-Number: 12535
Gerrit-PatchSet: 5
Gerrit-Owner: Paul Rogers <pr...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pr...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>