You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Alex Behm (Code Review)" <ge...@cloudera.org> on 2017/06/21 23:41:44 UTC

[Impala-ASF-CR] IMPALA-5547: Rework FK/PK join detection.

Alex Behm has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/7257

Change subject: IMPALA-5547: Rework FK/PK join detection.
......................................................................

IMPALA-5547: Rework FK/PK join detection.

Reworks the FK/PK join detection logic to:
- more accurately recognize many-to-many joins
- avoid dim/dim joins for multi-column PKs

The new detection logic maintains our existing philosophy of generally
assuming a FK/PK join, unless there is strong evidence to the
contrary, as follows.

For each set of simple equi-join conjuncts between two tables, we
compute the joint NDV of the right-hand side columns by
multiplication, and if the joint NDV is significantly smaller than
the right-hand side row count, then we are fairly confident that the
right-hand side is not a PK. Otherwise, we assume the set of conjuncts
could represent a FK/PK relationship.

Extends the explain plan to include the outcome of the FK/PK detection
at EXPLAIN_LEVEL > STANDARD.

Performance testing:
1. Full TPC-DS run on 10TB:
   - Q10 improved by >100x
   - Q72 improved by >25x
   - Q17,Q26,Q29 improved by 2x
   - Q64 regressed by 10x
   - Total runtime: Improved by 2x
   - Geomean: Minor improvement
   The regression of Q64 is understood and we will try to address it
   in follow-on changes. The previous plan was better by accident and
   not because of superior logic.
2. Nightly TPC-H and TPC-DS runs:
   - No perf differences

Testing:
- The existing planner test cover the changes.
- Code/hdfs run passed.

Change-Id: I49074fe743a28573cff541ef7dbd0edd88892067
---
M fe/src/main/java/org/apache/impala/analysis/JoinOperator.java
M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
M fe/src/main/java/org/apache/impala/planner/JoinNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test
M testdata/workloads/functional-planner/queries/PlannerTest/hbase.test
M testdata/workloads/functional-planner/queries/PlannerTest/join-order.test
M testdata/workloads/functional-planner/queries/PlannerTest/joins.test
M testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation.test
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds-all.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpch-all.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpch-kudu.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpch-nested.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpch-views.test
M testdata/workloads/functional-query/queries/QueryTest/explain-level2.test
M testdata/workloads/functional-query/queries/QueryTest/explain-level3.test
17 files changed, 1,500 insertions(+), 1,351 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/57/7257/1
-- 
To view, visit http://gerrit.cloudera.org:8080/7257
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I49074fe743a28573cff541ef7dbd0edd88892067
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>

[Impala-ASF-CR] IMPALA-5547: Rework FK/PK join detection.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-5547: Rework FK/PK join detection.
......................................................................


Patch Set 2:

(9 comments)

Responding to comments. New tests are still WIP.

http://gerrit.cloudera.org:8080/#/c/7257/2/fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
File fe/src/main/java/org/apache/impala/planner/HashJoinNode.java:

PS2, Line 27: import org.apache.impala.analysis.SlotDescriptor;
> Is this used?
No. Done.


PS2, Line 40: import org.apache.kudu.shaded.com.google.common.base.Joiner;
> replace
Sorry, IDE. Done.


http://gerrit.cloudera.org:8080/#/c/7257/1/fe/src/main/java/org/apache/impala/planner/JoinNode.java
File fe/src/main/java/org/apache/impala/planner/JoinNode.java:

Line 253:     if (fkPkEqJoinConjuncts_ != null) {
> My concern was that if we have compelling evidence that a FK/PK relationshi
Makes sense, thanks. In that case the adjustment based on the NDV ratio in L312 would be wrong and lead to underestimation.


http://gerrit.cloudera.org:8080/#/c/7257/2/fe/src/main/java/org/apache/impala/planner/JoinNode.java
File fe/src/main/java/org/apache/impala/planner/JoinNode.java:

PS2, Line 79: -
> nit: remove
Done


PS2, Line 80: ordered based on the tuple ids
> I am not sure I get what that ordering is. Do you mean grouped?
Reworded with "grouped by"


PS2, Line 202: based on the single most
             :    * selective join predicate. We do not attempt to estimate the joint selectivity of
             :    * multiple join predicates to avoid underestimation.
> Much better. Thanks
Done


PS2, Line 280: fkPkCandidate.get(0)
> Can you plz explain this? I could be wrong about this but it seems that the
Might have misunderstood your comment, but I'll give it a shot.

In L277 we iterate over all groups of conjuncts that belong to the same joined tuple id pair. For each group, we compute the join NDV of the rhs slots and compare it to the number of rows in the rhs table.

I didn't follow your comment about processing order. I don't think the processing order matters for what we want to check.

Are you asking why it's ok to get(0)? All entries belong to the same tuple id pair, so the num rows is the same for all of them.

Added a comment, but not sure if it addresses your concern.


PS2, Line 299: Preconditions.checkState(lhsCard >= 0 && rhsCard >= 0);
> Are we certain this is a valid precondition check? Why isn't a 0 rhsCard po
Maybe you misread the code: It's greater or equal to zero, so zero is valid.


PS2, Line 337: we adjustments
> nit: grammar
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7257
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I49074fe743a28573cff541ef7dbd0edd88892067
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5547: Rework FK/PK join detection.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged.

Change subject: IMPALA-5547: Rework FK/PK join detection.
......................................................................


IMPALA-5547: Rework FK/PK join detection.

Reworks the FK/PK join detection logic to:
- more accurately recognize many-to-many joins
- avoid dim/dim joins for multi-column PKs

The new detection logic maintains our existing philosophy of generally
assuming a FK/PK join, unless there is strong evidence to the
contrary, as follows.

For each set of simple equi-join conjuncts between two tables, we
compute the joint NDV of the right-hand side columns by
multiplication, and if the joint NDV is significantly smaller than
the right-hand side row count, then we are fairly confident that the
right-hand side is not a PK. Otherwise, we assume the set of conjuncts
could represent a FK/PK relationship.

Extends the explain plan to include the outcome of the FK/PK detection
at EXPLAIN_LEVEL > STANDARD.

Performance testing:
1. Full TPC-DS run on 10TB:
   - Q10 improved by >100x
   - Q72 improved by >25x
   - Q17,Q26,Q29 improved by 2x
   - Q64 regressed by 10x
   - Total runtime: Improved by 2x
   - Geomean: Minor improvement
   The regression of Q64 is understood and we will try to address it
   in follow-on changes. The previous plan was better by accident and
   not because of superior logic.
2. Nightly TPC-H and TPC-DS runs:
   - No perf differences

Testing:
- The existing planner test cover the changes.
- Code/hdfs run passed.

Change-Id: I49074fe743a28573cff541ef7dbd0edd88892067
Reviewed-on: http://gerrit.cloudera.org:8080/7257
Reviewed-by: Alex Behm <al...@cloudera.com>
Tested-by: Impala Public Jenkins
---
M fe/src/main/java/org/apache/impala/analysis/JoinOperator.java
M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
M fe/src/main/java/org/apache/impala/planner/JoinNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test
A testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection.test
M testdata/workloads/functional-planner/queries/PlannerTest/hbase.test
M testdata/workloads/functional-planner/queries/PlannerTest/joins.test
M testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation.test
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-planner/queries/PlannerTest/spillable-buffer-sizing.test
M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds-all.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpch-nested.test
M testdata/workloads/functional-query/queries/QueryTest/explain-level2.test
M testdata/workloads/functional-query/queries/QueryTest/explain-level3.test
16 files changed, 1,616 insertions(+), 1,018 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Alex Behm: Looks good to me, approved



-- 
To view, visit http://gerrit.cloudera.org:8080/7257
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I49074fe743a28573cff541ef7dbd0edd88892067
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>

[Impala-ASF-CR] IMPALA-5547: Rework FK/PK join detection.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Hello Dimitris Tsirogiannis,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7257

to look at the new patch set (#6).

Change subject: IMPALA-5547: Rework FK/PK join detection.
......................................................................

IMPALA-5547: Rework FK/PK join detection.

Reworks the FK/PK join detection logic to:
- more accurately recognize many-to-many joins
- avoid dim/dim joins for multi-column PKs

The new detection logic maintains our existing philosophy of generally
assuming a FK/PK join, unless there is strong evidence to the
contrary, as follows.

For each set of simple equi-join conjuncts between two tables, we
compute the joint NDV of the right-hand side columns by
multiplication, and if the joint NDV is significantly smaller than
the right-hand side row count, then we are fairly confident that the
right-hand side is not a PK. Otherwise, we assume the set of conjuncts
could represent a FK/PK relationship.

Extends the explain plan to include the outcome of the FK/PK detection
at EXPLAIN_LEVEL > STANDARD.

Performance testing:
1. Full TPC-DS run on 10TB:
   - Q10 improved by >100x
   - Q72 improved by >25x
   - Q17,Q26,Q29 improved by 2x
   - Q64 regressed by 10x
   - Total runtime: Improved by 2x
   - Geomean: Minor improvement
   The regression of Q64 is understood and we will try to address it
   in follow-on changes. The previous plan was better by accident and
   not because of superior logic.
2. Nightly TPC-H and TPC-DS runs:
   - No perf differences

Testing:
- The existing planner test cover the changes.
- Code/hdfs run passed.

Change-Id: I49074fe743a28573cff541ef7dbd0edd88892067
---
M fe/src/main/java/org/apache/impala/analysis/JoinOperator.java
M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
M fe/src/main/java/org/apache/impala/planner/JoinNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test
A testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection.test
M testdata/workloads/functional-planner/queries/PlannerTest/hbase.test
M testdata/workloads/functional-planner/queries/PlannerTest/joins.test
M testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation.test
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-planner/queries/PlannerTest/spillable-buffer-sizing.test
M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds-all.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpch-nested.test
M testdata/workloads/functional-query/queries/QueryTest/explain-level2.test
M testdata/workloads/functional-query/queries/QueryTest/explain-level3.test
16 files changed, 1,616 insertions(+), 1,018 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/57/7257/6
-- 
To view, visit http://gerrit.cloudera.org:8080/7257
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I49074fe743a28573cff541ef7dbd0edd88892067
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>

[Impala-ASF-CR] IMPALA-5547: Rework FK/PK join detection.

Posted by "Dimitris Tsirogiannis (Code Review)" <ge...@cloudera.org>.
Dimitris Tsirogiannis has posted comments on this change.

Change subject: IMPALA-5547: Rework FK/PK join detection.
......................................................................


Patch Set 1:

(17 comments)

I have one comment about testing. We have lots of test changes but it's not easy to verify if the end result makes sense or not. I think it may worth having a planner test with explain_level > 2 that covers a few interesting join cases so that we can see what kind of pk/fk conjuncts are detected and what the impact is on estimated join cardinalities.

http://gerrit.cloudera.org:8080/#/c/7257/1/fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
File fe/src/main/java/org/apache/impala/planner/HashJoinNode.java:

PS1, Line 170: else if (fkPkEqJoinConjuncts_.isEmpty()) {
             :           output.append("assumed fk/pk");
I haven't read the entire patch yet, so I am curious what this case represents.


PS1, Line 178: for (int j = 0; j < slots.size(); ++j) {
             :               output.append(slots.get(j).toString());
             :               if (j + 1 != slots.size()) output.append(", ");
             :             }
Can't you use the Guava's Joiner class here? Joiner.on(",").join(slots)


http://gerrit.cloudera.org:8080/#/c/7257/1/fe/src/main/java/org/apache/impala/planner/JoinNode.java
File fe/src/main/java/org/apache/impala/planner/JoinNode.java:

PS1, Line 37: import org.apache.kudu.client.shaded.com.google.common.collect.Maps;
?


PS1, Line 271: scanSlotsByTids
nit: Each pair represents two joined tables, no? So, maybe rename this to 'scanSlotsByJoinedTids'? A bit more explicit on what this thing represents.


PS1, Line 281: numRows
rhsTableCardinality?


Line 290: 
nit: extra line


PS1, Line 294: conjuncts
conjunct slots


PS1, Line 300:     Preconditions.checkState(joinOp_.isInnerJoin() || joinOp_.isOuterJoin());
This private function is only called in L253? I think you can remove this check.


PS1, Line 312: lhsNdv
Can this ever be 0?


PS1, Line 313: rhsNumRows
Same here, can this be 0?


PS1, Line 318: result = Math.min(result, joinCard);
That part I think is very important and I am not sure it is explained anywhere. Maybe expand the comment in L209?


PS1, Line 329: conjuncts
conjunct slots


PS1, Line 341: slots.lhs().getStats().getNumDistinctValues();
nit: these are used in couple of places. Since you already have the EqJoinConjunctScanSlots class, why don't your wrap them in some nice util functions there (e.g. getLhsNdvs() or something like that).


PS1, Line 343: slots.lhs().getParent().getTable().getNumRows();
same comment as above.


PS1, Line 373: Preconditions
I don't think these checks are useful. This private constructor is only called through the static function that has tight control on what gets passed in this ctor.


PS1, Line 382: .
nit: remove


PS1, Line 399: tupleDesc
nit: inline


-- 
To view, visit http://gerrit.cloudera.org:8080/7257
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I49074fe743a28573cff541ef7dbd0edd88892067
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5547: Rework FK/PK join detection.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-5547: Rework FK/PK join detection.
......................................................................


Patch Set 6: Verified+1

-- 
To view, visit http://gerrit.cloudera.org:8080/7257
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I49074fe743a28573cff541ef7dbd0edd88892067
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5547: Rework FK/PK join detection.

Posted by "Dimitris Tsirogiannis (Code Review)" <ge...@cloudera.org>.
Dimitris Tsirogiannis has posted comments on this change.

Change subject: IMPALA-5547: Rework FK/PK join detection.
......................................................................


Patch Set 2:

(2 comments)

Just minor responses; thanks for the clarifications. I'll take a look at the tests when submitted.

http://gerrit.cloudera.org:8080/#/c/7257/2/fe/src/main/java/org/apache/impala/planner/JoinNode.java
File fe/src/main/java/org/apache/impala/planner/JoinNode.java:

PS2, Line 280: fkPkCandidate.get(0)
> Might have misunderstood your comment, but I'll give it a shot.
Ah yes, thanks for clarifying. I forgot the grouping effect :)


PS2, Line 299: Preconditions.checkState(lhsCard >= 0 && rhsCard >= 0);
> Maybe you misread the code: It's greater or equal to zero, so zero is valid
oops yes. sorry for the confusion


-- 
To view, visit http://gerrit.cloudera.org:8080/7257
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I49074fe743a28573cff541ef7dbd0edd88892067
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5547: Rework FK/PK join detection.

Posted by "Dimitris Tsirogiannis (Code Review)" <ge...@cloudera.org>.
Dimitris Tsirogiannis has posted comments on this change.

Change subject: IMPALA-5547: Rework FK/PK join detection.
......................................................................


Patch Set 2:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/7257/2/fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
File fe/src/main/java/org/apache/impala/planner/HashJoinNode.java:

PS2, Line 27: import org.apache.impala.analysis.SlotDescriptor;
Is this used?


PS2, Line 40: import org.apache.kudu.shaded.com.google.common.base.Joiner;
replace


http://gerrit.cloudera.org:8080/#/c/7257/2/fe/src/main/java/org/apache/impala/planner/JoinNode.java
File fe/src/main/java/org/apache/impala/planner/JoinNode.java:

PS2, Line 79: -
nit: remove


PS2, Line 80: ordered based on the tuple ids
I am not sure I get what that ordering is. Do you mean grouped?


PS2, Line 202: based on the single most
             :    * selective join predicate. We do not attempt to estimate the joint selectivity of
             :    * multiple join predicates to avoid underestimation.
Much better. Thanks


PS2, Line 280: fkPkCandidate.get(0)
Can you plz explain this? I could be wrong about this but it seems that the decision of whether this is a pk/fk depends on which equi-join slots we process first. Maybe I am missing something. In either case, plz add a comment.


PS2, Line 299: Preconditions.checkState(lhsCard >= 0 && rhsCard >= 0);
Are we certain this is a valid precondition check? Why isn't a 0 rhsCard possible? Unless this is guaranteed to be the case...


PS2, Line 337: we adjustments
nit: grammar


-- 
To view, visit http://gerrit.cloudera.org:8080/7257
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I49074fe743a28573cff541ef7dbd0edd88892067
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5547: Rework FK/PK join detection.

Posted by "Dimitris Tsirogiannis (Code Review)" <ge...@cloudera.org>.
Dimitris Tsirogiannis has posted comments on this change.

Change subject: IMPALA-5547: Rework FK/PK join detection.
......................................................................


Patch Set 4:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/7257/4/testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection.test
File testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection.test:

PS4, Line 261:  
nit: trailing space


PS4, Line 327: ====
Also, consider adding one or two cases with outer joins. You may expand some existing query if you want. Also, you may want to add a few cases that we know are kind of problematic, similar to q64, that has regression due to the group by or analytic function.


-- 
To view, visit http://gerrit.cloudera.org:8080/7257
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I49074fe743a28573cff541ef7dbd0edd88892067
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5547: Rework FK/PK join detection.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-5547: Rework FK/PK join detection.
......................................................................


Patch Set 6:

Build started: http://jenkins.impala.io:8080/job/gerrit-verify-dryrun/819/

-- 
To view, visit http://gerrit.cloudera.org:8080/7257
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I49074fe743a28573cff541ef7dbd0edd88892067
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5547: Rework FK/PK join detection.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has uploaded a new patch set (#2).

Change subject: IMPALA-5547: Rework FK/PK join detection.
......................................................................

IMPALA-5547: Rework FK/PK join detection.

Reworks the FK/PK join detection logic to:
- more accurately recognize many-to-many joins
- avoid dim/dim joins for multi-column PKs

The new detection logic maintains our existing philosophy of generally
assuming a FK/PK join, unless there is strong evidence to the
contrary, as follows.

For each set of simple equi-join conjuncts between two tables, we
compute the joint NDV of the right-hand side columns by
multiplication, and if the joint NDV is significantly smaller than
the right-hand side row count, then we are fairly confident that the
right-hand side is not a PK. Otherwise, we assume the set of conjuncts
could represent a FK/PK relationship.

Extends the explain plan to include the outcome of the FK/PK detection
at EXPLAIN_LEVEL > STANDARD.

Performance testing:
1. Full TPC-DS run on 10TB:
   - Q10 improved by >100x
   - Q72 improved by >25x
   - Q17,Q26,Q29 improved by 2x
   - Q64 regressed by 10x
   - Total runtime: Improved by 2x
   - Geomean: Minor improvement
   The regression of Q64 is understood and we will try to address it
   in follow-on changes. The previous plan was better by accident and
   not because of superior logic.
2. Nightly TPC-H and TPC-DS runs:
   - No perf differences

Testing:
- The existing planner test cover the changes.
- Code/hdfs run passed.

Change-Id: I49074fe743a28573cff541ef7dbd0edd88892067
---
M fe/src/main/java/org/apache/impala/analysis/JoinOperator.java
M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
M fe/src/main/java/org/apache/impala/planner/JoinNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test
M testdata/workloads/functional-planner/queries/PlannerTest/hbase.test
M testdata/workloads/functional-planner/queries/PlannerTest/join-order.test
M testdata/workloads/functional-planner/queries/PlannerTest/joins.test
M testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation.test
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds-all.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpch-all.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpch-kudu.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpch-nested.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpch-views.test
M testdata/workloads/functional-query/queries/QueryTest/explain-level2.test
M testdata/workloads/functional-query/queries/QueryTest/explain-level3.test
17 files changed, 1,485 insertions(+), 1,351 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/57/7257/2
-- 
To view, visit http://gerrit.cloudera.org:8080/7257
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I49074fe743a28573cff541ef7dbd0edd88892067
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>

[Impala-ASF-CR] IMPALA-5547: Rework FK/PK join detection.

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has posted comments on this change.

Change subject: IMPALA-5547: Rework FK/PK join detection.
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7257/1/fe/src/main/java/org/apache/impala/planner/JoinNode.java
File fe/src/main/java/org/apache/impala/planner/JoinNode.java:

Line 253:       return getFkPkJoinCardinality(eqJoinConjunctSlots, lhsCard, rhsCard);
> 1. Changed the code to only pass those FK/PK conjuncts because that makes t
My concern was that if we have compelling evidence that a FK/PK relationship is not present, we shouldn't pass those eqJoin conjuncts to the FK/PK cardinality estimation - e.g.,

SELECT orders.customer_id, shipments.ship_id from orders, shipments, WHERE orders.order_date = shipments.ship_date

where the highly selective non-PK ship_date should not reduce the cardinality of the join - ideally, we multiply the cardinality by (|shipments| / NDV(shipments.ship_date))


-- 
To view, visit http://gerrit.cloudera.org:8080/7257
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I49074fe743a28573cff541ef7dbd0edd88892067
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5547: Rework FK/PK join detection.

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has posted comments on this change.

Change subject: IMPALA-5547: Rework FK/PK join detection.
......................................................................


Patch Set 1:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/7257/1/fe/src/main/java/org/apache/impala/planner/JoinNode.java
File fe/src/main/java/org/apache/impala/planner/JoinNode.java:

Line 91:   protected List<List<EqJoinConjunctScanSlots>> fkPkEqJoinConjuncts_;
Class member is only used in one function in the base class.  Maybe document that this is preserved just for printing values and wrap it with a getter so it can't be modified?


Line 253:       return getFkPkJoinCardinality(eqJoinConjunctSlots, lhsCard, rhsCard);
This computes estimates based on both PK conjuncts and non-PK conjuncts using the same formula, instead of filtering out non-PK joins.  If we have a RHS which is an extremely selective non-PK, this could bias the cardinality estimate to a very low value, since we take the minimum computed cardinality from these assumed PK values.

Is there any value in preserving the <List<List<>>> construction, or should getFkPkEqJoinConjuncts consider the conjuncts by tid order and just return a flattened list (which can be passed to getFkPkJoinCardinality)


Line 275:     for (List<EqJoinConjunctScanSlots> fkPkCandi: scanSlotsByTids.values()) {
fkPkCandi - this could use a better name, it isn't clear to me what this is trying to describe.


-- 
To view, visit http://gerrit.cloudera.org:8080/7257
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I49074fe743a28573cff541ef7dbd0edd88892067
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5547: Rework FK/PK join detection.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-5547: Rework FK/PK join detection.
......................................................................


Patch Set 6: Code-Review+2

Rebase and adjust test output. One final fix to cap NDV at the row count. Did another TPCDS 10TB per run and the numbers look good.

-- 
To view, visit http://gerrit.cloudera.org:8080/7257
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I49074fe743a28573cff541ef7dbd0edd88892067
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5547: Rework FK/PK join detection.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-5547: Rework FK/PK join detection.
......................................................................


Patch Set 1:

(20 comments)

I still need to add the targeted planner tests and adjust new tests after the rebase. In the meantime, you can continue looking at the code changes.

http://gerrit.cloudera.org:8080/#/c/7257/1/fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
File fe/src/main/java/org/apache/impala/planner/HashJoinNode.java:

PS1, Line 170: else if (fkPkEqJoinConjuncts_.isEmpty()) {
             :           output.append("assumed fk/pk");
> I haven't read the entire patch yet, so I am curious what this case represe
In this case we have no equi-join conjuncts that we can reason about, so we are optimistic and assume fk/pk with a join selectivity of 1.

We cannot reason about an equi-join conjunct if the underlying table/columns have no stats or if the conjunct involves non-trivial expressions (anything that is not a SlotRef).


PS1, Line 178: for (int j = 0; j < slots.size(); ++j) {
             :               output.append(slots.get(j).toString());
             :               if (j + 1 != slots.size()) output.append(", ");
             :             }
> Can't you use the Guava's Joiner class here? Joiner.on(",").join(slots)
Better. Done.


http://gerrit.cloudera.org:8080/#/c/7257/1/fe/src/main/java/org/apache/impala/planner/JoinNode.java
File fe/src/main/java/org/apache/impala/planner/JoinNode.java:

PS1, Line 37: import org.apache.kudu.client.shaded.com.google.common.collect.Maps;
> ?
Sigh. Fixed.


Line 91:   protected List<List<EqJoinConjunctScanSlots>> fkPkEqJoinConjuncts_;
> Class member is only used in one function in the base class.  Maybe documen
Added comment. Not sure it's worth protecting against accidental modifications in inheriting classes. I think it's fair to assume that subclasses should treat protected members carefully. Subclasses can already muck with pretty much anything here, and adding getters that wrap members with immutables seems cumbersome, and generates objects "unnecessarily".


Line 253:       return getFkPkJoinCardinality(eqJoinConjunctSlots, lhsCard, rhsCard);
> This computes estimates based on both PK conjuncts and non-PK conjuncts usi
1. Changed the code to only pass those FK/PK conjuncts because that makes the behavior and code clearer. I'm not sure I follow your described scenario. If the RHS is very selective, then it's correct to apply that to the join cardinality - if we believe that FK/PK is indeed the case. Are you concerned that our FK/PK assumption could be wrong and we might underestimate the join cardinality

2. Good idea. A flattened, ordered list makes things simpler in various places. Done.


PS1, Line 271: scanSlotsByTids
> nit: Each pair represents two joined tables, no? So, maybe rename this to '
Done


Line 275:     for (List<EqJoinConjunctScanSlots> fkPkCandi: scanSlotsByTids.values()) {
> fkPkCandi - this could use a better name, it isn't clear to me what this is
I renamed it to fkPkCandidate, but I doubt that's what you had in mind. Can you help me come up with a better name? This thing is a list of equi-join conjuncts between the same two tables. These conjuncts could represent a single- or multi-column FK/PK join condition.


PS1, Line 281: numRows
> rhsTableCardinality?
Used rhsNumRows because we typically use 'numRows' to indicate values that come directly from stats, although technically table cardinality is also correct of course.


Line 290: 
> nit: extra line
Done


PS1, Line 294: conjuncts
> conjunct slots
Done


PS1, Line 300:     Preconditions.checkState(joinOp_.isInnerJoin() || joinOp_.isOuterJoin());
> This private function is only called in L253? I think you can remove this c
Done


PS1, Line 312: lhsNdv
> Can this ever be 0?
Probably. Better be defensive. Fixed here and elsewhere.


PS1, Line 313: rhsNumRows
> Same here, can this be 0?
Probably. Better be defensive. Fixed here and elsewhere.


PS1, Line 318: result = Math.min(result, joinCard);
> That part I think is very important and I am not sure it is explained anywh
This part applies to both methods (generic and FK/PK). I expanded the comment on getJoinCardinality() to explain why we take the min().

We should certainly try (in a subsequent patch) whether estimating the joint selectivity of multiple predicates with exponential backoff or similar works well.


PS1, Line 329: conjuncts
> conjunct slots
Done


PS1, Line 341: slots.lhs().getStats().getNumDistinctValues();
> nit: these are used in couple of places. Since you already have the EqJoinC
Done


PS1, Line 343: slots.lhs().getParent().getTable().getNumRows();
> same comment as above.
Done


PS1, Line 373: Preconditions
> I don't think these checks are useful. This private constructor is only cal
Done


PS1, Line 382: .
> nit: remove
Done


PS1, Line 399: tupleDesc
> nit: inline
Removed tupleDesc instead, otherwise the if condition below becomes multi-line and harder to read imo.


-- 
To view, visit http://gerrit.cloudera.org:8080/7257
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I49074fe743a28573cff541ef7dbd0edd88892067
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5547: Rework FK/PK join detection.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-5547: Rework FK/PK join detection.
......................................................................


Patch Set 3:

New tests are complete now.

-- 
To view, visit http://gerrit.cloudera.org:8080/7257
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I49074fe743a28573cff541ef7dbd0edd88892067
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5547: Rework FK/PK join detection.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-5547: Rework FK/PK join detection.
......................................................................


Patch Set 4:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/7257/4/testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection.test
File testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection.test:

PS4, Line 261:  
> nit: trailing space
Done


PS4, Line 327: ====
> Also, consider adding one or two cases with outer joins. You may expand som
* Added left/right outer join tests.
* Added test with group by on rhs

Analytic functions do not produce keys, so did not did not add this test.


-- 
To view, visit http://gerrit.cloudera.org:8080/7257
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I49074fe743a28573cff541ef7dbd0edd88892067
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5547: Rework FK/PK join detection.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has uploaded a new patch set (#5).

Change subject: IMPALA-5547: Rework FK/PK join detection.
......................................................................

IMPALA-5547: Rework FK/PK join detection.

Reworks the FK/PK join detection logic to:
- more accurately recognize many-to-many joins
- avoid dim/dim joins for multi-column PKs

The new detection logic maintains our existing philosophy of generally
assuming a FK/PK join, unless there is strong evidence to the
contrary, as follows.

For each set of simple equi-join conjuncts between two tables, we
compute the joint NDV of the right-hand side columns by
multiplication, and if the joint NDV is significantly smaller than
the right-hand side row count, then we are fairly confident that the
right-hand side is not a PK. Otherwise, we assume the set of conjuncts
could represent a FK/PK relationship.

Extends the explain plan to include the outcome of the FK/PK detection
at EXPLAIN_LEVEL > STANDARD.

Performance testing:
1. Full TPC-DS run on 10TB:
   - Q10 improved by >100x
   - Q72 improved by >25x
   - Q17,Q26,Q29 improved by 2x
   - Q64 regressed by 10x
   - Total runtime: Improved by 2x
   - Geomean: Minor improvement
   The regression of Q64 is understood and we will try to address it
   in follow-on changes. The previous plan was better by accident and
   not because of superior logic.
2. Nightly TPC-H and TPC-DS runs:
   - No perf differences

Testing:
- The existing planner test cover the changes.
- Code/hdfs run passed.

Change-Id: I49074fe743a28573cff541ef7dbd0edd88892067
---
M fe/src/main/java/org/apache/impala/analysis/JoinOperator.java
M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
M fe/src/main/java/org/apache/impala/planner/JoinNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test
A testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection.test
M testdata/workloads/functional-planner/queries/PlannerTest/hbase.test
M testdata/workloads/functional-planner/queries/PlannerTest/join-order.test
M testdata/workloads/functional-planner/queries/PlannerTest/joins.test
M testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation.test
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpch-all.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpch-kudu.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpch-nested.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpch-views.test
M testdata/workloads/functional-query/queries/QueryTest/explain-level2.test
M testdata/workloads/functional-query/queries/QueryTest/explain-level3.test
18 files changed, 1,171 insertions(+), 588 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/57/7257/5
-- 
To view, visit http://gerrit.cloudera.org:8080/7257
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I49074fe743a28573cff541ef7dbd0edd88892067
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>

[Impala-ASF-CR] IMPALA-5547: Rework FK/PK join detection.

Posted by "Dimitris Tsirogiannis (Code Review)" <ge...@cloudera.org>.
Dimitris Tsirogiannis has posted comments on this change.

Change subject: IMPALA-5547: Rework FK/PK join detection.
......................................................................


Patch Set 5: Code-Review+1

-- 
To view, visit http://gerrit.cloudera.org:8080/7257
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I49074fe743a28573cff541ef7dbd0edd88892067
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5547: Rework FK/PK join detection.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has uploaded a new patch set (#3).

Change subject: IMPALA-5547: Rework FK/PK join detection.
......................................................................

IMPALA-5547: Rework FK/PK join detection.

Reworks the FK/PK join detection logic to:
- more accurately recognize many-to-many joins
- avoid dim/dim joins for multi-column PKs

The new detection logic maintains our existing philosophy of generally
assuming a FK/PK join, unless there is strong evidence to the
contrary, as follows.

For each set of simple equi-join conjuncts between two tables, we
compute the joint NDV of the right-hand side columns by
multiplication, and if the joint NDV is significantly smaller than
the right-hand side row count, then we are fairly confident that the
right-hand side is not a PK. Otherwise, we assume the set of conjuncts
could represent a FK/PK relationship.

Extends the explain plan to include the outcome of the FK/PK detection
at EXPLAIN_LEVEL > STANDARD.

Performance testing:
1. Full TPC-DS run on 10TB:
   - Q10 improved by >100x
   - Q72 improved by >25x
   - Q17,Q26,Q29 improved by 2x
   - Q64 regressed by 10x
   - Total runtime: Improved by 2x
   - Geomean: Minor improvement
   The regression of Q64 is understood and we will try to address it
   in follow-on changes. The previous plan was better by accident and
   not because of superior logic.
2. Nightly TPC-H and TPC-DS runs:
   - No perf differences

Testing:
- The existing planner test cover the changes.
- Code/hdfs run passed.

Change-Id: I49074fe743a28573cff541ef7dbd0edd88892067
---
M fe/src/main/java/org/apache/impala/analysis/JoinOperator.java
M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
M fe/src/main/java/org/apache/impala/planner/JoinNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test
A testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection.test
M testdata/workloads/functional-planner/queries/PlannerTest/hbase.test
M testdata/workloads/functional-planner/queries/PlannerTest/join-order.test
M testdata/workloads/functional-planner/queries/PlannerTest/joins.test
M testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation.test
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds-all.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpch-all.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpch-kudu.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpch-nested.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpch-views.test
M testdata/workloads/functional-query/queries/QueryTest/explain-level2.test
M testdata/workloads/functional-query/queries/QueryTest/explain-level3.test
19 files changed, 1,633 insertions(+), 1,351 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/57/7257/3
-- 
To view, visit http://gerrit.cloudera.org:8080/7257
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I49074fe743a28573cff541ef7dbd0edd88892067
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>

[Impala-ASF-CR] IMPALA-5547: Rework FK/PK join detection.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has uploaded a new patch set (#4).

Change subject: IMPALA-5547: Rework FK/PK join detection.
......................................................................

IMPALA-5547: Rework FK/PK join detection.

Reworks the FK/PK join detection logic to:
- more accurately recognize many-to-many joins
- avoid dim/dim joins for multi-column PKs

The new detection logic maintains our existing philosophy of generally
assuming a FK/PK join, unless there is strong evidence to the
contrary, as follows.

For each set of simple equi-join conjuncts between two tables, we
compute the joint NDV of the right-hand side columns by
multiplication, and if the joint NDV is significantly smaller than
the right-hand side row count, then we are fairly confident that the
right-hand side is not a PK. Otherwise, we assume the set of conjuncts
could represent a FK/PK relationship.

Extends the explain plan to include the outcome of the FK/PK detection
at EXPLAIN_LEVEL > STANDARD.

Performance testing:
1. Full TPC-DS run on 10TB:
   - Q10 improved by >100x
   - Q72 improved by >25x
   - Q17,Q26,Q29 improved by 2x
   - Q64 regressed by 10x
   - Total runtime: Improved by 2x
   - Geomean: Minor improvement
   The regression of Q64 is understood and we will try to address it
   in follow-on changes. The previous plan was better by accident and
   not because of superior logic.
2. Nightly TPC-H and TPC-DS runs:
   - No perf differences

Testing:
- The existing planner test cover the changes.
- Code/hdfs run passed.

Change-Id: I49074fe743a28573cff541ef7dbd0edd88892067
---
M fe/src/main/java/org/apache/impala/analysis/JoinOperator.java
M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
M fe/src/main/java/org/apache/impala/planner/JoinNode.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test
A testdata/workloads/functional-planner/queries/PlannerTest/fk-pk-join-detection.test
M testdata/workloads/functional-planner/queries/PlannerTest/hbase.test
M testdata/workloads/functional-planner/queries/PlannerTest/join-order.test
M testdata/workloads/functional-planner/queries/PlannerTest/joins.test
M testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation.test
M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds-all.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpch-all.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpch-kudu.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpch-nested.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpch-views.test
M testdata/workloads/functional-query/queries/QueryTest/explain-level2.test
M testdata/workloads/functional-query/queries/QueryTest/explain-level3.test
19 files changed, 1,829 insertions(+), 1,354 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/57/7257/4
-- 
To view, visit http://gerrit.cloudera.org:8080/7257
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I49074fe743a28573cff541ef7dbd0edd88892067
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>