You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Aman Sinha (Code Review)" <ge...@cloudera.org> on 2020/12/12 00:11:24 UTC

[Impala-ASF-CR] IMPALA-10287: Include parallelism in cost comparison of broadcast vs partition

Aman Sinha has uploaded this change for review. ( http://gerrit.cloudera.org:8080/16864


Change subject: IMPALA-10287: Include parallelism in cost comparison of broadcast vs partition
......................................................................

IMPALA-10287: Include parallelism in cost comparison of broadcast vs partition

The current planner tends to pick broadcast distribution in some cases
even when partition distribution would be more optimal (seen in
TPC-DS performance runs).

This patch adds 2 query options:
 - use_dop_for_costing (type:boolean, default:false)
 - broadcast_to_partition_factor (type:double, default:1.0)
By default, they don't alter the current behavior of the planner. If
use_dop_for_costing is enabled, the distributed planner will increase
the cost of the broadcast join's build side by C.ln(m) where
m = degree of parallelism of the join node and,
C = the broadcast_to_partition_factor
This allows the planner to more favorably consider partition distribution
where appropriate.

The choice of natural log in the calculation is not a final choice
at this point but is intended to model a non-linear relationship
between mt_dop and the query performance. After further performance
testing with tuning the above factor, we can establish a better
correlation and refine the formula.

Testing:
 - Added a new test file with TPC-DS Q78 which shows partition
   distribution for a left-outer join in the query when the
   query options are enabled (it chooses broadcast otherwise).

Change-Id: Idff569299e5c78720ca17c616a531adac78208e1
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A testdata/workloads/functional-planner/queries/PlannerTest/tpcds-dist-method.test
7 files changed, 603 insertions(+), 4 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/64/16864/1
-- 
To view, visit http://gerrit.cloudera.org:8080/16864
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Idff569299e5c78720ca17c616a531adac78208e1
Gerrit-Change-Number: 16864
Gerrit-PatchSet: 1
Gerrit-Owner: Aman Sinha <am...@cloudera.com>

[Impala-ASF-CR] IMPALA-10287: Include parallelism in cost comparison of broadcast vs partition

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16864 )

Change subject: IMPALA-10287: Include parallelism in cost comparison of broadcast vs partition
......................................................................


Patch Set 1:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/7844/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16864
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Idff569299e5c78720ca17c616a531adac78208e1
Gerrit-Change-Number: 16864
Gerrit-PatchSet: 1
Gerrit-Owner: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Sat, 12 Dec 2020 00:32:44 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10287: Include parallelism in cost comparison of broadcast vs partition

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16864 )

Change subject: IMPALA-10287: Include parallelism in cost comparison of broadcast vs partition
......................................................................


Patch Set 3: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/16864
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Idff569299e5c78720ca17c616a531adac78208e1
Gerrit-Change-Number: 16864
Gerrit-PatchSet: 3
Gerrit-Owner: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Tue, 15 Dec 2020 12:05:45 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10287: Include parallelism in cost comparison of broadcast vs partition

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16864 )

Change subject: IMPALA-10287: Include parallelism in cost comparison of broadcast vs partition
......................................................................


Patch Set 2:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/7856/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16864
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Idff569299e5c78720ca17c616a531adac78208e1
Gerrit-Change-Number: 16864
Gerrit-PatchSet: 2
Gerrit-Owner: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Tue, 15 Dec 2020 00:55:33 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10287: Include parallelism in cost comparison of broadcast vs partition

Posted by "Aman Sinha (Code Review)" <ge...@cloudera.org>.
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16864 )

Change subject: IMPALA-10287: Include parallelism in cost comparison of broadcast vs partition
......................................................................


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/16864/1/common/thrift/ImpalaInternalService.thrift
File common/thrift/ImpalaInternalService.thrift:

http://gerrit.cloudera.org:8080/#/c/16864/1/common/thrift/ImpalaInternalService.thrift@473
PS1, Line 473:   118: optional bool use_dop_for_costing = true;
> I'd consider enabling it by default, it seems like a fairly clear win parti
Enabled this.  It didn't actually change the plans in the test suite since mt_dop is small for dev testing.


http://gerrit.cloudera.org:8080/#/c/16864/1/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
File fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java:

http://gerrit.cloudera.org:8080/#/c/16864/1/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java@515
PS1, Line 515: sqrt(
> I wonder if natural log reduces the influence of DOP too much, e.g. log(16)
I was actually debating whether to use sqrt at one point but fell back to a more conservative function. But given your feedback and also offline feedback from Kurt about this not being aggressive enough and also the fact that the whole costing change can be easily disabled,  I went ahead and changed this to sqrt . 

I figured it would be easier to do the performance evaluation once this patch is merged and so I have created a follow-on JIRA (https://issues.apache.org/jira/browse/IMPALA-10395) to gather more performance data and use a curve-fitting to determine what the right equation should be.



-- 
To view, visit http://gerrit.cloudera.org:8080/16864
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Idff569299e5c78720ca17c616a531adac78208e1
Gerrit-Change-Number: 16864
Gerrit-PatchSet: 2
Gerrit-Owner: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Tue, 15 Dec 2020 00:31:14 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10287: Include parallelism in cost comparison of broadcast vs partition

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16864 )

Change subject: IMPALA-10287: Include parallelism in cost comparison of broadcast vs partition
......................................................................


Patch Set 1:

(2 comments)

Code change makes sense and seems fine. I have some questions about whether this is going far enough, but it sounds like you maybe have plans to improve further - is there a follow-on JIRA for that work?

http://gerrit.cloudera.org:8080/#/c/16864/1/common/thrift/ImpalaInternalService.thrift
File common/thrift/ImpalaInternalService.thrift:

http://gerrit.cloudera.org:8080/#/c/16864/1/common/thrift/ImpalaInternalService.thrift@473
PS1, Line 473:   118: optional bool use_dop_for_costing = false;
I'd consider enabling it by default, it seems like a fairly clear win particularly given the log() function making the estimate of the benefit fairly conservative.


http://gerrit.cloudera.org:8080/#/c/16864/1/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
File fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java:

http://gerrit.cloudera.org:8080/#/c/16864/1/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java@515
PS1, Line 515: log1p
I wonder if natural log reduces the influence of DOP too much, e.g. log(16) = 2.7, log(64) = 4.2 and presumably the speedup in hash table build time would be more than that. sqrt is also pretty conservative but seems to produce numbers that would be a bit closer to what we'd expect sqrt(16) = 4, sqrt(64) = 8.

Not a blocker for sure.

Edit: oh I see you're planning to investigate more.



-- 
To view, visit http://gerrit.cloudera.org:8080/16864
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Idff569299e5c78720ca17c616a531adac78208e1
Gerrit-Change-Number: 16864
Gerrit-PatchSet: 1
Gerrit-Owner: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Mon, 14 Dec 2020 20:43:35 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10287: Include parallelism in cost comparison of broadcast vs partition

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16864 )

Change subject: IMPALA-10287: Include parallelism in cost comparison of broadcast vs partition
......................................................................


Patch Set 3:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6777/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/16864
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Idff569299e5c78720ca17c616a531adac78208e1
Gerrit-Change-Number: 16864
Gerrit-PatchSet: 3
Gerrit-Owner: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Tue, 15 Dec 2020 06:35:14 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10287: Include parallelism in cost comparison of broadcast vs partition

Posted by "Aman Sinha (Code Review)" <ge...@cloudera.org>.
Hello Tim Armstrong, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16864

to look at the new patch set (#2).

Change subject: IMPALA-10287: Include parallelism in cost comparison of broadcast vs partition
......................................................................

IMPALA-10287: Include parallelism in cost comparison of broadcast vs partition

The current planner tends to pick broadcast distribution in some cases
even when partition distribution would be more optimal (seen in
TPC-DS performance runs).

This patch adds 2 query options:
 - use_dop_for_costing (type:boolean, default:true)
 - broadcast_to_partition_factor (type:double, default:1.0)
With use_dop_for_costing enabled, the distributed planner will increase
the cost of the broadcast join's build side by C.sqrt(m) where
m = degree of parallelism of the join node and,
C = the broadcast_to_partition_factor
This allows the planner to more favorably consider partition distribution
where appropriate.

The choice of sqrt in the calculation is not a final choice
at this point but is intended to model a non-linear relationship
between mt_dop and the query performance. After further performance
testing with tuning the above factor, we can establish a better
correlation and refine the formula (tracked by IMPALA-10395).

Testing:
 - Added a new test file with TPC-DS Q78 which shows partition
   distribution for a left-outer join (with store_returns on the right
   input) in the query when the query options are enabled (it chooses
   broadcast otherwise).
 - Ran PlannerTest and TpcdsPlannerTest.
 - Ran e2e tests for Tpcds and Tpch.

Change-Id: Idff569299e5c78720ca17c616a531adac78208e1
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A testdata/workloads/functional-planner/queries/PlannerTest/tpcds-dist-method.test
7 files changed, 603 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/64/16864/2
-- 
To view, visit http://gerrit.cloudera.org:8080/16864
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Idff569299e5c78720ca17c616a531adac78208e1
Gerrit-Change-Number: 16864
Gerrit-PatchSet: 2
Gerrit-Owner: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-10287: Include parallelism in cost comparison of broadcast vs partition

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16864 )

Change subject: IMPALA-10287: Include parallelism in cost comparison of broadcast vs partition
......................................................................


Patch Set 3: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/16864
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Idff569299e5c78720ca17c616a531adac78208e1
Gerrit-Change-Number: 16864
Gerrit-PatchSet: 3
Gerrit-Owner: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Tue, 15 Dec 2020 06:35:13 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10287: Include parallelism in cost comparison of broadcast vs partition

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16864 )

Change subject: IMPALA-10287: Include parallelism in cost comparison of broadcast vs partition
......................................................................


Patch Set 2: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/16864
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Idff569299e5c78720ca17c616a531adac78208e1
Gerrit-Change-Number: 16864
Gerrit-PatchSet: 2
Gerrit-Owner: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Tue, 15 Dec 2020 05:50:36 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10287: Include parallelism in cost comparison of broadcast vs partition

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/16864 )

Change subject: IMPALA-10287: Include parallelism in cost comparison of broadcast vs partition
......................................................................

IMPALA-10287: Include parallelism in cost comparison of broadcast vs partition

The current planner tends to pick broadcast distribution in some cases
even when partition distribution would be more optimal (seen in
TPC-DS performance runs).

This patch adds 2 query options:
 - use_dop_for_costing (type:boolean, default:true)
 - broadcast_to_partition_factor (type:double, default:1.0)
With use_dop_for_costing enabled, the distributed planner will increase
the cost of the broadcast join's build side by C.sqrt(m) where
m = degree of parallelism of the join node and,
C = the broadcast_to_partition_factor
This allows the planner to more favorably consider partition distribution
where appropriate.

The choice of sqrt in the calculation is not a final choice
at this point but is intended to model a non-linear relationship
between mt_dop and the query performance. After further performance
testing with tuning the above factor, we can establish a better
correlation and refine the formula (tracked by IMPALA-10395).

Testing:
 - Added a new test file with TPC-DS Q78 which shows partition
   distribution for a left-outer join (with store_returns on the right
   input) in the query when the query options are enabled (it chooses
   broadcast otherwise).
 - Ran PlannerTest and TpcdsPlannerTest.
 - Ran e2e tests for Tpcds and Tpch.

Change-Id: Idff569299e5c78720ca17c616a531adac78208e1
Reviewed-on: http://gerrit.cloudera.org:8080/16864
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
A testdata/workloads/functional-planner/queries/PlannerTest/tpcds-dist-method.test
7 files changed, 603 insertions(+), 4 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/16864
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Idff569299e5c78720ca17c616a531adac78208e1
Gerrit-Change-Number: 16864
Gerrit-PatchSet: 4
Gerrit-Owner: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>