You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by viirya <gi...@git.apache.org> on 2017/10/15 06:06:45 UTC
[GitHub] spark pull request #19501: [SPARK-22223][SQL] ObjectHashAggregate should not...
GitHub user viirya opened a pull request:
https://github.com/apache/spark/pull/19501
[SPARK-22223][SQL] ObjectHashAggregate should not introduce unnecessary shuffle
## What changes were proposed in this pull request?
`ObjectHashAggregateExec` should override `outputPartitioning` in order to avoid unnecessary shuffle.
## How was this patch tested?
Added Jenkins test.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/viirya/spark-1 SPARK-22223
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19501.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19501
----
commit c84562763034e3fc6a7ddba785131cb4a1c36eb4
Author: Liang-Chi Hsieh <vi...@gmail.com>
Date: 2017-10-15T06:02:59Z
ObjectHashAggregate should not introduce unnecessary shuffle.
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19501
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19501
**[Test build #82769 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82769/testReport)** for PR 19501 at commit [`c845627`](https://github.com/apache/spark/commit/c84562763034e3fc6a7ddba785131cb4a1c36eb4).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19501: [SPARK-22223][SQL] ObjectHashAggregate should not...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/19501#discussion_r144742648
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala ---
@@ -636,4 +637,33 @@ class DataFrameAggregateSuite extends QueryTest with SharedSQLContext {
spark.sql("SELECT 3 AS c, 4 AS d, SUM(b) FROM testData2 GROUP BY c, d"),
Seq(Row(3, 4, 9)))
}
+
+ test("SPARK-22223: ObjectHashAggregate should not introduce unnecessary shuffle") {
+ withSQLConf(SQLConf.USE_OBJECT_HASH_AGG.key -> "true") {
+ val df = Seq(("1", "2", 1), ("1", "2", 2), ("2", "3", 3), ("2", "3", 4)).toDF("a", "b", "c")
+ .repartition(col("a"))
+
+ val objHashAggDF = df
+ .withColumn("d", expr("(a, b, c)"))
+ .groupBy("a", "b").agg(collect_list("d").as("e"))
+ .withColumn("f", expr("(b, e)"))
+ .groupBy("a").agg(collect_list("f").as("g"))
+ val aggPlan = objHashAggDF.queryExecution.executedPlan
+
+ val sortAggPlans = aggPlan.collect {
+ case sortAgg: SortAggregateExec => sortAgg
+ }
+ assert(sortAggPlans.isEmpty)
+
+ val objHashAggPlans = aggPlan.collect {
+ case objHashAgg: ObjectHashAggregateExec => objHashAgg
+ }
+ assert(objHashAggPlans.length > 0)
--- End diff --
Ok. Sounds good.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19501
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82769/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19501
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82768/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19501
LGTM
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19501
**[Test build #82774 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82774/testReport)** for PR 19501 at commit [`45899fd`](https://github.com/apache/spark/commit/45899fd4fd4ee1efe5b75028c8fd9d2453c90aa8).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19501
**[Test build #82768 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82768/testReport)** for PR 19501 at commit [`c845627`](https://github.com/apache/spark/commit/c84562763034e3fc6a7ddba785131cb4a1c36eb4).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19501
retest this please
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19501
**[Test build #82768 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82768/testReport)** for PR 19501 at commit [`c845627`](https://github.com/apache/spark/commit/c84562763034e3fc6a7ddba785131cb4a1c36eb4).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19501
**[Test build #82769 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82769/testReport)** for PR 19501 at commit [`c845627`](https://github.com/apache/spark/commit/c84562763034e3fc6a7ddba785131cb4a1c36eb4).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19501
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82774/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19501
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19501
Thanks @HyukjinKwon @cloud-fan
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19501
thanks, merging to master!
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19501
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19501
**[Test build #82774 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82774/testReport)** for PR 19501 at commit [`45899fd`](https://github.com/apache/spark/commit/45899fd4fd4ee1efe5b75028c8fd9d2453c90aa8).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19501: [SPARK-22223][SQL] ObjectHashAggregate should not...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/19501
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19501: [SPARK-22223][SQL] ObjectHashAggregate should not...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/19501#discussion_r144739274
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala ---
@@ -636,4 +637,33 @@ class DataFrameAggregateSuite extends QueryTest with SharedSQLContext {
spark.sql("SELECT 3 AS c, 4 AS d, SUM(b) FROM testData2 GROUP BY c, d"),
Seq(Row(3, 4, 9)))
}
+
+ test("SPARK-22223: ObjectHashAggregate should not introduce unnecessary shuffle") {
+ withSQLConf(SQLConf.USE_OBJECT_HASH_AGG.key -> "true") {
+ val df = Seq(("1", "2", 1), ("1", "2", 2), ("2", "3", 3), ("2", "3", 4)).toDF("a", "b", "c")
+ .repartition(col("a"))
+
+ val objHashAggDF = df
+ .withColumn("d", expr("(a, b, c)"))
+ .groupBy("a", "b").agg(collect_list("d").as("e"))
+ .withColumn("f", expr("(b, e)"))
+ .groupBy("a").agg(collect_list("f").as("g"))
+ val aggPlan = objHashAggDF.queryExecution.executedPlan
+
+ val sortAggPlans = aggPlan.collect {
+ case sortAgg: SortAggregateExec => sortAgg
+ }
+ assert(sortAggPlans.isEmpty)
+
+ val objHashAggPlans = aggPlan.collect {
+ case objHashAgg: ObjectHashAggregateExec => objHashAgg
+ }
+ assert(objHashAggPlans.length > 0)
--- End diff --
Not a big deal at all but maybe `nonEmpty`?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org