You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by viirya <gi...@git.apache.org> on 2017/10/15 06:06:45 UTC

[GitHub] spark pull request #19501: [SPARK-22223][SQL] ObjectHashAggregate should not...

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/19501

    [SPARK-22223][SQL] ObjectHashAggregate should not introduce unnecessary shuffle

    ## What changes were proposed in this pull request?
    
    `ObjectHashAggregateExec` should override `outputPartitioning` in order to avoid unnecessary shuffle.
    
    ## How was this patch tested?
    
    Added Jenkins test.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 SPARK-22223

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19501.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19501
    
----
commit c84562763034e3fc6a7ddba785131cb4a1c36eb4
Author: Liang-Chi Hsieh <vi...@gmail.com>
Date:   2017-10-15T06:02:59Z

    ObjectHashAggregate should not introduce unnecessary shuffle.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19501
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19501
  
    **[Test build #82769 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82769/testReport)** for PR 19501 at commit [`c845627`](https://github.com/apache/spark/commit/c84562763034e3fc6a7ddba785131cb4a1c36eb4).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19501: [SPARK-22223][SQL] ObjectHashAggregate should not...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19501#discussion_r144742648
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala ---
    @@ -636,4 +637,33 @@ class DataFrameAggregateSuite extends QueryTest with SharedSQLContext {
           spark.sql("SELECT 3 AS c, 4 AS d, SUM(b) FROM testData2 GROUP BY c, d"),
           Seq(Row(3, 4, 9)))
       }
    +
    +  test("SPARK-22223: ObjectHashAggregate should not introduce unnecessary shuffle") {
    +    withSQLConf(SQLConf.USE_OBJECT_HASH_AGG.key -> "true") {
    +      val df = Seq(("1", "2", 1), ("1", "2", 2), ("2", "3", 3), ("2", "3", 4)).toDF("a", "b", "c")
    +        .repartition(col("a"))
    +
    +      val objHashAggDF = df
    +        .withColumn("d", expr("(a, b, c)"))
    +        .groupBy("a", "b").agg(collect_list("d").as("e"))
    +        .withColumn("f", expr("(b, e)"))
    +        .groupBy("a").agg(collect_list("f").as("g"))
    +      val aggPlan = objHashAggDF.queryExecution.executedPlan
    +
    +      val sortAggPlans = aggPlan.collect {
    +        case sortAgg: SortAggregateExec => sortAgg
    +      }
    +      assert(sortAggPlans.isEmpty)
    +
    +      val objHashAggPlans = aggPlan.collect {
    +        case objHashAgg: ObjectHashAggregateExec => objHashAgg
    +      }
    +      assert(objHashAggPlans.length > 0)
    --- End diff --
    
    Ok. Sounds good.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19501
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82769/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19501
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82768/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19501
  
    LGTM


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19501
  
    **[Test build #82774 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82774/testReport)** for PR 19501 at commit [`45899fd`](https://github.com/apache/spark/commit/45899fd4fd4ee1efe5b75028c8fd9d2453c90aa8).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19501
  
    **[Test build #82768 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82768/testReport)** for PR 19501 at commit [`c845627`](https://github.com/apache/spark/commit/c84562763034e3fc6a7ddba785131cb4a1c36eb4).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19501
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19501
  
    **[Test build #82768 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82768/testReport)** for PR 19501 at commit [`c845627`](https://github.com/apache/spark/commit/c84562763034e3fc6a7ddba785131cb4a1c36eb4).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19501
  
    **[Test build #82769 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82769/testReport)** for PR 19501 at commit [`c845627`](https://github.com/apache/spark/commit/c84562763034e3fc6a7ddba785131cb4a1c36eb4).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19501
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82774/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19501
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19501
  
    Thanks @HyukjinKwon @cloud-fan 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19501
  
    thanks, merging to master!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19501
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19501
  
    **[Test build #82774 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82774/testReport)** for PR 19501 at commit [`45899fd`](https://github.com/apache/spark/commit/45899fd4fd4ee1efe5b75028c8fd9d2453c90aa8).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19501: [SPARK-22223][SQL] ObjectHashAggregate should not...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/19501


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19501: [SPARK-22223][SQL] ObjectHashAggregate should not...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19501#discussion_r144739274
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala ---
    @@ -636,4 +637,33 @@ class DataFrameAggregateSuite extends QueryTest with SharedSQLContext {
           spark.sql("SELECT 3 AS c, 4 AS d, SUM(b) FROM testData2 GROUP BY c, d"),
           Seq(Row(3, 4, 9)))
       }
    +
    +  test("SPARK-22223: ObjectHashAggregate should not introduce unnecessary shuffle") {
    +    withSQLConf(SQLConf.USE_OBJECT_HASH_AGG.key -> "true") {
    +      val df = Seq(("1", "2", 1), ("1", "2", 2), ("2", "3", 3), ("2", "3", 4)).toDF("a", "b", "c")
    +        .repartition(col("a"))
    +
    +      val objHashAggDF = df
    +        .withColumn("d", expr("(a, b, c)"))
    +        .groupBy("a", "b").agg(collect_list("d").as("e"))
    +        .withColumn("f", expr("(b, e)"))
    +        .groupBy("a").agg(collect_list("f").as("g"))
    +      val aggPlan = objHashAggDF.queryExecution.executedPlan
    +
    +      val sortAggPlans = aggPlan.collect {
    +        case sortAgg: SortAggregateExec => sortAgg
    +      }
    +      assert(sortAggPlans.isEmpty)
    +
    +      val objHashAggPlans = aggPlan.collect {
    +        case objHashAgg: ObjectHashAggregateExec => objHashAgg
    +      }
    +      assert(objHashAggPlans.length > 0)
    --- End diff --
    
    Not a big deal at all but maybe `nonEmpty`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org