You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by mgaido91 <gi...@git.apache.org> on 2018/05/15 14:46:04 UTC

[GitHub] spark pull request #21333: [SPARK-23778][CORE] Avoid unneeded shuffle when u...

GitHub user mgaido91 opened a pull request:

    https://github.com/apache/spark/pull/21333

    [SPARK-23778][CORE] Avoid unneeded shuffle when union gets an empty RDD

    ## What changes were proposed in this pull request?
    
    When a `union` is invoked on several RDDs of which one is an empty RDD, the result of the operation is a `UnionRDD`. This causes an unneeded extra-shuffle when all the other RDDs have the same partitioning.
    
    The PR ignores incoming empty RDDs in the union method.
    
    ## How was this patch tested?
    
    added UT


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mgaido91/spark SPARK-23778

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21333.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21333
    
----
commit f67a88dc95845ae078655b237cd5ed8873ba465b
Author: Marco Gaido <ma...@...>
Date:   2018-05-15T14:40:57Z

    [SPARK-23778][CORE] Avoid unneeded shuffle when union gets an empty RDD

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21333: [SPARK-23778][CORE] Avoid unneeded shuffle when union ge...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21333
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21333: [SPARK-23778][CORE] Avoid unneeded shuffle when union ge...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21333
  
    **[Test build #92098 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92098/testReport)** for PR 21333 at commit [`7f16ea0`](https://github.com/apache/spark/commit/7f16ea0c423461aeaeb7f3878d45074608558b1b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21333: [SPARK-23778][CORE] Avoid unneeded shuffle when union ge...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21333
  
    LGTM


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21333: [SPARK-23778][CORE] Avoid unneeded shuffle when u...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21333#discussion_r192033104
  
    --- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala ---
    @@ -154,6 +154,13 @@ class RDDSuite extends SparkFunSuite with SharedSparkContext {
         }
       }
     
    +  test("SPARK-23778: empty RDD in union should not produce a UnionRDD") {
    --- End diff --
    
    When all RDDs are empty we are returning a `UnionRDD`. Though in this case it is not a big issue, since a shuffle of an empty RDD is not an issue.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21333: [SPARK-23778][CORE] Avoid unneeded shuffle when union ge...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21333
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92098/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21333: [SPARK-23778][CORE] Avoid unneeded shuffle when union ge...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21333
  
    **[Test build #90647 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90647/testReport)** for PR 21333 at commit [`f67a88d`](https://github.com/apache/spark/commit/f67a88dc95845ae078655b237cd5ed8873ba465b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21333: [SPARK-23778][CORE] Avoid unneeded shuffle when union ge...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21333
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3236/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21333: [SPARK-23778][CORE] Avoid unneeded shuffle when u...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21333#discussion_r196521913
  
    --- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala ---
    @@ -154,6 +154,13 @@ class RDDSuite extends SparkFunSuite with SharedSparkContext {
         }
       }
     
    +  test("SPARK-23778: empty RDD in union should not produce a UnionRDD") {
    --- End diff --
    
    can we add a test? just make sure we are safe when `UnionRDD.rdds` is Nil


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21333: [SPARK-23778][CORE] Avoid unneeded shuffle when union ge...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21333
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/327/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21333: [SPARK-23778][CORE] Avoid unneeded shuffle when union ge...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21333
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/4221/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21333: [SPARK-23778][CORE] Avoid unneeded shuffle when union ge...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21333
  
    **[Test build #90647 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90647/testReport)** for PR 21333 at commit [`f67a88d`](https://github.com/apache/spark/commit/f67a88dc95845ae078655b237cd5ed8873ba465b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21333: [SPARK-23778][CORE] Avoid unneeded shuffle when union ge...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21333
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90647/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21333: [SPARK-23778][CORE] Avoid unneeded shuffle when union ge...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21333
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21333: [SPARK-23778][CORE] Avoid unneeded shuffle when union ge...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/21333
  
    cc @cloud-fan @JoshRosen 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21333: [SPARK-23778][CORE] Avoid unneeded shuffle when union ge...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/21333
  
    @varuvish the Dataset API uses `sparkContext.union` under the hood, so it is addressed as well by the current change.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21333: [SPARK-23778][CORE] Avoid unneeded shuffle when union ge...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21333
  
    **[Test build #92098 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92098/testReport)** for PR 21333 at commit [`7f16ea0`](https://github.com/apache/spark/commit/7f16ea0c423461aeaeb7f3878d45074608558b1b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21333: [SPARK-23778][CORE] Avoid unneeded shuffle when union ge...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21333
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21333: [SPARK-23778][CORE] Avoid unneeded shuffle when u...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21333#discussion_r196541591
  
    --- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala ---
    @@ -154,6 +154,13 @@ class RDDSuite extends SparkFunSuite with SharedSparkContext {
         }
       }
     
    +  test("SPARK-23778: empty RDD in union should not produce a UnionRDD") {
    --- End diff --
    
    added, thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21333: [SPARK-23778][CORE] Avoid unneeded shuffle when union ge...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21333
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21333: [SPARK-23778][CORE] Avoid unneeded shuffle when union ge...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21333
  
    thanks, merging to master!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21333: [SPARK-23778][CORE] Avoid unneeded shuffle when union ge...

Posted by varuvish <gi...@git.apache.org>.
Github user varuvish commented on the issue:

    https://github.com/apache/spark/pull/21333
  
    Nice change! I tested this out as well and verified that the shuffle doesn't happen. I did notice that this change wasn't reflected in the dataset API. Is that something that should be addressed in this change?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21333: [SPARK-23778][CORE] Avoid unneeded shuffle when union ge...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21333
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21333: [SPARK-23778][CORE] Avoid unneeded shuffle when u...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/21333


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21333: [SPARK-23778][CORE] Avoid unneeded shuffle when u...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21333#discussion_r192000399
  
    --- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala ---
    @@ -154,6 +154,13 @@ class RDDSuite extends SparkFunSuite with SharedSparkContext {
         }
       }
     
    +  test("SPARK-23778: empty RDD in union should not produce a UnionRDD") {
    --- End diff --
    
    Have we tested when all input RDDs are empty?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org