You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by viirya <gi...@git.apache.org> on 2018/06/05 10:14:01 UTC

[GitHub] spark pull request #21498: [SPARK-24410][SQL][Core][WIP] Optimization for Un...

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/21498

    [SPARK-24410][SQL][Core][WIP] Optimization for Union outputPartitioning

    ## What changes were proposed in this pull request?
    
    Currently `Union` has only unknown output partitioning. That said if the children are bucketed tables, we still run shuffling on the union result if going to run aggregation on it.
    
    This patch tries to better decide output partitioning for `Union` operator.
    
    This patch adds a private API `zipPartitions` to `RDD`. Since `zipPartitions` asks a function to run on elements of rdds, it only supports fixed number of rdds. But for `Union`, the number of children is not fixed.
    
    ## How was this patch tested?
    
    TBD.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 SPARK-24410

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21498.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21498
    
----
commit 9f25bd19c4802a59039cef6f006f6c6e0802e01d
Author: Liang-Chi Hsieh <vi...@...>
Date:   2018-06-05T10:05:27Z

    Optimization for Union outputPartitioning.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/21498
  
    > In theory, this should be done in a cost-based style. Changing the way how union combines data will reduce the parallelism.
    > For example, if we union 2 tables each has 5 partitions. Without this PR we will launch 10 tasks to process the data, and locality should be easy to satisfy. After this PR, we only launch 5 tasks, and locality is hard to meet, we may have extra data transfer.
    
    Yes, the added `ZippedPartitionsRDD` for zipping RDDs works similar to `PartitionerAwareUnionRDD`, the preferred location for each partition will be the most common preferred location for zipped partitions.
    
    If we can have a solution which can be smarter so that we can make better choice between shuffle and locality/parallelism.
    
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21498: [SPARK-24410][SQL][Core] Optimization for Union o...

Posted by viirya <gi...@git.apache.org>.

Github user viirya closed the pull request at:

    https://github.com/apache/spark/pull/21498


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core][WIP] Optimization for Union out...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21498
  
    **[Test build #91493 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91493/testReport)** for PR 21498 at commit [`b058f89`](https://github.com/apache/spark/commit/b058f892af8204cb25a8daa1f1cd1a6de21c5fd6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/21498
  
    @mgaido91 WDYT? Does the benchmark make sense to you?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core][WIP] Optimization for Union out...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21498
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3812/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core][WIP] Optimization for Union out...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21498
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3818/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21498
  
    **[Test build #91495 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91495/testReport)** for PR 21498 at commit [`0dedf44`](https://github.com/apache/spark/commit/0dedf44559fb6da11c5d903c51bb73f5f508fe6f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/21498
  
    When the condition is satisfied and we know children of Union have same partitioning, this goes to let the first partition of union result includes first partitions of children RDDs, and 2nd, 3rd partitition...
    
    I'm not sure what part the data transfer might happen?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core][WIP] Optimization for Union out...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21498
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3819/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core][WIP] Optimization for Union out...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21498
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91482/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21498
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3821/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core][WIP] Optimization for Union out...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21498
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21498
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core][WIP] Optimization for Union out...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21498
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21498
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core][WIP] Optimization for Union out...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21498
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/21498
  
    Benchmarking on a Spark cluster with 5 nodes on EC2 too.
    
    ```scala
    def benchmark(func: () => Unit): Unit = {
        val t0 = System.nanoTime()
        func()
        val t1 = System.nanoTime()
        println("Elapsed time: " + (t1 - t0) + "ns")
    }
    
    val N = 10000
    
    val data = (0 until N).map { i =>
      (i, i % 2, i % 3, Array.fill(10)(i), Array.fill(10)(i.toString), Array.fill(10)(i.toDouble), (i, i.toString, i.toDouble))
    }
    
    val df1 = sc.parallelize(data).toDF("key", "t1", "t2", "t3", "t4", "t5", "t6").repartition($"key")
    val df2 = sc.parallelize(data).toDF("key", "t1", "t2", "t3", "t4", "t5", "t6").repartition($"key")
    
    spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
    spark.conf.set("spark.sql.unionInSamePartition", "true")
    
    val df3 = df1.union(df2).groupBy("key").agg(count("*"))
    val df4 = df1.union(df2)
    val df5 = df3.sample(0.8).filter($"key" > 1000000).sample(0.4)
    val df6 = df4.sample(0.8).filter($"key" > 1000000).sample(0.4)
    
    benchmark(() => df3.collect)
    benchmark(() => df4.collect)
    benchmark(() => df5.collect)
    benchmark(() => df6.collect)
    ```
    
    Before:
    ```scala
    scala> benchmark(() => df3.collect)
    Elapsed time: 663668585ns
    scala> benchmark(() => df4.collect)
    Elapsed time: 547487953ns
    scala> benchmark(() => df5.collect)
    Elapsed time: 712634187ns
    scala> benchmark(() => df6.collect)
    Elapsed time: 491917400ns
    ```
    
    After:
    ```scala
    scala> benchmark(() => df3.collect)
    Elapsed time: 516797788ns
    scala> benchmark(() => df4.collect)
    Elapsed time: 557499803ns
    scala> benchmark(() => df5.collect)
    Elapsed time: 611327782ns
    scala> benchmark(() => df6.collect)
    Elapsed time: 495387557ns
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21498
  
    **[Test build #91493 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91493/testReport)** for PR 21498 at commit [`b058f89`](https://github.com/apache/spark/commit/b058f892af8204cb25a8daa1f1cd1a6de21c5fd6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/21498
  
    Tests are added. cc @kiszk @mgaido91 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21498
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21498
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91496/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/21498
  
    I'd like to close this for now. Wait for necessary change on statistics.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21498: [SPARK-24410][SQL][Core] Optimization for Union outputPa...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/21498
  
    Thanks for you benchmark @viirya. The performance improvement is sensible. And seems no performance regression in the other case. Can we have a similar benchmark also with records with more complex schema? Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org