You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by cloud-fan <gi...@git.apache.org> on 2018/04/18 14:45:17 UTC

[GitHub] spark pull request #21101: [SPARK-23989][SQL] exchange should copy data befo...

GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/21101

    [SPARK-23989][SQL] exchange should copy data before non-serialized shuffle

    ## What changes were proposed in this pull request?
    
    In Spark SQL, we usually reuse the `UnsafeRow` instance and need to copy the data when a place buffers non-serialized objects.
    
    Shuffle may buffer objects if we don't make it to the bypass merge shuffle or unsafe shuffle.
    
    `ShuffleExchangeExec.needToCopyObjectsBeforeShuffle` misses the case that, if `spark.sql.shuffle.partitions` is large enough, we could fail to run unsafe shuffle and go with the non-serialized shuffle.
    
    This bug is very hard to hit since users wouldn't set such a large number of partitions for Spark SQL exchange.
    
    TODO: test
    
    ## How was this patch tested?
    
    todo.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark shuffle

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21101.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21101
    
----
commit 40b2c5ca196427c1391e4abe60cb89dc36cbea77
Author: Wenchen Fan <we...@...>
Date:   2018-04-18T14:37:29Z

    SQL exchange should copy data before non-serialized shuffle

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21101: [SPARK-23989][SQL] exchange should copy data befo...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/21101


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21101: [SPARK-23989][SQL] exchange should copy data before non-...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21101
  
    **[Test build #89513 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89513/testReport)** for PR 21101 at commit [`40b2c5c`](https://github.com/apache/spark/commit/40b2c5ca196427c1391e4abe60cb89dc36cbea77).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21101: [SPARK-23989][SQL] exchange should copy data befo...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21101#discussion_r182609675
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala ---
    @@ -167,22 +164,24 @@ object ShuffleExchangeExec {
         val shuffleManager = SparkEnv.get.shuffleManager
         val sortBasedShuffleOn = shuffleManager.isInstanceOf[SortShuffleManager]
         val bypassMergeThreshold = conf.getInt("spark.shuffle.sort.bypassMergeThreshold", 200)
    +    val numParts = partitioner.numPartitions
         if (sortBasedShuffleOn) {
    -      val bypassIsSupported = SparkEnv.get.shuffleManager.isInstanceOf[SortShuffleManager]
    -      if (bypassIsSupported && partitioner.numPartitions <= bypassMergeThreshold) {
    +      if (numParts <= bypassMergeThreshold) {
             // If we're using the original SortShuffleManager and the number of output partitions is
             // sufficiently small, then Spark will fall back to the hash-based shuffle write path, which
             // doesn't buffer deserialized records.
             // Note that we'll have to remove this case if we fix SPARK-6026 and remove this bypass.
             false
    -      } else if (serializer.supportsRelocationOfSerializedObjects) {
    +      } else if (numParts <= SortShuffleManager.MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE) {
    --- End diff --
    
    I was almost going to suggest that we should we check for both conditions with an `&&` here just as future-proofing in case `serializer` was changed, but I can now see why that isn't necessary: we always use an `UnsafeRowSerializer` here now. It was only in the pre-Tungsten era that we could use either `UnsafeRowSerializer` or `SparkSqlSerializer` here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21101: [SPARK-23989][SQL] exchange should copy data before non-...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21101
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21101: [SPARK-23989][SQL] exchange should copy data before non-...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21101
  
    cc @JoshRosen @hvanhovell @gatorsmile 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21101: [SPARK-23989][SQL] exchange should copy data before non-...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21101
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21101: [SPARK-23989][SQL] exchange should copy data before non-...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21101
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89513/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21101: [SPARK-23989][SQL] exchange should copy data before non-...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21101
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2439/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21101: [SPARK-23989][SQL] exchange should copy data before non-...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21101
  
    **[Test build #89513 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89513/testReport)** for PR 21101 at commit [`40b2c5c`](https://github.com/apache/spark/commit/40b2c5ca196427c1391e4abe60cb89dc36cbea77).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21101: [SPARK-23989][SQL] exchange should copy data before non-...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on the issue:

    https://github.com/apache/spark/pull/21101
  
    Merging to master, 2.3 and 2.2. Let me know if more further backports are needed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org