You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by cloud-fan <gi...@git.apache.org> on 2018/04/18 14:45:17 UTC
[GitHub] spark pull request #21101: [SPARK-23989][SQL] exchange should copy data befo...
GitHub user cloud-fan opened a pull request:
https://github.com/apache/spark/pull/21101
[SPARK-23989][SQL] exchange should copy data before non-serialized shuffle
## What changes were proposed in this pull request?
In Spark SQL, we usually reuse the `UnsafeRow` instance and need to copy the data when a place buffers non-serialized objects.
Shuffle may buffer objects if we don't make it to the bypass merge shuffle or unsafe shuffle.
`ShuffleExchangeExec.needToCopyObjectsBeforeShuffle` misses the case that, if `spark.sql.shuffle.partitions` is large enough, we could fail to run unsafe shuffle and go with the non-serialized shuffle.
This bug is very hard to hit since users wouldn't set such a large number of partitions for Spark SQL exchange.
TODO: test
## How was this patch tested?
todo.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/cloud-fan/spark shuffle
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21101.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21101
----
commit 40b2c5ca196427c1391e4abe60cb89dc36cbea77
Author: Wenchen Fan <we...@...>
Date: 2018-04-18T14:37:29Z
SQL exchange should copy data before non-serialized shuffle
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21101: [SPARK-23989][SQL] exchange should copy data befo...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/21101
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21101: [SPARK-23989][SQL] exchange should copy data before non-...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21101
**[Test build #89513 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89513/testReport)** for PR 21101 at commit [`40b2c5c`](https://github.com/apache/spark/commit/40b2c5ca196427c1391e4abe60cb89dc36cbea77).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21101: [SPARK-23989][SQL] exchange should copy data befo...
Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:
https://github.com/apache/spark/pull/21101#discussion_r182609675
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala ---
@@ -167,22 +164,24 @@ object ShuffleExchangeExec {
val shuffleManager = SparkEnv.get.shuffleManager
val sortBasedShuffleOn = shuffleManager.isInstanceOf[SortShuffleManager]
val bypassMergeThreshold = conf.getInt("spark.shuffle.sort.bypassMergeThreshold", 200)
+ val numParts = partitioner.numPartitions
if (sortBasedShuffleOn) {
- val bypassIsSupported = SparkEnv.get.shuffleManager.isInstanceOf[SortShuffleManager]
- if (bypassIsSupported && partitioner.numPartitions <= bypassMergeThreshold) {
+ if (numParts <= bypassMergeThreshold) {
// If we're using the original SortShuffleManager and the number of output partitions is
// sufficiently small, then Spark will fall back to the hash-based shuffle write path, which
// doesn't buffer deserialized records.
// Note that we'll have to remove this case if we fix SPARK-6026 and remove this bypass.
false
- } else if (serializer.supportsRelocationOfSerializedObjects) {
+ } else if (numParts <= SortShuffleManager.MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE) {
--- End diff --
I was almost going to suggest that we should we check for both conditions with an `&&` here just as future-proofing in case `serializer` was changed, but I can now see why that isn't necessary: we always use an `UnsafeRowSerializer` here now. It was only in the pre-Tungsten era that we could use either `UnsafeRowSerializer` or `SparkSqlSerializer` here.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21101: [SPARK-23989][SQL] exchange should copy data before non-...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21101
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21101: [SPARK-23989][SQL] exchange should copy data before non-...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/21101
cc @JoshRosen @hvanhovell @gatorsmile
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21101: [SPARK-23989][SQL] exchange should copy data before non-...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21101
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21101: [SPARK-23989][SQL] exchange should copy data before non-...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21101
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89513/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21101: [SPARK-23989][SQL] exchange should copy data before non-...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21101
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2439/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21101: [SPARK-23989][SQL] exchange should copy data before non-...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21101
**[Test build #89513 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89513/testReport)** for PR 21101 at commit [`40b2c5c`](https://github.com/apache/spark/commit/40b2c5ca196427c1391e4abe60cb89dc36cbea77).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21101: [SPARK-23989][SQL] exchange should copy data before non-...
Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on the issue:
https://github.com/apache/spark/pull/21101
Merging to master, 2.3 and 2.2. Let me know if more further backports are needed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org