You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/03/09 07:15:59 UTC

[GitHub] [spark] cloud-fan commented on a change in pull request #27833: [SPARK-31070][SQL] make skew join split skewed partitions more evenly

cloud-fan commented on a change in pull request #27833: [SPARK-31070][SQL] make skew join split skewed partitions more evenly
URL: https://github.com/apache/spark/pull/27833#discussion_r389492481
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala
 ##########
 @@ -115,18 +115,32 @@ case class OptimizeSkewedJoin(conf: SQLConf) extends Rule[SparkPlan] {
     partitionStartIndices += 0
     var i = 0
     var postMapPartitionSize = 0L
+    var lastPackagedPartitionSize = -1L
     while (i < mapPartitionSizes.length) {
       val nextMapPartitionSize = mapPartitionSizes(i)
       if (i > 0 && postMapPartitionSize + nextMapPartitionSize > targetSize) {
         partitionStartIndices += i
+        lastPackagedPartitionSize = postMapPartitionSize
         postMapPartitionSize = nextMapPartitionSize
       } else {
         postMapPartitionSize += nextMapPartitionSize
       }
       i += 1
     }
 
-    partitionStartIndices.toArray
+    if (lastPackagedPartitionSize > -1) {
+      val lastPartitionDiff = math.abs(targetSize - postMapPartitionSize)
+      val diffIfMergeLastPartition = math.abs(
+        lastPackagedPartitionSize + postMapPartitionSize - targetSize)
+      // If the last partition is very small, we should merge it to the previous partition.
+      if (lastPartitionDiff > diffIfMergeLastPartition * 2) {
 
 Review comment:
   Good discussion here! I'm thinking about more cases:
   ```
   targetSize = 7
   lastButOneSize = 1
   lastSize = 7
   ```
   shall we merge the last partition into the previous partition?
   
   Maybe we can use a simple heuristic: merge 2 adjacent partitions if they don't exceed the target size too much (say 20%?). And we can apply it to partitions in the middle, not have to be the last 2 partitions.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org