You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/05/19 18:03:04 UTC

[GitHub] [spark] huaxingao opened a new pull request, #34785: [SPARK-37523][SQL] Support optimize skewed partitions in Distribution and Ordering if numPartitions is not specified

huaxingao opened a new pull request, #34785:
URL: https://github.com/apache/spark/pull/34785

   
   ### What changes were proposed in this pull request?
   Support optimize skewed partitions in Distribution and Ordering if numPartitions is not specified
   
   ### Why are the changes needed?
   When doing repartition in distribution and sort, we will use Rebalance operator instead of RepartitionByExpression to optimize skewed partitions when
   1. numPartitions is not specified by the data source, and
   2. sortOrder is specified. This is because the requested distribution needs to be guaranteed, which can only be achieved by using RangePartitioning, not HashPartitioning.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Existing and new tests
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao closed pull request #34785: [SPARK-37523][SQL] Support optimize skewed partitions in Distribution and Ordering if numPartitions is not specified

Posted by GitBox <gi...@apache.org>.
huaxingao closed pull request #34785: [SPARK-37523][SQL] Support optimize skewed partitions in Distribution and Ordering if numPartitions is not specified
URL: https://github.com/apache/spark/pull/34785


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ulysses-you commented on pull request #34785: [SPARK-37523][SQL] Support optimize skewed partitions in Distribution and Ordering if numPartitions is not specified

Posted by GitBox <gi...@apache.org>.
ulysses-you commented on PR #34785:
URL: https://github.com/apache/spark/pull/34785#issuecomment-1132397474

   Looks correct to me. BTW, after Spark3.3 the RebalancePartitions supports specify the initialNumPartition, so the demo code can be:
   ```scala
   val optNumPartitions = if (numPartitions == 0) None else Some(numPartitions)
   if (!write.distributionStrictlyRequired()) {
     RebalancePartitions(distribution, query, optNumPartitions)
   } else {
     RepartitionByExpression(distribution, query, optNumPartitions)
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #34785: [SPARK-37523][SQL] Support optimize skewed partitions in Distribution and Ordering if numPartitions is not specified

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #34785:
URL: https://github.com/apache/spark/pull/34785#issuecomment-1094146675

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on pull request #34785: [SPARK-37523][SQL] Support optimize skewed partitions in Distribution and Ordering if numPartitions is not specified

Posted by GitBox <gi...@apache.org>.
huaxingao commented on PR #34785:
URL: https://github.com/apache/spark/pull/34785#issuecomment-1135252175

   It's a bit too hard to rebase. I will close this PR and the new one is here https://github.com/apache/spark/pull/36644. Thank you all very much for reviewing!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on pull request #34785: [SPARK-37523][SQL] Support optimize skewed partitions in Distribution and Ordering if numPartitions is not specified

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on PR #34785:
URL: https://github.com/apache/spark/pull/34785#issuecomment-1132014116

   Thanks for the PR, @huaxingao. I think it is a great feature and it would be awesome to get it done.
   
   I spent some time thinking about this and have a few questions/proposals.
   
   If I understand correctly, we currently hard-code the number of shuffle partitions in `RepartitionByExpression`, which prohibits both coalescing and skew split optimizations.
   
   It seems reasonable to support cases when the requested distribution is best-effort but I also think there are valid cases when the distribution is required for correctness and it is actually the current API contract. What about extending `RequiredDistributionAndOrdering` to indicate the distribution is not strictly required? We can add some boolean method and default it to keep the existing behavior. If the distribution is required, we can still benefit from coalescing as I think `CoalesceShufflePartitions` and `AQEShuffleReadExec` would keep the original distribution in coalesce cases. That’s already a huge win. We can avoid too small files while keeping the requested distribution.
   
   I also agree about using `RebalancePartitions` when the distribution is not strictly required. What about extending `RebalancePartitions` to also support range partitioning? It currently supports only hash and round-robin. If we make that change, we will be able to remove unnecessary shuffles in the optimizer and keep the original distribution as long as there is no skew and we only coalesce. If there is a skew, an extra shuffle and changed distribution seems like a reasonable overhead.
   
   What does everybody else think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on pull request #34785: [SPARK-37523][SQL] Support optimize skewed partitions in Distribution and Ordering if numPartitions is not specified

Posted by GitBox <gi...@apache.org>.
huaxingao commented on PR #34785:
URL: https://github.com/apache/spark/pull/34785#issuecomment-1132364552

   cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] closed pull request #34785: [SPARK-37523][SQL] Support optimize skewed partitions in Distribution and Ordering if numPartitions is not specified

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #34785: [SPARK-37523][SQL] Support optimize skewed partitions in Distribution and Ordering if numPartitions is not specified
URL: https://github.com/apache/spark/pull/34785


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on pull request #34785: [SPARK-37523][SQL] Support optimize skewed partitions in Distribution and Ordering if numPartitions is not specified

Posted by GitBox <gi...@apache.org>.
huaxingao commented on PR #34785:
URL: https://github.com/apache/spark/pull/34785#issuecomment-1132363307

   Thanks @aokolnychyi for the proposal. I agree that we should support both strictly required distribution and best effort distribution. For best effort distribution, if user doesn't request a specific number of partitions, we will split skewed partitions and coalesce small partitions. For strictly required distribution, if user doesn't request a specific number of partitions, we will coalesce small partitions but we will NOT split skewed partitions since this changes the required distribution.
   
   In interface `RequiresDistributionAndOrdering`, I will add
   ```
   default boolean distributionStrictlyRequired() { return true; }
   ```
   Then in `DistributionAndOrderingUtils`.`prepareQuery`, I will change the code to something like this:
   ```      
         val queryWithDistribution = if (distribution.nonEmpty) {
           if (!write.distributionStrictlyRequired() && numPartitions == 0) {
             RebalancePartitions(distribution, query)
           } else {
             if (numPartitions > 0) {
               RepartitionByExpression(distribution, query, numPartitions)
             } else {
               RepartitionByExpression(distribution, query, None)
             }
           }
           ...
   ``` 
   Basically, in the best effort case, if the requested numPartitions is 0, we will use `RebalancePartitions` so both `OptimizeSkewInRebalancePartitions` and `CoalesceShufflePartitions` will be applied. In the strictly required distribution case,  if the requested numPartitions is 0, we will use `RepartitionByExpression(distribution, query, None)` so `CoalesceShufflePartitions` will be applied. 
   
   Does this sound correct for every one?
   
   
   
   
   
   
   
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] closed pull request #34785: [SPARK-37523][SQL] Support optimize skewed partitions in Distribution and Ordering if numPartitions is not specified

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #34785: [SPARK-37523][SQL] Support optimize skewed partitions in Distribution and Ordering if numPartitions is not specified
URL: https://github.com/apache/spark/pull/34785


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org