You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/05/20 01:49:08 UTC

[GitHub] [spark] huaxingao commented on pull request #34785: [SPARK-37523][SQL] Support optimize skewed partitions in Distribution and Ordering if numPartitions is not specified

huaxingao commented on PR #34785:
URL: https://github.com/apache/spark/pull/34785#issuecomment-1132363307

   Thanks @aokolnychyi for the proposal. I agree that we should support both strictly required distribution and best effort distribution. For best effort distribution, if user doesn't request a specific number of partitions, we will split skewed partitions and coalesce small partitions. For strictly required distribution, if user doesn't request a specific number of partitions, we will coalesce small partitions but we will NOT split skewed partitions since this changes the required distribution.
   
   In interface `RequiresDistributionAndOrdering`, I will add
   ```
   default boolean distributionStrictlyRequired() { return true; }
   ```
   Then in `DistributionAndOrderingUtils`.`prepareQuery`, I will change the code to something like this:
   ```      
         val queryWithDistribution = if (distribution.nonEmpty) {
           if (!write.distributionStrictlyRequired() && numPartitions == 0) {
             RebalancePartitions(distribution, query)
           } else {
             if (numPartitions > 0) {
               RepartitionByExpression(distribution, query, numPartitions)
             } else {
               RepartitionByExpression(distribution, query, None)
             }
           }
           ...
   ``` 
   Basically, in the best effort case, if the requested numPartitions is 0, we will use `RebalancePartitions` so both `OptimizeSkewInRebalancePartitions` and `CoalesceShufflePartitions` will be applied. In the strictly required distribution case,  if the requested numPartitions is 0, we will use `RepartitionByExpression(distribution, query, None)` so `CoalesceShufflePartitions` will be applied. 
   
   Does this sound correct for every one?
   
   
   
   
   
   
   
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org