You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/05/23 23:52:57 UTC

[GitHub] [spark] huaxingao opened a new pull request, #36644: [SPARK-37523][SQL] Re-optimize partitions in Distribution and Ordering if numPartitions is not specified

huaxingao opened a new pull request, #36644:
URL: https://github.com/apache/spark/pull/36644

   
   ### What changes were proposed in this pull request?
   Re-optimize partitions in Distribution and Ordering if numPartitions is not specified.
   
   Support both strictly required distribution and best effort distribution. For best effort distribution, if user doesn't request a specific number of partitions, Spark will split skewed partitions and coalesce small partitions (`RebalancePartitions` will be used and  both `OptimizeSkewInRebalancePartitions` and `CoalesceShufflePartitions` will be applied). For strictly required distribution, if user doesn't request a specific number of partitions, Spark will coalesce small partitions but will NOT split skewed partitions(`RepartitionByExpression(distribution, query, None)` will be used and `CoalesceShufflePartitions` will be applied).
   
   
   
   ### Why are the changes needed?
   
   optimization
   
   
   ### Does this PR introduce _any_ user-facing change?
   In interface `RequiresDistributionAndOrdering`, a new API `distributionStrictlyRequired()` will be added to tell if the distribution required by this write is strictly required or best effort only
   ```
   default boolean distributionStrictlyRequired() { return true; }
   ```
   
   
   ### How was this patch tested?
   Existing and new tests
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ulysses-you commented on a diff in pull request #36644: [SPARK-37523][SQL] Re-optimize partitions in Distribution and Ordering if numPartitions is not specified

Posted by GitBox <gi...@apache.org>.
ulysses-you commented on code in PR #36644:
URL: https://github.com/apache/spark/pull/36644#discussion_r881114813


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:
##########
@@ -1503,31 +1530,16 @@ case class Repartition(numPartitions: Int, shuffle: Boolean, child: LogicalPlan)
 case class RepartitionByExpression(
     partitionExpressions: Seq[Expression],
     child: LogicalPlan,
-    optNumPartitions: Option[Int]) extends RepartitionOperation {
+    optNumPartitions: Option[Int]) extends RepartitionOperation with HasPartitionExpressions {
 
   val numPartitions = optNumPartitions.getOrElse(conf.numShufflePartitions)
   require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")

Review Comment:
   shall we pull out this require to `HasPartitionExpressions` ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on pull request #36644: [SPARK-37523][SQL] Re-optimize partitions in Distribution and Ordering if numPartitions is not specified

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on PR #36644:
URL: https://github.com/apache/spark/pull/36644#issuecomment-1139145639

   I'd love to take a look in the evening today.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #36644: [SPARK-37523][SQL] Re-optimize partitions in Distribution and Ordering if numPartitions is not specified

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #36644:
URL: https://github.com/apache/spark/pull/36644#discussion_r882263893


##########
sql/core/src/test/scala/org/apache/spark/sql/connector/WriteDistributionAndOrderingSuite.scala:
##########
@@ -77,36 +101,47 @@ class WriteDistributionAndOrderingSuite extends DistributionAndOrderingSuiteBase
   }
 
   test("ordered distribution and sort with same exprs with numPartitions: append") {
-    checkOrderedDistributionAndSortWithSameExprs("append", Some(10))
+    checkOrderedDistributionAndSortWithSameExprs("append", Some(10), true, false, false)
   }
 
   test("ordered distribution and sort with same exprs with numPartitions: overwrite") {
-    checkOrderedDistributionAndSortWithSameExprs("overwrite", Some(10))
+    checkOrderedDistributionAndSortWithSameExprs("overwrite", Some(10), true, false, false)
   }
 
   test("ordered distribution and sort with same exprs with numPartitions: overwriteDynamic") {
-    checkOrderedDistributionAndSortWithSameExprs("overwriteDynamic", Some(10))
+    checkOrderedDistributionAndSortWithSameExprs("overwriteDynamic", Some(10), true, false, false)
   }
 
   test("ordered distribution and sort with same exprs with numPartitions: micro-batch append") {
-    checkOrderedDistributionAndSortWithSameExprs(microBatchPrefix + "append", Some(10))
+    checkOrderedDistributionAndSortWithSameExprs(microBatchPrefix + "append", Some(10),
+      true, false, false)
   }
 
   test("ordered distribution and sort with same exprs with numPartitions: micro-batch update") {
-    checkOrderedDistributionAndSortWithSameExprs(microBatchPrefix + "update", Some(10))
+    checkOrderedDistributionAndSortWithSameExprs(microBatchPrefix + "update", Some(10),
+      true, false, false)
   }
 
   test("ordered distribution and sort with same exprs with numPartitions: micro-batch complete") {
-    checkOrderedDistributionAndSortWithSameExprs(microBatchPrefix + "complete", Some(10))
+    checkOrderedDistributionAndSortWithSameExprs(microBatchPrefix + "complete", Some(10),
+      true, false, false)
   }
 
-  private def checkOrderedDistributionAndSortWithSameExprs(command: String): Unit = {
-    checkOrderedDistributionAndSortWithSameExprs(command, None)
+  private def checkOrderedDistributionAndSortWithSameExprs(

Review Comment:
   seems we can have only one `checkOrderedDistributionAndSortWithSameExprs` and provide default values for all the parameters except for `command: String`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a diff in pull request #36644: [SPARK-37523][SQL] Re-optimize partitions in Distribution and Ordering if numPartitions is not specified

Posted by GitBox <gi...@apache.org>.
huaxingao commented on code in PR #36644:
URL: https://github.com/apache/spark/pull/36644#discussion_r880656277


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:
##########
@@ -1493,6 +1493,33 @@ case class Repartition(numPartitions: Int, shuffle: Boolean, child: LogicalPlan)
     copy(child = newChild)
 }
 
+trait hasPartitionExpressions extends SQLConfHelper {

Review Comment:
   Fixed. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on a diff in pull request #36644: [SPARK-37523][SQL] Re-optimize partitions in Distribution and Ordering if numPartitions is not specified

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on code in PR #36644:
URL: https://github.com/apache/spark/pull/36644#discussion_r883290220


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DistributionAndOrderingUtils.scala:
##########
@@ -41,15 +40,16 @@ object DistributionAndOrderingUtils {
       }
 
       val queryWithDistribution = if (distribution.nonEmpty) {
-        val finalNumPartitions = if (numPartitions > 0) {
-          numPartitions
-        } else {
-          conf.numShufflePartitions
-        }
+        val optNumPartitions = if (numPartitions > 0) Some(numPartitions) else None
         // the conversion to catalyst expressions above produces SortOrder expressions
         // for OrderedDistribution and generic expressions for ClusteredDistribution
-        // this allows RepartitionByExpression to pick either range or hash partitioning
-        RepartitionByExpression(distribution, query, finalNumPartitions)
+        // this allows RebalancePartitions/RepartitionByExpression to pick either
+        // range or hash partitioning
+        if (!write.distributionStrictlyRequired()) {

Review Comment:
   nit: I find it a bit easier to read when the condition is not negated so I'd flip the if statement



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a diff in pull request #36644: [SPARK-37523][SQL] Re-optimize partitions in Distribution and Ordering if numPartitions is not specified

Posted by GitBox <gi...@apache.org>.
huaxingao commented on code in PR #36644:
URL: https://github.com/apache/spark/pull/36644#discussion_r882338312


##########
sql/core/src/test/scala/org/apache/spark/sql/connector/WriteDistributionAndOrderingSuite.scala:
##########
@@ -53,15 +56,36 @@ class WriteDistributionAndOrderingSuite extends DistributionAndOrderingSuiteBase
     .add("data", StringType)
 
   test("ordered distribution and sort with same exprs: append") {
-    checkOrderedDistributionAndSortWithSameExprs("append")
+    Seq(true, false).foreach { distributionStrictlyRequired =>
+      Seq(true, false).foreach { dataSkewed =>
+        Seq(true, false).foreach { coalesce =>

Review Comment:
   Fixed. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a diff in pull request #36644: [SPARK-37523][SQL] Re-optimize partitions in Distribution and Ordering if numPartitions is not specified

Posted by GitBox <gi...@apache.org>.
huaxingao commented on code in PR #36644:
URL: https://github.com/apache/spark/pull/36644#discussion_r883708937


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DistributionAndOrderingUtils.scala:
##########
@@ -41,15 +40,16 @@ object DistributionAndOrderingUtils {
       }
 
       val queryWithDistribution = if (distribution.nonEmpty) {
-        val finalNumPartitions = if (numPartitions > 0) {
-          numPartitions
-        } else {
-          conf.numShufflePartitions
-        }
+        val optNumPartitions = if (numPartitions > 0) Some(numPartitions) else None
         // the conversion to catalyst expressions above produces SortOrder expressions
         // for OrderedDistribution and generic expressions for ClusteredDistribution
-        // this allows RepartitionByExpression to pick either range or hash partitioning
-        RepartitionByExpression(distribution, query, finalNumPartitions)
+        // this allows RebalancePartitions/RepartitionByExpression to pick either
+        // range or hash partitioning
+        if (!write.distributionStrictlyRequired()) {

Review Comment:
   Changed. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a diff in pull request #36644: [SPARK-37523][SQL] Re-optimize partitions in Distribution and Ordering if numPartitions is not specified

Posted by GitBox <gi...@apache.org>.
huaxingao commented on code in PR #36644:
URL: https://github.com/apache/spark/pull/36644#discussion_r881239002


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:
##########
@@ -1503,31 +1530,16 @@ case class Repartition(numPartitions: Int, shuffle: Boolean, child: LogicalPlan)
 case class RepartitionByExpression(
     partitionExpressions: Seq[Expression],
     child: LogicalPlan,
-    optNumPartitions: Option[Int]) extends RepartitionOperation {
+    optNumPartitions: Option[Int]) extends RepartitionOperation with HasPartitionExpressions {
 
   val numPartitions = optNumPartitions.getOrElse(conf.numShufflePartitions)
   require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")

Review Comment:
   Sounds good. Updated. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #36644: [SPARK-37523][SQL] Re-optimize partitions in Distribution and Ordering if numPartitions is not specified

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #36644:
URL: https://github.com/apache/spark/pull/36644#discussion_r882261647


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DistributionAndOrderingUtils.scala:
##########
@@ -41,15 +40,16 @@ object DistributionAndOrderingUtils {
       }
 
       val queryWithDistribution = if (distribution.nonEmpty) {
-        val finalNumPartitions = if (numPartitions > 0) {
-          numPartitions
-        } else {
-          conf.numShufflePartitions
-        }
+        val optNumPartitions = if (numPartitions == 0) None else Some(numPartitions)

Review Comment:
   nit: I think the previous condition is safer: `if (numPartitions > 0) Some(numPartitions) else None`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on pull request #36644: [SPARK-37523][SQL] Re-optimize partitions in Distribution and Ordering if numPartitions is not specified

Posted by GitBox <gi...@apache.org>.
huaxingao commented on PR #36644:
URL: https://github.com/apache/spark/pull/36644#issuecomment-1135317504

   cc @cloud-fan @aokolnychyi @ulysses-you 
   This PR is ready for review. Thanks in advance!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a diff in pull request #36644: [SPARK-37523][SQL] Re-optimize partitions in Distribution and Ordering if numPartitions is not specified

Posted by GitBox <gi...@apache.org>.
huaxingao commented on code in PR #36644:
URL: https://github.com/apache/spark/pull/36644#discussion_r880656630


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryTable.scala:
##########
@@ -359,24 +360,51 @@ class InMemoryTable(
         this
       }
 
-      override def build(): Write = new Write with RequiresDistributionAndOrdering {

Review Comment:
   Updated. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #36644: [SPARK-37523][SQL] Re-optimize partitions in Distribution and Ordering if numPartitions is not specified

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #36644:
URL: https://github.com/apache/spark/pull/36644#discussion_r880564619


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/RequiresDistributionAndOrdering.java:
##########
@@ -47,6 +47,22 @@ public interface RequiresDistributionAndOrdering extends Write {
    */
   Distribution requiredDistribution();
 
+  /**
+   * Returns if the distribution required by this write is strictly required or best effort only.
+   * <p>
+   * If true, Spark will strictly distribute incoming records across partitions to satisfy

Review Comment:
   It's probably better to not mention the internal optimizations so clearly. Maybe just say
   ```
   If true, Spark will strictly distribute incoming records across partitions to satisfy
   the required distribution before passing the records to the data source table on write.
   Otherwise, Spark may apply certain optimizations to speed up the query but break
   the distribution requirement.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #36644: [SPARK-37523][SQL] Re-optimize partitions in Distribution and Ordering if numPartitions is not specified

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #36644:
URL: https://github.com/apache/spark/pull/36644#discussion_r880601737


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:
##########
@@ -1493,6 +1493,33 @@ case class Repartition(numPartitions: Int, shuffle: Boolean, child: LogicalPlan)
     copy(child = newChild)
 }
 
+trait hasPartitionExpressions extends SQLConfHelper {

Review Comment:
   ```suggestion
   trait HasPartitionExpressions extends SQLConfHelper {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #36644: [SPARK-37523][SQL] Re-optimize partitions in Distribution and Ordering if numPartitions is not specified

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #36644:
URL: https://github.com/apache/spark/pull/36644#discussion_r882262535


##########
sql/core/src/test/scala/org/apache/spark/sql/connector/WriteDistributionAndOrderingSuite.scala:
##########
@@ -53,15 +56,36 @@ class WriteDistributionAndOrderingSuite extends DistributionAndOrderingSuiteBase
     .add("data", StringType)
 
   test("ordered distribution and sort with same exprs: append") {
-    checkOrderedDistributionAndSortWithSameExprs("append")
+    Seq(true, false).foreach { distributionStrictlyRequired =>
+      Seq(true, false).foreach { dataSkewed =>
+        Seq(true, false).foreach { coalesce =>

Review Comment:
   this pattern appears many times, can we deduplicate the code?
   ```
   def checkOrderedDistributionAndSortWithSameExprs(...) {
     Seq(true, false).foreach...
     ...
         checkOrderedDistributionAndSortWithSameExprs(..., distributionStrictlyRequired, dataSkewed, coalesce)
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a diff in pull request #36644: [SPARK-37523][SQL] Re-optimize partitions in Distribution and Ordering if numPartitions is not specified

Posted by GitBox <gi...@apache.org>.
huaxingao commented on code in PR #36644:
URL: https://github.com/apache/spark/pull/36644#discussion_r880655780


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/RequiresDistributionAndOrdering.java:
##########
@@ -47,6 +47,22 @@ public interface RequiresDistributionAndOrdering extends Write {
    */
   Distribution requiredDistribution();
 
+  /**
+   * Returns if the distribution required by this write is strictly required or best effort only.
+   * <p>
+   * If true, Spark will strictly distribute incoming records across partitions to satisfy

Review Comment:
   Fixed. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #36644: [SPARK-37523][SQL] Re-optimize partitions in Distribution and Ordering if numPartitions is not specified

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #36644:
URL: https://github.com/apache/spark/pull/36644#discussion_r882262535


##########
sql/core/src/test/scala/org/apache/spark/sql/connector/WriteDistributionAndOrderingSuite.scala:
##########
@@ -53,15 +56,36 @@ class WriteDistributionAndOrderingSuite extends DistributionAndOrderingSuiteBase
     .add("data", StringType)
 
   test("ordered distribution and sort with same exprs: append") {
-    checkOrderedDistributionAndSortWithSameExprs("append")
+    Seq(true, false).foreach { distributionStrictlyRequired =>
+      Seq(true, false).foreach { dataSkewed =>
+        Seq(true, false).foreach { coalesce =>

Review Comment:
   this pattern appears many times, can we deduplicate the code?
   ```
   def checkOrderedDistributionAndSortWithSameExprsInVariousCases(...) {
     Seq(true, false).foreach...
     ...
         checkOrderedDistributionAndSortWithSameExprs(..., distributionStrictlyRequired, dataSkewed, coalesce)
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a diff in pull request #36644: [SPARK-37523][SQL] Re-optimize partitions in Distribution and Ordering if numPartitions is not specified

Posted by GitBox <gi...@apache.org>.
huaxingao commented on code in PR #36644:
URL: https://github.com/apache/spark/pull/36644#discussion_r882338418


##########
sql/core/src/test/scala/org/apache/spark/sql/connector/WriteDistributionAndOrderingSuite.scala:
##########
@@ -77,36 +101,47 @@ class WriteDistributionAndOrderingSuite extends DistributionAndOrderingSuiteBase
   }
 
   test("ordered distribution and sort with same exprs with numPartitions: append") {
-    checkOrderedDistributionAndSortWithSameExprs("append", Some(10))
+    checkOrderedDistributionAndSortWithSameExprs("append", Some(10), true, false, false)
   }
 
   test("ordered distribution and sort with same exprs with numPartitions: overwrite") {
-    checkOrderedDistributionAndSortWithSameExprs("overwrite", Some(10))
+    checkOrderedDistributionAndSortWithSameExprs("overwrite", Some(10), true, false, false)
   }
 
   test("ordered distribution and sort with same exprs with numPartitions: overwriteDynamic") {
-    checkOrderedDistributionAndSortWithSameExprs("overwriteDynamic", Some(10))
+    checkOrderedDistributionAndSortWithSameExprs("overwriteDynamic", Some(10), true, false, false)
   }
 
   test("ordered distribution and sort with same exprs with numPartitions: micro-batch append") {
-    checkOrderedDistributionAndSortWithSameExprs(microBatchPrefix + "append", Some(10))
+    checkOrderedDistributionAndSortWithSameExprs(microBatchPrefix + "append", Some(10),
+      true, false, false)
   }
 
   test("ordered distribution and sort with same exprs with numPartitions: micro-batch update") {
-    checkOrderedDistributionAndSortWithSameExprs(microBatchPrefix + "update", Some(10))
+    checkOrderedDistributionAndSortWithSameExprs(microBatchPrefix + "update", Some(10),
+      true, false, false)
   }
 
   test("ordered distribution and sort with same exprs with numPartitions: micro-batch complete") {
-    checkOrderedDistributionAndSortWithSameExprs(microBatchPrefix + "complete", Some(10))
+    checkOrderedDistributionAndSortWithSameExprs(microBatchPrefix + "complete", Some(10),
+      true, false, false)
   }
 
-  private def checkOrderedDistributionAndSortWithSameExprs(command: String): Unit = {
-    checkOrderedDistributionAndSortWithSameExprs(command, None)
+  private def checkOrderedDistributionAndSortWithSameExprs(

Review Comment:
   Changed. Thanks!
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a diff in pull request #36644: [SPARK-37523][SQL] Re-optimize partitions in Distribution and Ordering if numPartitions is not specified

Posted by GitBox <gi...@apache.org>.
huaxingao commented on code in PR #36644:
URL: https://github.com/apache/spark/pull/36644#discussion_r882340555


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DistributionAndOrderingUtils.scala:
##########
@@ -41,15 +40,16 @@ object DistributionAndOrderingUtils {
       }
 
       val queryWithDistribution = if (distribution.nonEmpty) {
-        val finalNumPartitions = if (numPartitions > 0) {
-          numPartitions
-        } else {
-          conf.numShufflePartitions
-        }
+        val optNumPartitions = if (numPartitions == 0) None else Some(numPartitions)

Review Comment:
   Fixed. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #36644: [SPARK-37523][SQL] Re-optimize partitions in Distribution and Ordering if numPartitions is not specified

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #36644:
URL: https://github.com/apache/spark/pull/36644#discussion_r880604211


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryTable.scala:
##########
@@ -359,24 +360,51 @@ class InMemoryTable(
         this
       }
 
-      override def build(): Write = new Write with RequiresDistributionAndOrdering {

Review Comment:
   do we need such a big change? It seems we can just add
   ```
   override def distributionStrictlyRequired = distributionStrictlyRequired 
   ```
   into the class



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on pull request #36644: [SPARK-37523][SQL] Re-optimize partitions in Distribution and Ordering if numPartitions is not specified

Posted by GitBox <gi...@apache.org>.
huaxingao commented on PR #36644:
URL: https://github.com/apache/spark/pull/36644#issuecomment-1141651608

   Thank you all!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan closed pull request #36644: [SPARK-37523][SQL] Re-optimize partitions in Distribution and Ordering if numPartitions is not specified

Posted by GitBox <gi...@apache.org>.
cloud-fan closed pull request #36644: [SPARK-37523][SQL] Re-optimize partitions in Distribution and Ordering if numPartitions is not specified
URL: https://github.com/apache/spark/pull/36644


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #36644: [SPARK-37523][SQL] Re-optimize partitions in Distribution and Ordering if numPartitions is not specified

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #36644:
URL: https://github.com/apache/spark/pull/36644#issuecomment-1140758769

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org