You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/01/27 07:32:55 UTC

[GitHub] [spark] HeartSaVioR opened a new pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

HeartSaVioR opened a new pull request #31355:
URL: https://github.com/apache/spark/pull/31355


   ### What changes were proposed in this pull request?
   
   This PR proposes to extend the functionality of requirement for distribution and ordering on V2 write to specify the number of partitioning on repartition, so that data source is able to control the parallelism and determine the data distribution per partition in prior.
   
   The partitioning with static number is optional, and by default disabled via default method, so only implementations required to restrict the number of partition statically need to override the method and provide the number.
   
   ### Why are the changes needed?
   
   The use case comes from feature parity with DSv1.
   
   I have state data source which enables the state in SS to be rewritten, which enables repartitioning, schema evolution, etc via batch query. The writer requires hash partitioning against group key, with the "desired number of partitions", which is same as what Spark does read and write against state.
   
   This is now implemented as DSv1, and the requirement is simply done by calling repartition with the "desired number".
   
   ```
   val fullPathsForKeyColumns = keySchema.map(key => new Column(s"key.${key.name}"))
   data
     .repartition(newPartitions, fullPathsForKeyColumns: _*)
     .queryExecution
     .toRdd
     .foreachPartition(
       writeFn(resolvedCpLocation, version, operatorId, storeName, keySchema, valueSchema,
         storeConf, hadoopConfBroadcast, queryId))
   ```
   
   Thanks to SPARK-34026, it's now possible to require the hash partitioning, but still not able to require the number of partitions. This PR will enable to let data source require the number of partitions.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, but only for data source implementors. Even for them, this is no breaking change as default method is added.
   
   ### How was this patch tested?
   
   Added UTs.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768801016


   > AQE won't kick in if users specify num partitions, e.g. df.repartition(5), I think the same applies here if the sink requires a certain num partitions.
   
   The case I imagine is like this, `df.repartition(5).write.(blabla).save()`, with the condition sink requires a specific distribution but doesn't require specific number of partitions. Now Spark would repartition to default number of shuffle partitions, and I'm unsure changing it to None (SPARK-34230) would keep the user intention.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768816190


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39176/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r576407822



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/distributions/ClusteredDistribution.java
##########
@@ -32,4 +32,13 @@
    * Returns clustering expressions.
    */
   Expression[] clustering();
+
+  /**
+   * Returns the number of partitions required by this write.
+   * <p>
+   * Implementations may want to override this if it requires the specific number of partitions.
+   *
+   * @return the required number of partitions, non-positive values mean no requirement.
+   */
+  default int requiredNumPartitions() { return 0; }

Review comment:
       Yeah I thought this a bit more, and agree restricting the "parallelism" is a valid use case regardless of distribution.
   
   "coalesce" vs "repartition" remains the question; when distribution is specified I expect "repartition" will take effect, so probably "repartition" is more consistent. And what I observed from troubleshooting is that coalesce could bring an edge-case if used blindly, like when sink places with the same stage shuffle was done and takes whole stage quite slow.
   (That was `coalesce(1)` and we simply fixed the problem via changing it to `repartition(1)`.)
   
   I'd feel safer if writer requirement just affects the writer operation and doesn't affect other operations. What do you think?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r601572006



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/connector/WriteDistributionAndOrderingSuite.scala
##########
@@ -569,4 +785,26 @@ class WriteDistributionAndOrderingSuite
         fail("expected V2TableWriteExec")
     }
   }
+
+  private def orderedWritePartitioning(
+      writeOrdering: Seq[catalyst.expressions.SortOrder],
+      targetNumPartitions: Option[Int]): physical.Partitioning = {
+    targetNumPartitions match {

Review comment:
       it's simply `RangePartitioning(writeOrdering, targetNumPartitions.getOrElse(conf.numShufflePartitions))`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768099637


   **[Test build #134539 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134539/testReport)** for PR 31355 at commit [`84108e4`](https://github.com/apache/spark/commit/84108e41cf89e48641f4c4f0cf573a96b0e7adab).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
sunchao commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r565769723



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/distributions/OrderedDistribution.java
##########
@@ -32,4 +32,13 @@
    * Returns ordering expressions.
    */
   SortOrder[] ordering();
+
+  /**
+   * Returns the number of partitions required by this write.

Review comment:
       I'm not sure if `required by this write` is accurate since we could potentially reuse this interface for read as well in future?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-779577562


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39744/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768622003


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39156/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r585270435



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DistributionAndOrderingUtils.scala
##########
@@ -32,23 +32,34 @@ object DistributionAndOrderingUtils {
     case write: RequiresDistributionAndOrdering =>
       val resolver = conf.resolver
 
-      val distribution = write.requiredDistribution match {
+      val numParts = write.requiredNumPartitions()
+      val (distribution, numPartitions) = write.requiredDistribution match {
         case d: OrderedDistribution =>
-          d.ordering.map(e => toCatalyst(e, query, resolver))
+          val dist = d.ordering.map(e => toCatalyst(e, query, resolver))
+          (dist, numParts)
         case d: ClusteredDistribution =>
-          d.clustering.map(e => toCatalyst(e, query, resolver))
+          val dist = d.clustering.map(e => toCatalyst(e, query, resolver))
+          (dist, numParts)
         case _: UnspecifiedDistribution =>
-          Array.empty[Expression]
+          (Array.empty[Expression], numParts)
       }
 
       val queryWithDistribution = if (distribution.nonEmpty) {
-        val numShufflePartitions = conf.numShufflePartitions
+        val finalNumPartitions = if (numPartitions > 0) {
+          numPartitions
+        } else {
+          conf.numShufflePartitions
+        }
         // the conversion to catalyst expressions above produces SortOrder expressions
         // for OrderedDistribution and generic expressions for ClusteredDistribution
         // this allows RepartitionByExpression to pick either range or hash partitioning
-        RepartitionByExpression(distribution, query, numShufflePartitions)
+        RepartitionByExpression(distribution, query, finalNumPartitions)
       } else {
-        query
+        if (numPartitions > 0) {
+          Repartition(numPartitions, shuffle = true, query)

Review comment:
       @aokolnychyi kindly reminder.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-806666996


   **[Test build #136508 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136508/testReport)** for PR 31355 at commit [`e8fb6f1`](https://github.com/apache/spark/commit/e8fb6f18215ef43083cc2922328f90535380969c).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-799000237


   Kindly reminder @rdblue - looks like the static number of partitions for non specific distribution is the only thing we need to sort out, and it's not a one of initial goals of the PR. Unless we are very clear this has to be addressed altogether, I'd say this is beyond the PR. 
   
   Do you agree, or you have someone in real world cases and confirm static number of partitions work great for such cases? (Even hypothetically thinking on such cases, upper/lower bound makes more sense to me, which is different story.)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-809437969


   Thank you, @HeartSaVioR , @cloud-fan and all! :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-780132964


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39763/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-771826964






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-808623450


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41165/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-808632614


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41165/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768870502


   **[Test build #134601 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134601/testReport)** for PR 31355 at commit [`8b0eb56`](https://github.com/apache/spark/commit/8b0eb560b0ce71adf4c22dc4cf94dc2ac47d2856).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rdblue commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r576340131



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/distributions/ClusteredDistribution.java
##########
@@ -32,4 +32,13 @@
    * Returns clustering expressions.
    */
   Expression[] clustering();
+
+  /**
+   * Returns the number of partitions required by this write.
+   * <p>
+   * Implementations may want to override this if it requires the specific number of partitions.
+   *
+   * @return the required number of partitions, non-positive values mean no requirement.
+   */
+  default int requiredNumPartitions() { return 0; }

Review comment:
       @HeartSaVioR, I think this is part of the use case. If this interface can require some number of incoming partitions, then it should do that in all cases. It makes the behavior of this interface harder to understand if this is not enforced for certain values returned by the other methods.
   
   In addition, this interface is less useful if it can't be used to control the incoming parallelism without also changing the data. I think a coalesce makes a lot more sense than ignoring this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-808614930


   **[Test build #136581 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136581/testReport)** for PR 31355 at commit [`00fb5fb`](https://github.com/apache/spark/commit/00fb5fb18644f4f3843349c7d66a3649a9a47649).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-779579371


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39744/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-806221984


   **[Test build #136482 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136482/testReport)** for PR 31355 at commit [`e8fb6f1`](https://github.com/apache/spark/commit/e8fb6f18215ef43083cc2922328f90535380969c).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-806660809


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41100/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-806861772


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136515/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-809429366


   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-780240325


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135182/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768597841


   **[Test build #134571 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134571/testReport)** for PR 31355 at commit [`5d56dc0`](https://github.com/apache/spark/commit/5d56dc0b81a89e22f713d7160c04a5b9704bee29).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r662218289



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala
##########
@@ -1097,4 +1097,9 @@ private[spark] object QueryCompilationErrors {
       s"and column $colName cannot be resolved. Expected $expectedNum columns named $colName but " +
       s"got ${actualCols.map(_.name).mkString("[", ",", "]")}")
   }
+
+  def numberOfPartitionsNotAllowedWithUnspecifiedDistributionError(): Throwable = {
+    throw new AnalysisException("The number of partitions can't be specified with unspecified" +

Review comment:
       Nice finding. Would you mind to submit a MINOR PR fixing this? I can also work on this as well.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-808663688


   **[Test build #136581 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136581/testReport)** for PR 31355 at commit [`00fb5fb`](https://github.com/apache/spark/commit/00fb5fb18644f4f3843349c7d66a3649a9a47649).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768780252


   Probably the discussion would be more constructive/productive if each idea may bring the explanation with the actual case, which storage is expected to get benefit on that, how the idea can boost the performance or help avoiding too many concurrent requests on the fly. After that we could determine "who" should decide the parallelism and "who" should provide the information for hints.
   (e.g. I wouldn't think higher parallelism always gives the best performance where storage only is ready for lower parallelism.)
   
   My feeling is that these things are all about "optimizations" whereas the use case I described is about "requirement", so if we agree with the necessity of latter, former looks to be beyond of topic and we'd better discussing in other thread. (Either mailing list or sketched WIP PR if someone has some ideas.)
   
   Btw, one thing might need to revisit would be, if I understand correctly, once the data source requires distribution/ordering, end users lose the possibility of giving hint (based on heuristic, knowing the output data source and the characteristic of output data roughly) on the number of output partitions. Would it be addressed when we replace using default number of shuffle partitions with None, or AQE may disregard it?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-806858316


   **[Test build #136515 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136515/testReport)** for PR 31355 at commit [`3e75641`](https://github.com/apache/spark/commit/3e7564173664482a503f7f43e37d3be71f590106).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-806660742


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41100/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
sunchao commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r602555286



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala
##########
@@ -1097,4 +1097,9 @@ private[spark] object QueryCompilationErrors {
       s"and column $colName cannot be resolved. Expected $expectedNum columns named $colName but " +
       s"got ${actualCols.map(_.name).mkString("[", ",", "]")}")
   }
+
+  def numberOfPartitionsNotAllowedWithUnspecifiedDistributionError(): Throwable = {
+    throw new AnalysisException("The number of partitions can't be specified with unspecified" +
+      " distribution. Invalid Sink requirements detected.")

Review comment:
       nit: why "Invalid Sink requirements detected"? is this error specific to streaming?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-770376985


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134687/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768866810


   **[Test build #134603 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134603/testReport)** for PR 31355 at commit [`60ebadf`](https://github.com/apache/spark/commit/60ebadf0bf26efb7c1579073c5d3e235c23cfa24).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r565712581



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/connector/distributions/distributions.scala
##########
@@ -39,21 +48,29 @@ private[sql] object UnspecifiedDistributionImpl extends UnspecifiedDistribution
 }
 
 private[sql] final case class ClusteredDistributionImpl(
-    clusteringExprs: Seq[Expression]) extends ClusteredDistribution {
+    clusteringExprs: Seq[Expression],
+    numPartitions: Int) extends ClusteredDistribution {
 
   override def clustering: Array[Expression] = clusteringExprs.toArray
 
+  override def requiredNumPartitions(): Int = numPartitions
+
   override def toString: String = {
-    s"ClusteredDistribution(${clusteringExprs.map(_.describe).mkString(", ")})"
+    s"ClusteredDistribution(${clusteringExprs.map(_.describe).mkString(", ")}" +
+      s" / numPartitions=$numPartitions)"

Review comment:
       Seeing "numPartitions=0" is a bit weird to me. Shall we hide numPartitions info if no requirement?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r565581188



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/connector/distributions/distributions.scala
##########
@@ -29,31 +29,47 @@ private[sql] object LogicalDistributions {
     ClusteredDistributionImpl(clustering)
   }
 
+  def clustered(clustering: Array[Expression], numPartitions: Int): ClusteredDistribution = {
+    ClusteredDistributionImpl(clustering, Some(numPartitions))
+  }
+
   def ordered(ordering: Array[SortOrder]): OrderedDistribution = {
     OrderedDistributionImpl(ordering)
   }
+
+  def ordered(ordering: Array[SortOrder], numPartitions: Int): OrderedDistribution = {
+    OrderedDistributionImpl(ordering, Some(numPartitions))
+  }
 }
 
 private[sql] object UnspecifiedDistributionImpl extends UnspecifiedDistribution {
   override def toString: String = "UnspecifiedDistribution"
 }
 
 private[sql] final case class ClusteredDistributionImpl(
-    clusteringExprs: Seq[Expression]) extends ClusteredDistribution {
+    clusteringExprs: Seq[Expression],
+    numPartitions: Option[Int] = None) extends ClusteredDistribution {

Review comment:
       I thought None is clearer than non-negative value in toString form but no big deal once we define the condition.
   Actually I wanted to define the return type of requiredNumPartitions() as Optional for the same reason, but didn't do it as I was already concerning about language interop.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r565709706



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DistributionAndOrderingUtils.scala
##########
@@ -32,21 +32,29 @@ object DistributionAndOrderingUtils {
     case write: RequiresDistributionAndOrdering =>
       val resolver = conf.resolver
 
-      val distribution = write.requiredDistribution match {
+      val (distribution, numPartitions) = write.requiredDistribution match {
         case d: OrderedDistribution =>
-          d.ordering.map(e => toCatalyst(e, query, resolver))
+          val dist = d.ordering.map(e => toCatalyst(e, query, resolver))
+          val numParts = d.requiredNumPartitions()
+          (dist, numParts)
         case d: ClusteredDistribution =>
-          d.clustering.map(e => toCatalyst(e, query, resolver))
+          val dist = d.clustering.map(e => toCatalyst(e, query, resolver))
+          val numParts = d.requiredNumPartitions()
+          (dist, numParts)
         case _: UnspecifiedDistribution =>
-          Array.empty[Expression]
+          (Array.empty[Expression], 0)
       }
 
       val queryWithDistribution = if (distribution.nonEmpty) {
-        val numShufflePartitions = conf.numShufflePartitions
+        val finalNumPartitions = if (numPartitions > 0) {

Review comment:
       ~Shall we keep the original variable name?~ Never mind. I was a little confused at `final` wording, but it looks reasonable.
   ```scala
   - val finalNumPartitions = if (numPartitions > 0) {
   + val numShufflePartitions = if (numPartitions > 0) {
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768753424


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134571/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-771852803


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39379/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-808210633


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136564/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-804645450


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40964/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-806726487


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136508/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768802108


   It should keep the user intention (may need to fix some bugs). But this seems unrelated to this PR.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dgd-contributor commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
dgd-contributor commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r661107708



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala
##########
@@ -1097,4 +1097,9 @@ private[spark] object QueryCompilationErrors {
       s"and column $colName cannot be resolved. Expected $expectedNum columns named $colName but " +
       s"got ${actualCols.map(_.name).mkString("[", ",", "]")}")
   }
+
+  def numberOfPartitionsNotAllowedWithUnspecifiedDistributionError(): Throwable = {
+    throw new AnalysisException("The number of partitions can't be specified with unspecified" +

Review comment:
       this shouldn't be throw here




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768753424


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134571/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r565370445



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/distributions/OrderedDistribution.java
##########
@@ -32,4 +32,13 @@
    * Returns ordering expressions.
    */
   SortOrder[] ordering();
+
+  /**
+   * Returns the number of partitions required by this write.
+   * <p>
+   * Implementations may want to override this if it requires the specific number of partitions.
+   *
+   * @return the required number of partitions, 0 as no requirement

Review comment:
       ditto




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-780117441


   **[Test build #135182 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135182/testReport)** for PR 31355 at commit [`6807926`](https://github.com/apache/spark/commit/6807926187ec787f8d95d4bb205bf288da18547d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rdblue commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r577006463



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DistributionAndOrderingUtils.scala
##########
@@ -32,23 +32,34 @@ object DistributionAndOrderingUtils {
     case write: RequiresDistributionAndOrdering =>
       val resolver = conf.resolver
 
-      val distribution = write.requiredDistribution match {
+      val numParts = write.requiredNumPartitions()
+      val (distribution, numPartitions) = write.requiredDistribution match {
         case d: OrderedDistribution =>
-          d.ordering.map(e => toCatalyst(e, query, resolver))
+          val dist = d.ordering.map(e => toCatalyst(e, query, resolver))
+          (dist, numParts)
         case d: ClusteredDistribution =>
-          d.clustering.map(e => toCatalyst(e, query, resolver))
+          val dist = d.clustering.map(e => toCatalyst(e, query, resolver))
+          (dist, numParts)
         case _: UnspecifiedDistribution =>
-          Array.empty[Expression]
+          (Array.empty[Expression], numParts)
       }
 
       val queryWithDistribution = if (distribution.nonEmpty) {
-        val numShufflePartitions = conf.numShufflePartitions
+        val finalNumPartitions = if (numPartitions > 0) {
+          numPartitions
+        } else {
+          conf.numShufflePartitions
+        }
         // the conversion to catalyst expressions above produces SortOrder expressions
         // for OrderedDistribution and generic expressions for ClusteredDistribution
         // this allows RepartitionByExpression to pick either range or hash partitioning
-        RepartitionByExpression(distribution, query, numShufflePartitions)
+        RepartitionByExpression(distribution, query, finalNumPartitions)
       } else {
-        query
+        if (numPartitions > 0) {

Review comment:
       You could avoid changing a few more lines by using `else if`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r601569282



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DistributionAndOrderingUtils.scala
##########
@@ -32,21 +32,26 @@ object DistributionAndOrderingUtils {
     case write: RequiresDistributionAndOrdering =>
       val resolver = conf.resolver
 
+      val numPartitions = write.requiredNumPartitions()
       val distribution = write.requiredDistribution match {
-        case d: OrderedDistribution =>
-          d.ordering.map(e => toCatalyst(e, query, resolver))
-        case d: ClusteredDistribution =>
-          d.clustering.map(e => toCatalyst(e, query, resolver))
-        case _: UnspecifiedDistribution =>
-          Array.empty[Expression]
+        case d: OrderedDistribution => d.ordering.map(e => toCatalyst(e, query, resolver))
+        case d: ClusteredDistribution => d.clustering.map(e => toCatalyst(e, query, resolver))
+        case _: UnspecifiedDistribution => Array.empty[Expression]
       }
 
       val queryWithDistribution = if (distribution.nonEmpty) {
-        val numShufflePartitions = conf.numShufflePartitions
+        val finalNumPartitions = if (numPartitions > 0) {
+          numPartitions
+        } else {
+          conf.numShufflePartitions
+        }
         // the conversion to catalyst expressions above produces SortOrder expressions
         // for OrderedDistribution and generic expressions for ClusteredDistribution
         // this allows RepartitionByExpression to pick either range or hash partitioning
-        RepartitionByExpression(distribution, query, numShufflePartitions)
+        RepartitionByExpression(distribution, query, finalNumPartitions)
+      } else if (numPartitions > 0) {
+        throw new AnalysisException("The number of partitions can't be specified with unspecified" +

Review comment:
       let's move it to `QueryCompilationErrors.scala`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768958759






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-806492370


   **[Test build #136515 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136515/testReport)** for PR 31355 at commit [`3e75641`](https://github.com/apache/spark/commit/3e7564173664482a503f7f43e37d3be71f590106).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768822993


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39176/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768226110


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134539/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-771852760


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39379/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r576407822



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/distributions/ClusteredDistribution.java
##########
@@ -32,4 +32,13 @@
    * Returns clustering expressions.
    */
   Expression[] clustering();
+
+  /**
+   * Returns the number of partitions required by this write.
+   * <p>
+   * Implementations may want to override this if it requires the specific number of partitions.
+   *
+   * @return the required number of partitions, non-positive values mean no requirement.
+   */
+  default int requiredNumPartitions() { return 0; }

Review comment:
       Yeah I thought this a bit more, and agree restricting the "parallelism" is a valid use case regardless of distribution/ordering.
   
   "coalesce" vs "repartition" remains the question; when distribution is specified I expect "repartition" will take effect, so probably "repartition" is more consistent. And what I observed from troubleshooting is that coalesce could bring an edge-case if used blindly, like when sink places with the same stage shuffle was done and takes whole stage quite slow.
   (That was `coalesce(1)` and we simply fixed the problem via changing it to `repartition(1)`.)
   
   I'd feel safer if writer requirement just affects the writer operation and doesn't affect other operations. What do you think?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768120265


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39124/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768155604


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39124/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-779849863


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135172/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-769055426


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134601/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r565709706



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DistributionAndOrderingUtils.scala
##########
@@ -32,21 +32,29 @@ object DistributionAndOrderingUtils {
     case write: RequiresDistributionAndOrdering =>
       val resolver = conf.resolver
 
-      val distribution = write.requiredDistribution match {
+      val (distribution, numPartitions) = write.requiredDistribution match {
         case d: OrderedDistribution =>
-          d.ordering.map(e => toCatalyst(e, query, resolver))
+          val dist = d.ordering.map(e => toCatalyst(e, query, resolver))
+          val numParts = d.requiredNumPartitions()
+          (dist, numParts)
         case d: ClusteredDistribution =>
-          d.clustering.map(e => toCatalyst(e, query, resolver))
+          val dist = d.clustering.map(e => toCatalyst(e, query, resolver))
+          val numParts = d.requiredNumPartitions()
+          (dist, numParts)
         case _: UnspecifiedDistribution =>
-          Array.empty[Expression]
+          (Array.empty[Expression], 0)
       }
 
       val queryWithDistribution = if (distribution.nonEmpty) {
-        val numShufflePartitions = conf.numShufflePartitions
+        val finalNumPartitions = if (numPartitions > 0) {

Review comment:
       Shall we keep the original variable name?
   ```scala
   - val finalNumPartitions = if (numPartitions > 0) {
   + val numShufflePartitions = if (numPartitions > 0) {
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768832811






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-806529208


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41093/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-770332998


   I agree it may not be a good idea to add the method directly in RequiresDistributionAndOrdering, if we are sure to expand the possibility of controlling partitions. I'm open to suggestion; I'm OK either way, having new interface which extends RequiresDistributionAndOrdering, or having new interface playing as mix-in with RequiresDistributionAndOrdering.
   
   Probably leaving it as it is, and moving out once we add another control is also feasible, as we have multiple months for 3.2.0 and it'll not be addressed in near future if no one proposes new changes within time.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r567124615



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/connector/WriteDistributionAndOrderingSuite.scala
##########
@@ -372,17 +492,82 @@ class WriteDistributionAndOrderingSuite
         Seq.empty
       )
     )
-    val writePartitioning = RangePartitioning(writeOrdering, conf.numShufflePartitions)
+    val writePartitioning = orderedWritePartitioning(writeOrdering, targetNumPartitions)
 
     checkWriteRequirements(
       tableDistribution,
+      targetNumPartitions,
       tableOrdering,
       expectedWritePartitioning = writePartitioning,
       expectedWriteOrdering = writeOrdering,
       writeTransform = df => df.sortWithinPartitions("data", "id"),
       writeCommand = command)
   }
 
+  ignore("ordered distribution and sort with manual repartition: append") {

Review comment:
       Shall we add the JIRA here? Is SPARK-34184 the right one?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-804751882


   **[Test build #136380 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136380/testReport)** for PR 31355 at commit [`7f6e82d`](https://github.com/apache/spark/commit/7f6e82de5a63750c4c5f210f4000ae1423007d3e).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-769023158


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134603/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rdblue commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r577007175



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DistributionAndOrderingUtils.scala
##########
@@ -32,23 +32,34 @@ object DistributionAndOrderingUtils {
     case write: RequiresDistributionAndOrdering =>
       val resolver = conf.resolver
 
-      val distribution = write.requiredDistribution match {
+      val numParts = write.requiredNumPartitions()
+      val (distribution, numPartitions) = write.requiredDistribution match {
         case d: OrderedDistribution =>
-          d.ordering.map(e => toCatalyst(e, query, resolver))
+          val dist = d.ordering.map(e => toCatalyst(e, query, resolver))
+          (dist, numParts)
         case d: ClusteredDistribution =>
-          d.clustering.map(e => toCatalyst(e, query, resolver))
+          val dist = d.clustering.map(e => toCatalyst(e, query, resolver))
+          (dist, numParts)
         case _: UnspecifiedDistribution =>
-          Array.empty[Expression]
+          (Array.empty[Expression], numParts)
       }
 
       val queryWithDistribution = if (distribution.nonEmpty) {
-        val numShufflePartitions = conf.numShufflePartitions
+        val finalNumPartitions = if (numPartitions > 0) {
+          numPartitions
+        } else {
+          conf.numShufflePartitions
+        }
         // the conversion to catalyst expressions above produces SortOrder expressions
         // for OrderedDistribution and generic expressions for ClusteredDistribution
         // this allows RepartitionByExpression to pick either range or hash partitioning
-        RepartitionByExpression(distribution, query, numShufflePartitions)
+        RepartitionByExpression(distribution, query, finalNumPartitions)
       } else {
-        query
+        if (numPartitions > 0) {
+          Repartition(numPartitions, shuffle = true, query)

Review comment:
       @aokolnychyi, what do you think about using a round-robin partition when there is no required distribution?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-806660809


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41100/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r601573880



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/connector/WriteDistributionAndOrderingSuite.scala
##########
@@ -569,4 +785,26 @@ class WriteDistributionAndOrderingSuite
         fail("expected V2TableWriteExec")
     }
   }
+
+  private def orderedWritePartitioning(
+      writeOrdering: Seq[catalyst.expressions.SortOrder],
+      targetNumPartitions: Option[Int]): physical.Partitioning = {
+    targetNumPartitions match {
+      case Some(parts) => RangePartitioning(writeOrdering, parts)
+      case _ => RangePartitioning(writeOrdering, conf.numShufflePartitions)
+    }
+  }
+
+  private def clusteredTableDistribution(clustering: Array[Expression]): Distribution = {
+    Distributions.clustered(clustering)

Review comment:
       `clusteredTableDistribution` doesn't seem to be shorter than `Distributions.clustered` ...




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-806726487


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136508/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r600880055



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/RequiresDistributionAndOrdering.java
##########
@@ -42,6 +42,19 @@
    */
   Distribution requiredDistribution();
 
+  /**
+   * Returns the number of partitions required by this write.
+   * <p>
+   * Implementations may override this to require a specific number of input partitions.
+   * <p>
+   * Note that Spark doesn't support the number of partitions on {@link UnspecifiedDistribution},
+   * if requiredDistribution() returns {@link UnspecifiedDistribution}, the return value for this
+   * method will be ignored.

Review comment:
       Let's make it fail so that the intention isn't silently ignored. Good point.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR edited a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR edited a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768107101


   Actually the proposal is more likely giving data source to force having static number of partitions regardless of output data. 
   
   I see valid concerns about drawbacks when data source has no idea on the output data but able to provide static number of partitions. As data source is blinded on the amount of data, the value it gives could be likely sub-optimal. But as I described on the PR description, there's some exceptional case where it requires "static" number of partitions. We might be able to say "not supported on DSv2", but then we'll never be able to deprecate DSv1 because it doesn't have exhaustive coverage.
   
   Furthermore, it's also true that Spark is blinded on the characteristic of data source. Spark assumes the data source has unlimited bandwidth and performance will always be better on higher parallelism. That would depend on the real range what Spark will provide, but I'm wondering it's always true for arbitrary external storage if we assume arbitrary number.
   
   If it's not ideal for the target data source to require arbitrary number of parallelism of writes, end users may try to repartition manually in prior to stick with some static number of partitions, but Spark will do repartition again if data source requires distribution/ordering and ignore the adjustment. That's probably different issue as human is giving some hint, but the point is that there could be a case on desired number of partitions. The static number of partitions may not also be good idea for such case (might not be too bad if that's an intention by human), but (lower limit, upper limit) can be given heuristically accounting the characteristic.
   
   Another case is that data source may know better about the optimization for the relation of data volume and the parallelism. This requires Spark to provide some statistic info back to data source and let data source decide the parallelism if possible.
   
   Except static partitioning, others are like sketched ideas. Just a 2 cents.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-794441176


   I don't have a real use case as I commented earlier on review comment, but you could imagine the case where external storage has limited bandwidth on concurrent writes. Quite earlier (5+ years ago) I had a case and had to deal with manual repartition. 
   
   Actually, specifying upper bound / lower bound looks more suitable for that case (so it triggers repartition only when the write operation's partitions exceed/under certain boundary), but that looks to be beyond the PR.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-791791990


   Kindly reminder @viirya @rdblue @sunchao @cloud-fan @aokolnychyi @dongjoon-hyun 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-770337518


   **[Test build #134687 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134687/testReport)** for PR 31355 at commit [`be94bb4`](https://github.com/apache/spark/commit/be94bb492f6f413d6813702833827e837b9fb515).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-806410924


   retest this, please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768347645


   The use case makes sense to me.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768617000


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39156/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768622003


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39156/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-779683450


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135163/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-770337518


   **[Test build #134687 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134687/testReport)** for PR 31355 at commit [`be94bb4`](https://github.com/apache/spark/commit/be94bb492f6f413d6813702833827e837b9fb515).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768139344


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39124/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768226110


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134539/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-804645450


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40964/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r567372512



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/connector/WriteDistributionAndOrderingSuite.scala
##########
@@ -372,17 +492,82 @@ class WriteDistributionAndOrderingSuite
         Seq.empty
       )
     )
-    val writePartitioning = RangePartitioning(writeOrdering, conf.numShufflePartitions)
+    val writePartitioning = orderedWritePartitioning(writeOrdering, targetNumPartitions)
 
     checkWriteRequirements(
       tableDistribution,
+      targetNumPartitions,
       tableOrdering,
       expectedWritePartitioning = writePartitioning,
       expectedWriteOrdering = writeOrdering,
       writeTransform = df => df.sortWithinPartitions("data", "id"),
       writeCommand = command)
   }
 
+  ignore("ordered distribution and sort with manual repartition: append") {

Review comment:
       Ah OK this PR was based on your PR which didn't remove ignored tests. Probably safer to remove here as well, with adding change in JIRA comment. Will do.   

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/connector/WriteDistributionAndOrderingSuite.scala
##########
@@ -372,17 +492,82 @@ class WriteDistributionAndOrderingSuite
         Seq.empty
       )
     )
-    val writePartitioning = RangePartitioning(writeOrdering, conf.numShufflePartitions)
+    val writePartitioning = orderedWritePartitioning(writeOrdering, targetNumPartitions)
 
     checkWriteRequirements(
       tableDistribution,
+      targetNumPartitions,
       tableOrdering,
       expectedWritePartitioning = writePartitioning,
       expectedWriteOrdering = writeOrdering,
       writeTransform = df => df.sortWithinPartitions("data", "id"),
       writeCommand = command)
   }
 
+  ignore("ordered distribution and sort with manual repartition: append") {

Review comment:
       Ah OK this PR was based on your PR which didn't remove ignored tests yet. Probably safer to remove here as well, with adding change in JIRA comment. Will do.   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-806288626


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41066/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-806288651


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41066/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-804602562


   **[Test build #136380 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136380/testReport)** for PR 31355 at commit [`7f6e82d`](https://github.com/apache/spark/commit/7f6e82de5a63750c4c5f210f4000ae1423007d3e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r565709706



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DistributionAndOrderingUtils.scala
##########
@@ -32,21 +32,29 @@ object DistributionAndOrderingUtils {
     case write: RequiresDistributionAndOrdering =>
       val resolver = conf.resolver
 
-      val distribution = write.requiredDistribution match {
+      val (distribution, numPartitions) = write.requiredDistribution match {
         case d: OrderedDistribution =>
-          d.ordering.map(e => toCatalyst(e, query, resolver))
+          val dist = d.ordering.map(e => toCatalyst(e, query, resolver))
+          val numParts = d.requiredNumPartitions()
+          (dist, numParts)
         case d: ClusteredDistribution =>
-          d.clustering.map(e => toCatalyst(e, query, resolver))
+          val dist = d.clustering.map(e => toCatalyst(e, query, resolver))
+          val numParts = d.requiredNumPartitions()
+          (dist, numParts)
         case _: UnspecifiedDistribution =>
-          Array.empty[Expression]
+          (Array.empty[Expression], 0)
       }
 
       val queryWithDistribution = if (distribution.nonEmpty) {
-        val numShufflePartitions = conf.numShufflePartitions
+        val finalNumPartitions = if (numPartitions > 0) {

Review comment:
       ~Shall we keep the original variable name?~ Never mind.
   ```scala
   - val finalNumPartitions = if (numPartitions > 0) {
   + val numShufflePartitions = if (numPartitions > 0) {
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-780230566


   **[Test build #135182 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135182/testReport)** for PR 31355 at commit [`6807926`](https://github.com/apache/spark/commit/6807926187ec787f8d95d4bb205bf288da18547d).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768597841


   **[Test build #134571 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134571/testReport)** for PR 31355 at commit [`5d56dc0`](https://github.com/apache/spark/commit/5d56dc0b81a89e22f713d7160c04a5b9704bee29).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-806423266


   **[Test build #136508 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136508/testReport)** for PR 31355 at commit [`e8fb6f1`](https://github.com/apache/spark/commit/e8fb6f18215ef43083cc2922328f90535380969c).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-806528341


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41093/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768832811






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-806332854


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136482/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-779683450


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135163/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-804779677


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136380/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-793991215


   > how to deal with static number of partitions when there's no required distribution.
   
   Do we have a real use case for it? It looks weird if the sink just requires a certain number of partitions.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR edited a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR edited a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768801016


   > AQE won't kick in if users specify num partitions, e.g. df.repartition(5), I think the same applies here if the sink requires a certain num partitions.
   
   The case I imagine is like this, `df.repartition(5).write.(blabla).save()`, with the condition sink requires a specific distribution but doesn't require specific number of partitions. Now Spark would repartition to default number of shuffle partitions, and I'm unsure changing it to None (SPARK-34230) would keep the user intention.
   
   There's a glitch where end users would know the good number of partitions but don't know about the requirement of distribution/ordering for the sink, so end users can't add the same repartition (but with number of partitions) sink will require to Spark and see the opportunity Spark will deduplicate two neighbor repartitions.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-770354919


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39273/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-804602562


   **[Test build #136380 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136380/testReport)** for PR 31355 at commit [`7f6e82d`](https://github.com/apache/spark/commit/7f6e82de5a63750c4c5f210f4000ae1423007d3e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-771852803


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39379/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-771852803


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39379/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r565860036



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/distributions/OrderedDistribution.java
##########
@@ -32,4 +32,13 @@
    * Returns ordering expressions.
    */
   SortOrder[] ordering();
+
+  /**
+   * Returns the number of partitions required by this write.

Review comment:
       That was my first trial, but stepped back because the number is only used for specific distributions. Probably I was over-thinking. Let me give it a try again.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-780141389


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39763/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r565381214



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/connector/distributions/distributions.scala
##########
@@ -29,31 +29,47 @@ private[sql] object LogicalDistributions {
     ClusteredDistributionImpl(clustering)
   }
 
+  def clustered(clustering: Array[Expression], numPartitions: Int): ClusteredDistribution = {
+    ClusteredDistributionImpl(clustering, Some(numPartitions))
+  }
+
   def ordered(ordering: Array[SortOrder]): OrderedDistribution = {
     OrderedDistributionImpl(ordering)
   }
+
+  def ordered(ordering: Array[SortOrder], numPartitions: Int): OrderedDistribution = {
+    OrderedDistributionImpl(ordering, Some(numPartitions))
+  }
 }
 
 private[sql] object UnspecifiedDistributionImpl extends UnspecifiedDistribution {
   override def toString: String = "UnspecifiedDistribution"
 }
 
 private[sql] final case class ClusteredDistributionImpl(
-    clusteringExprs: Seq[Expression]) extends ClusteredDistribution {
+    clusteringExprs: Seq[Expression],
+    numPartitions: Option[Int] = None) extends ClusteredDistribution {

Review comment:
       why this needs to be optional while we define non-negative value as no requirement?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-780240325


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135182/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768580667


   I'd like to see our preference on dealing with "possible (but just floated ideas now) improvements" in public API.
   
   I think this use case is clear, and the change fulfills the requirement if we only consider such use case. Possible improvements might require the API to be changed as a backward incompatible way; I guess that is the reason @aokolnychyi said we should discuss. But again for me this use case is only solid, others are just ideas/possibility.
   
   One possible deal is adding this first, and give us enough time (feature freeze on 3.2.0) to make improvement with "free to break". Even after 3.2.0 we could make an excuse to break based on the fact these classes are marked as Experimental, though we'd like to minimize the occurrence.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-808668792


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136581/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-780117441


   **[Test build #135182 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135182/testReport)** for PR 31355 at commit [`6807926`](https://github.com/apache/spark/commit/6807926187ec787f8d95d4bb205bf288da18547d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-779567396


   **[Test build #135163 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135163/testReport)** for PR 31355 at commit [`b6d389e`](https://github.com/apache/spark/commit/b6d389e0d5a7ba8430a6e0c0f5b6d0379d8591ed).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-779552011


   Addressed removing the requirement on any specific distribution to require static number of partitions. Also renamed the method as it's no longer specific to distribution.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-779579031


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39744/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rdblue commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r565733924



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/distributions/ClusteredDistribution.java
##########
@@ -32,4 +32,13 @@
    * Returns clustering expressions.
    */
   Expression[] clustering();
+
+  /**
+   * Returns the number of partitions required by this write.
+   * <p>
+   * Implementations may want to override this if it requires the specific number of partitions.
+   *
+   * @return the required number of partitions, non-positive values mean no requirement.
+   */
+  default int requiredNumPartitions() { return 0; }

Review comment:
       Why not also add a parallelism requirement to `UnspecifiedDistribution`? That could insert a coalesce.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r565785171



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/distributions/OrderedDistribution.java
##########
@@ -32,4 +32,13 @@
    * Returns ordering expressions.
    */
   SortOrder[] ordering();
+
+  /**
+   * Returns the number of partitions required by this write.

Review comment:
       Hmm... you're right we seem to try to co-use the interface in both read/write path, so it might be confusing if something is only saying about write.
   
   I'm not sure this method can be used in read path as well though. For read path, data source is already able to control the parallelism (say, partitions), so no need to pass the requirement and let Spark does it instead.
   
   So I guess this is a point of divergence between read and write. If we can explicitly and effectively classify the methods as "common", "read specific", "write specific" then that would be great, but let's hear others' voices on this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768619236


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39156/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-806529208


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41093/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-806523361


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41093/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-806221984


   **[Test build #136482 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136482/testReport)** for PR 31355 at commit [`e8fb6f1`](https://github.com/apache/spark/commit/e8fb6f18215ef43083cc2922328f90535380969c).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-808668792


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136581/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-788617728


   I think now this PR only has a concern on how to deal with static number of partitions when there's no required distribution. 
   
   I picked repartition instead of coalesce to avoid messing up with the parallelism of computations in same stage. Picking up round-robin distribution is simply just because it looks to be the default for distribution without specific requirement of distribution.
   
   Please take a look again, and provide voices on above decision. Thanks in advance!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-769013099


   **[Test build #134603 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134603/testReport)** for PR 31355 at commit [`60ebadf`](https://github.com/apache/spark/commit/60ebadf0bf26efb7c1579073c5d3e235c23cfa24).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r577125087



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/RequiresDistributionAndOrdering.java
##########
@@ -42,6 +42,15 @@
    */
   Distribution requiredDistribution();
 
+  /**
+   * Returns the number of partitions required by this write.
+   * <p>
+   * Implementations may want to override this if it requires the specific number of partitions.
+   *
+   * @return the required number of partitions, non-positive values mean no requirement.

Review comment:
       Probably you've worried about the possible confusion on inclusion of "0". While non-positive isn't wrong I agree it's less intuitive (less straightforward). Thanks for the suggestion!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768949196


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39191/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-780141389


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39763/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-779694716


   **[Test build #135172 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135172/testReport)** for PR 31355 at commit [`1c42e7e`](https://github.com/apache/spark/commit/1c42e7e3cf7aec39d82a25c9e1c3896b91c84274).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-809823145


   Thanks all for reviewing and merging!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-779710833


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39753/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768802431


   Yes thanks for confirming. I wanted to make sure my understanding about possible behavioral change is correct. I agree that is not related to this PR.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r567373288



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/connector/WriteDistributionAndOrderingSuite.scala
##########
@@ -372,17 +492,82 @@ class WriteDistributionAndOrderingSuite
         Seq.empty
       )
     )
-    val writePartitioning = RangePartitioning(writeOrdering, conf.numShufflePartitions)
+    val writePartitioning = orderedWritePartitioning(writeOrdering, targetNumPartitions)
 
     checkWriteRequirements(
       tableDistribution,
+      targetNumPartitions,
       tableOrdering,
       expectedWritePartitioning = writePartitioning,
       expectedWriteOrdering = writeOrdering,
       writeTransform = df => df.sortWithinPartitions("data", "id"),
       writeCommand = command)
   }
 
+  ignore("ordered distribution and sort with manual repartition: append") {

Review comment:
       Thanks!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-794970669


   I'm OK to exclude it here and let someone come up with actual use case. As that was requested by @rdblue let's hear the voice before making change.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-804779677


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136380/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-804620342


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40964/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768107101


   Actually the proposal is more likely giving data source to force having static number of partitions regardless of output data. 
   
   I see valid concerns about drawbacks when data source has no idea on the output data but able to provide static number of partitions. As data source is blinded on the amount of data, the value it gives could be likely sub-optimal. But as I described on the PR description, there's some exceptional case where it requires "static" number of partitions. We might be able to say "not supported on DSv2", but then we'll never be able to deprecate DSv1 because it doesn't have exhaustive coverage.
   
   Furthermore, it's also true that Spark is blinded on the characteristic of data source. Spark assumes the data source has unlimited bandwidth and performance will always be better on higher parallelism. That would depend on the real range what Spark will provide, but I'm wondering it's always true for arbitrary external storage if we assume arbitrary number.
   
   If it's not ideal for the target data source to require arbitrary number of parallelism of writes, end users may try to repartition manually in prior to stick with some static number of partitions, but Spark will do repartition again if data source requires distribution/ordering and ignore the adjustment. That's probably different issue as human is giving some hint, but the point is that there could be a case on desired number of partitions. The static number of partitions may not also be good idea for such case, but (lower limit, upper limit) can be given heuristically accounting the characteristic.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768923566


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39191/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR edited a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR edited a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768107101


   Actually the proposal is more likely giving data source to force having static number of partitions regardless of output data. 
   
   I see valid concerns about drawbacks when data source has no idea on the output data but able to provide static number of partitions. As data source is blinded on the amount of data, the value it gives could be likely sub-optimal. But as I described on the PR description, there's some exceptional case where it requires "static" number of partitions. We might be able to say "not supported on DSv2", but then we'll never be able to deprecate DSv1 because it doesn't have exhaustive coverage.
   
   Furthermore, it's also true that Spark is blinded on the characteristic of data source. Spark assumes the data source has unlimited bandwidth and performance will always be better on higher parallelism. That would depend on the real range what Spark will provide, but I'm wondering it's always true for arbitrary external storage if we assume arbitrary number.
   
   If it's not ideal for the target data source to require arbitrary number of parallelism of writes, end users may try to repartition manually in prior to stick with some static number of partitions, but Spark will do repartition again if data source requires distribution/ordering and ignore the adjustment. That's probably different issue as human is giving some hint, but the point is that there could be a case on desired number of partitions. The static number of partitions may not also be good idea for such case, but (lower limit, upper limit) can be given heuristically accounting the characteristic.
   
   Another case is that data source may know better about the optimization for the relation of data volume and the parallelism. This requires Spark to provide some statistic info back to data source and let data source decide the parallelism if possible.
   
   Except static partitioning, others are like sketched ideas. Just a 2 cents.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-806327246


   > Test build #136482 has finished for PR 31355 at commit e8fb6f1.
   > 
   > This patch fails Spark unit tests.
   > This patch merges cleanly.
   > This patch adds no public classes.
   
   Failed tests are following:
   
   ```
   org.apache.spark.sql.TPCDSV1_4_PlanStabilitySuite.check simplified (tpcds-v1.4/q17)
   org.apache.spark.sql.TPCDSV1_4_PlanStabilityWithStatsSuite.check simplified sf100 (tpcds-v1.4/q17) 
   ```
   
   and I have no idea about the relationship. Is it known as flaky test, or is this PR possible to touch the plan?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r576402899



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/RequiresDistributionAndOrdering.java
##########
@@ -42,6 +42,19 @@
    */
   Distribution requiredDistribution();
 
+  /**
+   * Returns the number of partitions required by this write if specific distribution is required.
+   * <p>
+   * Implementations may want to override this if it requires the specific number of partitions
+   * on distribution.
+   * <p>
+   * {@link UnspecifiedDistribution} is not affected by this method, as it doesn't require the
+   * specific distribution.
+   *
+   * @return the required number of partitions, non-positive values mean no requirement.
+   */
+  default int requiredNumPartitionsOnDistribution() { return 0; }

Review comment:
       I'm actually more familiar with the word "parallelism" but the word looks to be less used in Spark - "partition" is being used almost everywhere. I'm OK to mention it as "parallelism" but let's hear more voices on this.
   
   The name comes from the fact the number is only effective when distribution is specified - longer name is to avoid misunderstanding that it also takes effect on sorting request, whereas it is not. Probably we could discuss the impact first and revisit this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rdblue commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r577004701



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/RequiresDistributionAndOrdering.java
##########
@@ -42,6 +42,15 @@
    */
   Distribution requiredDistribution();
 
+  /**
+   * Returns the number of partitions required by this write.
+   * <p>
+   * Implementations may want to override this if it requires the specific number of partitions.
+   *
+   * @return the required number of partitions, non-positive values mean no requirement.

Review comment:
       Rather than "non-positive" how about "any value less than 1"? I think that's more clear.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-771826964


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39379/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-779729436


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39753/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768866804


   NOTE to reviewers: I've changed the location of required number of partitions from Distribution interfaces/implementations to RequiresDistributionAndOrdering, so that Distribution interfaces/implementations are still suitable for both read and write. (Addressed https://github.com/apache/spark/pull/31355#discussion_r565840043)
   
   Please let me know if the change works better. If majority prefers former change I can revert the commit.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768740266


   **[Test build #134571 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134571/testReport)** for PR 31355 at commit [`5d56dc0`](https://github.com/apache/spark/commit/5d56dc0b81a89e22f713d7160c04a5b9704bee29).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-779694716


   **[Test build #135172 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135172/testReport)** for PR 31355 at commit [`1c42e7e`](https://github.com/apache/spark/commit/1c42e7e3cf7aec39d82a25c9e1c3896b91c84274).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-770354919






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-806861772


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136515/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-806332643


   I took the diff on printed plans (expected, actual) in stack trace, and they seem to be identical.
   
   ```
   TakeOrderedAndProject [i_item_id,i_item_desc,s_state,store_sales_quantitycount,store_sales_quantityave,store_sales_quantitystdev,store_sales_quantitycov,as_store_returns_quantitycount,as_store_returns_quantityave,as_store_returns_quantitystdev,store_returns_quantitycov,catalog_sales_quantitycount,catalog_sales_quantityave,catalog_sales_quantitystdev,catalog_sales_quantitycov]
     WholeStageCodegen (9)
       HashAggregate [i_item_id,i_item_desc,s_state,count,sum,count,n,avg,m2,count,sum,count,n,avg,m2,count,sum,count,n,avg,m2] [count(ss_quantity),avg(ss_quantity),stddev_samp(cast(ss_quantity as double)),count(sr_return_quantity),avg(sr_return_quantity),stddev_samp(cast(sr_return_quantity as double)),count(cs_quantity),avg(cs_quantity),stddev_samp(cast(cs_quantity as double)),store_sales_quantitycount,store_sales_quantityave,store_sales_quantitystdev,store_sales_quantitycov,as_store_returns_quantitycount,as_store_returns_quantityave,as_store_returns_quantitystdev,store_returns_quantitycov,catalog_sales_quantitycount,catalog_sales_quantityave,catalog_sales_quantitystdev,catalog_sales_quantitycov,count,sum,count,n,avg,m2,count,sum,count,n,avg,m2,count,sum,count,n,avg,m2]
         InputAdapter
           Exchange [i_item_id,i_item_desc,s_state] #1
             WholeStageCodegen (8)
               HashAggregate [i_item_id,i_item_desc,s_state,ss_quantity,sr_return_quantity,cs_quantity] [count,sum,count,n,avg,m2,count,sum,count,n,avg,m2,count,sum,count,n,avg,m2,count,sum,count,n,avg,m2,count,sum,count,n,avg,m2,count,sum,count,n,avg,m2]
                 Project [ss_quantity,sr_return_quantity,cs_quantity,s_state,i_item_id,i_item_desc]
                   BroadcastHashJoin [ss_item_sk,i_item_sk]
                     Project [ss_item_sk,ss_quantity,sr_return_quantity,cs_quantity,s_state]
                       BroadcastHashJoin [ss_store_sk,s_store_sk]
                         Project [ss_item_sk,ss_store_sk,ss_quantity,sr_return_quantity,cs_quantity]
                           BroadcastHashJoin [cs_sold_date_sk,d_date_sk]
                             Project [ss_item_sk,ss_store_sk,ss_quantity,sr_return_quantity,cs_quantity,cs_sold_date_sk]
                               BroadcastHashJoin [sr_returned_date_sk,d_date_sk]
                                 Project [ss_item_sk,ss_store_sk,ss_quantity,sr_return_quantity,sr_returned_date_sk,cs_quantity,cs_sold_date_sk]
                                   BroadcastHashJoin [ss_sold_date_sk,d_date_sk]
                                     Project [ss_item_sk,ss_store_sk,ss_quantity,ss_sold_date_sk,sr_return_quantity,sr_returned_date_sk,cs_quantity,cs_sold_date_sk]
                                       BroadcastHashJoin [sr_customer_sk,sr_item_sk,cs_bill_customer_sk,cs_item_sk]
                                         Project [ss_item_sk,ss_store_sk,ss_quantity,ss_sold_date_sk,sr_item_sk,sr_customer_sk,sr_return_quantity,sr_returned_date_sk]
                                           BroadcastHashJoin [ss_customer_sk,ss_item_sk,ss_ticket_number,sr_customer_sk,sr_item_sk,sr_ticket_number]
                                             Filter [ss_customer_sk,ss_item_sk,ss_ticket_number,ss_store_sk]
                                               ColumnarToRow
                                                 InputAdapter
                                                   Scan parquet default.store_sales [ss_item_sk,ss_customer_sk,ss_store_sk,ss_ticket_number,ss_quantity,ss_sold_date_sk]
                                                     SubqueryBroadcast [d_date_sk] #1
                                                       ReusedExchange [d_date_sk] #2
                                             InputAdapter
                                               BroadcastExchange #3
                                                 WholeStageCodegen (1)
                                                   Filter [sr_customer_sk,sr_item_sk,sr_ticket_number]
                                                     ColumnarToRow
                                                       InputAdapter
                                                         Scan parquet default.store_returns [sr_item_sk,sr_customer_sk,sr_ticket_number,sr_return_quantity,sr_returned_date_sk]
                                                           SubqueryBroadcast [d_date_sk] #2
                                                             ReusedExchange [d_date_sk] #4
                                         InputAdapter
                                           BroadcastExchange #5
                                             WholeStageCodegen (2)
                                               Filter [cs_bill_customer_sk,cs_item_sk]
                                                 ColumnarToRow
                                                   InputAdapter
                                                     Scan parquet default.catalog_sales [cs_bill_customer_sk,cs_item_sk,cs_quantity,cs_sold_date_sk]
                                                       SubqueryBroadcast [d_date_sk] #3
                                                         ReusedExchange [d_date_sk] #4
                                     InputAdapter
                                       BroadcastExchange #2
                                         WholeStageCodegen (3)
                                           Project [d_date_sk]
                                             Filter [d_quarter_name,d_date_sk]
                                               ColumnarToRow
                                                 InputAdapter
                                                   Scan parquet default.date_dim [d_date_sk,d_quarter_name]
                                 InputAdapter
                                   BroadcastExchange #4
                                     WholeStageCodegen (4)
                                       Project [d_date_sk]
                                         Filter [d_quarter_name,d_date_sk]
                                           ColumnarToRow
                                             InputAdapter
                                               Scan parquet default.date_dim [d_date_sk,d_quarter_name]
                             InputAdapter
                               ReusedExchange [d_date_sk] #4
                         InputAdapter
                           BroadcastExchange #6
                             WholeStageCodegen (6)
                               Filter [s_store_sk]
                                 ColumnarToRow
                                   InputAdapter
                                     Scan parquet default.store [s_store_sk,s_state]
                     InputAdapter
                       BroadcastExchange #7
                         WholeStageCodegen (7)
                           Filter [i_item_sk]
                             ColumnarToRow
                               InputAdapter
                                 Scan parquet default.item [i_item_sk,i_item_id,i_item_desc]
   ```
   
   ```
   TakeOrderedAndProject [i_item_id,i_item_desc,s_state,store_sales_quantitycount,store_sales_quantityave,store_sales_quantitystdev,store_sales_quantitycov,as_store_returns_quantitycount,as_store_returns_quantityave,as_store_returns_quantitystdev,store_returns_quantitycov,catalog_sales_quantitycount,catalog_sales_quantityave,catalog_sales_quantitystdev,catalog_sales_quantitycov]
     WholeStageCodegen (9)
       HashAggregate [i_item_id,i_item_desc,s_state,count,sum,count,n,avg,m2,count,sum,count,n,avg,m2,count,sum,count,n,avg,m2] [count(ss_quantity),avg(ss_quantity),stddev_samp(cast(ss_quantity as double)),count(sr_return_quantity),avg(sr_return_quantity),stddev_samp(cast(sr_return_quantity as double)),count(cs_quantity),avg(cs_quantity),stddev_samp(cast(cs_quantity as double)),store_sales_quantitycount,store_sales_quantityave,store_sales_quantitystdev,store_sales_quantitycov,as_store_returns_quantitycount,as_store_returns_quantityave,as_store_returns_quantitystdev,store_returns_quantitycov,catalog_sales_quantitycount,catalog_sales_quantityave,catalog_sales_quantitystdev,catalog_sales_quantitycov,count,sum,count,n,avg,m2,count,sum,count,n,avg,m2,count,sum,count,n,avg,m2]
         InputAdapter
           Exchange [i_item_id,i_item_desc,s_state] #1
             WholeStageCodegen (8)
               HashAggregate [i_item_id,i_item_desc,s_state,ss_quantity,sr_return_quantity,cs_quantity] [count,sum,count,n,avg,m2,count,sum,count,n,avg,m2,count,sum,count,n,avg,m2,count,sum,count,n,avg,m2,count,sum,count,n,avg,m2,count,sum,count,n,avg,m2]
                 Project [ss_quantity,sr_return_quantity,cs_quantity,s_state,i_item_id,i_item_desc]
                   BroadcastHashJoin [ss_item_sk,i_item_sk]
                     Project [ss_item_sk,ss_quantity,sr_return_quantity,cs_quantity,s_state]
                       BroadcastHashJoin [ss_store_sk,s_store_sk]
                         Project [ss_item_sk,ss_store_sk,ss_quantity,sr_return_quantity,cs_quantity]
                           BroadcastHashJoin [cs_sold_date_sk,d_date_sk]
                             Project [ss_item_sk,ss_store_sk,ss_quantity,sr_return_quantity,cs_quantity,cs_sold_date_sk]
                               BroadcastHashJoin [sr_returned_date_sk,d_date_sk]
                                 Project [ss_item_sk,ss_store_sk,ss_quantity,sr_return_quantity,sr_returned_date_sk,cs_quantity,cs_sold_date_sk]
                                   BroadcastHashJoin [ss_sold_date_sk,d_date_sk]
                                     Project [ss_item_sk,ss_store_sk,ss_quantity,ss_sold_date_sk,sr_return_quantity,sr_returned_date_sk,cs_quantity,cs_sold_date_sk]
                                       BroadcastHashJoin [sr_customer_sk,sr_item_sk,cs_bill_customer_sk,cs_item_sk]
                                         Project [ss_item_sk,ss_store_sk,ss_quantity,ss_sold_date_sk,sr_item_sk,sr_customer_sk,sr_return_quantity,sr_returned_date_sk]
                                           BroadcastHashJoin [ss_customer_sk,ss_item_sk,ss_ticket_number,sr_customer_sk,sr_item_sk,sr_ticket_number]
                                             Filter [ss_customer_sk,ss_item_sk,ss_ticket_number,ss_store_sk]
                                               ColumnarToRow
                                                 InputAdapter
                                                   Scan parquet default.store_sales [ss_item_sk,ss_customer_sk,ss_store_sk,ss_ticket_number,ss_quantity,ss_sold_date_sk]
                                                     SubqueryBroadcast [d_date_sk] #1
                                                       ReusedExchange [d_date_sk] #2
                                             InputAdapter
                                               BroadcastExchange #3
                                                 WholeStageCodegen (1)
                                                   Filter [sr_customer_sk,sr_item_sk,sr_ticket_number]
                                                     ColumnarToRow
                                                       InputAdapter
                                                         Scan parquet default.store_returns [sr_item_sk,sr_customer_sk,sr_ticket_number,sr_return_quantity,sr_returned_date_sk]
                                                           SubqueryBroadcast [d_date_sk] #2
                                                             ReusedExchange [d_date_sk] #4
                                         InputAdapter
                                           BroadcastExchange #5
                                             WholeStageCodegen (2)
                                               Filter [cs_bill_customer_sk,cs_item_sk]
                                                 ColumnarToRow
                                                   InputAdapter
                                                     Scan parquet default.catalog_sales [cs_bill_customer_sk,cs_item_sk,cs_quantity,cs_sold_date_sk]
                                                       SubqueryBroadcast [d_date_sk] #3
                                                         ReusedExchange [d_date_sk] #4
                                     InputAdapter
                                       BroadcastExchange #2
                                         WholeStageCodegen (3)
                                           Project [d_date_sk]
                                             Filter [d_quarter_name,d_date_sk]
                                               ColumnarToRow
                                                 InputAdapter
                                                   Scan parquet default.date_dim [d_date_sk,d_quarter_name]
                                 InputAdapter
                                   BroadcastExchange #4
                                     WholeStageCodegen (4)
                                       Project [d_date_sk]
                                         Filter [d_quarter_name,d_date_sk]
                                           ColumnarToRow
                                             InputAdapter
                                               Scan parquet default.date_dim [d_date_sk,d_quarter_name]
                             InputAdapter
                               ReusedExchange [d_date_sk] #4
                         InputAdapter
                           BroadcastExchange #6
                             WholeStageCodegen (6)
                               Filter [s_store_sk]
                                 ColumnarToRow
                                   InputAdapter
                                     Scan parquet default.store [s_store_sk,s_state]
                     InputAdapter
                       BroadcastExchange #7
                         WholeStageCodegen (7)
                           Filter [i_item_sk]
                             ColumnarToRow
                               InputAdapter
                                 Scan parquet default.item [i_item_sk,i_item_id,i_item_desc]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-806646293


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41100/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r601573150



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/connector/WriteDistributionAndOrderingSuite.scala
##########
@@ -569,4 +785,26 @@ class WriteDistributionAndOrderingSuite
         fail("expected V2TableWriteExec")
     }
   }
+
+  private def orderedWritePartitioning(
+      writeOrdering: Seq[catalyst.expressions.SortOrder],
+      targetNumPartitions: Option[Int]): physical.Partitioning = {
+    targetNumPartitions match {
+      case Some(parts) => RangePartitioning(writeOrdering, parts)
+      case _ => RangePartitioning(writeOrdering, conf.numShufflePartitions)
+    }
+  }
+
+  private def clusteredTableDistribution(clustering: Array[Expression]): Distribution = {
+    Distributions.clustered(clustering)
+  }
+
+  private def clusteredWritePartitioning(
+      writePartitioningExprs: Seq[catalyst.expressions.Expression],
+      targetNumPartitions: Option[Int]): physical.Partitioning = {
+    targetNumPartitions match {

Review comment:
       ditto




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-808614930


   **[Test build #136581 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136581/testReport)** for PR 31355 at commit [`00fb5fb`](https://github.com/apache/spark/commit/00fb5fb18644f4f3843349c7d66a3649a9a47649).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-780137252


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39763/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768866810


   **[Test build #134603 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134603/testReport)** for PR 31355 at commit [`60ebadf`](https://github.com/apache/spark/commit/60ebadf0bf26efb7c1579073c5d3e235c23cfa24).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-779756533


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39753/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-770091245


   Requiring is a static number of partitions is a valid use case so this change looks good to me. Thanks, @HeartSaVioR!
   
   In my comment on the original PR, I meant that most data sources (that's my guess) would want to control the parallelism based on the size of incoming data. For example, we may want to have reasonably sized files while writing to a Hive table. In that case, the number of shuffle partitions will depend on the number of records in the incoming batch and the estimated size of each record. To achieve that, we could rely solely on AQE or we could accept some info from the source and feed that to AQE (it is probably similar to @rdblue's idea on bytes per task).
   
   The question that was I asking myself is whether knowing that we would want to have this more dynamic control would influence the approach we take when a data source requires a static number of partitions. For example, I was thinking about some interfaces like `RequiresStaticNumberOfPartitions` that would extend `Write` or an interface that would extend `RequiresDistributionAndOrdering` instead of adding the control directly to the existing interface. However, I am just throwing ideas to see what everybody thinks.
   
   As I said in the beginning, this PR looks good to me too.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-779840881


   **[Test build #135172 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135172/testReport)** for PR 31355 at commit [`1c42e7e`](https://github.com/apache/spark/commit/1c42e7e3cf7aec39d82a25c9e1c3896b91c84274).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-808262001


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41148/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768870502


   **[Test build #134601 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134601/testReport)** for PR 31355 at commit [`8b0eb56`](https://github.com/apache/spark/commit/8b0eb560b0ce71adf4c22dc4cf94dc2ac47d2856).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rdblue commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768668219


   Overall, I have no problem adding this. My one reservation is that I would hope that most sources don't explicitly control parallelism and I think adding only this would cause implementations to do exactly that. Instead, I would like to see some factor given to Spark to control parallelism, like bytes per task. That way parallelism can grow as incoming data grows. That said, I know there are cases where parallelism does need to be purposely controlled, like writing to a system with just a few nodes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r565369889



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/distributions/ClusteredDistribution.java
##########
@@ -32,4 +32,13 @@
    * Returns clustering expressions.
    */
   Expression[] clustering();
+
+  /**
+   * Returns the number of partitions required by this write.
+   * <p>
+   * Implementations may want to override this if it requires the specific number of partitions.
+   *
+   * @return the required number of partitions, 0 as no requirement

Review comment:
       nit: `non-positive values mean no requirement`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rdblue commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r577005518



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/RequiresDistributionAndOrdering.java
##########
@@ -42,6 +42,15 @@
    */
   Distribution requiredDistribution();
 
+  /**
+   * Returns the number of partitions required by this write.
+   * <p>
+   * Implementations may want to override this if it requires the specific number of partitions.

Review comment:
       "Implementations" is plural, but later referred to with a singular "it". How about a slight change to "Implementations may override this to require a specific number of input partitions"?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-808268209


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41148/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR edited a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR edited a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768780252


   Probably the discussion would be more constructive/productive if each idea may bring the explanation with the actual case, which storage is expected to get benefit on that, how the idea can boost the performance or help avoiding too many concurrent requests on the fly. After that we could determine "who" should decide the parallelism and "who" should provide the information for hints.
   (e.g. I wouldn't think higher parallelism always gives the best performance where output storage is ready for lower parallelism. Simply saying, 1000 concurrent tasks writing to one Kafka topic which has only 10 partitions.)
   
   My feeling is that these things are all about "optimizations" whereas the use case I described is about "requirement", so if we agree with the necessity of latter, former looks to be beyond of topic and we'd better discussing in other thread. (Either mailing list or sketched WIP PR if someone has some ideas.)
   
   Btw, one thing might need to revisit would be, if I understand correctly, once the data source requires distribution/ordering, end users lose the possibility of giving hint (based on heuristic, knowing the output data source and the characteristic of output data roughly) on the number of output partitions. Would it be addressed when we replace using default number of shuffle partitions with None, or AQE may disregard it?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-806288651


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41066/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r600804966



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/RequiresDistributionAndOrdering.java
##########
@@ -42,6 +42,19 @@
    */
   Distribution requiredDistribution();
 
+  /**
+   * Returns the number of partitions required by this write.
+   * <p>
+   * Implementations may override this to require a specific number of input partitions.
+   * <p>
+   * Note that Spark doesn't support the number of partitions on {@link UnspecifiedDistribution},
+   * if requiredDistribution() returns {@link UnspecifiedDistribution}, the return value for this
+   * method will be ignored.

Review comment:
       Shall we ignore or fail? It's very easy to detect it before applying the requirement.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768096504


   cc. @cloud-fan @aokolnychyi 
   
   Let's discuss in this PR, or please let me know if we prefer to have discussion in dev@ mailing list.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-804640082


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40964/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-771852803


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39379/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-779756533


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39753/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
sunchao commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r565840043



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/distributions/OrderedDistribution.java
##########
@@ -32,4 +32,13 @@
    * Returns ordering expressions.
    */
   SortOrder[] ordering();
+
+  /**
+   * Returns the number of partitions required by this write.

Review comment:
       Agree this is not quite useful for read. Does it make sense to put this in `RequiresDistributionAndOrdering`? it will deviate from the catalyst definition of distribution but this is for write path only.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768155604


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39124/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-770341216


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39273/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-806332854


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136482/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-808268209


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41148/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-808137808


   **[Test build #136564 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136564/testReport)** for PR 31355 at commit [`923cc69`](https://github.com/apache/spark/commit/923cc6947f63c756df52a89f2e1700d53b8b2675).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-808268182


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41148/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-794887512


   > where external storage has limited bandwidth on concurrent writes
   
   for such a use case, shall we use max num partitions? One option is to not allow it first, and revisit it when there are feature requests with real-world use cases.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-779436038


   As I just commented on comment thread I'm OK to apply the parallelism regardless of requirement on distribution/sort. That sounds like a valid case, like the case the writer doesn't require any distribution/sort but wants to control the parallelism to avoid massive load on external storage.
   
   If the static parallelism applies on all cases, adding this to `RequiresDistributionAndOrdering` looks natural unless we want to open the possibility we drop the functionality. I'm open to hear more voice on this as well.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-779849863


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135172/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] aokolnychyi commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768642659


   Thanks for submitting this follow-up, @HeartSaVioR! I'll take a look tomorrow. 
   
   Let me also add folks who participated in the review of the original change.
   
   cc @sunchao @dongjoon-hyun @HyukjinKwon @rdblue @viirya 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r565860036



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/distributions/OrderedDistribution.java
##########
@@ -32,4 +32,13 @@
    * Returns ordering expressions.
    */
   SortOrder[] ordering();
+
+  /**
+   * Returns the number of partitions required by this write.

Review comment:
       That was my first trial, but stepped back because the number is used for specific distributions. Probably I was over-thinking. Let me give it a try again.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-806322506


   **[Test build #136482 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136482/testReport)** for PR 31355 at commit [`e8fb6f1`](https://github.com/apache/spark/commit/e8fb6f18215ef43083cc2922328f90535380969c).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768225534


   **[Test build #134539 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134539/testReport)** for PR 31355 at commit [`84108e4`](https://github.com/apache/spark/commit/84108e41cf89e48641f4c4f0cf573a96b0e7adab).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r577126287



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DistributionAndOrderingUtils.scala
##########
@@ -32,23 +32,34 @@ object DistributionAndOrderingUtils {
     case write: RequiresDistributionAndOrdering =>
       val resolver = conf.resolver
 
-      val distribution = write.requiredDistribution match {
+      val numParts = write.requiredNumPartitions()
+      val (distribution, numPartitions) = write.requiredDistribution match {

Review comment:
       Missed one. Nice catch!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768797232


   AQE won't kick in if users specify num partitions, e.g. `df.repartition(5)`, I think the same applies here if the sink requires a certain num partitions.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-805430634


   Appreciate another round of reviewing, thanks in advance! @viirya @rdblue @sunchao @cloud-fan @aokolnychyi @dongjoon-hyun


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768958763






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rdblue commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-779365676


   Before moving on, I think we need to decide whether this controls parallelism all the time, or just when certain distributions are used. I think it should be all the time for consistent behavior.
   
   As for whether this is on `RequiresDistributionAndOrdering` or a separate interface, I don't have a strong opinion. I would probably leave it where it is. It would be good to hear from @aokolnychyi on it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rdblue commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r576338554



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/RequiresDistributionAndOrdering.java
##########
@@ -42,6 +42,19 @@
    */
   Distribution requiredDistribution();
 
+  /**
+   * Returns the number of partitions required by this write if specific distribution is required.
+   * <p>
+   * Implementations may want to override this if it requires the specific number of partitions
+   * on distribution.
+   * <p>
+   * {@link UnspecifiedDistribution} is not affected by this method, as it doesn't require the
+   * specific distribution.
+   *
+   * @return the required number of partitions, non-positive values mean no requirement.
+   */
+  default int requiredNumPartitionsOnDistribution() { return 0; }

Review comment:
       I think this name is hard to understand. What about something more straightforward, like `requiredParallelism`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-770351088


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39273/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-770371697


   **[Test build #134687 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134687/testReport)** for PR 31355 at commit [`be94bb4`](https://github.com/apache/spark/commit/be94bb492f6f413d6813702833827e837b9fb515).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-768099637


   **[Test build #134539 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134539/testReport)** for PR 31355 at commit [`84108e4`](https://github.com/apache/spark/commit/84108e41cf89e48641f4c4f0cf573a96b0e7adab).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r565789213



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/distributions/ClusteredDistribution.java
##########
@@ -32,4 +32,13 @@
    * Returns clustering expressions.
    */
   Expression[] clustering();
+
+  /**
+   * Returns the number of partitions required by this write.
+   * <p>
+   * Implementations may want to override this if it requires the specific number of partitions.
+   *
+   * @return the required number of partitions, non-positive values mean no requirement.
+   */
+  default int requiredNumPartitions() { return 0; }

Review comment:
       I'm trying to not over-engineering here; this PR addresses the actual use case and no more. If we'd like to address more, I'd like to see the actual use case or at least possible scenario for each one. Like describing the characteristic of specific storage and how the functionality will help or why the functionality is required.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-779579371


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39744/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-800730995


   I'll just remove the handling of non specific distribution case early next week and ask for review again unless I hear some voice before.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-769030172


   **[Test build #134601 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134601/testReport)** for PR 31355 at commit [`8b0eb56`](https://github.com/apache/spark/commit/8b0eb560b0ce71adf4c22dc4cf94dc2ac47d2856).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-806423266


   **[Test build #136508 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136508/testReport)** for PR 31355 at commit [`e8fb6f1`](https://github.com/apache/spark/commit/e8fb6f18215ef43083cc2922328f90535380969c).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r602650158



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala
##########
@@ -1097,4 +1097,9 @@ private[spark] object QueryCompilationErrors {
       s"and column $colName cannot be resolved. Expected $expectedNum columns named $colName but " +
       s"got ${actualCols.map(_.name).mkString("[", ",", "]")}")
   }
+
+  def numberOfPartitionsNotAllowedWithUnspecifiedDistributionError(): Throwable = {
+    throw new AnalysisException("The number of partitions can't be specified with unspecified" +
+      " distribution. Invalid Sink requirements detected.")

Review comment:
       Nope. I thought the word "Sink" is not specific to streaming, but looks like it brings confusion. Just changed to "writer". Thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-779567396


   **[Test build #135163 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135163/testReport)** for PR 31355 at commit [`b6d389e`](https://github.com/apache/spark/commit/b6d389e0d5a7ba8430a6e0c0f5b6d0379d8591ed).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-808637619


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41165/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-779671501


   **[Test build #135163 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135163/testReport)** for PR 31355 at commit [`b6d389e`](https://github.com/apache/spark/commit/b6d389e0d5a7ba8430a6e0c0f5b6d0379d8591ed).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-808210633


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136564/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-806492370


   **[Test build #136515 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136515/testReport)** for PR 31355 at commit [`3e75641`](https://github.com/apache/spark/commit/3e7564173664482a503f7f43e37d3be71f590106).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-804598821


   I just removed the handling of non specific distribution case. Please take a look again. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-806270342


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41066/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan closed pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
cloud-fan closed pull request #31355:
URL: https://github.com/apache/spark/pull/31355


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-808637619


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41165/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] rdblue commented on a change in pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #31355:
URL: https://github.com/apache/spark/pull/31355#discussion_r577006259



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DistributionAndOrderingUtils.scala
##########
@@ -32,23 +32,34 @@ object DistributionAndOrderingUtils {
     case write: RequiresDistributionAndOrdering =>
       val resolver = conf.resolver
 
-      val distribution = write.requiredDistribution match {
+      val numParts = write.requiredNumPartitions()
+      val (distribution, numPartitions) = write.requiredDistribution match {

Review comment:
       I don't think it is necessary to return `numParts` from the `match` statement here. It is no longer carried by the distribution.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #31355: [SPARK-34255][SQL] Support partitioning with static number on required distribution and ordering on V2 write

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #31355:
URL: https://github.com/apache/spark/pull/31355#issuecomment-778848458


   Kindly reminder: it would be appreciated if I could get another round of review, and/or get some voices on the interface. ("keep it as it is" vs "have a separate interface extending RequiresDistributionAndOrdering" vs "have a 'mix-in' interface defining required number of partitions") 
   
   Thanks in advance!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org