You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "aokolnychyi (via GitHub)" <gi...@apache.org> on 2023/03/14 16:54:24 UTC

[GitHub] [spark] aokolnychyi commented on a diff in pull request #40421: [SPARK-42779][SQL] Allow V2 writes to indicate advisory shuffle partition size

aokolnychyi commented on code in PR #40421:
URL: https://github.com/apache/spark/pull/40421#discussion_r1135900008


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/RequiresDistributionAndOrdering.java:
##########
@@ -66,12 +66,33 @@ public interface RequiresDistributionAndOrdering extends Write {
    * <p>
    * Note that Spark doesn't support the number of partitions on {@link UnspecifiedDistribution},
    * the query will fail if the number of partitions are provided but the distribution is
-   * unspecified.
+   * unspecified. Data sources may either request a particular number of partitions or
+   * a preferred partition size via {@link #advisoryPartitionSizeInBytes}, not both.
    *
    * @return the required number of partitions, any value less than 1 mean no requirement.
    */
   default int requiredNumPartitions() { return 0; }
 
+  /**
+   * Returns the advisory (not guaranteed) shuffle partition size in bytes for this write.
+   * <p>
+   * Implementations may override this to indicate the preferable partition size in shuffles
+   * performed to satisfy the requested distribution. Note that Spark doesn't support setting
+   * the advisory partition size for {@link UnspecifiedDistribution}, the query will fail if
+   * the advisory partition size is set but the distribution is unspecified. Data sources may
+   * either request a particular number of partitions via {@link #requiredNumPartitions()} or
+   * a preferred partition size, not both.
+   * <p>
+   * Data sources should be careful with large advisory sizes as it will impact the writing
+   * parallelism and may degrade the overall job performance.
+   * <p>
+   * Note this value only acts like a guidance and Spark does not guarantee the actual and advisory
+   * shuffle partition sizes will match. Ignored if the adaptive execution is disabled.
+   *
+   * @return the advisory partition size, any value less than 1 means no preference.
+   */
+  default long advisoryPartitionSizeInBytes() { return 0; }

Review Comment:
   It is debatable whether we should use `Optional` but I followed the existing `requiredNumPartitions` method.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org