You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2023/01/14 12:54:22 UTC

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6588: Spark 3.3: Add Default Parallelism Level for All Spark Driver Based Deletes

RussellSpitzer commented on code in PR #6588:
URL: https://github.com/apache/iceberg/pull/6588#discussion_r1070261827


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java:
##########
@@ -47,4 +47,8 @@ private SparkSQLProperties() {}
   public static final String PRESERVE_DATA_GROUPING =
       "spark.sql.iceberg.planning.preserve-data-grouping";
   public static final boolean PRESERVE_DATA_GROUPING_DEFAULT = false;
+
+  // Controls how many physical file deletes to execute in parallel when not otherwise specified
+  public static final String DELETE_PARALLELISM = "driver-delete-default-parallelism";
+  public static final String DELETE_PARALLELISM_DEFAULT = "25";

Review Comment:
   With S3's request throttling around 4k requests a second this gives us a lot of overhead. 
   Assuming a 50ms response time
   4000 max requests / Second / 20 requests per thread per second =~  200 max concurrent requests. 
   
   Another option for this is to also incorporate the "bulk delete" apis but that would only help with S3 based filesystems.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org