You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/08/01 23:00:47 UTC

[GitHub] [iceberg] dramaticlly opened a new pull request, #5412: Spark: Support Bulk deletion in expire-snapshots if fileIO allows

dramaticlly opened a new pull request, #5412:
URL: https://github.com/apache/iceberg/pull/5412

   In #4052, S3 fileIO now implement a new interfaces to support S3 batch deletion, this PR introduce it for Spark expire-snapshots procedure to conditionally support delete files in batch if underlying fileIO supports it (if implements `SupportsBulkOperations` interface and currently only S3FileIO support such)
   
   It default to use S3 batch deletion if fileIO is supported in catalog, allow for customization with `bulkDeleteWith` method
   - cannot reuse the existing  `bulkDelete` consumer function because it only take single file name at a time instead of a iterable
   - did not add interface override, want to keep this change as small as possible, once approved then we can retroactively apply to interface and previous spark version (2.4/3.0/3.1/3.2)
   - Considering the existing test fixture all use HadoopTables and it's very hard to test the integration of S3FileIO and Spark action together in unit tests, so I am looking for a way to do some integration tests and will share the results later 
   
   Similar to #5373 but for expire-snapshots procedure
   
   CC @rdblue , @danielcweeks, @amogh-jahagirdar, @szehon-ho 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on pull request #5412: Spark: Support Bulk deletion in expire-snapshots if fileIO allows

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on PR #5412:
URL: https://github.com/apache/iceberg/pull/5412#issuecomment-1203270642

   Probably we need to synchronize with @amogh-jahagirdar on whether we need a bulkDeleteFunc, and whether retry happens at S3FileIO layer or not.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on a diff in pull request #5412: Spark: Support Bulk deletion in expire-snapshots if fileIO allows

Posted by GitBox <gi...@apache.org>.
dramaticlly commented on code in PR #5412:
URL: https://github.com/apache/iceberg/pull/5412#discussion_r935006214


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/ExpireSnapshotsSparkAction.java:
##########
@@ -139,6 +152,14 @@ public ExpireSnapshotsSparkAction deleteWith(Consumer<String> newDeleteFunc) {
     return this;
   }
 
+  public ExpireSnapshotsSparkAction bulkDeleteWith(Consumer<Iterable<String>> newBulkDeleteFunc) {
+    Preconditions.checkArgument(
+        ops.io() instanceof SupportsBulkOperations,
+        "FileIO %s does not support bulk deletion",
+        table.io().getClass().getName());

Review Comment:
   I shall probably move this `table.io()` to `ops.io()` for consistency, but given check take 1+hr to build, I will handle it together with other review feedback



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on pull request #5412: Spark: Support Bulk deletion in expire-snapshots if fileIO allows

Posted by "dramaticlly (via GitHub)" <gi...@apache.org>.
dramaticlly commented on PR #5412:
URL: https://github.com/apache/iceberg/pull/5412#issuecomment-1412990170

   closed in favor of #6682, which cover more comprehensive spark actions and also implement bulk deletion for HadoopFileIO


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on a diff in pull request #5412: Spark: Support Bulk deletion in expire-snapshots if fileIO allows

Posted by GitBox <gi...@apache.org>.
dramaticlly commented on code in PR #5412:
URL: https://github.com/apache/iceberg/pull/5412#discussion_r935006214


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/ExpireSnapshotsSparkAction.java:
##########
@@ -139,6 +152,14 @@ public ExpireSnapshotsSparkAction deleteWith(Consumer<String> newDeleteFunc) {
     return this;
   }
 
+  public ExpireSnapshotsSparkAction bulkDeleteWith(Consumer<Iterable<String>> newBulkDeleteFunc) {
+    Preconditions.checkArgument(
+        ops.io() instanceof SupportsBulkOperations,
+        "FileIO %s does not support bulk deletion",
+        table.io().getClass().getName());

Review Comment:
   I shall probably move this `table.io()` to `ops.io()` for consistency, but given check take 1+hr to complete, I will handle it together with other review feedback



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly closed pull request #5412: Spark: Support Bulk deletion in expire-snapshots if fileIO allows

Posted by "dramaticlly (via GitHub)" <gi...@apache.org>.
dramaticlly closed pull request #5412: Spark: Support Bulk deletion in expire-snapshots if fileIO allows
URL: https://github.com/apache/iceberg/pull/5412


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org