You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/06/20 19:58:44 UTC

[GitHub] [iceberg] danielcweeks opened a new pull request, #5096: Add interface for FileIO prefix operations and implementations

danielcweeks opened a new pull request, #5096:
URL: https://github.com/apache/iceberg/pull/5096

   This adds an interface for FileIO implementations to support prefix based operations for listing and deleting.
   
   The primary motivation is to enable supporting maintenance activities (like cleaning path directories or listing table locations) without the need to fall back to Hadoop FileSystem. 
   
   There are some notable behavioral differences between directory based and object based storage systems (e.g. for directory based storage, a the prefix must denote a directory).  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5096: Add interface for FileIO prefix operations and implementations

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5096:
URL: https://github.com/apache/iceberg/pull/5096#discussion_r902050700


##########
api/src/main/java/org/apache/iceberg/io/FileInfo.java:
##########
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.io;
+
+public class FileInfo {
+  private final String location;
+  private final long size;
+  private final long created;
+
+  public FileInfo(String location, long size, long created) {
+    this.location = location;
+    this.size = size;
+    this.created = created;
+  }
+
+  public String location() {
+    return location;
+  }
+
+  public long size() {
+    return size;
+  }
+
+  public long created() {

Review Comment:
   What about `createdAtMillis`? That way we specify the expected unit (millis) and it is clear that this is a timestamp?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #5096: Add interface for FileIO prefix operations and implementations

Posted by GitBox <gi...@apache.org>.
rdblue merged PR #5096:
URL: https://github.com/apache/iceberg/pull/5096


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] danielcweeks commented on a diff in pull request #5096: Add interface for FileIO prefix operations and implementations

Posted by GitBox <gi...@apache.org>.
danielcweeks commented on code in PR #5096:
URL: https://github.com/apache/iceberg/pull/5096#discussion_r902055003


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIO.java:
##########
@@ -241,6 +246,52 @@ private List<String> deleteObjectsInBucket(String bucket, Collection<String> obj
     return Lists.newArrayList();
   }
 
+  @Override
+  public Iterator<FileInfo> listPrefix(String prefix) {
+    S3URI s3uri = new S3URI(prefix, awsProperties.s3BucketToAccessPointMapping());
+
+    return internalListPrefix(s3uri.bucket(), s3uri.key()).stream()
+        .flatMap(r -> r.contents().stream())
+        .map(o -> new FileInfo(o.key(), o.size(), o.lastModified().toEpochMilli())).iterator();
+  }
+
+  /**
+   * This method provides a "best-effort" to delete all objects under the
+   * given prefix.
+   *
+   * Bulk delete operations are used and no reattempt is made for deletes if
+   * they fail, but will log any individual objects that are not deleted as part
+   * of the bulk operation.
+   *
+   * @param prefix prefix to delete
+   */
+  @Override
+  public void deletePrefix(String prefix) {

Review Comment:
   I'll take a look at consolidating the two delete paths.  It's a little messy because the bulk delete is not streaming and will pull everything into memory, so doesn't work well for large operations.
   
   I'll see if I can find a combination / compromise here. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5096: Add interface for FileIO prefix operations and implementations

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5096:
URL: https://github.com/apache/iceberg/pull/5096#discussion_r902040981


##########
api/src/main/java/org/apache/iceberg/io/SupportsPrefixOperations.java:
##########
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.io;
+
+import java.util.stream.Stream;
+
+/**
+ * This interface is intended as an extension for FileIO implementations
+ * to provide additional prefix based operations that may be useful in
+ * performing supporting operations.
+ */
+public interface SupportsPrefixOperations {
+
+  /**
+   * Return a stream of all files under a prefix.
+   *

Review Comment:
   Javadoc: if you want a paragraph break in the rendered version, you have to add `<p>` to the empty lines.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5096: Add interface for FileIO prefix operations and implementations

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5096:
URL: https://github.com/apache/iceberg/pull/5096#discussion_r902033126


##########
api/src/main/java/org/apache/iceberg/io/FileInfo.java:
##########
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.io;
+
+import java.time.Instant;
+
+public class FileInfo {
+  private final String location;
+  private final long size;
+  private final Instant created;
+
+  public FileInfo(String location, long size, Instant created) {
+    this.location = location;
+    this.size = size;
+    this.created = created;
+  }
+
+  public String getLocation() {

Review Comment:
   Can you remove `get` from these method names? We typically omit them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5096: Add interface for FileIO prefix operations and implementations

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5096:
URL: https://github.com/apache/iceberg/pull/5096#discussion_r904205358


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIO.java:
##########
@@ -241,6 +245,33 @@ private List<String> deleteObjectsInBucket(String bucket, Collection<String> obj
     return Lists.newArrayList();
   }
 
+  @Override
+  public Iterable<FileInfo> listPrefix(String prefix) {
+    S3URI s3uri = new S3URI(prefix, awsProperties.s3BucketToAccessPointMapping());
+    ListObjectsV2Request request = ListObjectsV2Request.builder().bucket(s3uri.bucket()).prefix(s3uri.key()).build();
+
+    return () -> client().listObjectsV2Paginator(request).stream()
+        .flatMap(r -> r.contents().stream())
+        .map(o -> new FileInfo(
+            String.format("%s://%s/%s", s3uri.scheme(), s3uri.bucket(), o.key()),

Review Comment:
   Minor: The bucket returned by `S3URI` may actually be an access point reference, which could break URI parsing if this were to embed it in a location.
   
   I think the solution is to move the S3 access point mapping out of `S3URI`, since that class should remain simple and report what was parsed and avoid this kind of issue. Let's not fix it here, but we should revisit this.
   
   FYI @jackye1995.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] danielcweeks commented on a diff in pull request #5096: Add interface for FileIO prefix operations and implementations

Posted by GitBox <gi...@apache.org>.
danielcweeks commented on code in PR #5096:
URL: https://github.com/apache/iceberg/pull/5096#discussion_r904004083


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIO.java:
##########
@@ -241,6 +246,54 @@ private List<String> deleteObjectsInBucket(String bucket, Collection<String> obj
     return Lists.newArrayList();
   }
 
+  @Override
+  public Iterator<FileInfo> listPrefix(String prefix) {
+    S3URI s3uri = new S3URI(prefix, awsProperties.s3BucketToAccessPointMapping());
+
+    return internalListPrefix(s3uri.bucket(), s3uri.key()).stream()
+        .flatMap(r -> r.contents().stream())
+        .map(o -> new FileInfo(
+            String.format("%s://%s/%s", s3uri.scheme(), s3uri.bucket(), o.key()),
+            o.size(), o.lastModified().toEpochMilli())).iterator();
+  }
+
+  /**
+   * This method provides a "best-effort" to delete all objects under the
+   * given prefix.
+   *
+   * Bulk delete operations are used and no reattempt is made for deletes if
+   * they fail, but will log any individual objects that are not deleted as part
+   * of the bulk operation.
+   *
+   * @param prefix prefix to delete
+   */
+  @Override
+  public void deletePrefix(String prefix) {
+    S3URI s3uri = new S3URI(prefix, awsProperties.s3BucketToAccessPointMapping());
+
+    internalListPrefix(s3uri.bucket(), s3uri.key()).stream().parallel().forEach(listing -> {
+      List<ObjectIdentifier> objectIdentifiers = listing.contents().stream()
+          .map(o -> ObjectIdentifier.builder().key(o.key()).build())
+          .collect(Collectors.toList());
+
+      DeleteObjectsRequest request = DeleteObjectsRequest.builder()
+          .bucket(s3uri.bucket())
+          .delete(Delete.builder().objects(objectIdentifiers).build())
+          .build();
+
+      client().deleteObjects(request).errors().forEach((s3Error -> {
+        LOG.warn("Error occurred during delete operation. {}: {}. {}", s3Error.code(), s3Error.message(),
+            s3Error.key());
+      }));
+    });
+  }
+
+  private ListObjectsV2Iterable internalListPrefix(String bucket, String keyPrefix) {
+    ListObjectsV2Request request = ListObjectsV2Request.builder().bucket(bucket).prefix(keyPrefix).build();

Review Comment:
   This was removed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5096: Add interface for FileIO prefix operations and implementations

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5096:
URL: https://github.com/apache/iceberg/pull/5096#discussion_r904202899


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIO.java:
##########
@@ -241,6 +245,33 @@ private List<String> deleteObjectsInBucket(String bucket, Collection<String> obj
     return Lists.newArrayList();
   }
 
+  @Override
+  public Iterable<FileInfo> listPrefix(String prefix) {
+    S3URI s3uri = new S3URI(prefix, awsProperties.s3BucketToAccessPointMapping());
+    ListObjectsV2Request request = ListObjectsV2Request.builder().bucket(s3uri.bucket()).prefix(s3uri.key()).build();
+
+    return () -> client().listObjectsV2Paginator(request).stream()
+        .flatMap(r -> r.contents().stream())
+        .map(o -> new FileInfo(
+            String.format("%s://%s/%s", s3uri.scheme(), s3uri.bucket(), o.key()),
+            o.size(), o.lastModified().toEpochMilli())).iterator();
+  }
+
+  /**
+   * This method provides a "best-effort" to delete all objects under the
+   * given prefix.
+   *

Review Comment:
   Missing `<p>`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5096: Add interface for FileIO prefix operations and implementations

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5096:
URL: https://github.com/apache/iceberg/pull/5096#discussion_r902051530


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIO.java:
##########
@@ -241,6 +246,52 @@ private List<String> deleteObjectsInBucket(String bucket, Collection<String> obj
     return Lists.newArrayList();
   }
 
+  @Override
+  public Iterator<FileInfo> listPrefix(String prefix) {
+    S3URI s3uri = new S3URI(prefix, awsProperties.s3BucketToAccessPointMapping());
+
+    return internalListPrefix(s3uri.bucket(), s3uri.key()).stream()
+        .flatMap(r -> r.contents().stream())
+        .map(o -> new FileInfo(o.key(), o.size(), o.lastModified().toEpochMilli())).iterator();
+  }
+
+  /**
+   * This method provides a "best-effort" to delete all objects under the
+   * given prefix.
+   *
+   * Bulk delete operations are used and no reattempt is made for deletes if
+   * they fail, but will log any individual objects that are not deleted as part
+   * of the bulk operation.
+   *
+   * @param prefix prefix to delete
+   */
+  @Override
+  public void deletePrefix(String prefix) {

Review Comment:
   After looking at the bulk delete code, I think we may want to delegate to it, or at least refactor to share.
   
   Deletes now support tag-based deletion, so this needs to detect whether tags should be used. Also, S3 supports bulk deletes, which would make this go much more quickly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5096: Add interface for FileIO prefix operations and implementations

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5096:
URL: https://github.com/apache/iceberg/pull/5096#discussion_r902049477


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIO.java:
##########
@@ -241,6 +246,52 @@ private List<String> deleteObjectsInBucket(String bucket, Collection<String> obj
     return Lists.newArrayList();
   }
 
+  @Override
+  public Stream<FileInfo> listPrefix(String prefix) {
+    S3URI s3uri = new S3URI(prefix, awsProperties.s3BucketToAccessPointMapping());
+
+    return internalListPrefix(s3uri.bucket(), s3uri.key()).stream()
+        .flatMap(r -> r.contents().stream())
+        .map(o -> new FileInfo(o.key(), o.size(), o.lastModified()));
+  }
+
+  /**
+   * This method provides a "best-effort" to delete all objects under the
+   * given prefix.
+   *
+   * Bulk delete operations are used and no reattempt is made for deletes if
+   * they fail, but will log any individual objects that are not deleted as part
+   * of the bulk operation.
+   *
+   * @param prefix prefix to delete
+   */
+  @Override
+  public void deletePrefix(String prefix) {
+    S3URI s3uri = new S3URI(prefix, awsProperties.s3BucketToAccessPointMapping());
+
+    internalListPrefix(s3uri.bucket(), s3uri.key()).stream().parallel().forEach(listing -> {

Review Comment:
   Rather than using `parallel`, can you use `Tasks` and a thread pool? Tasks is how we prefer to parallelize operations because it gives you much better control over error handling:
   
   ```java
   Tasks.foreach(internalListPrefix(s3uri.bucket(), s3uri.key()).stream())
       .suppressFailureWhenFinished()
       .executeWith(executorService)
       .retry(5)
       .exponentialBackoff(10, 60_000, 600_000, 2)
       .onlyRetryOn(ThrottledException.class, TooManyRequestsException.class)
       .onFailure((task, exc) -> LOG.warn("Failed to delete S3 key: %s", task.key(), exc))
       .run(o -> {
         ObjectIdentifier obj = ObjectIdentifier.builder().key(o.key()).build();
         ...
       });
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5096: Add interface for FileIO prefix operations and implementations

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5096:
URL: https://github.com/apache/iceberg/pull/5096#discussion_r902046602


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIO.java:
##########
@@ -241,6 +246,52 @@ private List<String> deleteObjectsInBucket(String bucket, Collection<String> obj
     return Lists.newArrayList();
   }
 
+  @Override
+  public Stream<FileInfo> listPrefix(String prefix) {
+    S3URI s3uri = new S3URI(prefix, awsProperties.s3BucketToAccessPointMapping());
+
+    return internalListPrefix(s3uri.bucket(), s3uri.key()).stream()
+        .flatMap(r -> r.contents().stream())
+        .map(o -> new FileInfo(o.key(), o.size(), o.lastModified()));

Review Comment:
   I think this needs to reconstruct a URI rather than returning just the key. Otherwise, it isn't really a "location" as we think of them in Iceberg. For example, when listing `"s3://bucket/prefix"`, this might produce `[FileInfo("prefix/file1.parquet", 10, ...), ...]`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] danielcweeks commented on a diff in pull request #5096: Add interface for FileIO prefix operations and implementations

Posted by GitBox <gi...@apache.org>.
danielcweeks commented on code in PR #5096:
URL: https://github.com/apache/iceberg/pull/5096#discussion_r904002585


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIO.java:
##########
@@ -241,6 +246,52 @@ private List<String> deleteObjectsInBucket(String bucket, Collection<String> obj
     return Lists.newArrayList();
   }
 
+  @Override
+  public Stream<FileInfo> listPrefix(String prefix) {
+    S3URI s3uri = new S3URI(prefix, awsProperties.s3BucketToAccessPointMapping());
+
+    return internalListPrefix(s3uri.bucket(), s3uri.key()).stream()
+        .flatMap(r -> r.contents().stream())
+        .map(o -> new FileInfo(o.key(), o.size(), o.lastModified()));
+  }
+
+  /**
+   * This method provides a "best-effort" to delete all objects under the
+   * given prefix.
+   *
+   * Bulk delete operations are used and no reattempt is made for deletes if
+   * they fail, but will log any individual objects that are not deleted as part
+   * of the bulk operation.
+   *
+   * @param prefix prefix to delete
+   */
+  @Override
+  public void deletePrefix(String prefix) {
+    S3URI s3uri = new S3URI(prefix, awsProperties.s3BucketToAccessPointMapping());
+
+    internalListPrefix(s3uri.bucket(), s3uri.key()).stream().parallel().forEach(listing -> {

Review Comment:
   This is now handled by the existing batch delete call.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5096: Add interface for FileIO prefix operations and implementations

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5096:
URL: https://github.com/apache/iceberg/pull/5096#discussion_r902056812


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIO.java:
##########
@@ -241,6 +246,52 @@ private List<String> deleteObjectsInBucket(String bucket, Collection<String> obj
     return Lists.newArrayList();
   }
 
+  @Override
+  public Iterator<FileInfo> listPrefix(String prefix) {
+    S3URI s3uri = new S3URI(prefix, awsProperties.s3BucketToAccessPointMapping());
+
+    return internalListPrefix(s3uri.bucket(), s3uri.key()).stream()
+        .flatMap(r -> r.contents().stream())
+        .map(o -> new FileInfo(o.key(), o.size(), o.lastModified().toEpochMilli())).iterator();
+  }
+
+  /**
+   * This method provides a "best-effort" to delete all objects under the
+   * given prefix.
+   *
+   * Bulk delete operations are used and no reattempt is made for deletes if
+   * they fail, but will log any individual objects that are not deleted as part
+   * of the bulk operation.
+   *
+   * @param prefix prefix to delete
+   */
+  @Override
+  public void deletePrefix(String prefix) {

Review Comment:
   The bulk delete accepts an Iterable, so I think it should work with streaming.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] danielcweeks commented on a diff in pull request #5096: Add interface for FileIO prefix operations and implementations

Posted by GitBox <gi...@apache.org>.
danielcweeks commented on code in PR #5096:
URL: https://github.com/apache/iceberg/pull/5096#discussion_r902047078


##########
api/src/main/java/org/apache/iceberg/io/FileInfo.java:
##########
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.io;
+
+import java.time.Instant;
+
+public class FileInfo {
+  private final String location;
+  private final long size;
+  private final Instant created;
+
+  public FileInfo(String location, long size, Instant created) {

Review Comment:
   No specific need (other than it removes confusion about the unit).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5096: Add interface for FileIO prefix operations and implementations

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5096:
URL: https://github.com/apache/iceberg/pull/5096#discussion_r902050353


##########
core/src/main/java/org/apache/iceberg/hadoop/HadoopFileIO.java:
##########
@@ -89,4 +97,62 @@ public Configuration getConf() {
   public void serializeConfWith(Function<Configuration, SerializableSupplier<Configuration>> confSerializer) {
     this.hadoopConf = confSerializer.apply(getConf());
   }
+
+  @Override
+  public Stream<FileInfo> listPrefix(String prefix) {
+    Path prefixToList = new Path(prefix);
+    FileSystem fs = Util.getFs(prefixToList, hadoopConf.get());
+
+    try {
+      return Streams.stream(new AdaptingIterator<>(fs.listFiles(prefixToList, true)))
+          .map(fileStatus -> new FileInfo(fileStatus.getPath().toString(), fileStatus.getLen(),
+              Instant.ofEpochMilli(fileStatus.getModificationTime())));
+    } catch (IOException e) {
+      throw new UncheckedIOException(e);
+    }
+  }
+
+  @Override
+  public void deletePrefix(String prefix) {
+    Path prefixToDelete = new Path(prefix);
+    FileSystem fs = Util.getFs(prefixToDelete, hadoopConf.get());
+
+    try {
+      fs.delete(prefixToDelete, true);

Review Comment:
   Same here: `fs.delete(prefixToDelete, true /* recursive */);`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5096: Add interface for FileIO prefix operations and implementations

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5096:
URL: https://github.com/apache/iceberg/pull/5096#discussion_r902046007


##########
api/src/main/java/org/apache/iceberg/io/FileInfo.java:
##########
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.io;
+
+import java.time.Instant;
+
+public class FileInfo {
+  private final String location;
+  private final long size;
+  private final Instant created;
+
+  public FileInfo(String location, long size, Instant created) {

Review Comment:
   In the rest of the code, we use a plain long value in milliseconds for timestamps like this rather than using `Instant`. Can you switch to `long` or is there a reason to use `Instant` instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] nastra commented on a diff in pull request #5096: Add interface for FileIO prefix operations and implementations

Posted by GitBox <gi...@apache.org>.
nastra commented on code in PR #5096:
URL: https://github.com/apache/iceberg/pull/5096#discussion_r902558622


##########
api/src/main/java/org/apache/iceberg/io/SupportsPrefixOperations.java:
##########
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.io;
+
+import java.util.Iterator;
+
+/**
+ * This interface is intended as an extension for FileIO implementations
+ * to provide additional prefix based operations that may be useful in
+ * performing supporting operations.
+ */
+public interface SupportsPrefixOperations {
+
+  /**
+   * Return a stream of all files under a prefix.

Review Comment:
   nit: stream -> iterator



##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIO.java:
##########
@@ -241,6 +246,54 @@ private List<String> deleteObjectsInBucket(String bucket, Collection<String> obj
     return Lists.newArrayList();
   }
 
+  @Override
+  public Iterator<FileInfo> listPrefix(String prefix) {
+    S3URI s3uri = new S3URI(prefix, awsProperties.s3BucketToAccessPointMapping());
+
+    return internalListPrefix(s3uri.bucket(), s3uri.key()).stream()
+        .flatMap(r -> r.contents().stream())
+        .map(o -> new FileInfo(
+            String.format("%s://%s/%s", s3uri.scheme(), s3uri.bucket(), o.key()),
+            o.size(), o.lastModified().toEpochMilli())).iterator();
+  }
+
+  /**
+   * This method provides a "best-effort" to delete all objects under the
+   * given prefix.
+   *
+   * Bulk delete operations are used and no reattempt is made for deletes if
+   * they fail, but will log any individual objects that are not deleted as part
+   * of the bulk operation.
+   *
+   * @param prefix prefix to delete
+   */
+  @Override
+  public void deletePrefix(String prefix) {
+    S3URI s3uri = new S3URI(prefix, awsProperties.s3BucketToAccessPointMapping());
+
+    internalListPrefix(s3uri.bucket(), s3uri.key()).stream().parallel().forEach(listing -> {
+      List<ObjectIdentifier> objectIdentifiers = listing.contents().stream()
+          .map(o -> ObjectIdentifier.builder().key(o.key()).build())
+          .collect(Collectors.toList());
+
+      DeleteObjectsRequest request = DeleteObjectsRequest.builder()
+          .bucket(s3uri.bucket())
+          .delete(Delete.builder().objects(objectIdentifiers).build())
+          .build();
+
+      client().deleteObjects(request).errors().forEach((s3Error -> {
+        LOG.warn("Error occurred during delete operation. {}: {}. {}", s3Error.code(), s3Error.message(),
+            s3Error.key());
+      }));
+    });
+  }
+
+  private ListObjectsV2Iterable internalListPrefix(String bucket, String keyPrefix) {
+    ListObjectsV2Request request = ListObjectsV2Request.builder().bucket(bucket).prefix(keyPrefix).build();

Review Comment:
   nit: maybe `request` could be inlined?



##########
api/src/main/java/org/apache/iceberg/io/SupportsPrefixOperations.java:
##########
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.io;
+
+import java.util.Iterator;
+
+/**
+ * This interface is intended as an extension for FileIO implementations
+ * to provide additional prefix based operations that may be useful in
+ * performing supporting operations.
+ */
+public interface SupportsPrefixOperations {
+
+  /**
+   * Return a stream of all files under a prefix.
+   * <p>
+   * Hierarchical file systems (e.g. HDFS) may impose additional restrictions
+   * like the prefix mush fully match a directory whereas key/value object

Review Comment:
   nit: mush -> must



##########
api/src/main/java/org/apache/iceberg/io/SupportsPrefixOperations.java:
##########
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.io;
+
+import java.util.Iterator;
+
+/**
+ * This interface is intended as an extension for FileIO implementations
+ * to provide additional prefix based operations that may be useful in
+ * performing supporting operations.
+ */
+public interface SupportsPrefixOperations {
+
+  /**
+   * Return a stream of all files under a prefix.
+   * <p>
+   * Hierarchical file systems (e.g. HDFS) may impose additional restrictions
+   * like the prefix mush fully match a directory whereas key/value object
+   * stores may allow for arbitrary prefixes.
+   *
+   * @param prefix prefix to list
+   * @return iterator of file information
+   */
+  Iterator<FileInfo> listPrefix(String prefix);
+
+  /**
+   * Delete all files under a prefix.
+   * <p>
+   * Hierarchical file systems (e.g. HDFS) may impose additional restrictions
+   * like the prefix mush fully match a directory whereas key/value object

Review Comment:
   nit: same here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] danielcweeks commented on a diff in pull request #5096: Add interface for FileIO prefix operations and implementations

Posted by GitBox <gi...@apache.org>.
danielcweeks commented on code in PR #5096:
URL: https://github.com/apache/iceberg/pull/5096#discussion_r903033706


##########
api/src/main/java/org/apache/iceberg/io/SupportsPrefixOperations.java:
##########
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.io;
+
+import java.util.Iterator;
+
+/**
+ * This interface is intended as an extension for FileIO implementations
+ * to provide additional prefix based operations that may be useful in
+ * performing supporting operations.
+ */
+public interface SupportsPrefixOperations {
+
+  /**
+   * Return a stream of all files under a prefix.

Review Comment:
   Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] danielcweeks commented on a diff in pull request #5096: Add interface for FileIO prefix operations and implementations

Posted by GitBox <gi...@apache.org>.
danielcweeks commented on code in PR #5096:
URL: https://github.com/apache/iceberg/pull/5096#discussion_r902045781


##########
api/src/main/java/org/apache/iceberg/io/SupportsPrefixOperations.java:
##########
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.io;
+
+import java.util.stream.Stream;
+
+/**
+ * This interface is intended as an extension for FileIO implementations
+ * to provide additional prefix based operations that may be useful in
+ * performing supporting operations.
+ */
+public interface SupportsPrefixOperations {
+
+  /**
+   * Return a stream of all files under a prefix.
+   *
+   * Hierarchical file systems (e.g. HDFS) may impose additional restrictions
+   * like the prefix mush fully match a directory whereas key/value object
+   * stores may allow for arbitrary prefixes.
+   *
+   * @param prefix prefix to list
+   * @return stream of file information
+   */
+  Stream<FileInfo> listPrefix(String prefix);

Review Comment:
   We probably shouldn't expose `Iterable` here because this should really be "traversable once" and it gets overly complicated to fully support the contract.  Hadoop uses `Iterator`, but has its own implementation as you can see from the adaptor.  If your goal is to simply keep streams out of the API, we should probably fallback to `Iterator`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5096: Add interface for FileIO prefix operations and implementations

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5096:
URL: https://github.com/apache/iceberg/pull/5096#discussion_r902041138


##########
api/src/main/java/org/apache/iceberg/io/SupportsPrefixOperations.java:
##########
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.io;
+
+import java.util.stream.Stream;
+
+/**
+ * This interface is intended as an extension for FileIO implementations
+ * to provide additional prefix based operations that may be useful in
+ * performing supporting operations.
+ */
+public interface SupportsPrefixOperations {
+
+  /**
+   * Return a stream of all files under a prefix.
+   *
+   * Hierarchical file systems (e.g. HDFS) may impose additional restrictions
+   * like the prefix mush fully match a directory whereas key/value object
+   * stores may allow for arbitrary prefixes.
+   *
+   * @param prefix prefix to list
+   * @return stream of file information
+   */
+  Stream<FileInfo> listPrefix(String prefix);

Review Comment:
   We don't usually expose `Stream` in our APIs. Normally this would be an `Iterable<FileInfo>`. Is there a reason to use a stream instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] danielcweeks commented on a diff in pull request #5096: Add interface for FileIO prefix operations and implementations

Posted by GitBox <gi...@apache.org>.
danielcweeks commented on code in PR #5096:
URL: https://github.com/apache/iceberg/pull/5096#discussion_r902043327


##########
api/src/main/java/org/apache/iceberg/io/FileInfo.java:
##########
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.io;
+
+import java.time.Instant;
+
+public class FileInfo {
+  private final String location;
+  private final long size;
+  private final Instant created;
+
+  public FileInfo(String location, long size, Instant created) {
+    this.location = location;
+    this.size = size;
+    this.created = created;
+  }
+
+  public String getLocation() {

Review Comment:
   Yeah, sorry, those were IDE generated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] danielcweeks commented on a diff in pull request #5096: Add interface for FileIO prefix operations and implementations

Posted by GitBox <gi...@apache.org>.
danielcweeks commented on code in PR #5096:
URL: https://github.com/apache/iceberg/pull/5096#discussion_r902048896


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIO.java:
##########
@@ -241,6 +246,52 @@ private List<String> deleteObjectsInBucket(String bucket, Collection<String> obj
     return Lists.newArrayList();
   }
 
+  @Override
+  public Stream<FileInfo> listPrefix(String prefix) {
+    S3URI s3uri = new S3URI(prefix, awsProperties.s3BucketToAccessPointMapping());
+
+    return internalListPrefix(s3uri.bucket(), s3uri.key()).stream()
+        .flatMap(r -> r.contents().stream())
+        .map(o -> new FileInfo(o.key(), o.size(), o.lastModified()));

Review Comment:
   Good catch, the bucket is actually not returned as part of the object, but I think it can be reconstructed from the original path.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5096: Add interface for FileIO prefix operations and implementations

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5096:
URL: https://github.com/apache/iceberg/pull/5096#discussion_r902050251


##########
core/src/main/java/org/apache/iceberg/hadoop/HadoopFileIO.java:
##########
@@ -89,4 +97,62 @@ public Configuration getConf() {
   public void serializeConfWith(Function<Configuration, SerializableSupplier<Configuration>> confSerializer) {
     this.hadoopConf = confSerializer.apply(getConf());
   }
+
+  @Override
+  public Stream<FileInfo> listPrefix(String prefix) {
+    Path prefixToList = new Path(prefix);
+    FileSystem fs = Util.getFs(prefixToList, hadoopConf.get());
+
+    try {
+      return Streams.stream(new AdaptingIterator<>(fs.listFiles(prefixToList, true)))

Review Comment:
   Minor: when passing boolean arguments, we prefer to add an inline comment to explain them: `fs.listFiles(prefixToList, true /* recursive *);`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] danielcweeks commented on a diff in pull request #5096: Add interface for FileIO prefix operations and implementations

Posted by GitBox <gi...@apache.org>.
danielcweeks commented on code in PR #5096:
URL: https://github.com/apache/iceberg/pull/5096#discussion_r902053424


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIO.java:
##########
@@ -241,6 +246,52 @@ private List<String> deleteObjectsInBucket(String bucket, Collection<String> obj
     return Lists.newArrayList();
   }
 
+  @Override
+  public Stream<FileInfo> listPrefix(String prefix) {
+    S3URI s3uri = new S3URI(prefix, awsProperties.s3BucketToAccessPointMapping());
+
+    return internalListPrefix(s3uri.bucket(), s3uri.key()).stream()
+        .flatMap(r -> r.contents().stream())
+        .map(o -> new FileInfo(o.key(), o.size(), o.lastModified()));

Review Comment:
   Updated and made sure the original scheme was returned.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] danielcweeks commented on a diff in pull request #5096: Add interface for FileIO prefix operations and implementations

Posted by GitBox <gi...@apache.org>.
danielcweeks commented on code in PR #5096:
URL: https://github.com/apache/iceberg/pull/5096#discussion_r903035140


##########
aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIO.java:
##########
@@ -241,6 +246,52 @@ private List<String> deleteObjectsInBucket(String bucket, Collection<String> obj
     return Lists.newArrayList();
   }
 
+  @Override
+  public Iterator<FileInfo> listPrefix(String prefix) {
+    S3URI s3uri = new S3URI(prefix, awsProperties.s3BucketToAccessPointMapping());
+
+    return internalListPrefix(s3uri.bucket(), s3uri.key()).stream()
+        .flatMap(r -> r.contents().stream())
+        .map(o -> new FileInfo(o.key(), o.size(), o.lastModified().toEpochMilli())).iterator();
+  }
+
+  /**
+   * This method provides a "best-effort" to delete all objects under the
+   * given prefix.
+   *
+   * Bulk delete operations are used and no reattempt is made for deletes if
+   * they fail, but will log any individual objects that are not deleted as part
+   * of the bulk operation.
+   *
+   * @param prefix prefix to delete
+   */
+  @Override
+  public void deletePrefix(String prefix) {

Review Comment:
   I've updated to use iterable everywhere and we'll just reissue the underlying listing.
   
   The latest update reuses the existing bulk delete path (which is functionally equivalent, but avoid duplication and respects the tagging behavior).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org