You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/12/20 00:35:21 UTC

[GitHub] [iceberg] rdblue opened a new pull request, #6460: Core: Refactor ManifestListReadTask to avoid extra S3 calls

rdblue opened a new pull request, #6460:
URL: https://github.com/apache/iceberg/pull/6460

   We noticed throttling on `getLength` calls to S3 in the all manifests table. This avoids those unnecessary calls by no longer creating a fake `DataFile` or `FileScanTask`. Previously, the manifest list scan task wrapped a file scan task so that it could be read as a normal data file. However, since v2 the task has been converted to a `DataTask` to fill in inherited sequence numbers correctly. As a result, the wrapped `FileScanTask` is no longer needed.
   
   Not creating the wrapped `FileScanTask` make it possible to avoid creating a `DataFile` (unless one is for some reason requested by calling `file()` on the task).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6460: Core: Refactor ManifestListReadTask to avoid extra S3 calls

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on code in PR #6460:
URL: https://github.com/apache/iceberg/pull/6460#discussion_r1052814712


##########
core/src/main/java/org/apache/iceberg/AllManifestsTable.java:
##########
@@ -212,19 +200,22 @@ public CloseableIterable<StructLike> rows() {
         return CloseableIterable.transform(rowIterable, projection::wrap);
 
       } catch (IOException e) {
-        throw new RuntimeIOException(
-            e, "Cannot read manifest list file: %s", manifestListTask.file().path());
+        throw new RuntimeIOException(e, "Cannot read manifest list file: %s", manifestListLocation);
       }
     }
 
     @Override
     public DataFile file() {
-      return manifestListTask.file();
+      return DataFiles.builder(PartitionSpec.unpartitioned())

Review Comment:
   +1 for having a lazy field



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6460: Core: Refactor ManifestListReadTask to avoid extra S3 calls

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on code in PR #6460:
URL: https://github.com/apache/iceberg/pull/6460#discussion_r1053415013


##########
core/src/main/java/org/apache/iceberg/AllManifestsTable.java:
##########
@@ -168,31 +153,34 @@ static class ManifestListReadTask implements DataTask {
     private final FileIO io;
     private final Schema schema;
     private final Map<Integer, PartitionSpec> specs;
-    private final FileScanTask manifestListTask;
+    private final String manifestListLocation;
+    private final Expression residual;
     private final long referenceSnapshotId;
 
     ManifestListReadTask(
         FileIO io,
         Schema schema,
         Map<Integer, PartitionSpec> specs,
-        FileScanTask manifestListTask,
+        String manifestListLocation,
+        Expression residual,
         long referenceSnapshotId) {
       this.io = io;
       this.schema = schema;
       this.specs = specs;
-      this.manifestListTask = manifestListTask;
+      this.manifestListLocation = manifestListLocation;
+      this.residual = residual;
       this.referenceSnapshotId = referenceSnapshotId;
     }
 
     @Override
     public List<DeleteFile> deletes() {
-      return manifestListTask.deletes();
+      return ImmutableList.of();

Review Comment:
   Because we don't have any possible delete files here since we are just reading for manifests?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] nastra commented on a diff in pull request #6460: Core: Refactor ManifestListReadTask to avoid extra S3 calls

Posted by GitBox <gi...@apache.org>.
nastra commented on code in PR #6460:
URL: https://github.com/apache/iceberg/pull/6460#discussion_r1053246907


##########
core/src/main/java/org/apache/iceberg/AllManifestsTable.java:
##########
@@ -212,19 +200,22 @@ public CloseableIterable<StructLike> rows() {
         return CloseableIterable.transform(rowIterable, projection::wrap);
 
       } catch (IOException e) {
-        throw new RuntimeIOException(
-            e, "Cannot read manifest list file: %s", manifestListTask.file().path());
+        throw new RuntimeIOException(e, "Cannot read manifest list file: %s", manifestListLocation);
       }
     }
 
     @Override
     public DataFile file() {
-      return manifestListTask.file();
+      return DataFiles.builder(PartitionSpec.unpartitioned())

Review Comment:
   +1, would it make sense to store this in a memoized supplier so that `DataFile` only gets initialized when required?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6460: Core: Refactor ManifestListReadTask to avoid extra S3 calls

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on code in PR #6460:
URL: https://github.com/apache/iceberg/pull/6460#discussion_r1052809830


##########
core/src/main/java/org/apache/iceberg/AllManifestsTable.java:
##########
@@ -212,19 +200,22 @@ public CloseableIterable<StructLike> rows() {
         return CloseableIterable.transform(rowIterable, projection::wrap);
 
       } catch (IOException e) {
-        throw new RuntimeIOException(
-            e, "Cannot read manifest list file: %s", manifestListTask.file().path());
+        throw new RuntimeIOException(e, "Cannot read manifest list file: %s", manifestListLocation);
       }
     }
 
     @Override
     public DataFile file() {
-      return manifestListTask.file();
+      return DataFiles.builder(PartitionSpec.unpartitioned())

Review Comment:
   Optional: If this is called repeatedly maybe worth creating/caching this in the CTOR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6460: Core: Refactor ManifestListReadTask to avoid extra S3 calls

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6460:
URL: https://github.com/apache/iceberg/pull/6460#discussion_r1053448341


##########
core/src/main/java/org/apache/iceberg/AllManifestsTable.java:
##########
@@ -212,19 +200,22 @@ public CloseableIterable<StructLike> rows() {
         return CloseableIterable.transform(rowIterable, projection::wrap);
 
       } catch (IOException e) {
-        throw new RuntimeIOException(
-            e, "Cannot read manifest list file: %s", manifestListTask.file().path());
+        throw new RuntimeIOException(e, "Cannot read manifest list file: %s", manifestListLocation);
       }
     }
 
     @Override
     public DataFile file() {
-      return manifestListTask.file();
+      return DataFiles.builder(PartitionSpec.unpartitioned())

Review Comment:
   I don't think it gets called, but that could change. I'll update to a lazy field.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #6460: Core: Refactor ManifestListReadTask to avoid extra S3 calls

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #6460:
URL: https://github.com/apache/iceberg/pull/6460#issuecomment-1359711616

   Thanks for the reviews, everyone!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6460: Core: Refactor ManifestListReadTask to avoid extra S3 calls

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6460:
URL: https://github.com/apache/iceberg/pull/6460#discussion_r1053447615


##########
core/src/main/java/org/apache/iceberg/AllManifestsTable.java:
##########
@@ -168,31 +153,34 @@ static class ManifestListReadTask implements DataTask {
     private final FileIO io;
     private final Schema schema;
     private final Map<Integer, PartitionSpec> specs;
-    private final FileScanTask manifestListTask;
+    private final String manifestListLocation;
+    private final Expression residual;
     private final long referenceSnapshotId;
 
     ManifestListReadTask(
         FileIO io,
         Schema schema,
         Map<Integer, PartitionSpec> specs,
-        FileScanTask manifestListTask,
+        String manifestListLocation,
+        Expression residual,
         long referenceSnapshotId) {
       this.io = io;
       this.schema = schema;
       this.specs = specs;
-      this.manifestListTask = manifestListTask;
+      this.manifestListLocation = manifestListLocation;
+      this.residual = residual;
       this.referenceSnapshotId = referenceSnapshotId;
     }
 
     @Override
     public List<DeleteFile> deletes() {
-      return manifestListTask.deletes();
+      return ImmutableList.of();

Review Comment:
   That's right. We don't have delete files against manifests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6460: Core: Refactor ManifestListReadTask to avoid extra S3 calls

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on code in PR #6460:
URL: https://github.com/apache/iceberg/pull/6460#discussion_r1052809830


##########
core/src/main/java/org/apache/iceberg/AllManifestsTable.java:
##########
@@ -212,19 +200,22 @@ public CloseableIterable<StructLike> rows() {
         return CloseableIterable.transform(rowIterable, projection::wrap);
 
       } catch (IOException e) {
-        throw new RuntimeIOException(
-            e, "Cannot read manifest list file: %s", manifestListTask.file().path());
+        throw new RuntimeIOException(e, "Cannot read manifest list file: %s", manifestListLocation);
       }
     }
 
     @Override
     public DataFile file() {
-      return manifestListTask.file();
+      return DataFiles.builder(PartitionSpec.unpartitioned())

Review Comment:
   Optional: If this is called repeatedly maybe worth making this in the CTOR.



##########
core/src/main/java/org/apache/iceberg/AllManifestsTable.java:
##########
@@ -234,12 +225,14 @@ public long start() {
 
     @Override
     public long length() {
-      return manifestListTask.length();
+      // return a generic length to avoid looking up the actual length
+      return 8192;
     }
 
     @Override
     public Expression residual() {
-      return manifestListTask.residual();
+      // this table is unpartitioned so the residual is always constant
+      return residual;

Review Comment:
   Is this ever used?  I wonder can we just return ResidualEvaluator.unpartitioned(Expressions.alwaysFalse()) or something



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #6460: Core: Refactor ManifestListReadTask to avoid extra S3 calls

Posted by GitBox <gi...@apache.org>.
rdblue merged PR #6460:
URL: https://github.com/apache/iceberg/pull/6460


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6460: Core: Refactor ManifestListReadTask to avoid extra S3 calls

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6460:
URL: https://github.com/apache/iceberg/pull/6460#discussion_r1053449241


##########
core/src/main/java/org/apache/iceberg/AllManifestsTable.java:
##########
@@ -234,12 +225,14 @@ public long start() {
 
     @Override
     public long length() {
-      return manifestListTask.length();
+      // return a generic length to avoid looking up the actual length
+      return 8192;
     }
 
     @Override
     public Expression residual() {
-      return manifestListTask.residual();
+      // this table is unpartitioned so the residual is always constant
+      return residual;

Review Comment:
   I don't think this is used, but I would still make it the correct residual. An unpartitioned table always has the same residual as the original filter, so this is correct.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org