You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/08/25 12:12:49 UTC

[GitHub] [iceberg] ConeyLiu opened a new pull request, #5632: Core: Avoid reading ManifestFile when create ManifestReader

ConeyLiu opened a new pull request, #5632:
URL: https://github.com/apache/iceberg/pull/5632

   This patch aims to reduce the time of Iceberg task planning. We notice the task plan could not benefit from ParallelIterator a lot when there are many manifest files to read. The problem is the ManifestReader needs to read the manifest file to get the PartitionSpec which is not needed for most cases (because the ManifestFile object has the PartitionSpec Id and the Table has the mapping from PartitionSpec Id to PartitionSpec). It needs to read all these manifest files in serial when the following code of ParallelIterator is initing. And this is very slow when the HDFS traffic is busy. 
   
   https://github.com/apache/iceberg/blob/dbb8a404f6632a55acb36e949f0e7b84b643cede/core/src/main/java/org/apache/iceberg/util/ParallelIterable.java#L62
   
   After this change, we could get several times (depending on the number of driver threads and count of manifest files) time reduction.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #5632: Core: Avoid reading ManifestFile when create ManifestReader

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on code in PR #5632:
URL: https://github.com/apache/iceberg/pull/5632#discussion_r956853567


##########
core/src/main/java/org/apache/iceberg/ManifestReader.java:
##########
@@ -101,20 +100,34 @@ private String fileClass() {
 
   protected ManifestReader(
       InputFile file,
+      int specId,
       Map<Integer, PartitionSpec> specsById,
       InheritableMetadata inheritableMetadata,
       FileType content) {
     this.file = file;
     this.inheritableMetadata = inheritableMetadata;
     this.content = content;
 
+    if (specsById != null) {
+      this.spec = specsById.get(specId);
+      Preconditions.checkArgument(
+          spec != null, "Could not find PartitionSpec for specId: %s", specId);
+    } else {
+      this.spec = readPartitionSpec(file);
+    }
+
+    this.fileSchema = new Schema(DataFile.getType(spec.partitionType()).fields());
+  }
+
+  private PartitionSpec readPartitionSpec(InputFile inputFile) {

Review Comment:
   Updated to `static` and changed to use `specsById` if it is present.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5632: Core: Avoid reading ManifestFile when create ManifestReader

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5632:
URL: https://github.com/apache/iceberg/pull/5632#discussion_r956791002


##########
core/src/main/java/org/apache/iceberg/ManifestReader.java:
##########
@@ -101,20 +100,34 @@ private String fileClass() {
 
   protected ManifestReader(
       InputFile file,
+      int specId,
       Map<Integer, PartitionSpec> specsById,
       InheritableMetadata inheritableMetadata,
       FileType content) {
     this.file = file;
     this.inheritableMetadata = inheritableMetadata;
     this.content = content;
 
+    if (specsById != null) {
+      this.spec = specsById.get(specId);
+      Preconditions.checkArgument(
+          spec != null, "Could not find PartitionSpec for specId: %s", specId);
+    } else {
+      this.spec = readPartitionSpec(file);
+    }
+
+    this.fileSchema = new Schema(DataFile.getType(spec.partitionType()).fields());
+  }
+
+  private PartitionSpec readPartitionSpec(InputFile inputFile) {

Review Comment:
   This should still pass in `specsById`. There is no reason to change the behavior of loading. If the spec is already loaded and defined, it should be used rather than parsed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5632: Core: Avoid reading ManifestFile when create ManifestReader

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5632:
URL: https://github.com/apache/iceberg/pull/5632#discussion_r956790762


##########
core/src/main/java/org/apache/iceberg/ManifestReader.java:
##########
@@ -101,20 +100,34 @@ private String fileClass() {
 
   protected ManifestReader(
       InputFile file,
+      int specId,
       Map<Integer, PartitionSpec> specsById,
       InheritableMetadata inheritableMetadata,
       FileType content) {
     this.file = file;
     this.inheritableMetadata = inheritableMetadata;
     this.content = content;
 
+    if (specsById != null) {
+      this.spec = specsById.get(specId);
+      Preconditions.checkArgument(
+          spec != null, "Could not find PartitionSpec for specId: %s", specId);
+    } else {
+      this.spec = readPartitionSpec(file);
+    }
+
+    this.fileSchema = new Schema(DataFile.getType(spec.partitionType()).fields());
+  }
+
+  private PartitionSpec readPartitionSpec(InputFile inputFile) {

Review Comment:
   Can this be `static`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on pull request #5632: Core: Avoid reading ManifestFile when create ManifestReader

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on PR #5632:
URL: https://github.com/apache/iceberg/pull/5632#issuecomment-1227915439

   I think this is an interesting find, it makes sense to me on the first reading, so partition-spec-id on the header of manifestFile is always the same as partition-spec-id of the manifest-file entry?   Looks that way from the writer.  Curious for @rdblue context why we read it from the header
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #5632: Core: Avoid reading ManifestFile when create ManifestReader

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on code in PR #5632:
URL: https://github.com/apache/iceberg/pull/5632#discussion_r956853038


##########
core/src/main/java/org/apache/iceberg/ManifestReader.java:
##########
@@ -101,20 +100,34 @@ private String fileClass() {
 
   protected ManifestReader(
       InputFile file,
+      int specId,
       Map<Integer, PartitionSpec> specsById,
       InheritableMetadata inheritableMetadata,
       FileType content) {
     this.file = file;
     this.inheritableMetadata = inheritableMetadata;
     this.content = content;
 
+    if (specsById != null) {
+      this.spec = specsById.get(specId);
+      Preconditions.checkArgument(
+          spec != null, "Could not find PartitionSpec for specId: %s", specId);

Review Comment:
   Thanks for the advice



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5632: Core: Avoid reading ManifestFile when create ManifestReader

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5632:
URL: https://github.com/apache/iceberg/pull/5632#discussion_r956790500


##########
core/src/main/java/org/apache/iceberg/ManifestReader.java:
##########
@@ -101,20 +100,34 @@ private String fileClass() {
 
   protected ManifestReader(
       InputFile file,
+      int specId,
       Map<Integer, PartitionSpec> specsById,
       InheritableMetadata inheritableMetadata,
       FileType content) {
     this.file = file;
     this.inheritableMetadata = inheritableMetadata;
     this.content = content;
 
+    if (specsById != null) {
+      this.spec = specsById.get(specId);
+      Preconditions.checkArgument(
+          spec != null, "Could not find PartitionSpec for specId: %s", specId);

Review Comment:
   In Iceberg, we don't use class or variable names in error messages. The people reading these messages are not Iceberg developers so it is better to just use normal words. Here, this should be `"Cannot find partition spec for id: %s"`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #5632: Core: Avoid reading ManifestFile when create ManifestReader

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on code in PR #5632:
URL: https://github.com/apache/iceberg/pull/5632#discussion_r1007807328


##########
core/src/main/java/org/apache/iceberg/ManifestReader.java:
##########
@@ -126,15 +138,12 @@ protected ManifestReader(
       specId = Integer.parseInt(specProperty);
     }
 
-    if (specsById != null) {
-      this.spec = specsById.get(specId);
+    if (specsById != null && specsById.containsKey(specId)) {

Review Comment:
   Updated the code, I think the most concerns come from the guard code: `specsById.contains(specId)`. I removed it and match the existing implementation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho merged pull request #5632: Core: Avoid reading ManifestFile when create ManifestReader

Posted by GitBox <gi...@apache.org>.
szehon-ho merged PR #5632:
URL: https://github.com/apache/iceberg/pull/5632


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #5632: Core: Avoid reading ManifestFile when create ManifestReader

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on code in PR #5632:
URL: https://github.com/apache/iceberg/pull/5632#discussion_r977903306


##########
core/src/main/java/org/apache/iceberg/ManifestReader.java:
##########
@@ -126,15 +138,12 @@ protected ManifestReader(
       specId = Integer.parseInt(specProperty);
     }
 
-    if (specsById != null) {
-      this.spec = specsById.get(specId);
+    if (specsById != null && specsById.containsKey(specId)) {

Review Comment:
   Not sure why we need to add this  extra case here?  This spec id is for sure read from file in this branch (not the one you passed in), so all assumptions should still hold right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on pull request #5632: Core: Avoid reading ManifestFile when create ManifestReader

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on PR #5632:
URL: https://github.com/apache/iceberg/pull/5632#issuecomment-1302833958

   Merged, thanks @ConeyLiu and @rdblue for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #5632: Core: Avoid reading ManifestFile when create ManifestReader

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on PR #5632:
URL: https://github.com/apache/iceberg/pull/5632#issuecomment-1303486785

   Thanks @szehon-ho @rdblue @nastra @zinking 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #5632: Core: Avoid reading ManifestFile when create ManifestReader

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on code in PR #5632:
URL: https://github.com/apache/iceberg/pull/5632#discussion_r978438340


##########
core/src/main/java/org/apache/iceberg/ManifestReader.java:
##########
@@ -126,15 +138,12 @@ protected ManifestReader(
       specId = Integer.parseInt(specProperty);
     }
 
-    if (specsById != null) {
-      this.spec = specsById.get(specId);
+    if (specsById != null && specsById.containsKey(specId)) {

Review Comment:
   Yes, this part of the code is the same as the above part. Make sure we could get the spec from the `specId` and `specsById`. Here we check again because the `specId` is read from the file.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5632: Core: Avoid reading ManifestFile when create ManifestReader

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5632:
URL: https://github.com/apache/iceberg/pull/5632#discussion_r956790600


##########
core/src/main/java/org/apache/iceberg/ManifestReader.java:
##########
@@ -101,20 +100,34 @@ private String fileClass() {
 
   protected ManifestReader(
       InputFile file,
+      int specId,
       Map<Integer, PartitionSpec> specsById,
       InheritableMetadata inheritableMetadata,
       FileType content) {
     this.file = file;
     this.inheritableMetadata = inheritableMetadata;
     this.content = content;
 
+    if (specsById != null) {
+      this.spec = specsById.get(specId);
+      Preconditions.checkArgument(

Review Comment:
   Why does this fail? I think it should warn and fall back to reading the spec. We don't want to needlessly introduce places where bad metadata can cause a table to be unreadable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5632: Core: Avoid reading ManifestFile when create ManifestReader

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5632:
URL: https://github.com/apache/iceberg/pull/5632#discussion_r984084652


##########
core/src/main/java/org/apache/iceberg/BaseRewriteManifests.java:
##########
@@ -164,7 +164,13 @@ private ManifestFile copyManifest(ManifestFile manifest) {
     InputFile toCopy = ops.io().newInputFile(manifest.path());
     OutputFile newFile = newManifestOutput();
     return ManifestFiles.copyRewriteManifest(
-        current.formatVersion(), toCopy, specsById, newFile, snapshotId(), summaryBuilder);
+        current.formatVersion(),
+        manifest.partitionSpecId(),
+        toCopy,

Review Comment:
   `toCopy` should be just after format version.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #5632: Core: Avoid reading ManifestFile when create ManifestReader

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on PR #5632:
URL: https://github.com/apache/iceberg/pull/5632#issuecomment-1228120281

   > so partition-spec-id on the header of manifestFile is always the same as partition-spec-id of the manifest-file entry
   
   From here: https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/ManifestFile.java#L34. I think it should be. And the partition spec id stored in the Avro header is got from the given partition spec.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #5632: Core: Avoid reading ManifestFile when create ManifestReader

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #5632:
URL: https://github.com/apache/iceberg/pull/5632#issuecomment-1229569319

   This mostly looks good, but there are some unnecessary behavior changes to fix. Thanks, @ConeyLiu!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zinking commented on pull request #5632: Core: Avoid reading ManifestFile when create ManifestReader

Posted by GitBox <gi...@apache.org>.
zinking commented on PR #5632:
URL: https://github.com/apache/iceberg/pull/5632#issuecomment-1248099574

   @rdblue @szehon-ho  can we get this merged? thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #5632: Core: Avoid reading ManifestFile when create ManifestReader

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on code in PR #5632:
URL: https://github.com/apache/iceberg/pull/5632#discussion_r956853143


##########
core/src/main/java/org/apache/iceberg/ManifestReader.java:
##########
@@ -101,20 +100,34 @@ private String fileClass() {
 
   protected ManifestReader(
       InputFile file,
+      int specId,
       Map<Integer, PartitionSpec> specsById,
       InheritableMetadata inheritableMetadata,
       FileType content) {
     this.file = file;
     this.inheritableMetadata = inheritableMetadata;
     this.content = content;
 
+    if (specsById != null) {
+      this.spec = specsById.get(specId);
+      Preconditions.checkArgument(

Review Comment:
   Updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #5632: Core: Avoid reading ManifestFile when create ManifestReader

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on code in PR #5632:
URL: https://github.com/apache/iceberg/pull/5632#discussion_r1007807328


##########
core/src/main/java/org/apache/iceberg/ManifestReader.java:
##########
@@ -126,15 +138,12 @@ protected ManifestReader(
       specId = Integer.parseInt(specProperty);
     }
 
-    if (specsById != null) {
-      this.spec = specsById.get(specId);
+    if (specsById != null && specsById.containsKey(specId)) {

Review Comment:
   Updated the code, I think the most concerns come from the guard code: `specsById.contains(specId)`. I removed it and keep it to match the existing implementation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #5632: Core: Avoid reading ManifestFile when create ManifestReader

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on code in PR #5632:
URL: https://github.com/apache/iceberg/pull/5632#discussion_r1008680666


##########
core/src/main/java/org/apache/iceberg/ManifestReader.java:
##########
@@ -101,20 +100,32 @@ private String fileClass() {
 
   protected ManifestReader(
       InputFile file,
+      int specId,
       Map<Integer, PartitionSpec> specsById,
       InheritableMetadata inheritableMetadata,
       FileType content) {
     this.file = file;
     this.inheritableMetadata = inheritableMetadata;
     this.content = content;
 
+    if (specsById != null) {
+      this.spec = specsById.get(specId);
+    } else {
+      this.spec = readPartitionSpec(file);
+    }
+
+    this.fileSchema = new Schema(DataFile.getType(spec.partitionType()).fields());
+  }
+
+  private <T extends ContentFile<T>> PartitionSpec readPartitionSpec(InputFile inputFile) {
+    Map<String, String> metadata;
     try {

Review Comment:
   Yes, the existing one.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #5632: Core: Avoid reading ManifestFile when create ManifestReader

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on PR #5632:
URL: https://github.com/apache/iceberg/pull/5632#issuecomment-1227177690

   Hi @rdblue @kbendick @szehon-ho could you help to review this when you are free? Thanks a lot.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #5632: Core: Avoid reading ManifestFile when create ManifestReader

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on code in PR #5632:
URL: https://github.com/apache/iceberg/pull/5632#discussion_r979320542


##########
core/src/main/java/org/apache/iceberg/ManifestReader.java:
##########
@@ -126,15 +138,12 @@ protected ManifestReader(
       specId = Integer.parseInt(specProperty);
     }
 
-    if (specsById != null) {
-      this.spec = specsById.get(specId);
+    if (specsById != null && specsById.containsKey(specId)) {

Review Comment:
   I see it now.  Yea I'm not entirely sure we need all these checks if specsById contains specId, now unfortunately it gets really complicated.  But I dont have the background to say we don't need it.
   
   I do notice here, there is a bit inefficiency in case specsById == null.  Here you will read specId from file.  It's  unnecessarily , as there is no map to look it up, instead it should go directly to read the spec from file.  In previous code it would go straight to read spec from file.
   
   Your code:
   ```
   if (specsById != null && specsById.contains(specId)) {
      return specsById.get(specId)
   else
      specId = readSpecId()
      if (specsById != null && specsById.contains(specId)
         return specById.get(specId)
   ```
   
   More optimal code:
   ```
   if (specsById == null)
      return readSpec
   
   if (specsById.contains(specId)
      return specsById.get(specId)
   else 
      specId = readSpecId()
      if (specsById.contains(specId)
         return specsById.get(specId)
      else
         return readSpec()
   ```
   
   
   I hope I understand it right.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on pull request #5632: Core: Avoid reading ManifestFile when create ManifestReader

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on PR #5632:
URL: https://github.com/apache/iceberg/pull/5632#issuecomment-1299324287

   Note: will merge it soon if there's no further comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #5632: Core: Avoid reading ManifestFile when create ManifestReader

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on code in PR #5632:
URL: https://github.com/apache/iceberg/pull/5632#discussion_r977903306


##########
core/src/main/java/org/apache/iceberg/ManifestReader.java:
##########
@@ -126,15 +138,12 @@ protected ManifestReader(
       specId = Integer.parseInt(specProperty);
     }
 
-    if (specsById != null) {
-      this.spec = specsById.get(specId);
+    if (specsById != null && specsById.containsKey(specId)) {

Review Comment:
   Not sure why we need to add this  extra case here?  This spec id is for sure read from file (not the one you passed in), so all assumptions should still hold right?



##########
core/src/main/java/org/apache/iceberg/ManifestReader.java:
##########
@@ -101,20 +100,33 @@ private String fileClass() {
 
   protected ManifestReader(
       InputFile file,
+      int specId,
       Map<Integer, PartitionSpec> specsById,
       InheritableMetadata inheritableMetadata,
       FileType content) {
     this.file = file;
     this.inheritableMetadata = inheritableMetadata;
     this.content = content;
 
+    if (specsById != null && specsById.containsKey(specId)) {

Review Comment:
   Curious, specsById.containsKey(specId) should always be true, shouldnt it?  Does it break if we remove this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a diff in pull request #5632: Core: Avoid reading ManifestFile when create ManifestReader

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on code in PR #5632:
URL: https://github.com/apache/iceberg/pull/5632#discussion_r978433178


##########
core/src/main/java/org/apache/iceberg/ManifestReader.java:
##########
@@ -101,20 +100,33 @@ private String fileClass() {
 
   protected ManifestReader(
       InputFile file,
+      int specId,
       Map<Integer, PartitionSpec> specsById,
       InheritableMetadata inheritableMetadata,
       FileType content) {
     this.file = file;
     this.inheritableMetadata = inheritableMetadata;
     this.content = content;
 
+    if (specsById != null && specsById.containsKey(specId)) {

Review Comment:
   It should be always true. It is a guard code because we have checked the specsById. I can remove it if you are concerned about it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #5632: Core: Avoid reading ManifestFile when create ManifestReader

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on code in PR #5632:
URL: https://github.com/apache/iceberg/pull/5632#discussion_r1008565020


##########
core/src/main/java/org/apache/iceberg/ManifestReader.java:
##########
@@ -101,20 +100,32 @@ private String fileClass() {
 
   protected ManifestReader(
       InputFile file,
+      int specId,
       Map<Integer, PartitionSpec> specsById,
       InheritableMetadata inheritableMetadata,
       FileType content) {
     this.file = file;
     this.inheritableMetadata = inheritableMetadata;
     this.content = content;
 
+    if (specsById != null) {
+      this.spec = specsById.get(specId);
+    } else {
+      this.spec = readPartitionSpec(file);
+    }
+
+    this.fileSchema = new Schema(DataFile.getType(spec.partitionType()).fields());
+  }
+
+  private <T extends ContentFile<T>> PartitionSpec readPartitionSpec(InputFile inputFile) {
+    Map<String, String> metadata;
     try {

Review Comment:
   Interesting, noticed there looks duplicate layer of try-finally, but guess it's there before.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org