You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/12/01 06:51:16 UTC

[GitHub] [iceberg] shardulm94 opened a new pull request, #6327: ORC: Fix error when projecting nested indentity partition column

shardulm94 opened a new pull request, #6327:
URL: https://github.com/apache/iceberg/pull/6327

   Closes #4604
   
   This is an alternate and arguably simpler implementation to #4599. The issue is that ORC read path did not support projecting nested structs which have just the partition columns selected. As part of the ORC read path, we drop constant fields from the projected schema before passing it to the ORC file reader. Example:
   ```
   Schema readSchemaWithoutConstantAndMetadataFields =
           TypeUtil.selectNot(
               readSchema, Sets.union(idToConstant.keySet(), MetadataColumns.metadataFieldIds()));
   ```
   This step also results in dropping of structs which contain just the partition columns as they now become empty.
   
   #4599 tries to fix this by not dropping nested struct containing partition columns, thus reading the partition values from the file. This PR instead takes a different approach by preserving empty struct when dropping constant fields. This allows the existing constant handling in the ORC read path to work as expected even for nested partition fields.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #6327: ORC: Fix error when projecting nested indentity partition column

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on PR #6327:
URL: https://github.com/apache/iceberg/pull/6327#issuecomment-1424699284

   @shardulm94, yes. I'll put it in the queue. Thanks for pinging me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] shardulm94 commented on a diff in pull request #6327: ORC: Fix error when projecting nested indentity partition column

Posted by GitBox <gi...@apache.org>.
shardulm94 commented on code in PR #6327:
URL: https://github.com/apache/iceberg/pull/6327#discussion_r1042628482


##########
orc/src/main/java/org/apache/iceberg/orc/ORCSchemaUtil.java:
##########
@@ -442,4 +445,23 @@ static TypeDescription applyNameMapping(TypeDescription orcSchema, NameMapping n
   public static Map<Integer, String> idToOrcName(Schema schema) {
     return TypeUtil.visit(schema, new IdToOrcName());
   }
+
+  /**
+   * Returns a {@link Schema} which has constant fields and metadata fields removed from the
+   * provided schema. This utility can be used to create a "read schema" which can be passed to the
+   * ORC file reader and hence avoiding deserialization and memory costs associated with column
+   * values already available through Iceberg metadata.
+   *
+   * <p>NOTE: This method, unlike {@link TypeUtil#selectNot(Schema, Set)}, preserves empty structs
+   * (caused due to a struct having all constant fields) so that Iceberg ORC readers can later add
+   * constant fields in these structs

Review Comment:
   Kept only the first part of the comment now



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] shardulm94 commented on pull request #6327: ORC: Fix error when projecting nested indentity partition column

Posted by GitBox <gi...@apache.org>.
shardulm94 commented on PR #6327:
URL: https://github.com/apache/iceberg/pull/6327#issuecomment-1333293921

   @tprelle @kbendick Can you help take a look? This is an alternate implementation to #4599.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] autumnust commented on a diff in pull request #6327: ORC: Fix error when projecting nested indentity partition column

Posted by GitBox <gi...@apache.org>.
autumnust commented on code in PR #6327:
URL: https://github.com/apache/iceberg/pull/6327#discussion_r1042526245


##########
orc/src/main/java/org/apache/iceberg/orc/ORCSchemaUtil.java:
##########
@@ -442,4 +445,23 @@ static TypeDescription applyNameMapping(TypeDescription orcSchema, NameMapping n
   public static Map<Integer, String> idToOrcName(Schema schema) {
     return TypeUtil.visit(schema, new IdToOrcName());
   }
+
+  /**
+   * Returns a {@link Schema} which has constant fields and metadata fields removed from the
+   * provided schema. This utility can be used to create a "read schema" which can be passed to the
+   * ORC file reader and hence avoiding deserialization and memory costs associated with column
+   * values already available through Iceberg metadata.
+   *
+   * <p>NOTE: This method, unlike {@link TypeUtil#selectNot(Schema, Set)}, preserves empty structs
+   * (caused due to a struct having all constant fields) so that Iceberg ORC readers can later add
+   * constant fields in these structs

Review Comment:
   nit: Doesn't have to mention the cause for empty structs as there might be other scenarios like intentional empty struct as part of schema ? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] shardulm94 commented on a diff in pull request #6327: ORC: Fix error when projecting nested indentity partition column

Posted by GitBox <gi...@apache.org>.
shardulm94 commented on code in PR #6327:
URL: https://github.com/apache/iceberg/pull/6327#discussion_r1041512368


##########
api/src/main/java/org/apache/iceberg/types/TypeUtil.java:
##########
@@ -136,7 +136,16 @@ public static Types.StructType selectNot(Types.StructType struct, Set<Integer> f
   }
 
   public static Schema selectNot(Schema schema, Set<Integer> fieldIds) {
-    Set<Integer> projectedIds = getIdsInternal(schema.asStruct(), false);
+    return selectNot(schema, fieldIds, false);
+  }
+
+  public static Schema selectNotPreserveEmptyStructs(Schema schema, Set<Integer> fieldIds) {

Review Comment:
   That makes sense. Since ORC is the only format using this behavior, I now moved this code to `OrcSchemaUtil`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] shardulm94 commented on pull request #6327: ORC: Fix error when projecting nested indentity partition column

Posted by GitBox <gi...@apache.org>.
shardulm94 commented on PR #6327:
URL: https://github.com/apache/iceberg/pull/6327#issuecomment-1341617676

   @aokolnychyi Mind taking a look as well? This PR paves the way to fix https://github.com/apache/iceberg/issues/3192


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] shardulm94 commented on pull request #6327: ORC: Fix error when projecting nested indentity partition column

Posted by "shardulm94 (via GitHub)" <gi...@apache.org>.
shardulm94 commented on PR #6327:
URL: https://github.com/apache/iceberg/pull/6327#issuecomment-1401365132

   Hey @rdblue, can you take a look at this? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] nastra commented on a diff in pull request #6327: ORC: Fix error when projecting nested indentity partition column

Posted by GitBox <gi...@apache.org>.
nastra commented on code in PR #6327:
URL: https://github.com/apache/iceberg/pull/6327#discussion_r1040867729


##########
api/src/main/java/org/apache/iceberg/types/TypeUtil.java:
##########
@@ -136,7 +136,16 @@ public static Types.StructType selectNot(Types.StructType struct, Set<Integer> f
   }
 
   public static Schema selectNot(Schema schema, Set<Integer> fieldIds) {
-    Set<Integer> projectedIds = getIdsInternal(schema.asStruct(), false);
+    return selectNot(schema, fieldIds, false);
+  }
+
+  public static Schema selectNotPreserveEmptyStructs(Schema schema, Set<Integer> fieldIds) {

Review Comment:
   maybe we should just have a single utility method for reading the schema without constants and without metadata fields rather than doing it in multiple places? Then we wouldn't need to expose `selectNotPreserveEmptyStructs`. wdyt?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] nastra commented on pull request #6327: ORC: Fix error when projecting nested indentity partition column

Posted by GitBox <gi...@apache.org>.
nastra commented on PR #6327:
URL: https://github.com/apache/iceberg/pull/6327#issuecomment-1339188679

   > As part of the ORC read path, we drop constant fields from the projected schema before passing it to the ORC file reader.
   
   @shardulm94 do we know why dropping those constant fields is required in the read path? I see that this was introduced by https://github.com/apache/iceberg/pull/1191 but I'm lacking some historical context and would appreciate your input here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] tprelle commented on pull request #6327: ORC: Fix error when projecting nested indentity partition column

Posted by GitBox <gi...@apache.org>.
tprelle commented on PR #6327:
URL: https://github.com/apache/iceberg/pull/6327#issuecomment-1338004394

   hi @shardulm94, it's seems less intrusive and better than https://github.com/apache/iceberg/pull/4599


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] shardulm94 commented on pull request #6327: ORC: Fix error when projecting nested indentity partition column

Posted by GitBox <gi...@apache.org>.
shardulm94 commented on PR #6327:
URL: https://github.com/apache/iceberg/pull/6327#issuecomment-1340022074

   > @shardulm94 do we know why dropping those constant fields is required in the read path? I see that this was introduced by #1191 but I'm lacking some historical context and would appreciate your input here.
   
   Here we are construct the "readSchema" i.e. the schema that should be passed to the underlying file reader e.g. ORC when reading the data file. We already have the value of constant columns per file available via Iceberg, so reading them from the file is unnecessary. Doing so, we avoid the read & deserialization cost for the constant columns and also avoid the memory footprint to store the column vector in memory.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] shardulm94 commented on a diff in pull request #6327: ORC: Fix error when projecting nested indentity partition column

Posted by GitBox <gi...@apache.org>.
shardulm94 commented on code in PR #6327:
URL: https://github.com/apache/iceberg/pull/6327#discussion_r1041512368


##########
api/src/main/java/org/apache/iceberg/types/TypeUtil.java:
##########
@@ -136,7 +136,16 @@ public static Types.StructType selectNot(Types.StructType struct, Set<Integer> f
   }
 
   public static Schema selectNot(Schema schema, Set<Integer> fieldIds) {
-    Set<Integer> projectedIds = getIdsInternal(schema.asStruct(), false);
+    return selectNot(schema, fieldIds, false);
+  }
+
+  public static Schema selectNotPreserveEmptyStructs(Schema schema, Set<Integer> fieldIds) {

Review Comment:
   That makes sense. Since ORC is the only format using this behavior, I now added this code to `OrcSchemaUtil`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org