You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/04/13 17:53:51 UTC

[GitHub] [iceberg] yyanyy commented on a change in pull request #2465: Core: add row identifier to schema

yyanyy commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r612661016



##########
File path: core/src/main/java/org/apache/iceberg/SchemaParser.java
##########
@@ -158,7 +169,7 @@ public static String toJson(Schema schema, boolean pretty) {
       if (pretty) {
         generator.useDefaultPrettyPrinter();
       }
-      toJson(schema.asStruct(), generator);
+      toJson(schema.asStruct(), schema.schemaId(), schema.rowIdentifiers(), generator);

Review comment:
       Haven't reviewed the entire PR yet, just for this specific comment - the reason was mentioned in https://github.com/apache/iceberg/pull/2096#discussion_r575716542. Basically I think there are a lot of places where this method is called for serialization/deserialization of the schema as well as preserve the schema to some file metadata, and at that point sometimes `schema` object is already transformed/projected so it is no longer an existing schema of the table, and thus shouldn't have schema ID. Since schema ID is in theory only useful when retrieved directly from table metadata to get the right schema version for time travel, I think it's better to not explicitly write a schema Id except for writing to table metadata file, to avoid persisting an incorrect version. 
   
   For the case of supporting row identifier to schema, I wonder if we want to do something similar - since row identifier is mostly useful during writing (to check table to get the default row identifier for delete) and not to read (since data/delete file should have this preserved in another place), we might want to have a separate `toJson`/`fromJson` to handle the row identifier parsing separately, and normally when handling schema we don't include this information; i.e. to make the internal model of rowId similar to what we have in #2354, except that serialization/deserialization is different. 
   
   @rdblue since you involved in both conversations, could you please take a look? Thank you!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org