You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/01/16 02:01:12 UTC

[GitHub] [iceberg] yyanyy opened a new pull request #2096: Core: add schema id and schemas to table metadata

yyanyy opened a new pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096


   This PR adds `current-schema-id` and `schemas` to table metadata. It also introduces a wrapper around schema to associate table schema with id. 
   The reason to not add ID directly into `Schema` is that currently schema creation is widely used as a convenient method for a lot of actions that don't involve a "real" table schema. 
   
   Next steps:
   - adds `schema-id` to snapshot logs/history entries and populate
   - use history entries and `schemas` to look up the right schema in time travel queries; this may mean to add `schemas()` in `Table` API
   - spec update
   - add schema id to `historyTable` (will be mentioned later) 
   
   Open questions:
   1. Current approach writes the newly introduced fields to JSON by default even in v1, and there could be forward/backward compatibility concern with the current approach: if a new writer writes (with ID 0) and then update (with ID 1) schema, metadata will store both schema ID 0 and 1, and default ID will be 1. Then an old writer reads and writes the metadata for whatever change, which drops schema 0 in metadata. Then when a new writer picks up the metadata again, the original schema 0 is gone, and 1 is replaced with ID 0. This could result in schema ID consistency issue among different writers.    
      - Since ID is introduced in this PR, there is no metadata table that exposes these inconsistent schema IDs, so we may not have this problem for now. However when we start to add schema ID to `historyTable` metadata table, at different time ID 0 could mean different things in this history table. We could potentially workaround this by only exposing `schemaID` field in `historyTable` only for v2 tables, or mention this caveat on spec. 
      - Alternatively we can expose these two fields only in v2 table, and time travel queries in v1 always rely on looking at old table metadata files as implemented in #1508. This could mean in future any new changes that may depend on schema ID cannot be introduced in v1. 
   2. Do we want to add a `last-assigned-schema-id` to table metadata? My answer would be yes, for a similar reason mentioned in [this comment](https://github.com/apache/iceberg/pull/2089#issuecomment-761184851)
   3. I think currently when replacing a table, earlier history entries/`snapshotLog` will be reset to empty (second to last argument in [here](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/TableMetadata.java#L712)). Is this expected? do we want to fix this as a separate issue?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r570695071



##########
File path: api/src/main/java/org/apache/iceberg/types/TypeUtil.java
##########
@@ -157,6 +157,21 @@ public static Schema assignFreshIds(Schema schema, NextID nextId) {
         .fields());
   }
 
+  /**
+   * Assigns fresh ids from the {@link NextID nextId function} for all fields in a schema.

Review comment:
       Thanks for the catch! Will update




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r567323955



##########
File path: api/src/main/java/org/apache/iceberg/Schema.java
##########
@@ -328,6 +329,24 @@ private Schema internalSelect(Collection<String> names, boolean caseSensitive) {
     return TypeUtil.select(this, selected);
   }
 
+  @Override
+  public boolean equals(Object other) {
+    if (this == other) {
+      return true;
+    } else if (!(other instanceof Schema)) {
+      return false;
+    }
+
+    Schema that = (Schema) other;
+    return struct.equals(that.struct) &&
+        Objects.equals(aliasToId, that.aliasToId);
+  }
+
+  @Override
+  public int hashCode() {
+    return Objects.hash(struct, aliasToId);
+  }

Review comment:
       Why add `equals` and `hashCode` to `Schema`? The reason why we didn't already have this implemented is that you can compare the type using `schema.asStruct().equals(...)`. And we don't want to define whether two schemas are "equal" based on the aliases because the same schema can have different original aliases.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#issuecomment-783773073


   Looks great, thanks for updating this, @yyanyy!
   
   There were a few minor comments in tests, but we can pick up those fixes later. I think overall this looks great!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r570695154



##########
File path: core/src/main/java/org/apache/iceberg/SchemaParser.java
##########
@@ -134,6 +143,10 @@ static void toJson(Type type, JsonGenerator generator) throws IOException {
     }
   }
 
+  public static void toJsonWithVersion(Schema schema, JsonGenerator generator) throws IOException {

Review comment:
       Good point, will update




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r567324641



##########
File path: api/src/main/java/org/apache/iceberg/VersionedSchema.java
##########
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Objects;
+
+/**
+ * A wrapper around Schema to include its schema ID
+ */
+public class VersionedSchema implements Serializable {

Review comment:
       I think I agree with @jackye1995 here. We could add the ID to `Schema` and just have it use either `null` or `0` if there is no schema ID. I think we can use `0`.
   
   If there is no schema ID, then there will not be a `schemas` list, and all of the snapshots would not have a schema ID. So if an older writer updates a table, then we can expect the current schema ID to be dropped, older schemas to be dropped, snapshot schema IDs to be dropped, etc. So it is consistent with having just one schema. That should be safe for v1 because we won't leave any dangling IDs that could reference the wrong schema later.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r575716935



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -451,10 +470,24 @@ public TableMetadata withUUID() {
       return this;
     } else {
       return new TableMetadata(null, formatVersion, UUID.randomUUID().toString(), location,
-          lastSequenceNumber, lastUpdatedMillis, lastColumnId, schema, defaultSpecId, specs, lastAssignedPartitionId,
-          defaultSortOrderId, sortOrders, properties, currentSnapshotId, snapshots, snapshotLog,
-          addPreviousFile(file, lastUpdatedMillis));
+          lastSequenceNumber, lastUpdatedMillis, lastColumnId, currentSchemaId, schemas, defaultSpecId, specs,
+          lastAssignedPartitionId, defaultSortOrderId, sortOrders, properties,
+          currentSnapshotId, snapshots, snapshotLog, addPreviousFile(file, lastUpdatedMillis));
+    }
+  }
+
+  private int reuseOrCreateNewSchemaId(Schema newSchema) {

Review comment:
       Can you move this down to the bottom with other private helper methods?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r575717535



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadataParser.java
##########
@@ -262,7 +277,41 @@ static TableMetadata fromJson(FileIO io, InputFile file, JsonNode node) {
       lastSequenceNumber = TableMetadata.INITIAL_SEQUENCE_NUMBER;
     }
     int lastAssignedColumnId = JsonUtil.getInt(LAST_COLUMN_ID, node);
-    Schema schema = SchemaParser.fromJson(node.get(SCHEMA));
+
+    List<Schema> schemas;
+    int currentSchemaId;
+    Schema schema = null;
+
+    JsonNode schemaArray = node.get(SCHEMAS);
+    if (schemaArray != null) {
+      Preconditions.checkArgument(schemaArray.isArray(),
+          "Cannot parse schemas from non-array: %s", schemaArray);
+      // current schema ID is required when the schema array is present
+      currentSchemaId = JsonUtil.getInt(CURRENT_SCHEMA_ID, node);
+
+      // parse the schema array
+      ImmutableList.Builder<Schema> builder = ImmutableList.builder();
+      for (JsonNode schemaNode : schemaArray) {
+        Schema current = SchemaParser.fromJsonWithId(schemaNode);
+        if (current.schemaId() == currentSchemaId) {
+          schema = current;
+        }
+        builder.add(current);
+      }
+
+      Preconditions.checkArgument(schema != null,
+          "Cannot find schema with %s=%s from %s", CURRENT_SCHEMA_ID, currentSchemaId, SCHEMAS);
+
+      schemas = builder.build();
+
+    } else {
+      Preconditions.checkArgument(formatVersion == 1,
+          "%s must exist in format v%s", SCHEMAS, formatVersion);
+
+      currentSchemaId = TableMetadata.INITIAL_SCHEMA_ID;

Review comment:
       I think this should be `schema.schemaId()` instead of referencing the constant.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#issuecomment-772102052


   > Do you have any suggestion on if we want to add schema id to the history table?
   
   I don't think that we want to add it to history. The snapshot log should refer to snapshots by ID and we can get the schema easily enough from there. For the actual metadata table, we don't want to expose schema ID because that's something that users should ideally never need to know much about. We've avoided adding IDs in other similar situations. For example, you can rewrite manifests for a specific partition spec, by ID. But we didn't include that in the stored procedure because we haven't exposed the partition spec ID anywhere to SQL users.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r568292231



##########
File path: api/src/main/java/org/apache/iceberg/Schema.java
##########
@@ -328,6 +329,24 @@ private Schema internalSelect(Collection<String> names, boolean caseSensitive) {
     return TypeUtil.select(this, selected);
   }
 
+  @Override
+  public boolean equals(Object other) {
+    if (this == other) {
+      return true;
+    } else if (!(other instanceof Schema)) {
+      return false;
+    }
+
+    Schema that = (Schema) other;
+    return struct.equals(that.struct) &&
+        Objects.equals(aliasToId, that.aliasToId);
+  }
+
+  @Override
+  public int hashCode() {
+    return Objects.hash(struct, aliasToId);
+  }

Review comment:
       In the new change I introduced, we can reuse the old schema ID during schema update, if the exact same schema already exists in the schema list. I think in this case we may need to confirm both schemas' aliases should be the same? Or do you think there shouldn't be a case when schema are the same while aliases are not?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r575717651



##########
File path: core/src/test/java/org/apache/iceberg/TestTableMetadata.java
##########
@@ -100,8 +101,15 @@ public void testJsonConversion() throws Exception {
         .add(new SnapshotLogEntry(currentSnapshot.timestampMillis(), currentSnapshot.snapshotId()))
         .build();
 
+    Schema schema = new Schema(6,
+        Types.NestedField.required(10, "x", Types.StringType.get()));
+

Review comment:
       Nit: extra newline.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r569016246



##########
File path: api/src/main/java/org/apache/iceberg/Schema.java
##########
@@ -328,6 +329,24 @@ private Schema internalSelect(Collection<String> names, boolean caseSensitive) {
     return TypeUtil.select(this, selected);
   }
 
+  @Override
+  public boolean equals(Object other) {
+    if (this == other) {
+      return true;
+    } else if (!(other instanceof Schema)) {
+      return false;
+    }
+
+    Schema that = (Schema) other;
+    return struct.equals(that.struct) &&
+        Objects.equals(aliasToId, that.aliasToId);
+  }
+
+  @Override
+  public int hashCode() {
+    return Objects.hash(struct, aliasToId);
+  }

Review comment:
       I don't think that should use aliases. Aliases are populated when converting from a file schema to Iceberg, so that we can look up the file's column names from IDs and vice versa. That information is used by integration and not by Iceberg itself. The aliases are not serialized when writing a schema into Iceberg metadata.
   
   For the purposes of maintaining the schema list, we can ignore aliases and just use `schema.asStruct()`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r575715362



##########
File path: core/src/main/java/org/apache/iceberg/SchemaParser.java
##########
@@ -134,6 +143,10 @@ static void toJson(Type type, JsonGenerator generator) throws IOException {
     }
   }
 
+  public static void toJsonWithId(Schema schema, JsonGenerator generator) throws IOException {

Review comment:
       This is only called in `TableMetadataParser` and I don't see why there should be two `toJson` methods. It isn't a problem for the v1 metadata to be written with the schema ID because extra fields are ignored. So I think that this method isn't needed and `toJson` should always pass the ID to the struct writer.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
yyanyy commented on pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#issuecomment-771320010


   > > Do we want to add a last-assigned-schema-id to table metadata?
   > 
   > My initial thought was no because the schema IDs are completely local to a metadata file. We can reassign all IDs from one file to the next and as long as they are internally consistent we are okay. The IDs are not embedded in separate files that might get improperly used.
   > 
   > But, the Nessie comment you pointed to is a good one to think about. I think it is still okay because Nessie merges would either be fast-forward, or would re-apply changes and create entirely new metadata files. So the internal consistency argument still holds.
   > 
   > > Do we want to assign special schema IDs to metadata tables, to avoid potential collision on schema IDs?
   > 
   > I don't think so.
   > 
   > > Do we want to add snapshot-id to only history entries, or Snapshot interface?
   > 
   > I would add it to Snapshot, not just History. Snapshot is what will be used for time travel queries. I would avoid adding too much metadata to history.
   > 
   > > I think currently when replacing a table, earlier history entries/snapshotLog will be reset to empty (second to last argument in here). Is this expected? do we want to fix this as a separate issue?
   > 
   > I think we should follow up and keep the history. I think the reason for this is because we previously didn't have any compatibility across schemas (IDs were completely reassigned, so time travel would be incorrect). But that's fixed now.
   
   Thank you for the review and the input! I'll create a separate issue to mention the history problem and link here. Do you have any suggestion on if we want to add schema id to the history table? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r575716542



##########
File path: core/src/main/java/org/apache/iceberg/SchemaParser.java
##########
@@ -253,4 +266,15 @@ public static Schema fromJson(String json) {
       }
     });
   }
+
+  public static Schema fromJsonWithId(int schemaId, JsonNode json) {

Review comment:
       I'd prefer if all of the parsing methods returned the correct ID instead of having some that use a version that ignores the ID.
   
   I think you chose to use two methods because there are some situations where the ID wasn't written and you don't want the schema to have the wrong ID. But, because the `fromJson` method uses the `Schema` constructor that does not pass the ID, the schemas will have the default ID, 0, anyway.
   
   As long as the schemas have an ID, I think there is no need to have multiple parser methods, both with and without ID. Also, we need to make sure that the IDs are consistent when reading the schema from files that were written before IDs. That should just be manifest files, where the schema is embedded in the header.
   
   For manifests, we need to add the schema ID when writing (https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/ManifestWriter.java#L190). And when reading, we will need to update how the spec is constructed. If the spec is loaded by ID then it's fine. Otherwise we should load the schema by ID to parse the spec, and if that doesn't work we should parse both the schema and the spec (https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/ManifestReader.java#L115-L120). I think we should probably also try to take the parsed schema and find the correct one from table metadata so that we don't have an incorrect ID anywhere.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r575716817



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -451,10 +470,24 @@ public TableMetadata withUUID() {
       return this;
     } else {
       return new TableMetadata(null, formatVersion, UUID.randomUUID().toString(), location,
-          lastSequenceNumber, lastUpdatedMillis, lastColumnId, schema, defaultSpecId, specs, lastAssignedPartitionId,
-          defaultSortOrderId, sortOrders, properties, currentSnapshotId, snapshots, snapshotLog,
-          addPreviousFile(file, lastUpdatedMillis));
+          lastSequenceNumber, lastUpdatedMillis, lastColumnId, currentSchemaId, schemas, defaultSpecId, specs,
+          lastAssignedPartitionId, defaultSortOrderId, sortOrders, properties,
+          currentSnapshotId, snapshots, snapshotLog, addPreviousFile(file, lastUpdatedMillis));
+    }
+  }
+
+  private int reuseOrCreateNewSchemaId(Schema newSchema) {
+    // if the schema already exists, use its id; otherwise use the highest id + 1
+    int newSchemaId = currentSchemaId;

Review comment:
       Shouldn't this be `currentSchemaId + 1`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r559023819



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadataParser.java
##########
@@ -242,6 +255,7 @@ public static TableMetadata read(FileIO io, InputFile file) {
     }
   }
 
+  @SuppressWarnings("checkstyle:CyclomaticComplexity")

Review comment:
       I remember this also shows up when adding primary key #2010 , I think we should probably find some way to refactor the code a bit.

##########
File path: api/src/main/java/org/apache/iceberg/VersionedSchema.java
##########
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Objects;
+
+/**
+ * A wrapper around Schema to include its schema ID
+ */
+public class VersionedSchema implements Serializable {

Review comment:
       The biggest question I have for this PR is whether we should have this wrapped class for adding schema ID, or we should just add schema ID to the `Schema` class itself, maybe as an optional `Integer`. Because Iceberg have its own jackson parser for serdes, it is not hard to ignore the `schemaId` field or set a default for it for backwards compatibility. What is the benefit of this `VersionedSchema` approach?
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
yyanyy commented on pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#issuecomment-775597041


   > Works fine for me so far, tested a few operations among a few clusters of different versions.
   > 
   > One behavior I see is that, if an old v1 writer is somehow still used to create a new table version after v2 writer is used, all schema history will be dropped, so new schema written by v2 writer will have schema id starting from 0 again, which can break time travel query.
   > 
   > In theory, once a table is written by a v2 writer, it should never be touched by a v1 writer, but we cannot prevent people from doing that in real life, especially when the organization is large and people use different versions of the same software all the time. So I feel we still need something like a `last-assigned-schema-id`, so that the schemas don't accidentally share the same id and we can backfill the earlier schema and their IDs if necessary. But `last-assigned-schema-id` field should not as a part of the table metadata field because it will be dropped by an old writer. One way to go is to maybe put it in table properties, but there might be better ways. What do you think?
   
   Thanks for putting efforts into testing this! 
   
   I think the problem you described seem to be similar to what I had in my question 1, and I think we might be fine with that since the IDs within the metadata file should always be consistent, and we don't expose them in metadata table as Ryan mentioned above. I think it wouldn't break time travel query, as after the old writer writes the data, the schema ID with the snapshot got dropped, in which case we will likely need to rely on #1508 for time travel? 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r560442043



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadataParser.java
##########
@@ -242,6 +255,7 @@ public static TableMetadata read(FileIO io, InputFile file) {
     }
   }
 
+  @SuppressWarnings("checkstyle:CyclomaticComplexity")

Review comment:
       Yeah I imagine there would be some conflict when either of the PRs got merged. I think the main problem with refactoring this is that a lot of the things it parsed should be parsed together/validate each other; e.g. parse of schema array has to come together with current schema id since both will either present at the same time, or derived from a different field, and same goes for sort order/partition spec. Since Java only allows returning a single object per method, we either have to refactor the code a bit and ensure the logic flow doesn't change and is readable, or let the new methods to return things like `tuple<Integer, List<Schema>>` that I'm not sure which way is cleaner. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#issuecomment-778682629


   Thanks @yyanyy! I think this is getting close. The main thing is not exposing extra id-aware methods in the parser and also reading/writing manifests with IDs. Adding schema ID to snapshots will be a follow-up, right?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#issuecomment-770291719


   > Do we want to add a last-assigned-schema-id to table metadata?
   
   My initial thought was no because the schema IDs are completely local to a metadata file. We can reassign all IDs from one file to the next and as long as they are internally consistent we are okay. The IDs are not embedded in separate files that might get improperly used.
   
   But, the Nessie comment you pointed to is a good one to think about. I think it is still okay because Nessie merges would either be fast-forward, or would re-apply changes and create entirely new metadata files. So the internal consistency argument still holds.
   
   > Do we want to assign special schema IDs to metadata tables, to avoid potential collision on schema IDs?
   
   I don't think so.
   
   > Do we want to add snapshot-id to only history entries, or Snapshot interface?
   
   I would add it to Snapshot, not just History. Snapshot is what will be used for time travel queries. I would avoid adding too much metadata to history.
   
   > I think currently when replacing a table, earlier history entries/snapshotLog will be reset to empty (second to last argument in here). Is this expected? do we want to fix this as a separate issue?
   
   I think we should follow up and keep the history. I think the reason for this is because we previously didn't have any compatibility across schemas (IDs were completely reassigned, so time travel would be incorrect). But that's fixed now.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r578066096



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -451,10 +470,24 @@ public TableMetadata withUUID() {
       return this;
     } else {
       return new TableMetadata(null, formatVersion, UUID.randomUUID().toString(), location,
-          lastSequenceNumber, lastUpdatedMillis, lastColumnId, schema, defaultSpecId, specs, lastAssignedPartitionId,
-          defaultSortOrderId, sortOrders, properties, currentSnapshotId, snapshots, snapshotLog,
-          addPreviousFile(file, lastUpdatedMillis));
+          lastSequenceNumber, lastUpdatedMillis, lastColumnId, currentSchemaId, schemas, defaultSpecId, specs,
+          lastAssignedPartitionId, defaultSortOrderId, sortOrders, properties,
+          currentSnapshotId, snapshots, snapshotLog, addPreviousFile(file, lastUpdatedMillis));
+    }
+  }
+
+  private int reuseOrCreateNewSchemaId(Schema newSchema) {
+    // if the schema already exists, use its id; otherwise use the highest id + 1
+    int newSchemaId = currentSchemaId;

Review comment:
       In the for loop below it will check if each existing schema's ID is greater or equal to this `newSchemaId` and +1 if so, so I think at the end of the method it will be currentSchemaId + 1 if no reuse happens? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r580684458



##########
File path: core/src/test/java/org/apache/iceberg/TestTableMetadata.java
##########
@@ -599,4 +639,81 @@ public void testUpdateSortOrder() {
     Assert.assertEquals("Should be nulls first",
         NullOrder.NULLS_FIRST, sortedByX.sortOrder().fields().get(0).nullOrder());
   }
+
+  @Test
+  public void testUpdateSchema() {
+    Schema schema = new Schema(0,
+        Types.NestedField.required(1, "y", Types.LongType.get(), "comment")
+    );
+    TableMetadata freshTable = TableMetadata.newTableMetadata(
+        schema, PartitionSpec.unpartitioned(), null, ImmutableMap.of());
+    Assert.assertEquals("Should use TableMetadata.INITIAL_SCHEMA_ID for current schema id",
+        TableMetadata.INITIAL_SCHEMA_ID, freshTable.currentSchemaId());
+    assertSameSchemaList(ImmutableList.of(schema), freshTable.schemas());
+    Assert.assertEquals("Should have expected schema upon return",
+        schema.asStruct(), freshTable.schema().asStruct());
+    Assert.assertEquals("Should return expected last column id", 1, freshTable.lastColumnId());
+
+    // update schema
+    Schema schema2 = new Schema(
+        Types.NestedField.required(1, "y", Types.LongType.get(), "comment"),
+        Types.NestedField.required(2, "x", Types.StringType.get())
+    );
+    TableMetadata twoSchemasTable = freshTable.updateSchema(schema2, 2);
+    Assert.assertEquals("Should have current schema id as 1",
+        1, twoSchemasTable.currentSchemaId());
+    assertSameSchemaList(ImmutableList.of(schema, new Schema(1, schema2.columns())),
+        twoSchemasTable.schemas());
+    Assert.assertEquals("Should have expected schema upon return",
+        schema2.asStruct(), twoSchemasTable.schema().asStruct());
+    Assert.assertEquals("Should return expected last column id", 2, twoSchemasTable.lastColumnId());
+
+    // update schema with the the same schema and last column ID as current shouldn't cause change
+    Schema sameSchema2 = new Schema(
+        Types.NestedField.required(1, "y", Types.LongType.get(), "comment"),
+        Types.NestedField.required(2, "x", Types.StringType.get())
+    );
+    TableMetadata sameSchemaTable = twoSchemasTable.updateSchema(sameSchema2, 2);
+    Assert.assertEquals("Should return same table metadata",
+        twoSchemasTable, sameSchemaTable);
+
+    // update schema with the the same schema and different last column ID as current should create a new table
+    TableMetadata differentColumnIdTable = sameSchemaTable.updateSchema(sameSchema2, 3);
+    Assert.assertEquals("Should have current schema id as 1",
+        1, differentColumnIdTable.currentSchemaId());
+    assertSameSchemaList(ImmutableList.of(schema, new Schema(1, schema2.columns())),
+        differentColumnIdTable.schemas());
+    Assert.assertEquals("Should have expected schema upon return",
+        schema2.asStruct(), differentColumnIdTable.schema().asStruct());
+    Assert.assertEquals("Should return expected last column id",
+        3, differentColumnIdTable.lastColumnId());
+
+    // update schema with old schema does not change schemas
+    TableMetadata revertSchemaTable = differentColumnIdTable.updateSchema(schema, 3);
+    Assert.assertEquals("Should have current schema id as 0",
+        0, revertSchemaTable.currentSchemaId());
+    assertSameSchemaList(ImmutableList.of(schema, new Schema(1, schema2.columns())),
+        revertSchemaTable.schemas());
+    Assert.assertEquals("Should have expected schema upon return",
+        schema.asStruct(), revertSchemaTable.schema().asStruct());
+    Assert.assertEquals("Should return expected last column id",
+        3, revertSchemaTable.lastColumnId());
+
+    // create new schema will use the largest schema id + 1
+    Schema schema3 = new Schema(
+        Types.NestedField.required(2, "y", Types.LongType.get(), "comment"),
+        Types.NestedField.required(4, "x", Types.StringType.get()),
+        Types.NestedField.required(6, "z", Types.IntegerType.get())
+    );
+    TableMetadata threeSchemaTable = revertSchemaTable.updateSchema(schema3, 3);

Review comment:
       Nit: the last assigned ID should be 6 since `z` has ID 6.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r578065645



##########
File path: core/src/test/java/org/apache/iceberg/TestTableMetadataSerialization.java
##########
@@ -78,4 +82,21 @@ public void testSerialization() throws Exception {
         Lists.transform(result.snapshots(), Snapshot::snapshotId));
     Assert.assertEquals("History should match", meta.snapshotLog(), result.snapshotLog());
   }
+
+  private void assertSameSchemaList(List<Schema> list1, List<Schema> list2) {

Review comment:
       I didn't find a test util class in core, but looks like there's a `TestHelpers` in api module so I'll put it there. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r575717562



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadataParser.java
##########
@@ -262,7 +277,41 @@ static TableMetadata fromJson(FileIO io, InputFile file, JsonNode node) {
       lastSequenceNumber = TableMetadata.INITIAL_SEQUENCE_NUMBER;
     }
     int lastAssignedColumnId = JsonUtil.getInt(LAST_COLUMN_ID, node);
-    Schema schema = SchemaParser.fromJson(node.get(SCHEMA));
+
+    List<Schema> schemas;
+    int currentSchemaId;
+    Schema schema = null;

Review comment:
       Why set this to null but not the two others?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r575716962



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -463,14 +496,30 @@ public TableMetadata updateSchema(Schema newSchema, int newLastColumnId) {
     // rebuild all of the partition specs and sort orders for the new current schema
     List<PartitionSpec> updatedSpecs = Lists.transform(specs, spec -> updateSpecSchema(newSchema, spec));
     List<SortOrder> updatedSortOrders = Lists.transform(sortOrders, order -> updateSortOrderSchema(newSchema, order));
+
+    int newSchemaId = reuseOrCreateNewSchemaId(newSchema);
+    if (currentSchemaId == newSchemaId && newLastColumnId == lastColumnId) {
+      // the new spec and last column Id is already current and no change is needed
+      return this;
+    }
+
+    ImmutableList.Builder<Schema> builder = ImmutableList.<Schema>builder()
+        .addAll(schemas);

Review comment:
       Can this be on the previous line?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r575716644



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -116,8 +117,9 @@ static TableMetadata newTableMetadata(Schema schema,
 
     return new TableMetadata(null, formatVersion, UUID.randomUUID().toString(), location,
         INITIAL_SEQUENCE_NUMBER, System.currentTimeMillis(),
-        lastColumnId.get(), freshSchema, INITIAL_SPEC_ID, ImmutableList.of(freshSpec),
-        freshSpec.lastAssignedFieldId(), freshSortOrderId, ImmutableList.of(freshSortOrder),
+        lastColumnId.get(), freshSchema.schemaId(), ImmutableList.of(freshSchema),
+        INITIAL_SPEC_ID, ImmutableList.of(freshSpec), freshSpec.lastAssignedFieldId(),

Review comment:
       I like that you use `freshSchema.schemaId()` instead of using `INITIAL_SCHEMA_ID` again. Should we do the same for `INITIAL_SPEC_ID`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#issuecomment-775665530


   > I think the problem you described seem to be similar to what I had in my question 1, and I think we might be fine with that since the IDs within the metadata file should always be consistent, and we don't expose them in metadata table as Ryan mentioned above. I think it wouldn't break time travel query, as after the old writer writes the data, the schema ID with the snapshot got dropped, in which case we will likely need to rely on #1508 for time travel?
   
   I see, that makes sense.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r570459069



##########
File path: api/src/main/java/org/apache/iceberg/types/TypeUtil.java
##########
@@ -157,6 +157,21 @@ public static Schema assignFreshIds(Schema schema, NextID nextId) {
         .fields());
   }
 
+  /**
+   * Assigns fresh ids from the {@link NextID nextId function} for all fields in a schema.

Review comment:
       nit: function should be outside the bracket

##########
File path: core/src/main/java/org/apache/iceberg/SchemaParser.java
##########
@@ -134,6 +143,10 @@ static void toJson(Type type, JsonGenerator generator) throws IOException {
     }
   }
 
+  public static void toJsonWithVersion(Schema schema, JsonGenerator generator) throws IOException {

Review comment:
       nit: should this be called `toJsonWithId`, just to match the `fromJsonWithId` below?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r575717370



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -744,9 +804,9 @@ public TableMetadata upgradeToFormatVersion(int newFormatVersion) {
     }
 
     return new TableMetadata(null, newFormatVersion, uuid, location,
-        lastSequenceNumber, System.currentTimeMillis(), lastColumnId, schema, defaultSpecId, specs,
-        lastAssignedPartitionId, defaultSortOrderId, sortOrders, properties, currentSnapshotId,
-        snapshots, snapshotLog, addPreviousFile(file, lastUpdatedMillis));
+        lastSequenceNumber, System.currentTimeMillis(), lastColumnId, currentSchemaId, schemas, defaultSpecId, specs,
+        lastAssignedPartitionId, defaultSortOrderId, sortOrders, properties, currentSnapshotId, snapshots, snapshotLog,
+        addPreviousFile(file, lastUpdatedMillis));

Review comment:
       Same here. Could you remove unnecessary reformatting?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r567326067



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadataParser.java
##########
@@ -242,6 +255,7 @@ public static TableMetadata read(FileIO io, InputFile file) {
     }
   }
 
+  @SuppressWarnings("checkstyle:CyclomaticComplexity")

Review comment:
       I think we could have a method that parses schemas and a separate one that returns the current schema ID. That said, I'm fine with suppressing the complexity warning for now.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r578065391



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -499,13 +548,14 @@ public TableMetadata updatePartitionSpec(PartitionSpec newPartitionSpec) {
     }
 
     return new TableMetadata(null, formatVersion, uuid, location,
-        lastSequenceNumber, System.currentTimeMillis(), lastColumnId, schema, newDefaultSpecId,
+        lastSequenceNumber, System.currentTimeMillis(), lastColumnId, currentSchemaId, schemas, newDefaultSpecId,
         builder.build(), Math.max(lastAssignedPartitionId, newPartitionSpec.lastAssignedFieldId()),
-        defaultSortOrderId, sortOrders, properties,
-        currentSnapshotId, snapshots, snapshotLog, addPreviousFile(file, lastUpdatedMillis));
+        defaultSortOrderId, sortOrders, properties, currentSnapshotId, snapshots, snapshotLog,
+        addPreviousFile(file, lastUpdatedMillis));

Review comment:
       Sorry about these, I got confused on which changes were mine when resolving merge conflicts, I'll be more careful next time. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] wypoon commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
wypoon commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r589966136



##########
File path: core/src/test/java/org/apache/iceberg/TestTableMetadata.java
##########
@@ -223,6 +237,7 @@ public static String toJsonWithoutSpecList(TableMetadata metadata) {
       generator.writeNumberField(LAST_UPDATED_MILLIS, metadata.lastUpdatedMillis());
       generator.writeNumberField(LAST_COLUMN_ID, metadata.lastColumnId());
 
+      // mimic an old writer by writing only schema and not the current ID or schema list
       generator.writeFieldName(SCHEMA);
       SchemaParser.toJson(metadata.schema(), generator);

Review comment:
       I think that to properly test backward compatibility, the schema in this `schema` field should not have a `schema-id`, so this should be `SchemaParser.toJson(metadata.schema().asStruct(), generator);`

##########
File path: core/src/test/java/org/apache/iceberg/TestTableMetadata.java
##########
@@ -100,8 +101,14 @@ public void testJsonConversion() throws Exception {
         .add(new SnapshotLogEntry(currentSnapshot.timestampMillis(), currentSnapshot.snapshotId()))
         .build();
 
+    Schema schema = new Schema(6,
+        Types.NestedField.required(10, "x", Types.StringType.get()));
+
     TableMetadata expected = new TableMetadata(null, 2, UUID.randomUUID().toString(), TEST_LOCATION,
-        SEQ_NO, System.currentTimeMillis(), 3, TEST_SCHEMA, 5, ImmutableList.of(SPEC_5), SPEC_5.lastAssignedFieldId(),
+        SEQ_NO, System.currentTimeMillis(), 3,
+        7, ImmutableList.of(TEST_SCHEMA, schema),
+        5, ImmutableList.of(SPEC_5), SPEC_5.lastAssignedFieldId(),
+

Review comment:
       Nit: extraneous blank line.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#issuecomment-772102052


   > Do you have any suggestion on if we want to add schema id to the history table?
   
   I don't think that we want to add it to history. The snapshot log should refer to snapshots by ID and we can get the schema easily enough from there. For the actual metadata table, we don't want to expose schema ID because that's something that users should ideally never need to know much about. We've avoided adding IDs in other similar situations. For example, you can rewrite manifests for a specific partition spec, by ID. But we didn't include that in the stored procedure because we haven't exposed the partition spec ID anywhere to SQL users.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r575717275



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -661,9 +713,9 @@ public TableMetadata removeSnapshotLogEntries(Set<Long> snapshotIds) {
         "Cannot set invalid snapshot log: latest entry is not the current snapshot");
 
     return new TableMetadata(null, formatVersion, uuid, location,
-        lastSequenceNumber, System.currentTimeMillis(), lastColumnId, schema, defaultSpecId, specs,
-        lastAssignedPartitionId, defaultSortOrderId, sortOrders, properties, currentSnapshotId,
-        snapshots, newSnapshotLog, addPreviousFile(file, lastUpdatedMillis));
+        lastSequenceNumber, System.currentTimeMillis(), lastColumnId, currentSchemaId, schemas, defaultSpecId, specs,
+        lastAssignedPartitionId, defaultSortOrderId, sortOrders, properties, currentSnapshotId, snapshots,
+        newSnapshotLog, addPreviousFile(file, lastUpdatedMillis));

Review comment:
       Did `snapshots` need to move?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r570459069



##########
File path: api/src/main/java/org/apache/iceberg/types/TypeUtil.java
##########
@@ -157,6 +157,21 @@ public static Schema assignFreshIds(Schema schema, NextID nextId) {
         .fields());
   }
 
+  /**
+   * Assigns fresh ids from the {@link NextID nextId function} for all fields in a schema.

Review comment:
       nit: function should be outside the bracket

##########
File path: core/src/main/java/org/apache/iceberg/SchemaParser.java
##########
@@ -134,6 +143,10 @@ static void toJson(Type type, JsonGenerator generator) throws IOException {
     }
   }
 
+  public static void toJsonWithVersion(Schema schema, JsonGenerator generator) throws IOException {

Review comment:
       nit: should this be called `toJsonWithId`, just to match the `fromJsonWithId` below?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r575717337



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -718,18 +770,26 @@ public TableMetadata buildReplacement(Schema updatedSchema, PartitionSpec update
     newProperties.putAll(this.properties);
     newProperties.putAll(updatedProperties);
 
+    // determine the next schema id
+    int freshSchemaId = reuseOrCreateNewSchemaId(freshSchema);
+    ImmutableList.Builder<Schema> schemasBuilder = ImmutableList.<Schema>builder().addAll(schemas);
+
+    if (!schemasById.containsKey(freshSchemaId)) {
+      schemasBuilder.add(new Schema(freshSchemaId, freshSchema.columns()));
+    }
+
     return new TableMetadata(null, formatVersion, uuid, newLocation,
-        lastSequenceNumber, System.currentTimeMillis(), newLastColumnId.get(), freshSchema,
+        lastSequenceNumber, System.currentTimeMillis(), newLastColumnId.get(), freshSchemaId, schemasBuilder.build(),
         specId, specListBuilder.build(), Math.max(lastAssignedPartitionId, freshSpec.lastAssignedFieldId()),
         orderId, sortOrdersBuilder.build(), ImmutableMap.copyOf(newProperties),
         -1, snapshots, ImmutableList.of(), addPreviousFile(file, lastUpdatedMillis, newProperties));
   }
 
   public TableMetadata updateLocation(String newLocation) {
     return new TableMetadata(null, formatVersion, uuid, newLocation,
-        lastSequenceNumber, System.currentTimeMillis(), lastColumnId, schema, defaultSpecId, specs,
-        lastAssignedPartitionId, defaultSortOrderId, sortOrders, properties, currentSnapshotId,
-        snapshots, snapshotLog, addPreviousFile(file, lastUpdatedMillis));
+        lastSequenceNumber, System.currentTimeMillis(), lastColumnId, currentSchemaId, schemas, defaultSpecId, specs,
+        lastAssignedPartitionId, defaultSortOrderId, sortOrders, properties, currentSnapshotId, snapshots, snapshotLog,
+        addPreviousFile(file, lastUpdatedMillis));

Review comment:
       Looks like it isn't necessary to move `snapshots` and `snapshotLog`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#issuecomment-774773226


   Works fine for me so far, tested a few operations among a few clusters of different versions. 
   
   One behavior I see is that, if an old v1 writer is somehow still used to create a new table version after v2 writer is used, all schema history will be dropped, so new schema written by v2 writer will have schema id starting from 0 again, which can break time travel query. 
   
   In theory, once a table is written by a v2 writer, it should never be touched by a v1 writer, but we cannot prevent people from doing that in real life, especially when the organization is large and people use different versions of the same software all the time. So I feel we still need something like a `last-assigned-schema-id`, so that the schemas don't accidentally share the same id and we can backfill the earlier schema and their IDs if necessary. But `last-assigned-schema-id` field should not as a part of the table metadata field because it will be dropped by an old writer. One way to go is to maybe put it in table properties, but there might be better ways. What do you think?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r575717631



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadataParser.java
##########
@@ -351,8 +400,8 @@ static TableMetadata fromJson(FileIO io, InputFile file, JsonNode node) {
     }
 
     return new TableMetadata(file, formatVersion, uuid, location,
-        lastSequenceNumber, lastUpdatedMillis, lastAssignedColumnId, schema, defaultSpecId, specs,
-        lastAssignedPartitionId, defaultSortOrderId, sortOrders, properties, currentVersionId,
-        snapshots, entries.build(), metadataEntries.build());
+        lastSequenceNumber, lastUpdatedMillis, lastAssignedColumnId, currentSchemaId, schemas, defaultSpecId, specs,
+        lastAssignedPartitionId, defaultSortOrderId, sortOrders, properties, currentVersionId, snapshots,
+        entries.build(), metadataEntries.build());

Review comment:
       Snapshots doesn't need to move.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r569016246



##########
File path: api/src/main/java/org/apache/iceberg/Schema.java
##########
@@ -328,6 +329,24 @@ private Schema internalSelect(Collection<String> names, boolean caseSensitive) {
     return TypeUtil.select(this, selected);
   }
 
+  @Override
+  public boolean equals(Object other) {
+    if (this == other) {
+      return true;
+    } else if (!(other instanceof Schema)) {
+      return false;
+    }
+
+    Schema that = (Schema) other;
+    return struct.equals(that.struct) &&
+        Objects.equals(aliasToId, that.aliasToId);
+  }
+
+  @Override
+  public int hashCode() {
+    return Objects.hash(struct, aliasToId);
+  }

Review comment:
       I don't think that should use aliases. Aliases are populated when converting from a file schema to Iceberg, so that we can look up the file's column names from IDs and vice versa. That information is used by integration and not by Iceberg itself. The aliases are not serialized when writing a schema into Iceberg metadata.
   
   For the purposes of maintaining the schema list, we can ignore aliases and just use `schema.asStruct()`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r570695071



##########
File path: api/src/main/java/org/apache/iceberg/types/TypeUtil.java
##########
@@ -157,6 +157,21 @@ public static Schema assignFreshIds(Schema schema, NextID nextId) {
         .fields());
   }
 
+  /**
+   * Assigns fresh ids from the {@link NextID nextId function} for all fields in a schema.

Review comment:
       Hmm actually I think either way may make sense? We either link `nextId function` or `nextId` to `NextId` interface, and I think both are valid interpretations? I'm also a bit hesitant to change since I copied this from another method, and there are a few others that use this style. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r575717192



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -499,13 +548,14 @@ public TableMetadata updatePartitionSpec(PartitionSpec newPartitionSpec) {
     }
 
     return new TableMetadata(null, formatVersion, uuid, location,
-        lastSequenceNumber, System.currentTimeMillis(), lastColumnId, schema, newDefaultSpecId,
+        lastSequenceNumber, System.currentTimeMillis(), lastColumnId, currentSchemaId, schemas, newDefaultSpecId,
         builder.build(), Math.max(lastAssignedPartitionId, newPartitionSpec.lastAssignedFieldId()),
-        defaultSortOrderId, sortOrders, properties,
-        currentSnapshotId, snapshots, snapshotLog, addPreviousFile(file, lastUpdatedMillis));
+        defaultSortOrderId, sortOrders, properties, currentSnapshotId, snapshots, snapshotLog,
+        addPreviousFile(file, lastUpdatedMillis));

Review comment:
       It looks like this change wasn't needed and just moves 3 variables?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r567326556



##########
File path: api/src/main/java/org/apache/iceberg/VersionedSchema.java
##########
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Objects;
+
+/**
+ * A wrapper around Schema to include its schema ID
+ */
+public class VersionedSchema implements Serializable {

Review comment:
       My comment above is thinking through your question 1 in the description. I think it is okay to do this as you have it here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r567325845



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadataParser.java
##########
@@ -161,8 +163,19 @@ private static void toJson(TableMetadata metadata, JsonGenerator generator) thro
     generator.writeNumberField(LAST_UPDATED_MILLIS, metadata.lastUpdatedMillis());
     generator.writeNumberField(LAST_COLUMN_ID, metadata.lastColumnId());
 
-    generator.writeFieldName(SCHEMA);
-    SchemaParser.toJson(metadata.schema(), generator);
+    // for older readers, continue writing the current schema as "schema"

Review comment:
       Might want to note that this is done until v2 because support for `schemas` and `current-schema-id` is required in v2 and later.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r575714561



##########
File path: api/src/main/java/org/apache/iceberg/Schema.java
##########
@@ -54,6 +58,7 @@
   private transient Map<Integer, String> idToName = null;
 
   public Schema(List<NestedField> columns, Map<String, Integer> aliases) {
+    this.schemaId = 0;

Review comment:
       I think we should add a constant for the default schema ID. That way we can reference it in other places when we fill in the default.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r575715520



##########
File path: core/src/main/java/org/apache/iceberg/SchemaParser.java
##########
@@ -58,9 +59,17 @@ private SchemaParser() {
   private static final String VALUE_REQUIRED = "value-required";
 
   static void toJson(Types.StructType struct, JsonGenerator generator) throws IOException {
+    toJson(struct, null, generator);
+  }
+
+  static void toJson(Types.StructType struct, Integer schemaId, JsonGenerator generator) throws IOException {

Review comment:
       Looks like the struct `toJson` methods aren't called outside of this file. What do you think about making them private?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r560449979



##########
File path: api/src/main/java/org/apache/iceberg/VersionedSchema.java
##########
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Objects;
+
+/**
+ * A wrapper around Schema to include its schema ID
+ */
+public class VersionedSchema implements Serializable {

Review comment:
       I mentioned this briefly in the PR description: 
   
   >The reason to not add ID directly into Schema is that currently schema creation is widely used as a convenient method for a lot of actions that don't involve a "real" table schema. 
   
   The main concern I had with adding schema ID is that it might be confusing to tell from the code base when a schema should have ID and when not. A schema only needs to have ID when it's really a table schema strictly associated with table metadata, but outside of that context, the ID is not useful in any "real" operation of the schema, especially considering that it's often used for conveniently mutate fields for things like projection. Maybe we should rename `VersionedSchema` to something like `TableSchema` to better reflect this? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r570695071



##########
File path: api/src/main/java/org/apache/iceberg/types/TypeUtil.java
##########
@@ -157,6 +157,21 @@ public static Schema assignFreshIds(Schema schema, NextID nextId) {
         .fields());
   }
 
+  /**
+   * Assigns fresh ids from the {@link NextID nextId function} for all fields in a schema.

Review comment:
       Thanks for the catch! Will update

##########
File path: core/src/main/java/org/apache/iceberg/SchemaParser.java
##########
@@ -134,6 +143,10 @@ static void toJson(Type type, JsonGenerator generator) throws IOException {
     }
   }
 
+  public static void toJsonWithVersion(Schema schema, JsonGenerator generator) throws IOException {

Review comment:
       Good point, will update

##########
File path: api/src/main/java/org/apache/iceberg/types/TypeUtil.java
##########
@@ -157,6 +157,21 @@ public static Schema assignFreshIds(Schema schema, NextID nextId) {
         .fields());
   }
 
+  /**
+   * Assigns fresh ids from the {@link NextID nextId function} for all fields in a schema.

Review comment:
       Hmm actually I think either way may make sense? We either link `nextId function` or `nextId` to `NextId` interface, and I think both are valid interpretations? I'm also a bit hesitant to change since I copied this from another method, and there are a few others that use this style. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r578065083



##########
File path: core/src/main/java/org/apache/iceberg/SchemaParser.java
##########
@@ -253,4 +266,15 @@ public static Schema fromJson(String json) {
       }
     });
   }
+
+  public static Schema fromJsonWithId(int schemaId, JsonNode json) {

Review comment:
       I have removed `fromJsonWithId` and update `fromJson` to use the constructor with schema Id, if Id present in Json blob, or default to constructor that doesn't require Id. Please let me know if it is expected!
   
   Regarding persisting schema ID in manifest, since schema and partition-spec in both v1 and v2 are required in manifest metadata, I think it is unlikely that we will miss schema itself but has schema ID for lookup? And I think in manifest reader, schema is only used for constructing partition spec, so that having a correct schema ID might not be super necessary. 
   
   Today there are a lot of places that rely on toJson/fromJson for schema, e.g. Avro metadata, Hive table properties, scan tasks. In this PR, except for table metadata parser, those are all using the `toJson` version that do not write schemaId at all, which means that when the strings are read back to Schema object they will always have id of 0; and this goes back to the persisting schema ID question above, that we do not persist id in the manifest header, so we are not persisting an incorrect ID, but instead constructed a schema without ID that default to be 0 like everywhere else. 
   
   My original thinking was that schema Id should always be ignored/considered inaccurate unless we directly get it from table metadata, as after all kinds of projections it might be hard to track if the ID is correct. And the ID is only used for lookup, so if we have the schema struct, it's fine to have incorrect ID since we are unlikely to use it. But I might be overcomplicating the problem. Do you think we are able to guarantee the ID to be right everywhere, and do you have comment about the current state of `toJson` variations (that only the one used by metadata parser writes out ID)? 
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r568292231



##########
File path: api/src/main/java/org/apache/iceberg/Schema.java
##########
@@ -328,6 +329,24 @@ private Schema internalSelect(Collection<String> names, boolean caseSensitive) {
     return TypeUtil.select(this, selected);
   }
 
+  @Override
+  public boolean equals(Object other) {
+    if (this == other) {
+      return true;
+    } else if (!(other instanceof Schema)) {
+      return false;
+    }
+
+    Schema that = (Schema) other;
+    return struct.equals(that.struct) &&
+        Objects.equals(aliasToId, that.aliasToId);
+  }
+
+  @Override
+  public int hashCode() {
+    return Objects.hash(struct, aliasToId);
+  }

Review comment:
       In the new change I introduced, we can reuse the old schema ID during schema update, if the exact same schema already exists in the schema list. I think in this case we may need to confirm both schemas' aliases should be the same? Or do you think there shouldn't be a case when schema are the same while aliases are not?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r578065567



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadataParser.java
##########
@@ -262,7 +277,41 @@ static TableMetadata fromJson(FileIO io, InputFile file, JsonNode node) {
       lastSequenceNumber = TableMetadata.INITIAL_SEQUENCE_NUMBER;
     }
     int lastAssignedColumnId = JsonUtil.getInt(LAST_COLUMN_ID, node);
-    Schema schema = SchemaParser.fromJson(node.get(SCHEMA));
+
+    List<Schema> schemas;
+    int currentSchemaId;
+    Schema schema = null;

Review comment:
       This is because in the first if statement, after parsing the schema array and assign the schema, I did a check to ensure the schema holding current schema ID exists in schema array. Without this null check, the `Preconditions.checkArgument(schema != null, ...)` will highlight with error warning "schema variable might not be initialized".
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r567324641



##########
File path: api/src/main/java/org/apache/iceberg/VersionedSchema.java
##########
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Objects;
+
+/**
+ * A wrapper around Schema to include its schema ID
+ */
+public class VersionedSchema implements Serializable {

Review comment:
       I think I agree with @jackye1995 here. We could add the ID to `Schema` and just have it return `null` if there is no schema ID.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r593551188



##########
File path: core/src/test/java/org/apache/iceberg/TestTableMetadata.java
##########
@@ -599,4 +639,81 @@ public void testUpdateSortOrder() {
     Assert.assertEquals("Should be nulls first",
         NullOrder.NULLS_FIRST, sortedByX.sortOrder().fields().get(0).nullOrder());
   }
+
+  @Test
+  public void testUpdateSchema() {
+    Schema schema = new Schema(0,
+        Types.NestedField.required(1, "y", Types.LongType.get(), "comment")
+    );
+    TableMetadata freshTable = TableMetadata.newTableMetadata(
+        schema, PartitionSpec.unpartitioned(), null, ImmutableMap.of());
+    Assert.assertEquals("Should use TableMetadata.INITIAL_SCHEMA_ID for current schema id",
+        TableMetadata.INITIAL_SCHEMA_ID, freshTable.currentSchemaId());

Review comment:
       Do you mean to change the schema id supplied to `TableMetadata.newTableMetadata` to be non-0? Actually in `TableMetadata` the input schema Id will be ignored and always assign from 0, so changing this will break the test... Although I think this current behavior should be fine since I think this method is only called when creating a new table, so there shouldn't be any existing schema within the table that will collide on schema ID. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
yyanyy commented on pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#issuecomment-780975835


   > Thanks @yyanyy! I think this is getting close. The main thing is not exposing extra id-aware methods in the parser and also reading/writing manifests with IDs. Adding schema ID to snapshots will be a follow-up, right?
   
   Thanks for the review and sorry for delay responding! And yes, schema ID to snapshot will be a follow up PR. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r580682837



##########
File path: core/src/test/java/org/apache/iceberg/TestTableMetadata.java
##########
@@ -599,4 +639,81 @@ public void testUpdateSortOrder() {
     Assert.assertEquals("Should be nulls first",
         NullOrder.NULLS_FIRST, sortedByX.sortOrder().fields().get(0).nullOrder());
   }
+
+  @Test
+  public void testUpdateSchema() {
+    Schema schema = new Schema(0,
+        Types.NestedField.required(1, "y", Types.LongType.get(), "comment")
+    );
+    TableMetadata freshTable = TableMetadata.newTableMetadata(
+        schema, PartitionSpec.unpartitioned(), null, ImmutableMap.of());
+    Assert.assertEquals("Should use TableMetadata.INITIAL_SCHEMA_ID for current schema id",
+        TableMetadata.INITIAL_SCHEMA_ID, freshTable.currentSchemaId());

Review comment:
       Minor: The schema passed in uses the same ID as the one that is validated (`INITIAL_SCHEMA_ID=0`). I think to test this, it would be better to pass in a schema with a non-zero ID.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r580683983



##########
File path: core/src/test/java/org/apache/iceberg/TestTableMetadata.java
##########
@@ -599,4 +639,81 @@ public void testUpdateSortOrder() {
     Assert.assertEquals("Should be nulls first",
         NullOrder.NULLS_FIRST, sortedByX.sortOrder().fields().get(0).nullOrder());
   }
+
+  @Test
+  public void testUpdateSchema() {
+    Schema schema = new Schema(0,
+        Types.NestedField.required(1, "y", Types.LongType.get(), "comment")
+    );
+    TableMetadata freshTable = TableMetadata.newTableMetadata(
+        schema, PartitionSpec.unpartitioned(), null, ImmutableMap.of());
+    Assert.assertEquals("Should use TableMetadata.INITIAL_SCHEMA_ID for current schema id",
+        TableMetadata.INITIAL_SCHEMA_ID, freshTable.currentSchemaId());
+    assertSameSchemaList(ImmutableList.of(schema), freshTable.schemas());
+    Assert.assertEquals("Should have expected schema upon return",
+        schema.asStruct(), freshTable.schema().asStruct());
+    Assert.assertEquals("Should return expected last column id", 1, freshTable.lastColumnId());
+
+    // update schema
+    Schema schema2 = new Schema(
+        Types.NestedField.required(1, "y", Types.LongType.get(), "comment"),
+        Types.NestedField.required(2, "x", Types.StringType.get())
+    );
+    TableMetadata twoSchemasTable = freshTable.updateSchema(schema2, 2);
+    Assert.assertEquals("Should have current schema id as 1",
+        1, twoSchemasTable.currentSchemaId());
+    assertSameSchemaList(ImmutableList.of(schema, new Schema(1, schema2.columns())),
+        twoSchemasTable.schemas());
+    Assert.assertEquals("Should have expected schema upon return",
+        schema2.asStruct(), twoSchemasTable.schema().asStruct());
+    Assert.assertEquals("Should return expected last column id", 2, twoSchemasTable.lastColumnId());
+
+    // update schema with the the same schema and last column ID as current shouldn't cause change
+    Schema sameSchema2 = new Schema(
+        Types.NestedField.required(1, "y", Types.LongType.get(), "comment"),
+        Types.NestedField.required(2, "x", Types.StringType.get())
+    );
+    TableMetadata sameSchemaTable = twoSchemasTable.updateSchema(sameSchema2, 2);
+    Assert.assertEquals("Should return same table metadata",
+        twoSchemasTable, sameSchemaTable);

Review comment:
       Should this use `assertSame` since the object must be the same `TableMetadata` instance and not just equal to it?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
yyanyy commented on pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#issuecomment-771320010


   > > Do we want to add a last-assigned-schema-id to table metadata?
   > 
   > My initial thought was no because the schema IDs are completely local to a metadata file. We can reassign all IDs from one file to the next and as long as they are internally consistent we are okay. The IDs are not embedded in separate files that might get improperly used.
   > 
   > But, the Nessie comment you pointed to is a good one to think about. I think it is still okay because Nessie merges would either be fast-forward, or would re-apply changes and create entirely new metadata files. So the internal consistency argument still holds.
   > 
   > > Do we want to assign special schema IDs to metadata tables, to avoid potential collision on schema IDs?
   > 
   > I don't think so.
   > 
   > > Do we want to add snapshot-id to only history entries, or Snapshot interface?
   > 
   > I would add it to Snapshot, not just History. Snapshot is what will be used for time travel queries. I would avoid adding too much metadata to history.
   > 
   > > I think currently when replacing a table, earlier history entries/snapshotLog will be reset to empty (second to last argument in here). Is this expected? do we want to fix this as a separate issue?
   > 
   > I think we should follow up and keep the history. I think the reason for this is because we previously didn't have any compatibility across schemas (IDs were completely reassigned, so time travel would be incorrect). But that's fixed now.
   
   Thank you for the review and the input! I'll create a separate issue to mention the history problem and link here. Do you have any suggestion on if we want to add schema id to the history table? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r567324931



##########
File path: core/src/main/java/org/apache/iceberg/SchemaParser.java
##########
@@ -58,9 +59,16 @@ private SchemaParser() {
   private static final String VALUE_REQUIRED = "value-required";
 
   static void toJson(Types.StructType struct, JsonGenerator generator) throws IOException {
+    toJson(struct, null, generator);
+  }
+  static void toJson(Types.StructType struct, Integer schemaId, JsonGenerator generator) throws IOException {

Review comment:
       Nit: missing newline between methods.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r575717760



##########
File path: core/src/test/java/org/apache/iceberg/TestTableMetadataSerialization.java
##########
@@ -78,4 +82,21 @@ public void testSerialization() throws Exception {
         Lists.transform(result.snapshots(), Snapshot::snapshotId));
     Assert.assertEquals("History should match", meta.snapshotLog(), result.snapshotLog());
   }
+
+  private void assertSameSchemaList(List<Schema> list1, List<Schema> list2) {

Review comment:
       Is it possible to factor this into a common util rather than copying it?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org