You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/04/12 23:08:10 UTC

[GitHub] [iceberg] jackye1995 opened a new pull request #2465: Core: add row identifier to schema

jackye1995 opened a new pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465


   
   Continuation of #2354
   
   @yyanyy @rdblue @openinx @aokolnychyi 
   
   Spec with row identifier:
   
   ```
   {
         "type": "struct",
         "schema-id": 1,
         "row-identifiers": [
           1,
           2
         ],
         "fields": [
           {
             "id": 1,
             "name": "x",
             "required": true,
             "type": "long"
           },
           {
             "id": 2,
             "name": "y",
             "required": true,
             "type": "long",
             "doc": "comment"
           },
           {
             "id": 3,
             "name": "z",
             "required": true,
             "type": "long"
           }
         ]
       }
   
   ```
   
   New Schema toString:
   
   ```
   table {
     fields {
       1: x: required long
       2: y: required long (comment)
       3: z: required long
     }
     row identifiers { 1,2 }
   }
   ```
   
   
   Update row identifier rules:
   1. row identifier should be added through `UpdateSchema.addRowIdentifier(columnName)`
   2. the column added should exist in schema or a part of the newly added columns (to make adding a new primary key a single atomic update)
   3. rename, move column should not affect row identifier because it is referencing the field IDs
   4. row identifier should be dropped through `UpdateSchema.deleteRowIdentifier(columnName)`
   5. it can only drop existing row identifier column
   6. a row identifier column cannot be dropped unless it is first dropped in the row identifiers list, to satisfy both use cases (1) user want to actually drop that row identifier column, (2) prevent user from directly dropping that column without knowing the implications
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r612211781



##########
File path: core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java
##########
@@ -1223,4 +1223,98 @@ public void testMoveBetweenStructsFails() {
                 .moveBefore("s2.x", "s1.a")
                 .apply());
   }
+
+  @Test
+  public void testAddRowIdentifier() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .allowIncompatibleChanges()
+        .addRowIdentifier("id")
+        .addRequiredColumn("id2", Types.StringType.get())
+        .addRowIdentifier("id2")
+        .apply();
+    Assert.assertEquals(Sets.newHashSet(1, SCHEMA_LAST_COLUMN_ID + 1), newSchema.rowIdentifiers());
+
+    AssertHelpers.assertThrows("Should fail if column with name not exist",
+        IllegalArgumentException.class,
+        "Cannot find column of the given name in existing or newly added columns",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID).addRowIdentifier("unknown").apply());
+
+    AssertHelpers.assertThrows("Should fail if column with name not exist",
+        IllegalArgumentException.class,
+        "it is not a required field",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID).addRowIdentifier("data").apply());
+
+    AssertHelpers.assertThrows("Should fail if column with name not exist",
+        IllegalArgumentException.class,
+        "it is not a primitive field",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID).addRowIdentifier("locations").apply());
+  }
+
+  @Test
+  public void testRenameRowIdentifier() {
+    Schema schemaWithRowIdentifier = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .allowIncompatibleChanges()
+        .addRowIdentifier("id")
+        .apply();
+
+    Schema newSchema = new SchemaUpdate(schemaWithRowIdentifier, SCHEMA_LAST_COLUMN_ID)
+        .renameColumn("id", "id2")
+        .apply();
+
+    Assert.assertEquals(Sets.newHashSet(1), newSchema.rowIdentifiers());
+  }
+

Review comment:
       What's the behavior if we try to add a unit tests based on this [comment](https://github.com/apache/iceberg/pull/2465/files#r612199953) ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r620747946



##########
File path: core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java
##########
@@ -1223,4 +1223,232 @@ public void testMoveBetweenStructsFails() {
                 .moveBefore("s2.x", "s1.a")
                 .apply());
   }
+
+  @Test
+  public void testAddExistingIdentifierFields() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("id"))
+        .apply();
+
+    Assert.assertEquals("add an existing field as identifier field should succeed",
+        Sets.newHashSet(newSchema.findField("id").fieldId()),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testAddNewIdentifierFieldColumns() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new_field", Types.StringType.get())
+        .setIdentifierFields(Sets.newHashSet("id", "new_field"))
+        .apply();
+
+    Assert.assertEquals("add column then set as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("id").fieldId(), newSchema.findField("new_field").fieldId()),
+        newSchema.identifierFieldIds());
+
+    newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("id", "new_field"))
+        .addColumn("new_field", Types.StringType.get())
+        .apply();
+
+    Assert.assertEquals("set identifier then add column should succeed",
+        Sets.newHashSet(newSchema.findField("id").fieldId(), newSchema.findField("new_field").fieldId()),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testAddNestedIdentifierFieldColumns() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("preferences.feature1"))
+        .apply();
+
+    Assert.assertEquals("set existing nested field as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("preferences.feature1").fieldId()),
+        newSchema.identifierFieldIds());
+
+    newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new", Types.StructType.of(
+            Types.NestedField.optional(SCHEMA_LAST_COLUMN_ID + 1, "field", Types.StringType.get())
+        ))
+        .setIdentifierFields(Sets.newHashSet("new.field"))
+        .apply();
+
+    Assert.assertEquals("set newly added nested field as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("new.field").fieldId()),
+        newSchema.identifierFieldIds());
+
+    newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new", Types.StructType.of(
+            Types.NestedField.optional(SCHEMA_LAST_COLUMN_ID + 1, "field", Types.StructType.of(
+                Types.NestedField.optional(SCHEMA_LAST_COLUMN_ID + 2, "nested", Types.StringType.get())))))
+        .setIdentifierFields(Sets.newHashSet("new.field.nested"))
+        .apply();
+
+    Assert.assertEquals("set newly added multi-layer nested field as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("new.field.nested").fieldId()),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testAddDottedIdentifierFieldColumns() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn(null, "dot.field", Types.StringType.get())
+        .setIdentifierFields(Sets.newHashSet("id", "dot.field"))
+        .apply();
+
+    Assert.assertEquals("add a field with dot as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("id").fieldId(), newSchema.findField("dot.field").fieldId()),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testRemoveIdentifierFields() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new_field", Types.StringType.get())
+        .addColumn("new_field2", Types.StringType.get())
+        .setIdentifierFields(Sets.newHashSet("id", "new_field", "new_field2"))
+        .apply();
+
+    newSchema = new SchemaUpdate(newSchema, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("new_field", "new_field2"))
+        .apply();
+
+    Assert.assertEquals("remove an identifier field should succeed",
+        Sets.newHashSet(newSchema.findField("new_field").fieldId(), newSchema.findField("new_field2").fieldId()),
+        newSchema.identifierFieldIds());
+
+    newSchema = new SchemaUpdate(newSchema, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet())
+        .apply();
+
+    Assert.assertEquals("remove all identifier fields should succeed",
+        Sets.newHashSet(),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testSetIdentifierFieldsFails() {
+    AssertHelpers.assertThrows("add a field with name not exist should fail",
+        IllegalArgumentException.class,
+        "not found in current schema or added columns",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+            .setIdentifierFields(Sets.newHashSet("unknown"))
+            .apply());
+
+    AssertHelpers.assertThrows("add a field of non-primitive type should fail",
+        IllegalArgumentException.class,
+        "not a primitive type field",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+            .setIdentifierFields(Sets.newHashSet("locations"))
+            .apply());
+
+    AssertHelpers.assertThrows("add a nested field in map should fail",
+        IllegalArgumentException.class,
+        "must not be nested in " + SCHEMA.findField("locations"),
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+            .setIdentifierFields(Sets.newHashSet("locations.key.zip"))
+            .apply());
+
+    AssertHelpers.assertThrows("add a nested field in list should fail",
+        IllegalArgumentException.class,
+        "must not be nested in " + SCHEMA.findField("points"),
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+            .setIdentifierFields(Sets.newHashSet("points.element.x"))
+            .apply());
+
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new", Types.StructType.of(
+            Types.NestedField.optional(SCHEMA_LAST_COLUMN_ID + 1, "fields", Types.ListType.ofOptional(
+                SCHEMA_LAST_COLUMN_ID + 2, Types.StructType.of(
+                    Types.NestedField.optional(SCHEMA_LAST_COLUMN_ID + 3, "nested", Types.StringType.get())
+                ))
+            )
+        ))
+        .apply();
+
+    AssertHelpers.assertThrows("add a nested field in struct of a map should fail",

Review comment:
       This is trying to test struct -> list -> struct, instead of just list -> struct to trigger the while loop.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r612913679



##########
File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
##########
@@ -193,6 +204,11 @@ public UpdateSchema renameColumn(String name, String newName) {
       updates.put(fieldId, Types.NestedField.of(fieldId, field.isOptional(), newName, field.type(), field.doc()));
     }
 
+    if (identifierFieldNames.contains(name)) {

Review comment:
       Should we throw exception if the `newName` has been already existed in `identifierFieldNames` (Maybe actually `schema`) ? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] yyanyy commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

yyanyy commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r612759918



##########
File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
##########
@@ -408,11 +450,15 @@ private TableMetadata applyChangesToMapping(TableMetadata metadata) {
   private static Schema applyChanges(Schema schema, List<Integer> deletes,
                                      Map<Integer, Types.NestedField> updates,
                                      Multimap<Integer, Types.NestedField> adds,
-                                     Multimap<Integer, Move> moves) {
+                                     Multimap<Integer, Move> moves,
+                                     Set<String> rowIdentifierNames) {
     Types.StructType struct = TypeUtil
         .visit(schema, new ApplyChanges(deletes, updates, adds, moves))
         .asNestedType().asStructType();
-    return new Schema(struct.fields());
+    Set<Integer> refreshedRowIdentifiers = rowIdentifierNames.stream()
+        .map(name -> struct.field(name).fieldId())
+        .collect(Collectors.toSet());
+    return new Schema(struct.fields(), refreshedRowIdentifiers);

Review comment:
       I think we may want a change to `TableMetadata` to ensure an update to only row identifier should still create a new schema; I think currently if a schema update does not change the underlying `struct`, we will consider the operation a no-op (comparison on `struct` is also why we don't have `equals`/`hashcode` in `Schema`)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r615310750



##########
File path: core/src/main/java/org/apache/iceberg/SchemaParser.java
##########
@@ -158,7 +169,7 @@ public static String toJson(Schema schema, boolean pretty) {
       if (pretty) {
         generator.useDefaultPrettyPrinter();
       }
-      toJson(schema.asStruct(), generator);
+      toJson(schema.asStruct(), schema.schemaId(), schema.rowIdentifiers(), generator);

Review comment:
       I think we do want to pass the schema ID here. Whenever we can, we should preserve the ID. My comment on the other PR was recommending that we preserve the ID where possible even when reading and writing manifest files.
   
   I think I missed @yyanyy's follow-up comment about other places that embed the schema ID. I tend to agree with that line of reasoning, but because of it I'm coming to a different conclusion. Rather than removing the schema ID from those places, I think we should try to maintain them.
   
   The risk to keeping schema IDs in other files is that another file may be written with a schema ID that is lost because an older writer (e.g. v1) didn't write out the schema list, only the current schema. If that happens, then an Avro file might write a schema with ID 1, which is then lost and later redefined. But as Yan noted, the actual schema (fields) are stored in those files and we don't need to look up the schema from the ID. So we have a choice of whether to discard the schema ID or not. I'd prefer not to drop it, since we can easily detect whether the schema actually matches the one with its ID and because we won't really need to do the lookup. I think it adds complexity to remember whether or not to drop the ID.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r616245321



##########
File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
##########
@@ -193,6 +204,11 @@ public UpdateSchema renameColumn(String name, String newName) {
       updates.put(fieldId, Types.NestedField.of(fieldId, field.isOptional(), newName, field.type(), field.doc()));
     }
 
+    if (identifierFieldNames.contains(name)) {

Review comment:
       Yeah, I was expecting the lines above to catch this, but surprisingly it did not do this check.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r617123843



##########
File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
##########
@@ -317,6 +320,31 @@ public UpdateSchema unionByNameWith(Schema newSchema) {
     return this;
   }
 
+  @Override
+  public UpdateSchema setIdentifierFields(Set<String> names) {

Review comment:
       I think we may want to also keep `addIdentifierField`. There's still a use case around adding a field at a time. 
   ~~Plus, given that I think we want to be able to specify the parent (see my comment below), this form of the method is a bit awkward because we can't pass pairs of strings, `(parent, name)`.~~
   
   ~~What about changing this to `resetIdentifierFields`, `addIdentifierField`, and `removeIdentifierField`? I think those get the same job done because `set` is `reset` plus `add` calls.~~
   
   Edit: I think I was wrong on changing the API, but `addIdentifierField` is still valid.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] yyanyy commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

yyanyy commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r612661016



##########
File path: core/src/main/java/org/apache/iceberg/SchemaParser.java
##########
@@ -158,7 +169,7 @@ public static String toJson(Schema schema, boolean pretty) {
       if (pretty) {
         generator.useDefaultPrettyPrinter();
       }
-      toJson(schema.asStruct(), generator);
+      toJson(schema.asStruct(), schema.schemaId(), schema.rowIdentifiers(), generator);

Review comment:
       Haven't reviewed the entire PR yet, just for this specific comment - the reason was mentioned in https://github.com/apache/iceberg/pull/2096#discussion_r575716542. Basically I think there are a lot of places where this method is called for serialization/deserialization of the schema as well as preserve the schema to some file metadata, and at that point sometimes `schema` object is already transformed/projected so it is no longer an existing schema of the table, and thus shouldn't have schema ID. Since schema ID is in theory only useful when retrieved directly from table metadata to get the right schema version for time travel, I think it's better to not explicitly write a schema Id except for writing to table metadata file, to avoid persisting an incorrect version. 
   
   For the case of supporting row identifier to schema, I wonder if we want to do something similar - since row identifier is mostly useful during writing (to check table to get the default row identifier for delete) and not to read (since data/delete file should have this preserved in another place), we might want to have a separate `toJson`/`fromJson` to handle the row identifier parsing separately, and normally when handling schema we don't include this information; i.e. to make the internal model of rowId similar to what we have in #2354, except that serialization/deserialization is different. 
   
   @rdblue since you involved in both conversations, could you please take a look? Thank you!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r620693472



##########
File path: api/src/main/java/org/apache/iceberg/UpdateSchema.java
##########
@@ -384,4 +386,24 @@ default UpdateSchema updateColumn(String name, Type.PrimitiveType newType, Strin
    *                                  with other additions, renames, or updates.
    */
   UpdateSchema unionByNameWith(Schema newSchema);
+
+  /**
+   * Set the identifier fields given a set of field names.
+   * See {@link Schema#identifierFieldIds()} to learn more about Iceberg identifier.
+   *
+   * @param names names of the columns to set as identifier fields
+   * @return this for method chaining
+   */
+  UpdateSchema setIdentifierFields(Set<String> names);

Review comment:
       Why require this to be a set instead of accepting any `Collection` or `Iterable`? Would a user never need to pass a `List`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r615319740



##########
File path: core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java
##########
@@ -1223,4 +1223,98 @@ public void testMoveBetweenStructsFails() {
                 .moveBefore("s2.x", "s1.a")
                 .apply());
   }
+
+  @Test
+  public void testAddIdentifierField() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .allowIncompatibleChanges()
+        .addIdentifierField("id")
+        .addRequiredColumn("id2", Types.StringType.get())
+        .addIdentifierField("id2")
+        .apply();
+    Assert.assertEquals(Sets.newHashSet(1, SCHEMA_LAST_COLUMN_ID + 1), newSchema.identifierFieldIds());
+
+    AssertHelpers.assertThrows("Should fail if column with name not exist",
+        IllegalArgumentException.class,
+        "Cannot find column of the given name in existing or newly added columns",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID).addIdentifierField("unknown").apply());
+
+    AssertHelpers.assertThrows("Should fail if column with name not exist",
+        IllegalArgumentException.class,
+        "it is not a required field",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID).addIdentifierField("data").apply());
+
+    AssertHelpers.assertThrows("Should fail if column with name not exist",
+        IllegalArgumentException.class,
+        "it is not a primitive field",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID).addIdentifierField("locations").apply());
+  }
+
+  @Test
+  public void testRenameIdentifierField() {
+    Schema schemaWithIdentifierFields = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .allowIncompatibleChanges()
+        .addIdentifierField("id")
+        .apply();
+
+    Schema newSchema = new SchemaUpdate(schemaWithIdentifierFields, SCHEMA_LAST_COLUMN_ID)
+        .renameColumn("id", "id2")
+        .apply();
+
+    Assert.assertEquals(Sets.newHashSet(1), newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testMoveIdentifierField() {
+    Schema schemaWithIdentifierFields = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .allowIncompatibleChanges()
+        .addIdentifierField("id")
+        .apply();
+
+    Schema newSchema = new SchemaUpdate(schemaWithIdentifierFields, SCHEMA_LAST_COLUMN_ID)
+        .moveAfter("id", "locations")
+        .apply();
+    Assert.assertEquals(Sets.newHashSet(1), newSchema.identifierFieldIds());
+
+    newSchema = new SchemaUpdate(schemaWithIdentifierFields, SCHEMA_LAST_COLUMN_ID)
+        .moveBefore("id", "locations")
+        .apply();
+    Assert.assertEquals(Sets.newHashSet(1), newSchema.identifierFieldIds());
+
+    newSchema = new SchemaUpdate(schemaWithIdentifierFields, SCHEMA_LAST_COLUMN_ID)
+        .moveFirst("id")
+        .apply();
+    Assert.assertEquals(Sets.newHashSet(1), newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testRemoveIdentifierField() {
+    Schema schemaWithIdentifierFields = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .allowIncompatibleChanges()
+        .addIdentifierField("id")
+        .apply();
+
+    AssertHelpers.assertThrows("Should fail if deleting column before deleting it as identifier field",
+        IllegalArgumentException.class,
+        "Cannot delete a column that is an identifier field",

Review comment:
       Does this error message include the column name that the user wanted to delete?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r615317012



##########
File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
##########
@@ -317,6 +333,38 @@ public UpdateSchema unionByNameWith(Schema newSchema) {
     return this;
   }
 
+  @Override
+  public UpdateSchema addIdentifierField(String name) {
+    Types.NestedField field = schema.findField(name);
+    if (field == null) {
+      field = adds.get(TABLE_ROOT_ID).stream()
+          .filter(f -> f.name().equals(name))
+          .findAny().orElse(null);
+    }
+
+    Preconditions.checkArgument(field != null,
+        "Cannot find column of the given name in existing or newly added columns: %s", name);
+    Preconditions.checkArgument(field.isRequired(),
+        "Cannot add column %s as an identifier field because it is not a required field", name);
+    Preconditions.checkArgument(field.type().isPrimitiveType(),
+        "Cannot add column %s as an identifier field because it is not a primitive field", name);
+    Preconditions.checkArgument(!identifierFieldNames.contains(name),
+        "Cannot add column %s as an identifier field because it already is", name);
+    identifierFieldNames.add(name);

Review comment:
       I think this should also check that the column isn't nested within a map or list because those contain repeated keys, values, or elements. Anything within a repeated field can't be part of an identifier because there is more than one.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 edited a comment on pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

jackye1995 edited a comment on pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#issuecomment-824442288


   @rdblue @openinx sorry for the delayed update, I tried multiple different ways to implement this and went back and forth, and in the end the cleanest one I found was to use name to record all the identifier fields set, and at `applyChange` time, deletion of identifier column is checked before generating the new schema struct to validate deletion of identifier field, other validations are done after the new schema struct is generated.
   
   Ryan talked about using 2 sets, one for existing IDs and one for new names. The issue with that approach was that the validation checks done for the 2 sets are very similar. For existing IDs, I need to check recursively that all the parents of the field up to root are struct, which can be done using `idToParent` index. For new names, I need to do something very similar, but also merge the `IdToParent` with the parent information of newly added fields. So in the end the code was very similar to just creating a new schema and recreating the `IdToParent` index and a bit redundant. Combining those two facts, I think it is much cleaner to just store everything as name and then validate name and resolve back to ID after the new schema struct is generated.
   
   Ryan also talked about building an index of added name to ID when adding new columns, and with this approach that is not needed because the new schema would have that index when calling `findField`. I think that is a win to not increase complexity in `addColumn` and avoid new indexes in `UpdateSchema`.
   
   I can also do the `deletes` check in the validation function, but in the end I chose to do it in a separated code block before generating the schema struct, because doing it in the validation method requires passing in many arguments, and that flow was not intuitive to code readers, it only increased the complexity of the code and reduced readability.
   
   There are some other methods I tried, such as filtering all potential identifier fields in schema and then check if the identifiers are in that potential set. But that approach could not produce informative error message so I did not go with that route.
   
   I have added some more tests to cover all possible use cases, please let me know if you have any additional concern with the approach.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r620700102



##########
File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
##########
@@ -408,11 +428,50 @@ private TableMetadata applyChangesToMapping(TableMetadata metadata) {
   private static Schema applyChanges(Schema schema, List<Integer> deletes,
                                      Map<Integer, Types.NestedField> updates,
                                      Multimap<Integer, Types.NestedField> adds,
-                                     Multimap<Integer, Move> moves) {
+                                     Multimap<Integer, Move> moves,
+                                     Set<String> identifierNames) {
+    // validate existing identifier fields are not deleted
+    for (String name : identifierNames) {
+      Types.NestedField field = schema.findField(name);
+      if (field != null) {
+        Preconditions.checkArgument(!deletes.contains(field.fieldId()),
+            "Cannot delete identifier field %s. To force deletion, " +
+                "also call setIdentifierFields to update identifier fields.", field);
+      }
+    }
+
+    // apply schema changes
     Types.StructType struct = TypeUtil
         .visit(schema, new ApplyChanges(deletes, updates, adds, moves))
         .asNestedType().asStructType();
-    return new Schema(struct.fields());
+
+    // validate identifier requirements based on latest schema
+    Schema noIdentifierSchema = new Schema(struct.fields());
+    Set<Integer> validatedIdentifiers = identifierNames.stream()
+        .map(n -> validateIdentifierField(n, noIdentifierSchema))
+        .collect(Collectors.toSet());
+
+    return new Schema(struct.fields(), validatedIdentifiers);
+  }
+
+  private static int validateIdentifierField(String name, Schema schema) {

Review comment:
       I don't think that this method is quite right. Each time it is called, it will index the parents in the schema that is passed in, and that schema was only created so that `findFIeld` would work. It also mixes validation with returning the field IDs of each identifier field.
   
   I think it would be cleaner to create the parent and name index just once. And there is no need to create a schema just to index the names. It would also be better to create the list of identifier IDs, then validate the ID list separately.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r615319641



##########
File path: core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java
##########
@@ -1223,4 +1223,98 @@ public void testMoveBetweenStructsFails() {
                 .moveBefore("s2.x", "s1.a")
                 .apply());
   }
+
+  @Test
+  public void testAddIdentifierField() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .allowIncompatibleChanges()
+        .addIdentifierField("id")
+        .addRequiredColumn("id2", Types.StringType.get())
+        .addIdentifierField("id2")
+        .apply();
+    Assert.assertEquals(Sets.newHashSet(1, SCHEMA_LAST_COLUMN_ID + 1), newSchema.identifierFieldIds());
+
+    AssertHelpers.assertThrows("Should fail if column with name not exist",
+        IllegalArgumentException.class,
+        "Cannot find column of the given name in existing or newly added columns",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID).addIdentifierField("unknown").apply());
+
+    AssertHelpers.assertThrows("Should fail if column with name not exist",
+        IllegalArgumentException.class,
+        "it is not a required field",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID).addIdentifierField("data").apply());
+
+    AssertHelpers.assertThrows("Should fail if column with name not exist",
+        IllegalArgumentException.class,
+        "it is not a primitive field",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID).addIdentifierField("locations").apply());
+  }
+
+  @Test
+  public void testRenameIdentifierField() {
+    Schema schemaWithIdentifierFields = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .allowIncompatibleChanges()
+        .addIdentifierField("id")
+        .apply();
+
+    Schema newSchema = new SchemaUpdate(schemaWithIdentifierFields, SCHEMA_LAST_COLUMN_ID)
+        .renameColumn("id", "id2")
+        .apply();
+
+    Assert.assertEquals(Sets.newHashSet(1), newSchema.identifierFieldIds());

Review comment:
       Also, I think that assertions should almost always have context strings. Here, I would expect something like "rename should not change identifier field ids".




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r617119193



##########
File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
##########
@@ -317,6 +320,31 @@ public UpdateSchema unionByNameWith(Schema newSchema) {
     return this;
   }
 
+  @Override
+  public UpdateSchema setIdentifierFields(Set<String> names) {
+    identifierFields = Sets.newHashSet();

Review comment:
       When setting an instance field, we prefix with `this.` so that it is clear that we're initializing state with a larger scope than just the method.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r620692694



##########
File path: api/src/main/java/org/apache/iceberg/Schema.java
##########
@@ -158,6 +186,33 @@ public StructType asStruct() {
     return struct.fields();
   }
 
+  /**
+   * The set of identifier field IDs.
+   * <p>
+   * Identifier is a concept similar to primary key in a relational database system.
+   * A row should be unique in a table based on the values of the identifier fields.
+   * However, unlike a primary key, Iceberg identifier differs in the following ways:
+   * <ul>
+   * <li>Iceberg does not enforce the uniqueness of a row based on this identifier information.
+   * It is used for operations like upsert to define the default upsert key.</li>
+   * <li>NULL can be used as value of an identifier field. Iceberg ensures null-safe equality check.</li>
+   * <li>A nested field in a struct can be used as an identifier. For example, if there is a "last_name" field
+   * inside a "user" struct in a schema, field "user.last_name" can be set as a part of the identifier field.</li>
+   * </ul>
+   * <p>
+   * A field can be used as a part of the identifier only if:
+   * <ul>
+   * <li>its type is primitive</li>
+   * <li>it exists in the current schema, or have been added in this update</li>

Review comment:
       This point doesn't make much sense here because there is no "this update" and the "current schema" is this schema. You could say that these field IDs are guaranteed to be non-repeated primitive fields in this schema.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r612816624



##########
File path: core/src/main/java/org/apache/iceberg/SchemaParser.java
##########
@@ -40,6 +41,7 @@ private SchemaParser() {
   }
 
   private static final String SCHEMA_ID = "schema-id";
+  private static final String ROW_IDENTIFIERS = "row-identifiers";

Review comment:
       As noted above, I think this should be `identifier-field-ids`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r615319518



##########
File path: core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java
##########
@@ -1223,4 +1223,98 @@ public void testMoveBetweenStructsFails() {
                 .moveBefore("s2.x", "s1.a")
                 .apply());
   }
+
+  @Test
+  public void testAddIdentifierField() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .allowIncompatibleChanges()
+        .addIdentifierField("id")
+        .addRequiredColumn("id2", Types.StringType.get())
+        .addIdentifierField("id2")
+        .apply();
+    Assert.assertEquals(Sets.newHashSet(1, SCHEMA_LAST_COLUMN_ID + 1), newSchema.identifierFieldIds());
+
+    AssertHelpers.assertThrows("Should fail if column with name not exist",
+        IllegalArgumentException.class,
+        "Cannot find column of the given name in existing or newly added columns",

Review comment:
       What is the given name? I think this error message should be "Cannot find column in existing or new columns: unknown"




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r616244656



##########
File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
##########
@@ -317,6 +333,38 @@ public UpdateSchema unionByNameWith(Schema newSchema) {
     return this;
   }
 
+  @Override
+  public UpdateSchema addIdentifierField(String name) {
+    Types.NestedField field = schema.findField(name);
+    if (field == null) {
+      field = adds.get(TABLE_ROOT_ID).stream()
+          .filter(f -> f.name().equals(name))
+          .findAny().orElse(null);
+    }
+
+    Preconditions.checkArgument(field != null,
+        "Cannot find column of the given name in existing or newly added columns: %s", name);
+    Preconditions.checkArgument(field.isRequired(),
+        "Cannot add column %s as an identifier field because it is not a required field", name);

Review comment:
       Sounds good, I was trying to match the primary key spec in relational SQL, but I agree that in our case of upsert null still makes sense. I will update the logic.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r615318360



##########
File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
##########
@@ -317,6 +333,38 @@ public UpdateSchema unionByNameWith(Schema newSchema) {
     return this;
   }
 
+  @Override
+  public UpdateSchema addIdentifierField(String name) {
+    Types.NestedField field = schema.findField(name);
+    if (field == null) {
+      field = adds.get(TABLE_ROOT_ID).stream()
+          .filter(f -> f.name().equals(name))
+          .findAny().orElse(null);
+    }
+
+    Preconditions.checkArgument(field != null,
+        "Cannot find column of the given name in existing or newly added columns: %s", name);
+    Preconditions.checkArgument(field.isRequired(),
+        "Cannot add column %s as an identifier field because it is not a required field", name);
+    Preconditions.checkArgument(field.type().isPrimitiveType(),
+        "Cannot add column %s as an identifier field because it is not a primitive field", name);
+    Preconditions.checkArgument(!identifierFieldNames.contains(name),
+        "Cannot add column %s as an identifier field because it already is", name);
+    identifierFieldNames.add(name);
+    return this;
+  }
+
+  @Override
+  public UpdateSchema removeIdentifierField(String name) {
+    Types.NestedField field = schema.findField(name);
+    Preconditions.checkArgument(field != null,
+        "Cannot find column of name %s in existing columns of schema %s", name, schema);
+    Preconditions.checkArgument(identifierFieldNames.contains(name),
+        "Cannot remove column from identifier fields %s because it is not an identifier field", name);

Review comment:
       Same comments here about using "because" and a long sentence instead of a simple `:`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r615319859



##########
File path: core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java
##########
@@ -1223,4 +1223,98 @@ public void testMoveBetweenStructsFails() {
                 .moveBefore("s2.x", "s1.a")
                 .apply());
   }
+
+  @Test
+  public void testAddIdentifierField() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .allowIncompatibleChanges()
+        .addIdentifierField("id")
+        .addRequiredColumn("id2", Types.StringType.get())
+        .addIdentifierField("id2")
+        .apply();
+    Assert.assertEquals(Sets.newHashSet(1, SCHEMA_LAST_COLUMN_ID + 1), newSchema.identifierFieldIds());
+
+    AssertHelpers.assertThrows("Should fail if column with name not exist",
+        IllegalArgumentException.class,
+        "Cannot find column of the given name in existing or newly added columns",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID).addIdentifierField("unknown").apply());
+
+    AssertHelpers.assertThrows("Should fail if column with name not exist",
+        IllegalArgumentException.class,
+        "it is not a required field",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID).addIdentifierField("data").apply());
+
+    AssertHelpers.assertThrows("Should fail if column with name not exist",
+        IllegalArgumentException.class,
+        "it is not a primitive field",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID).addIdentifierField("locations").apply());
+  }
+
+  @Test
+  public void testRenameIdentifierField() {
+    Schema schemaWithIdentifierFields = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .allowIncompatibleChanges()
+        .addIdentifierField("id")
+        .apply();
+
+    Schema newSchema = new SchemaUpdate(schemaWithIdentifierFields, SCHEMA_LAST_COLUMN_ID)
+        .renameColumn("id", "id2")
+        .apply();
+
+    Assert.assertEquals(Sets.newHashSet(1), newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testMoveIdentifierField() {
+    Schema schemaWithIdentifierFields = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .allowIncompatibleChanges()
+        .addIdentifierField("id")
+        .apply();
+
+    Schema newSchema = new SchemaUpdate(schemaWithIdentifierFields, SCHEMA_LAST_COLUMN_ID)
+        .moveAfter("id", "locations")
+        .apply();
+    Assert.assertEquals(Sets.newHashSet(1), newSchema.identifierFieldIds());
+
+    newSchema = new SchemaUpdate(schemaWithIdentifierFields, SCHEMA_LAST_COLUMN_ID)
+        .moveBefore("id", "locations")
+        .apply();
+    Assert.assertEquals(Sets.newHashSet(1), newSchema.identifierFieldIds());
+
+    newSchema = new SchemaUpdate(schemaWithIdentifierFields, SCHEMA_LAST_COLUMN_ID)
+        .moveFirst("id")
+        .apply();
+    Assert.assertEquals(Sets.newHashSet(1), newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testRemoveIdentifierField() {
+    Schema schemaWithIdentifierFields = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .allowIncompatibleChanges()
+        .addIdentifierField("id")
+        .apply();
+
+    AssertHelpers.assertThrows("Should fail if deleting column before deleting it as identifier field",
+        IllegalArgumentException.class,
+        "Cannot delete a column that is an identifier field",
+        () -> new SchemaUpdate(schemaWithIdentifierFields, SCHEMA_LAST_COLUMN_ID)
+            .deleteColumn("id").apply());
+
+    Schema newSchema = new SchemaUpdate(schemaWithIdentifierFields, SCHEMA_LAST_COLUMN_ID)
+        .removeIdentifierField("id")
+        .apply();
+    Assert.assertEquals(Sets.newHashSet(), newSchema.identifierFieldIds());
+
+    AssertHelpers.assertThrows("Should fail if column with name not exist",
+        IllegalArgumentException.class,
+        "Cannot find column of name",
+        () -> new SchemaUpdate(schemaWithIdentifierFields, SCHEMA_LAST_COLUMN_ID)
+            .removeIdentifierField("unknown").apply());

Review comment:
       I think that this should be a separate test case. This is testing that you can't remove an unknown identifier field. The code above this point tests the behavior of `deleteColumn` with an identifier field, then is a basic test of `removeIdentifierField`. Those are both separate cases as well.
   
   Can you separate each idea into its own test case? That makes tests much easier to maintain and to find out what's going wrong when they fail because they are only testing one specific case. In addition, it also makes more test failures visible because one test failure won't prevent another test case from running.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r615319469



##########
File path: core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java
##########
@@ -1223,4 +1223,98 @@ public void testMoveBetweenStructsFails() {
                 .moveBefore("s2.x", "s1.a")
                 .apply());
   }
+
+  @Test
+  public void testAddIdentifierField() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .allowIncompatibleChanges()
+        .addIdentifierField("id")
+        .addRequiredColumn("id2", Types.StringType.get())
+        .addIdentifierField("id2")
+        .apply();
+    Assert.assertEquals(Sets.newHashSet(1, SCHEMA_LAST_COLUMN_ID + 1), newSchema.identifierFieldIds());
+
+    AssertHelpers.assertThrows("Should fail if column with name not exist",
+        IllegalArgumentException.class,
+        "Cannot find column of the given name in existing or newly added columns",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID).addIdentifierField("unknown").apply());
+
+    AssertHelpers.assertThrows("Should fail if column with name not exist",
+        IllegalArgumentException.class,
+        "it is not a required field",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID).addIdentifierField("data").apply());
+
+    AssertHelpers.assertThrows("Should fail if column with name not exist",
+        IllegalArgumentException.class,
+        "it is not a primitive field",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID).addIdentifierField("locations").apply());
+  }
+
+  @Test
+  public void testRenameIdentifierField() {
+    Schema schemaWithIdentifierFields = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .allowIncompatibleChanges()
+        .addIdentifierField("id")
+        .apply();
+
+    Schema newSchema = new SchemaUpdate(schemaWithIdentifierFields, SCHEMA_LAST_COLUMN_ID)
+        .renameColumn("id", "id2")
+        .apply();
+
+    Assert.assertEquals(Sets.newHashSet(1), newSchema.identifierFieldIds());

Review comment:
       Would it be better to use `Sets.newHashSet(SCHEMA.findField("id").fieldId())` instead of assuming the ID?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r617736448



##########
File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
##########
@@ -317,6 +333,38 @@ public UpdateSchema unionByNameWith(Schema newSchema) {
     return this;
   }
 
+  @Override
+  public UpdateSchema addIdentifierField(String name) {
+    Types.NestedField field = schema.findField(name);
+    if (field == null) {
+      field = adds.get(TABLE_ROOT_ID).stream()
+          .filter(f -> f.name().equals(name))
+          .findAny().orElse(null);

Review comment:
       @openinx, that's what I was thinking at first as well, but I don't think that we need a version with parent now.
   
   The reason is that once a column is added, it is uniquely identified by a dotted name: `a.b` might be a struct named `"a` with a field named `b` or might be a field named `a.b`, but it cannot be both. We guarantee this so that we never have ambiguous names.
   
   Because we know that the string passed in here will always identify just one field, we don't need to handle the ambiguous case by adding an explicit parent like we do with `addColumn`.
   
   I think that means we can have a much nicer API:
   * `setIdentifierFields(String...)` and `setIdentifierFields(Collection<String>)`
   * `addIdentifierField(String)`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#issuecomment-827254021


   @rdblue updated based on your comments. For `Set` versus `List` or anything else in the API, please let me know what you think should be the general guidance for this type of change in Iceberg.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r616245321



##########
File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
##########
@@ -193,6 +204,11 @@ public UpdateSchema renameColumn(String name, String newName) {
       updates.put(fieldId, Types.NestedField.of(fieldId, field.isOptional(), newName, field.type(), field.doc()));
     }
 
+    if (identifierFieldNames.contains(name)) {

Review comment:
       Yeah, I was expecting the lines above to catch this, but surprisingly it did not do this check. I can put up another small PR to fix it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r617124334



##########
File path: core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java
##########
@@ -1223,4 +1223,173 @@ public void testMoveBetweenStructsFails() {
                 .moveBefore("s2.x", "s1.a")
                 .apply());
   }
+
+  @Test
+  public void testSetExistingIdentifierFields() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("id"))
+        .apply();
+
+    Assert.assertEquals("add an existing field as identifier field should succeed",
+        Sets.newHashSet(newSchema.findField("id").fieldId()),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testSetNewIdentifierFieldColumns() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new_field", Types.StringType.get())
+        .setIdentifierFields(Sets.newHashSet("id", "new_field"))
+        .apply();
+
+    Assert.assertEquals("add a new field as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("id").fieldId(), newSchema.findField("new_field").fieldId()),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testAddDottedIdentifierFieldColumns() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID + 1)
+        .addColumn(null, "dot.field", Types.StringType.get())

Review comment:
       I think we also need a test where `dot` is a struct with a field named `field`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r612604259



##########
File path: core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java
##########
@@ -1223,4 +1223,98 @@ public void testMoveBetweenStructsFails() {
                 .moveBefore("s2.x", "s1.a")
                 .apply());
   }
+
+  @Test
+  public void testAddRowIdentifier() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .allowIncompatibleChanges()
+        .addRowIdentifier("id")
+        .addRequiredColumn("id2", Types.StringType.get())
+        .addRowIdentifier("id2")
+        .apply();
+    Assert.assertEquals(Sets.newHashSet(1, SCHEMA_LAST_COLUMN_ID + 1), newSchema.rowIdentifiers());
+
+    AssertHelpers.assertThrows("Should fail if column with name not exist",
+        IllegalArgumentException.class,
+        "Cannot find column of the given name in existing or newly added columns",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID).addRowIdentifier("unknown").apply());
+
+    AssertHelpers.assertThrows("Should fail if column with name not exist",
+        IllegalArgumentException.class,
+        "it is not a required field",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID).addRowIdentifier("data").apply());
+
+    AssertHelpers.assertThrows("Should fail if column with name not exist",
+        IllegalArgumentException.class,
+        "it is not a primitive field",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID).addRowIdentifier("locations").apply());
+  }
+
+  @Test
+  public void testRenameRowIdentifier() {
+    Schema schemaWithRowIdentifier = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .allowIncompatibleChanges()
+        .addRowIdentifier("id")
+        .apply();
+
+    Schema newSchema = new SchemaUpdate(schemaWithRowIdentifier, SCHEMA_LAST_COLUMN_ID)
+        .renameColumn("id", "id2")
+        .apply();
+
+    Assert.assertEquals(Sets.newHashSet(1), newSchema.rowIdentifiers());
+  }
+

Review comment:
       I think what you describe is `Should fail if deleting column before deleting it as row identifier` case in `testDeleteRowIdentifier` test, please let me know if I misunderstood.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue merged pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue merged pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r615319289



##########
File path: core/src/main/java/org/apache/iceberg/util/JsonUtil.java
##########
@@ -117,18 +119,67 @@ public static String getStringOrNull(String property, JsonNode node) {
 
   public static List<String> getStringList(String property, JsonNode node) {
     Preconditions.checkArgument(node.has(property), "Cannot parse missing list %s", property);
-    JsonNode pNode = node.get(property);
-    Preconditions.checkArgument(pNode != null && !pNode.isNull() && pNode.isArray(),
-        "Cannot parse %s from non-array value: %s", property, pNode);
-
-    ImmutableList.Builder<String> builder = ImmutableList.builder();
-    Iterator<JsonNode> elements = pNode.elements();
-    while (elements.hasNext()) {
-      JsonNode element = elements.next();
-      Preconditions.checkArgument(element.isTextual(),
-          "Cannot parse string from non-text value: %s", element);
-      builder.add(element.asText());
+    return ImmutableList.<String>builder()
+        .addAll(new JsonStringArrayVisitor(property, node))
+        .build();
+  }
+
+  public static Set<Integer> getIntegerSetOrNull(String property, JsonNode node) {
+    if (!node.has(property)) {
+      return null;
+    }
+
+    return ImmutableSet.<Integer>builder()
+        .addAll(new JsonIntegerArrayVisitor(property, node))
+        .build();
+  }
+
+  abstract static class JsonArrayIterator<T> implements Iterator<T> {
+
+    private final Iterator<JsonNode> elements;
+
+    JsonArrayIterator(String property, JsonNode node) {
+      JsonNode pNode = node.get(property);
+      Preconditions.checkArgument(pNode != null && !pNode.isNull() && pNode.isArray(),
+          "Cannot parse %s from non-array value: %s", property, pNode);
+      this.elements = pNode.elements();
+    }
+
+    @Override
+    public boolean hasNext() {
+      return elements().hasNext();
+    }
+
+    Iterator<JsonNode> elements() {

Review comment:
       Rather than exposing elements, I think it is a bit cleaner to implement `next` here and add an abstract `T convert(JsonNode element)` method that the sub-classes implement. That way they don't need to know about `elements` or call `next` manually.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r612908946



##########
File path: api/src/main/java/org/apache/iceberg/Schema.java
##########
@@ -46,34 +48,53 @@
  */
 public class Schema implements Serializable {
   private static final Joiner NEWLINE = Joiner.on('\n');
+  private static final Joiner COMMA = Joiner.on(',');
   private static final String ALL_COLUMNS = "*";
   private static final int DEFAULT_SCHEMA_ID = 0;
 
   private final StructType struct;
   private final int schemaId;
+  private final int[] identifierFieldIds;
+
   private transient BiMap<String, Integer> aliasToId = null;
   private transient Map<Integer, NestedField> idToField = null;
   private transient Map<String, Integer> nameToId = null;
   private transient Map<String, Integer> lowerCaseNameToId = null;
   private transient Map<Integer, Accessor<StructLike>> idToAccessor = null;
   private transient Map<Integer, String> idToName = null;
+  private transient Set<Integer> identifierFieldIdSet = null;
 
   public Schema(List<NestedField> columns, Map<String, Integer> aliases) {
+    this(columns, aliases, ImmutableSet.of());
+  }
+
+  public Schema(List<NestedField> columns, Map<String, Integer> aliases, Set<Integer> identifierFieldIds) {

Review comment:
       Nit:  Is it possible to reuse the construct logics from this [constructor](https://github.com/apache/iceberg/pull/2465/files#diff-67561e1c7140f30c816a0934fa2f6beea66883a4b25b37b5cc9ca6b3cd034681R93) ? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r620700484



##########
File path: core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java
##########
@@ -1223,4 +1223,232 @@ public void testMoveBetweenStructsFails() {
                 .moveBefore("s2.x", "s1.a")
                 .apply());
   }
+
+  @Test
+  public void testAddExistingIdentifierFields() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("id"))
+        .apply();
+
+    Assert.assertEquals("add an existing field as identifier field should succeed",
+        Sets.newHashSet(newSchema.findField("id").fieldId()),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testAddNewIdentifierFieldColumns() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new_field", Types.StringType.get())
+        .setIdentifierFields(Sets.newHashSet("id", "new_field"))
+        .apply();
+
+    Assert.assertEquals("add column then set as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("id").fieldId(), newSchema.findField("new_field").fieldId()),
+        newSchema.identifierFieldIds());
+
+    newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("id", "new_field"))
+        .addColumn("new_field", Types.StringType.get())
+        .apply();
+
+    Assert.assertEquals("set identifier then add column should succeed",
+        Sets.newHashSet(newSchema.findField("id").fieldId(), newSchema.findField("new_field").fieldId()),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testAddNestedIdentifierFieldColumns() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("preferences.feature1"))
+        .apply();
+
+    Assert.assertEquals("set existing nested field as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("preferences.feature1").fieldId()),
+        newSchema.identifierFieldIds());
+
+    newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new", Types.StructType.of(
+            Types.NestedField.optional(SCHEMA_LAST_COLUMN_ID + 1, "field", Types.StringType.get())
+        ))
+        .setIdentifierFields(Sets.newHashSet("new.field"))
+        .apply();
+
+    Assert.assertEquals("set newly added nested field as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("new.field").fieldId()),
+        newSchema.identifierFieldIds());
+
+    newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new", Types.StructType.of(
+            Types.NestedField.optional(SCHEMA_LAST_COLUMN_ID + 1, "field", Types.StructType.of(
+                Types.NestedField.optional(SCHEMA_LAST_COLUMN_ID + 2, "nested", Types.StringType.get())))))
+        .setIdentifierFields(Sets.newHashSet("new.field.nested"))
+        .apply();
+
+    Assert.assertEquals("set newly added multi-layer nested field as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("new.field.nested").fieldId()),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testAddDottedIdentifierFieldColumns() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn(null, "dot.field", Types.StringType.get())
+        .setIdentifierFields(Sets.newHashSet("id", "dot.field"))
+        .apply();
+
+    Assert.assertEquals("add a field with dot as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("id").fieldId(), newSchema.findField("dot.field").fieldId()),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testRemoveIdentifierFields() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new_field", Types.StringType.get())
+        .addColumn("new_field2", Types.StringType.get())
+        .setIdentifierFields(Sets.newHashSet("id", "new_field", "new_field2"))
+        .apply();
+
+    newSchema = new SchemaUpdate(newSchema, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("new_field", "new_field2"))
+        .apply();
+
+    Assert.assertEquals("remove an identifier field should succeed",
+        Sets.newHashSet(newSchema.findField("new_field").fieldId(), newSchema.findField("new_field2").fieldId()),
+        newSchema.identifierFieldIds());
+
+    newSchema = new SchemaUpdate(newSchema, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet())
+        .apply();
+
+    Assert.assertEquals("remove all identifier fields should succeed",
+        Sets.newHashSet(),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testSetIdentifierFieldsFails() {
+    AssertHelpers.assertThrows("add a field with name not exist should fail",
+        IllegalArgumentException.class,
+        "not found in current schema or added columns",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+            .setIdentifierFields(Sets.newHashSet("unknown"))
+            .apply());
+
+    AssertHelpers.assertThrows("add a field of non-primitive type should fail",
+        IllegalArgumentException.class,
+        "not a primitive type field",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+            .setIdentifierFields(Sets.newHashSet("locations"))
+            .apply());
+
+    AssertHelpers.assertThrows("add a nested field in map should fail",
+        IllegalArgumentException.class,
+        "must not be nested in " + SCHEMA.findField("locations"),
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+            .setIdentifierFields(Sets.newHashSet("locations.key.zip"))
+            .apply());
+
+    AssertHelpers.assertThrows("add a nested field in list should fail",
+        IllegalArgumentException.class,
+        "must not be nested in " + SCHEMA.findField("points"),
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+            .setIdentifierFields(Sets.newHashSet("points.element.x"))
+            .apply());
+
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new", Types.StructType.of(
+            Types.NestedField.optional(SCHEMA_LAST_COLUMN_ID + 1, "fields", Types.ListType.ofOptional(
+                SCHEMA_LAST_COLUMN_ID + 2, Types.StructType.of(
+                    Types.NestedField.optional(SCHEMA_LAST_COLUMN_ID + 3, "nested", Types.StringType.get())
+                ))
+            )
+        ))
+        .apply();
+
+    AssertHelpers.assertThrows("add a nested field in struct of a map should fail",
+        IllegalArgumentException.class,
+        "must not be nested in " + newSchema.findField("new.fields"),
+        () -> new SchemaUpdate(newSchema, SCHEMA_LAST_COLUMN_ID + 3)
+            .setIdentifierFields(Sets.newHashSet("new.fields.element.nested"))
+            .apply());
+  }
+
+  @Test
+  public void testDeleteIdentifierFieldColumns() {
+    Schema schemaWithIdentifierFields = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("id"))
+        .apply();
+
+    Assert.assertEquals("delete column and then reset identifier field should succeed",
+        Sets.newHashSet(),
+        new SchemaUpdate(schemaWithIdentifierFields, SCHEMA_LAST_COLUMN_ID)
+            .deleteColumn("id").setIdentifierFields(Sets.newHashSet()).apply()
+            .identifierFieldIds());
+
+    Assert.assertEquals("delete reset identifier field and then delete column should succeed",
+        Sets.newHashSet(),
+        new SchemaUpdate(schemaWithIdentifierFields, SCHEMA_LAST_COLUMN_ID)
+            .setIdentifierFields(Sets.newHashSet()).deleteColumn("id").apply()
+            .identifierFieldIds());
+  }
+
+  @Test
+  public void testDeleteIdentifierFieldColumnsFails() {
+    Schema schemaWithIdentifierFields = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("id"))
+        .apply();
+
+    AssertHelpers.assertThrows("delete an identifier column without setting identifier fields should fail",
+        IllegalArgumentException.class,
+        "Cannot delete identifier field 1: id: required int. To force deletion, " +
+            "also call setIdentifierFields to update identifier fields.",
+        () -> new SchemaUpdate(schemaWithIdentifierFields, SCHEMA_LAST_COLUMN_ID).deleteColumn("id").apply());
+  }
+
+  @Test
+  public void testRenameIdentifierFields() {
+    Schema schemaWithIdentifierFields = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("id"))
+        .apply();
+
+    Schema newSchema = new SchemaUpdate(schemaWithIdentifierFields, SCHEMA_LAST_COLUMN_ID)
+        .renameColumn("id", "id2")
+        .apply();
+
+    Assert.assertEquals("rename should not affect identifier fields",
+        Sets.newHashSet(SCHEMA.findField("id").fieldId()),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testMoveIdentifierFields() {
+    Schema schemaWithIdentifierFields = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .allowIncompatibleChanges()

Review comment:
       This isn't needed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r618143119



##########
File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
##########
@@ -59,14 +62,17 @@
   private final Multimap<Integer, Move> moves = Multimaps.newListMultimap(Maps.newHashMap(), Lists::newArrayList);
   private int lastColumnId;
   private boolean allowIncompatibleChanges = false;
-
+  private Set<String> identifierNames;
 
   SchemaUpdate(TableOperations ops) {
     this.ops = ops;
     this.base = ops.current();
     this.schema = base.schema();
     this.lastColumnId = base.lastColumnId();
     this.idToParent = Maps.newHashMap(TypeUtil.indexParents(schema.asStruct()));
+    this.identifierNames = schema.identifierFieldIds().stream()

Review comment:
       Nit:  It may be unrelated to this PR but I think it's good to reuse this two constructor into one so that we could have a simpler version of code: 
   
   ```java
   SchemaUpdate(TableOperations ops) {
       this(ops, ops.current(), ops.current().schema(), ops.current().lastColumnId());
     }
   
     /**
      * For testing only.
      */
     SchemaUpdate(Schema schema, int lastColumnId) {
       this(null, null, schema, lastColumnId);
     }
   
     private SchemaUpdate(TableOperations ops, TableMetadata base, Schema schema, int lastColumnId) {
       this.ops = ops;
       this.base = base;
       this.schema = schema;
       this.lastColumnId = lastColumnId;
       this.idToParent = Maps.newHashMap(TypeUtil.indexParents(schema.asStruct()));
       this.identifierNames = schema.identifierFieldIds().stream()
           .map(id -> schema.findField(id).name())
           .collect(Collectors.toSet());
     }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r615320049



##########
File path: core/src/test/java/org/apache/iceberg/TestTableMetadata.java
##########
@@ -646,6 +646,34 @@ public void testUpdateSortOrder() {
         NullOrder.NULLS_FIRST, sortedByX.sortOrder().fields().get(0).nullOrder());
   }
 
+  @Test
+  public void testParseSchemaIdentifierFields() throws Exception {
+    String data = readTableMetadataInputFile("TableMetadataV2Valid.json");
+    TableMetadata parsed = TableMetadataParser.fromJson(
+        ops.io(), null, JsonUtil.mapper().readValue(data, JsonNode.class));
+    Assert.assertEquals(Sets.newHashSet(), parsed.schemasById().get(0).identifierFieldIds());
+    Assert.assertEquals(Sets.newHashSet(1, 2), parsed.schemasById().get(1).identifierFieldIds());
+  }
+
+  @Test
+  public void testUpdateSchemaIdentifierFields() {
+    Schema schema = new Schema(
+        Types.NestedField.required(10, "x", Types.StringType.get())
+    );
+
+    TableMetadata meta = TableMetadata.newTableMetadata(
+        schema, PartitionSpec.unpartitioned(), null, ImmutableMap.of());
+    Assert.assertTrue("Should default to no identifier field", meta.schema().identifierFieldIds().isEmpty());

Review comment:
       I don't think this is correct, or if it is then it isn't testing `TableMetadata`. The `newTableMetadata` result should use whatever identifier fields were set in the schema. So this is really validating that the schema didn't create identifier fields without them being passed in.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r617122753



##########
File path: api/src/main/java/org/apache/iceberg/UpdateSchema.java
##########
@@ -384,4 +385,21 @@ default UpdateSchema updateColumn(String name, Type.PrimitiveType newType, Strin
    *                                  with other additions, renames, or updates.
    */
   UpdateSchema unionByNameWith(Schema newSchema);
+
+  /**
+   * Set the identifier fields given a set of column names.
+   * <p>
+   * Identifier field is a similar concept as primary key in a relational database system.
+   * A row should be unique based on the values of the identifier fields.
+   * However, unlike a primary key, Iceberg does not enforce the uniqueness of a row based on this information.
+   * It is used for operations like upsert to define the default upsert key.
+   * <p>
+   * A column in the identifier fields must be of a primitive type.
+   * Each column set in this operation must exist in the current schema to update,
+   * or is added as a part of this update.
+   *
+   * @param names names of the columns to set as identifier fields
+   * @return this for method chaining
+   */
+  UpdateSchema setIdentifierFields(Set<String> names);

Review comment:
       We normally also add a `String...` variant for convenience so you don't need to call `setIdentifierFields(ImmutableSet.of("a", "b"))`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 edited a comment on pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

jackye1995 edited a comment on pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#issuecomment-824442288


   @rdblue @openinx sorry for the delayed update, I tried multiple different ways to implement this and went back and forth, and in the end the cleanest one I found was to use name to record all the identifier fields set, and at `applyChange` time, deletion of identifier column is checked before generating the new schema struct to validate deletion of identifier field, other validations are done after the new schema struct is generated.
   
   Ryan talked about using 2 sets, one for existing IDs and one for new names. The issue with that approach was that the validation checks done for the 2 sets are very similar. For existing IDs, I need to check recursively that all the parents of the field up to root are struct, which can be done using `idToParent` index. For new names, I need to do something very similar, but also merge the `IdToParent` with the parent information newly added fields. So in the end the code was very similar to just creating a new schema and recreating the `IdToParent` index. So combining those two facts, I think it is much cleaner to just store everything as name and then validate name and resolve back to ID after the new schema struct is generated.
   
   Ryan also talked about building an index of added name to ID when adding new columns, and with this approach that is not needed because the new schema would have that index when calling `findField`. I think that is a win to not increase complexity in `addColumn` and avoid new indexes in `UpdateSchema`.
   
   I can also do the `deletes` check in the validation function, but in the end I chose to do it in a separated code block before generating the schema struct, because doing it in the validation method requires passing in many arguments, and that flow was not intuitive to code readers, it only increased the complexity of the code and reduced readability.
   
   There are some other methods I tried to implement this whole thing, such as filtering all potential identifier fields in schema and then check if the identifiers are in that potential set. But that approach could not produce informative error message so I did not go with that route.
   
   I have added some more tests to cover all possible use cases, please let me know if you have any additional concern with the approach.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r615310999



##########
File path: api/src/main/java/org/apache/iceberg/Schema.java
##########
@@ -355,9 +397,15 @@ private Schema internalSelect(Collection<String> names, boolean caseSensitive) {
 
   @Override
   public String toString() {
-    return String.format("table {\n%s\n}",
-        NEWLINE.join(struct.fields().stream()
-            .map(f -> "  " + f)
-            .collect(Collectors.toList())));
+    StringBuilder sb = new StringBuilder();
+    sb.append("table {\n");
+    sb.append("  fields {\n");
+    sb.append(NEWLINE.join(struct.fields().stream()
+        .map(f -> "    " + f)
+        .collect(Collectors.toList())));
+    sb.append("\n  }\n  identifier fields { ");
+    sb.append(COMMA.join(identifierFieldIds()));
+    sb.append(" }\n}");
+    return sb.toString();

Review comment:
       Minor: I'd prefer a smaller change to the `toString` format. Maybe something like adding `(id)` to identifier fields in the list, rather than adding the `identifier fields { ... }`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r618164919



##########
File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
##########
@@ -408,11 +427,55 @@ private TableMetadata applyChangesToMapping(TableMetadata metadata) {
   private static Schema applyChanges(Schema schema, List<Integer> deletes,
                                      Map<Integer, Types.NestedField> updates,
                                      Multimap<Integer, Types.NestedField> adds,
-                                     Multimap<Integer, Move> moves) {
+                                     Multimap<Integer, Move> moves,
+                                     Set<String> identifierNames) {
+    // validate existing identifier fields are not deleted
+    for (String name : identifierNames) {
+      Types.NestedField field = schema.findField(name);
+      if (field != null) {
+        Preconditions.checkArgument(!deletes.contains(field.fieldId()),
+            "Cannot delete identifier field %s. To force deletion, " +
+                "also call setIdentifierFields to update identifier fields.", field);
+      }
+    }
+
+    // apply schema changes
     Types.StructType struct = TypeUtil
         .visit(schema, new ApplyChanges(deletes, updates, adds, moves))
         .asNestedType().asStructType();
-    return new Schema(struct.fields());
+
+    // validate identifier requirements based on latest schema
+    Schema noIdentifierSchema = new Schema(struct.fields());
+    Set<Integer> validatedIdentifiers = identifierNames.stream()
+        .map(n -> validateIdentifierField(n, noIdentifierSchema))
+        .collect(Collectors.toSet());
+
+    return new Schema(struct.fields(), validatedIdentifiers);
+  }
+
+  private static int validateIdentifierField(String name, Schema schema) {
+    Types.NestedField field = schema.findField(name);
+    Preconditions.checkArgument(field != null,
+        "Cannot add field %s as an identifier field, not found in current schema or added columns");
+    Preconditions.checkArgument(field.type().isPrimitiveType(),
+        "Cannot add field %s as an identifier field: not a primitive type field", name);
+    Map<Integer, Integer> newIdToParent = TypeUtil.indexParents(schema.asStruct());
+    validateIdentifierFieldParent(field.name(), field.fieldId(), newIdToParent, schema);

Review comment:
       I would prefer to use a while loop (rather than a recursive method) to describe the parent field validation logics because that makes more clear:
   
   ```java
     private static int validateIdentifierField(String name, Schema schema) {
       Types.NestedField field = schema.findField(name);
       Preconditions.checkArgument(field != null,
           "Cannot add field %s as an identifier field, not found in current schema or added columns");
       Preconditions.checkArgument(field.type().isPrimitiveType(),
           "Cannot add field %s as an identifier field: not a primitive type field", name);
   
       // Check whether the nested column is in a chain of struct field list.
       Map<Integer, Integer> newIdToParent = TypeUtil.indexParents(schema.asStruct());
       Integer parentId = newIdToParent.get(field.fieldId());
       while (parentId != null) {
         Types.NestedField parent = schema.findField(parentId);
         ValidationException.check(parent.type().isStructType(),
             "Cannot add field %s as an identifier field: must not be nested in %s", name, parent);
         parentId = newIdToParent.get(parent.fieldId());
       }
       return field.fieldId();
     }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r617120019



##########
File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
##########
@@ -317,6 +333,38 @@ public UpdateSchema unionByNameWith(Schema newSchema) {
     return this;
   }
 
+  @Override
+  public UpdateSchema addIdentifierField(String name) {
+    Types.NestedField field = schema.findField(name);
+    if (field == null) {
+      field = adds.get(TABLE_ROOT_ID).stream()
+          .filter(f -> f.name().equals(name))
+          .findAny().orElse(null);

Review comment:
       A column could be nested in a struct without being in a map or a list. For example, if you had a table of HTTP logs, you might have a column of `request` information with `request.path` and `request.headers`, along with a `response` column that has `response.code`. It would be valid to refer to `response.code` because there is only one per row. That's the case where we need the parent so we can distinguish between a top-level `response.code` and `code` within the `response` struct.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r617170445



##########
File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
##########
@@ -317,6 +333,38 @@ public UpdateSchema unionByNameWith(Schema newSchema) {
     return this;
   }
 
+  @Override
+  public UpdateSchema addIdentifierField(String name) {
+    Types.NestedField field = schema.findField(name);
+    if (field == null) {
+      field = adds.get(TABLE_ROOT_ID).stream()
+          .filter(f -> f.name().equals(name))
+          .findAny().orElse(null);

Review comment:
       I did not realize that we could add a single column name `response.code` in this `addColumn` API (Checked the code, sounds like the correct way to add `response.code` as top-level column is: `addColumn(null, "response.code",  type, doc)` ).
   
   If we wanna to distinguish between the nested columns  and the top-level column whose name contains DOT,  then we will need to add extra API for this , right ? 
   
   ```java
   // This is used to add columns that does not contains DOT.
   UpdateSchema setIdentifierFields(Colleciton<String> names);
   
   // This is used to add nested columns or top-level columns that contains DOT
   UpdateSchema setIdentifierFields(Collection<Pair<String, String>> parentAndNames); 
   ```
   
   

##########
File path: api/src/main/java/org/apache/iceberg/Schema.java
##########
@@ -158,6 +186,10 @@ public StructType asStruct() {
     return struct.fields();
   }
 
+  public Set<Integer> identifierFieldIds() {

Review comment:
       Nit:  Better to add a javadoc for this public API.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r612011371



##########
File path: core/src/main/java/org/apache/iceberg/SchemaParser.java
##########
@@ -158,7 +169,7 @@ public static String toJson(Schema schema, boolean pretty) {
       if (pretty) {
         generator.useDefaultPrettyPrinter();
       }
-      toJson(schema.asStruct(), generator);
+      toJson(schema.asStruct(), schema.schemaId(), schema.rowIdentifiers(), generator);

Review comment:
       not sure why this method was not updated in the past, is there any risk to use the v2 json writer?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r615313057



##########
File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
##########
@@ -317,6 +333,38 @@ public UpdateSchema unionByNameWith(Schema newSchema) {
     return this;
   }
 
+  @Override
+  public UpdateSchema addIdentifierField(String name) {
+    Types.NestedField field = schema.findField(name);
+    if (field == null) {
+      field = adds.get(TABLE_ROOT_ID).stream()
+          .filter(f -> f.name().equals(name))
+          .findAny().orElse(null);
+    }
+
+    Preconditions.checkArgument(field != null,
+        "Cannot find column of the given name in existing or newly added columns: %s", name);
+    Preconditions.checkArgument(field.isRequired(),
+        "Cannot add column %s as an identifier field because it is not a required field", name);

Review comment:
       I don't think that this requirement is correct. In the case where we're adding a column and then making it part of the identifier, values for all of the existing data will be null. I don't see a reason why we can't consider, for example, `(1, null)` to be an identifier.
   
   The strongest argument for making this a required field is SQL's boolean logic, where `null` is not equal to itself. But in Iceberg, we translate those expressions so that we can use normal boolean logic throughout the library. Equality deletes consider null equal to itself in Iceberg, just like set operations in SQL use null-safe equality.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r612816486



##########
File path: api/src/main/java/org/apache/iceberg/UpdateSchema.java
##########
@@ -384,4 +384,25 @@ default UpdateSchema updateColumn(String name, Type.PrimitiveType newType, Strin
    *                                  with other additions, renames, or updates.
    */
   UpdateSchema unionByNameWith(Schema newSchema);
+
+  /**
+   * Add a new row identifier given the column name.
+   * <p>
+   * The column must be a required field of a primitive type.
+   * It must exist in the current schema to update, or is added as a part of this update.
+   *
+   * @param columnName name of the column to add as a row identifier
+   * @return this for method chaining
+   */
+  UpdateSchema addRowIdentifier(String columnName);

Review comment:
       Here also, I don't think that "row identifier" is the right noun. I'd prefer `addIdentifierField` and `removeIdentifierField`. I'd use `remove` rather than `delete` because it is more clear that the field itself will not be deleted, just removed from the identifier. I also think that we should make sure that behavior (removing from the identifier field list, not dropping the column) is clear in the Javadoc.
   
   What about also adding `setIdentifierFields`? I think that a bulk set/replace operation fits more cleanly with the use case where users are setting the identifier for the first time or replacing it. I think that this will tend to be idempotent: set the identifier fields to `account_id` and `profile_id`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r612872453



##########
File path: core/src/main/java/org/apache/iceberg/util/JsonUtil.java
##########
@@ -116,19 +122,42 @@ public static String getStringOrNull(String property, JsonNode node) {
   }
 
   public static List<String> getStringList(String property, JsonNode node) {
-    Preconditions.checkArgument(node.has(property), "Cannot parse missing list %s", property);
+    ImmutableList.Builder<String> builder = ImmutableList.builder();
+    return getCollection(property, node, builder::build, builder::add, JsonNode::asText,
+        e -> Preconditions.checkArgument(e.isTextual(), "Cannot parse string from non-text value: %s", e),
+        false);
+  }
+
+  public static Set<Integer> getIntegerSetOrNull(String property, JsonNode node) {
+    ImmutableSet.Builder<Integer> builder = ImmutableSet.builder();
+    return getCollection(property, node, builder::build, builder::add, JsonNode::asInt,
+        e -> Preconditions.checkArgument(e.isInt(), "Cannot parse string from non-integer value: %s", e),
+        true);
+  }
+
+  private static <C extends Collection<T>, T> C getCollection(
+      String property, JsonNode node,
+      Supplier<C> builder, Consumer<T> appender,
+      Function<JsonNode, T> getter, Consumer<JsonNode> validator,

Review comment:
       thanks for the suggestion, please see if the new approach looks better.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

openinx commented on pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#issuecomment-826536901


   @rdblue ,  would you have any other concern ?  If no,  I will plan to get this merged.  Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r620696855



##########
File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
##########
@@ -59,25 +62,32 @@
   private final Multimap<Integer, Move> moves = Multimaps.newListMultimap(Maps.newHashMap(), Lists::newArrayList);
   private int lastColumnId;
   private boolean allowIncompatibleChanges = false;
-
+  private Set<String> identifierNames;

Review comment:
       I would prefer not to track these in the update by name since it makes the code more complex. I think the only missing part is tracking what fields have been added, for which I think we could keep a map from added name to new ID that is used if the current schema doesn't have the field.
   
   I understand this has taken a while to get in, so we can do that in a follow-up if you'd like.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r612604259



##########
File path: core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java
##########
@@ -1223,4 +1223,98 @@ public void testMoveBetweenStructsFails() {
                 .moveBefore("s2.x", "s1.a")
                 .apply());
   }
+
+  @Test
+  public void testAddRowIdentifier() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .allowIncompatibleChanges()
+        .addRowIdentifier("id")
+        .addRequiredColumn("id2", Types.StringType.get())
+        .addRowIdentifier("id2")
+        .apply();
+    Assert.assertEquals(Sets.newHashSet(1, SCHEMA_LAST_COLUMN_ID + 1), newSchema.rowIdentifiers());
+
+    AssertHelpers.assertThrows("Should fail if column with name not exist",
+        IllegalArgumentException.class,
+        "Cannot find column of the given name in existing or newly added columns",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID).addRowIdentifier("unknown").apply());
+
+    AssertHelpers.assertThrows("Should fail if column with name not exist",
+        IllegalArgumentException.class,
+        "it is not a required field",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID).addRowIdentifier("data").apply());
+
+    AssertHelpers.assertThrows("Should fail if column with name not exist",
+        IllegalArgumentException.class,
+        "it is not a primitive field",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID).addRowIdentifier("locations").apply());
+  }
+
+  @Test
+  public void testRenameRowIdentifier() {
+    Schema schemaWithRowIdentifier = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .allowIncompatibleChanges()
+        .addRowIdentifier("id")
+        .apply();
+
+    Schema newSchema = new SchemaUpdate(schemaWithRowIdentifier, SCHEMA_LAST_COLUMN_ID)
+        .renameColumn("id", "id2")
+        .apply();
+
+    Assert.assertEquals(Sets.newHashSet(1), newSchema.rowIdentifiers());
+  }
+

Review comment:
       I think what you describe is `Should fail if deleting column before deleting it as row identifier` case in `testDeleteRowIdentifier` test




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r620747363



##########
File path: core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java
##########
@@ -1223,4 +1223,232 @@ public void testMoveBetweenStructsFails() {
                 .moveBefore("s2.x", "s1.a")
                 .apply());
   }
+
+  @Test
+  public void testAddExistingIdentifierFields() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("id"))
+        .apply();
+
+    Assert.assertEquals("add an existing field as identifier field should succeed",
+        Sets.newHashSet(newSchema.findField("id").fieldId()),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testAddNewIdentifierFieldColumns() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new_field", Types.StringType.get())
+        .setIdentifierFields(Sets.newHashSet("id", "new_field"))
+        .apply();
+
+    Assert.assertEquals("add column then set as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("id").fieldId(), newSchema.findField("new_field").fieldId()),
+        newSchema.identifierFieldIds());
+
+    newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("id", "new_field"))
+        .addColumn("new_field", Types.StringType.get())
+        .apply();
+
+    Assert.assertEquals("set identifier then add column should succeed",
+        Sets.newHashSet(newSchema.findField("id").fieldId(), newSchema.findField("new_field").fieldId()),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testAddNestedIdentifierFieldColumns() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("preferences.feature1"))
+        .apply();
+
+    Assert.assertEquals("set existing nested field as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("preferences.feature1").fieldId()),
+        newSchema.identifierFieldIds());
+
+    newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new", Types.StructType.of(
+            Types.NestedField.optional(SCHEMA_LAST_COLUMN_ID + 1, "field", Types.StringType.get())
+        ))
+        .setIdentifierFields(Sets.newHashSet("new.field"))
+        .apply();
+
+    Assert.assertEquals("set newly added nested field as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("new.field").fieldId()),
+        newSchema.identifierFieldIds());
+
+    newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new", Types.StructType.of(
+            Types.NestedField.optional(SCHEMA_LAST_COLUMN_ID + 1, "field", Types.StructType.of(
+                Types.NestedField.optional(SCHEMA_LAST_COLUMN_ID + 2, "nested", Types.StringType.get())))))
+        .setIdentifierFields(Sets.newHashSet("new.field.nested"))
+        .apply();
+
+    Assert.assertEquals("set newly added multi-layer nested field as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("new.field.nested").fieldId()),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testAddDottedIdentifierFieldColumns() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn(null, "dot.field", Types.StringType.get())
+        .setIdentifierFields(Sets.newHashSet("id", "dot.field"))
+        .apply();
+
+    Assert.assertEquals("add a field with dot as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("id").fieldId(), newSchema.findField("dot.field").fieldId()),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testRemoveIdentifierFields() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new_field", Types.StringType.get())
+        .addColumn("new_field2", Types.StringType.get())
+        .setIdentifierFields(Sets.newHashSet("id", "new_field", "new_field2"))
+        .apply();
+
+    newSchema = new SchemaUpdate(newSchema, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("new_field", "new_field2"))
+        .apply();
+
+    Assert.assertEquals("remove an identifier field should succeed",
+        Sets.newHashSet(newSchema.findField("new_field").fieldId(), newSchema.findField("new_field2").fieldId()),
+        newSchema.identifierFieldIds());
+
+    newSchema = new SchemaUpdate(newSchema, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet())
+        .apply();
+
+    Assert.assertEquals("remove all identifier fields should succeed",
+        Sets.newHashSet(),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testSetIdentifierFieldsFails() {
+    AssertHelpers.assertThrows("add a field with name not exist should fail",
+        IllegalArgumentException.class,
+        "not found in current schema or added columns",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+            .setIdentifierFields(Sets.newHashSet("unknown"))
+            .apply());
+
+    AssertHelpers.assertThrows("add a field of non-primitive type should fail",
+        IllegalArgumentException.class,
+        "not a primitive type field",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+            .setIdentifierFields(Sets.newHashSet("locations"))
+            .apply());
+
+    AssertHelpers.assertThrows("add a nested field in map should fail",
+        IllegalArgumentException.class,
+        "must not be nested in " + SCHEMA.findField("locations"),
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+            .setIdentifierFields(Sets.newHashSet("locations.key.zip"))
+            .apply());
+
+    AssertHelpers.assertThrows("add a nested field in list should fail",
+        IllegalArgumentException.class,
+        "must not be nested in " + SCHEMA.findField("points"),
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+            .setIdentifierFields(Sets.newHashSet("points.element.x"))
+            .apply());
+
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new", Types.StructType.of(
+            Types.NestedField.optional(SCHEMA_LAST_COLUMN_ID + 1, "fields", Types.ListType.ofOptional(
+                SCHEMA_LAST_COLUMN_ID + 2, Types.StructType.of(
+                    Types.NestedField.optional(SCHEMA_LAST_COLUMN_ID + 3, "nested", Types.StringType.get())
+                ))
+            )
+        ))
+        .apply();
+
+    AssertHelpers.assertThrows("add a nested field in struct of a map should fail",
+        IllegalArgumentException.class,
+        "must not be nested in " + newSchema.findField("new.fields"),
+        () -> new SchemaUpdate(newSchema, SCHEMA_LAST_COLUMN_ID + 3)
+            .setIdentifierFields(Sets.newHashSet("new.fields.element.nested"))
+            .apply());
+  }
+
+  @Test
+  public void testDeleteIdentifierFieldColumns() {
+    Schema schemaWithIdentifierFields = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("id"))
+        .apply();
+
+    Assert.assertEquals("delete column and then reset identifier field should succeed",
+        Sets.newHashSet(),
+        new SchemaUpdate(schemaWithIdentifierFields, SCHEMA_LAST_COLUMN_ID)
+            .deleteColumn("id").setIdentifierFields(Sets.newHashSet()).apply()
+            .identifierFieldIds());
+
+    Assert.assertEquals("delete reset identifier field and then delete column should succeed",
+        Sets.newHashSet(),
+        new SchemaUpdate(schemaWithIdentifierFields, SCHEMA_LAST_COLUMN_ID)
+            .setIdentifierFields(Sets.newHashSet()).deleteColumn("id").apply()
+            .identifierFieldIds());
+  }
+
+  @Test
+  public void testDeleteIdentifierFieldColumnsFails() {
+    Schema schemaWithIdentifierFields = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("id"))

Review comment:
       tests updated.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r618189003



##########
File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
##########
@@ -408,11 +427,55 @@ private TableMetadata applyChangesToMapping(TableMetadata metadata) {
   private static Schema applyChanges(Schema schema, List<Integer> deletes,
                                      Map<Integer, Types.NestedField> updates,
                                      Multimap<Integer, Types.NestedField> adds,
-                                     Multimap<Integer, Move> moves) {
+                                     Multimap<Integer, Move> moves,
+                                     Set<String> identifierNames) {
+    // validate existing identifier fields are not deleted
+    for (String name : identifierNames) {
+      Types.NestedField field = schema.findField(name);

Review comment:
       If we have two columns `(a, b)` in the table, and the original identifier fields is `a`.  Then people want to delete the column `a` first and then rename the column `b` to column `a` and also change the identifier fields from old `a` to new `a`.  As the identifier is a constraint for delete, so people will need to call the API by : 
   
   ```java
   SchemaUpdate update = ...
   update.deleteColumn("a").renameColumn("b", "a")
   ```
   
   In this way,  he will encounter an argument exception and try to call the `setIdentifierFields`: 
   
   ```java
   update.setIdentifierFields("b")
     .deleteColumn("a")
     .renameColumn("b", "a")
     .setIdentiferFields("a")
   ```
   
   But this way, people will still encounter the exception.  So maybe the final solution will be: 
   
   ```java
   Step.1   update.setIdentifierFields(ImmutableSet.of()).commit();
   Step.2   update.deleteColumn("a").renameColumn("b", "a").commit();
   Step.3   update.setIdentifierFields("a").commit();
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r615317981



##########
File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
##########
@@ -51,6 +53,7 @@
   private final TableMetadata base;
   private final Schema schema;
   private final Map<Integer, Integer> idToParent;
+  private final Set<String> identifierFieldNames;

Review comment:
       Why did you choose to use names instead of ids here? I think IDs would be a little simpler because you wouldn't need to translate in more cases. For example, rename wouldn't need to modify the identifier field ID set because it doesn't change the field IDs. You'd also only need to copy the set of field IDs returned by the current schema in the constructor.
   
   I was looking for a reason why you might have chosen to use names, but I think that it is straightforward to use IDs instead of names in all cases. For example, the check in `deleteColumn` validates the name passed in directly, which seems easier; but the name has already been looked up in the schema so we have a field and could validate `identifierFieldIds.contains(field.fieldId())` instead like the other checks do.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r615319173



##########
File path: core/src/main/java/org/apache/iceberg/util/JsonUtil.java
##########
@@ -117,18 +119,67 @@ public static String getStringOrNull(String property, JsonNode node) {
 
   public static List<String> getStringList(String property, JsonNode node) {
     Preconditions.checkArgument(node.has(property), "Cannot parse missing list %s", property);
-    JsonNode pNode = node.get(property);
-    Preconditions.checkArgument(pNode != null && !pNode.isNull() && pNode.isArray(),
-        "Cannot parse %s from non-array value: %s", property, pNode);
-
-    ImmutableList.Builder<String> builder = ImmutableList.builder();
-    Iterator<JsonNode> elements = pNode.elements();
-    while (elements.hasNext()) {
-      JsonNode element = elements.next();
-      Preconditions.checkArgument(element.isTextual(),
-          "Cannot parse string from non-text value: %s", element);
-      builder.add(element.asText());
+    return ImmutableList.<String>builder()
+        .addAll(new JsonStringArrayVisitor(property, node))
+        .build();
+  }
+
+  public static Set<Integer> getIntegerSetOrNull(String property, JsonNode node) {
+    if (!node.has(property)) {
+      return null;
+    }
+
+    return ImmutableSet.<Integer>builder()
+        .addAll(new JsonIntegerArrayVisitor(property, node))
+        .build();
+  }
+
+  abstract static class JsonArrayIterator<T> implements Iterator<T> {
+
+    private final Iterator<JsonNode> elements;
+
+    JsonArrayIterator(String property, JsonNode node) {
+      JsonNode pNode = node.get(property);
+      Preconditions.checkArgument(pNode != null && !pNode.isNull() && pNode.isArray(),
+          "Cannot parse %s from non-array value: %s", property, pNode);
+      this.elements = pNode.elements();
+    }
+
+    @Override
+    public boolean hasNext() {
+      return elements().hasNext();
+    }
+
+    Iterator<JsonNode> elements() {
+      return elements;
+    }
+  }
+
+  static class JsonStringArrayVisitor extends JsonArrayIterator<String> {

Review comment:
       This isn't a visitor, it's an `Iterator` so I think we should name it `JsonStringArrayIterator`. Same for the int version.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#issuecomment-824442288


   @rdblue @openinx sorry for the delayed update, I tried multiple different ways to implement this and went back and forth, and in the end the cleanest one I found was to use name to record all the updates, and at `applyChange` time, deletion of identifier column is checked before generating the new schema struct to validate deletion of identifier field, other validations are done after the new schema struct is generated.
   
   Ryan talked about using 2 sets, one for existing IDs and one for new names. The issue with that approach was that the validation checks done for the 2 sets are very similar. For existing IDs, I need to check recursively that all the parents of the field up to root are struct, which can be done using `idToParent` index. For new names, I need to do something very similar, but also merge the `IdToParent` with the parent information newly added fields. So in the end the code was very similar to just creating a new schema and recreating the `IdToParent` index. So combining those two facts, I think it is much cleaner to just store everything as name and then validate name and resolve back to ID after the new schema struct is generated.
   
   Ryan also talked about building an index of added name to ID when adding new columns, and with this approach that is not needed because the new schema would have that index when calling `findField`. I think that is a win to not increase complexity in `addColumn` and avoid new indexes in `UpdateSchema`.
   
   I can also do the `deletes` check in the validation function, but in the end I chose to do it in a separated code block before generating the schema struct, because doing it in the validation method requires passing in many arguments, and that flow was not intuitive to code readers, it only increased the complexity of the code and reduced readability.
   
   There are some other methods I tried to implement this whole thing, such as filtering all potential identifier fields in schema and then check if the identifiers are in that potential set. But that approach could not produce informative error message so I did not go with that route.
   
   I have added some more tests to cover all possible use cases, please let me know if you have any additional concern with the approach.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r615320262



##########
File path: core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java
##########
@@ -1223,4 +1223,98 @@ public void testMoveBetweenStructsFails() {
                 .moveBefore("s2.x", "s1.a")
                 .apply());
   }
+
+  @Test
+  public void testAddIdentifierField() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .allowIncompatibleChanges()

Review comment:
       I'd prefer to separate the existing column case from the new column case. And these tests should also validate that adding the identifier field before adding the column fails even if they are part of the same `SchemaUpdate` operation.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r620740478



##########
File path: api/src/main/java/org/apache/iceberg/UpdateSchema.java
##########
@@ -384,4 +386,24 @@ default UpdateSchema updateColumn(String name, Type.PrimitiveType newType, Strin
    *                                  with other additions, renames, or updates.
    */
   UpdateSchema unionByNameWith(Schema newSchema);
+
+  /**
+   * Set the identifier fields given a set of field names.
+   * See {@link Schema#identifierFieldIds()} to learn more about Iceberg identifier.
+   *
+   * @param names names of the columns to set as identifier fields
+   * @return this for method chaining
+   */
+  UpdateSchema setIdentifierFields(Set<String> names);

Review comment:
       `Set` was used to explicitly tell users that identifier fields need to be unique, instead of giving people an illusion that the update operation can still succeed with repeated value.
   
   Technically we can use `List`, `Iterable` or `Collection`. If we would like to make a broader use case for the API, I can document the behavior in javadoc.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r617736448



##########
File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
##########
@@ -317,6 +333,38 @@ public UpdateSchema unionByNameWith(Schema newSchema) {
     return this;
   }
 
+  @Override
+  public UpdateSchema addIdentifierField(String name) {
+    Types.NestedField field = schema.findField(name);
+    if (field == null) {
+      field = adds.get(TABLE_ROOT_ID).stream()
+          .filter(f -> f.name().equals(name))
+          .findAny().orElse(null);

Review comment:
       @openinx, that's what I was thinking at first as well, but I don't think that we need a version with parent now.
   
   The reason is that once a column is added, it is uniquely identified by a dotted name: `a.b` might be a struct named `a` with a field named `b` or might be a field named `a.b`, but it cannot be both. We guarantee this so that we never have ambiguous names.
   
   Because we know that the string passed in here will always identify just one field, we don't need to handle the ambiguous case by adding an explicit parent like we do with `addColumn`.
   
   I think that means we can have a much nicer API:
   * `setIdentifierFields(String...)` and `setIdentifierFields(Collection<String>)`
   * `addIdentifierField(String)`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r624752115



##########
File path: api/src/main/java/org/apache/iceberg/UpdateSchema.java
##########
@@ -384,4 +386,24 @@ default UpdateSchema updateColumn(String name, Type.PrimitiveType newType, Strin
    *                                  with other additions, renames, or updates.
    */
   UpdateSchema unionByNameWith(Schema newSchema);
+
+  /**
+   * Set the identifier fields given a set of field names.
+   * See {@link Schema#identifierFieldIds()} to learn more about Iceberg identifier.
+   *
+   * @param names names of the columns to set as identifier fields
+   * @return this for method chaining
+   */
+  UpdateSchema setIdentifierFields(Set<String> names);

Review comment:
       I would prefer to use `Collection`. It's fine to collapse that into a set internally since a set of columns or a multi-set of columns has the same identity behavior. `Collection` is easier to use for the caller.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r612814134



##########
File path: api/src/main/java/org/apache/iceberg/Schema.java
##########
@@ -46,34 +48,53 @@
  */
 public class Schema implements Serializable {
   private static final Joiner NEWLINE = Joiner.on('\n');
+  private static final Joiner COMMA = Joiner.on(',');
   private static final String ALL_COLUMNS = "*";
   private static final int DEFAULT_SCHEMA_ID = 0;
 
   private final StructType struct;
   private final int schemaId;
+  private final int[] rowIdentifiers;

Review comment:
       I think we need to fix the term used here. There isn't more than one identifier, so it isn't quite right to call this "identifiers". We also want to make sure that it is a bit more clear just what these are, so I think we should clearly call out that what is returned are field IDs. That's why I originally suggested `identifier-field-ids` for the JSON field and `identifierFieldIds` for the `Schema` method.
   
   I think it would be fine to use the "row" prefix, but I don't think that it adds much value. I typically opt to go with the simpler option so I prefer `identifierFieldIds`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

openinx commented on pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#issuecomment-825365008


   The current patch looks good to me overall,  I just left few minor comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r612917605



##########
File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
##########
@@ -408,11 +450,15 @@ private TableMetadata applyChangesToMapping(TableMetadata metadata) {
   private static Schema applyChanges(Schema schema, List<Integer> deletes,
                                      Map<Integer, Types.NestedField> updates,
                                      Multimap<Integer, Types.NestedField> adds,
-                                     Multimap<Integer, Move> moves) {
+                                     Multimap<Integer, Move> moves,
+                                     Set<String> rowIdentifierNames) {
     Types.StructType struct = TypeUtil
         .visit(schema, new ApplyChanges(deletes, updates, adds, moves))
         .asNestedType().asStructType();
-    return new Schema(struct.fields());
+    Set<Integer> refreshedRowIdentifiers = rowIdentifierNames.stream()
+        .map(name -> struct.field(name).fieldId())

Review comment:
       Shoud we add the unit tests for above cases ?  I mean we are deleting the rowIdentifierField and columns in the same txn .




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r617122093



##########
File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
##########
@@ -317,6 +320,31 @@ public UpdateSchema unionByNameWith(Schema newSchema) {
     return this;
   }
 
+  @Override
+  public UpdateSchema setIdentifierFields(Set<String> names) {
+    identifierFields = Sets.newHashSet();
+    names.forEach(this::addIdentifierField);
+    return this;
+  }
+
+  private void addIdentifierField(String name) {
+    Types.NestedField field = schema.findField(name);
+    if (field == null) {
+      field = adds.get(TABLE_ROOT_ID).stream()
+          .filter(f -> f.name().equals(name))
+          .findAny().orElse(null);
+    } else {
+      Preconditions.checkArgument(idToParent.get(field.fieldId()) == null,
+          "Cannot add column %s as an identifier: must not be nested in a map or list", name);

Review comment:
       I see, in this implementation I did not allow a field in a struct, that's why I implemented it this way. But it does make sense now I think about it, I will update the code for that use case.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r618846426



##########
File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
##########
@@ -408,11 +427,55 @@ private TableMetadata applyChangesToMapping(TableMetadata metadata) {
   private static Schema applyChanges(Schema schema, List<Integer> deletes,
                                      Map<Integer, Types.NestedField> updates,
                                      Multimap<Integer, Types.NestedField> adds,
-                                     Multimap<Integer, Move> moves) {
+                                     Multimap<Integer, Move> moves,
+                                     Set<String> identifierNames) {
+    // validate existing identifier fields are not deleted
+    for (String name : identifierNames) {
+      Types.NestedField field = schema.findField(name);

Review comment:
       I think multiple steps is right. We should not allow redefining the ID column's ID with a delete/rename.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r612602734



##########
File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
##########
@@ -408,11 +450,15 @@ private TableMetadata applyChangesToMapping(TableMetadata metadata) {
   private static Schema applyChanges(Schema schema, List<Integer> deletes,
                                      Map<Integer, Types.NestedField> updates,
                                      Multimap<Integer, Types.NestedField> adds,
-                                     Multimap<Integer, Move> moves) {
+                                     Multimap<Integer, Move> moves,
+                                     Set<String> rowIdentifierNames) {
     Types.StructType struct = TypeUtil
         .visit(schema, new ApplyChanges(deletes, updates, adds, moves))
         .asNestedType().asStructType();
-    return new Schema(struct.fields());
+    Set<Integer> refreshedRowIdentifiers = rowIdentifierNames.stream()
+        .map(name -> struct.field(name).fieldId())

Review comment:
       Yes, that is why I decide to throw exception when `deleteColumn` is called and the column is a row identifier. There are 2 possible situations:
   
   1. the user actually wants to delete the column, then he/she should call:
   
   ```
   update.deleteRowIdentifier(columnName).deleteColumn(columnName);
   ```
   
   2. the user does not know the column is used as a row identifier, and in that case just calling `deleteColumn` would result in exception.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r617815077



##########
File path: api/src/main/java/org/apache/iceberg/Schema.java
##########
@@ -158,6 +186,10 @@ public StructType asStruct() {
     return struct.fields();
   }
 
+  public Set<Integer> identifierFieldIds() {

Review comment:
       Good point, I will move the explanation of identifier fields in `setIdentifierFields` here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r615317429



##########
File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
##########
@@ -317,6 +333,38 @@ public UpdateSchema unionByNameWith(Schema newSchema) {
     return this;
   }
 
+  @Override
+  public UpdateSchema addIdentifierField(String name) {
+    Types.NestedField field = schema.findField(name);
+    if (field == null) {
+      field = adds.get(TABLE_ROOT_ID).stream()
+          .filter(f -> f.name().equals(name))
+          .findAny().orElse(null);

Review comment:
       This assumes that the field is not a newly added field within a struct.
   
   I think that this should use the same strategy that we use for `addColumn`. There are two variations of `addColumn`, one that accepts a parent name and a field name, and one that accepts only a field name. If the latter is called with a name that contains `.`, then an exception is thrown because the reference is ambiguous. For example, `a.b` could be a top-level field named `a.b` or could be field `b` nested within field `a`.
   
   By structuring the methods that way, we always know the parent field and can use that instead of assuming `TABLE_ROOT_ID` here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 edited a comment on pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

jackye1995 edited a comment on pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#issuecomment-824442288


   @rdblue @openinx sorry for the delayed update, I tried multiple different ways to implement this and went back and forth, and in the end the cleanest one I found was to use name to record all the identifier fields set, and at `applyChange` time, deletion of identifier column is checked before generating the new schema struct to validate deletion of identifier field, other validations are done after the new schema struct is generated.
   
   Ryan talked about using 2 sets, one for existing IDs and one for new names. The issue with that approach was that the validation checks done for the 2 sets are very similar. For existing IDs, I need to check recursively that all the parents of the field up to root are struct, which can be done using `idToParent` index. For new names, I need to do something very similar, but also merge the `IdToParent` with the parent information of newly added fields. So in the end the code was very similar to just creating a new schema and recreating the `IdToParent` index and a bit redundant. Combining those two facts, I think it is much cleaner to just store everything as name and then validate name and resolve back to ID after the new schema struct is generated.
   
   Ryan also talked about building an index of added name to ID when adding new columns, and with this approach that is not needed because the new schema would have that index when calling `findField`. I think that is a win to not increase complexity in `addColumn` and avoid new indexes in `UpdateSchema`.
   
   I can also do the `deletes` check in the validation function, but in the end I chose to do it in a separated code block before generating the schema struct, because doing it in the validation method requires passing in many arguments, and that flow was not intuitive to code readers, it only increased the complexity of the code and reduced readability.
   
   There are some other methods I tried to implement this whole thing, such as filtering all potential identifier fields in schema and then check if the identifiers are in that potential set. But that approach could not produce informative error message so I did not go with that route.
   
   I have added some more tests to cover all possible use cases, please let me know if you have any additional concern with the approach.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] yyanyy commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

yyanyy commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r616222010



##########
File path: core/src/main/java/org/apache/iceberg/SchemaParser.java
##########
@@ -158,7 +169,7 @@ public static String toJson(Schema schema, boolean pretty) {
       if (pretty) {
         generator.useDefaultPrettyPrinter();
       }
-      toJson(schema.asStruct(), generator);
+      toJson(schema.asStruct(), schema.schemaId(), schema.rowIdentifiers(), generator);

Review comment:
       Thanks for the response @rdblue ! Just want to mention an additional case: in addition to the possibility of writing incorrect ID for a table schema, writing id here also makes us persisting id for things not related to the table schema itself, e.g. [manifest list](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/ManifestListWriter.java#L94-L98) and [delete files](https://github.com/apache/iceberg/blob/master/parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java#L387-L414). But if we are fine with writing incorrect ID, we probably will be fine with writing ID for non-schema structs as well... 
   
   I guess essentially we are choosing between 1) simpler code with potential confusing IDs and 2) code complexities (due to avoid writing incorrect/unuseful IDs); I personally would prefer the latter but I don't strongly oppose the former if we think it's easier to reason about in code. Also we might want to involve people from Hive as part of this discussion since they are using this method as well to preserve the schema ([example](https://github.com/apache/iceberg/blob/master/mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java#L119))




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r615313160



##########
File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
##########
@@ -317,6 +333,38 @@ public UpdateSchema unionByNameWith(Schema newSchema) {
     return this;
   }
 
+  @Override
+  public UpdateSchema addIdentifierField(String name) {
+    Types.NestedField field = schema.findField(name);
+    if (field == null) {
+      field = adds.get(TABLE_ROOT_ID).stream()
+          .filter(f -> f.name().equals(name))
+          .findAny().orElse(null);
+    }
+
+    Preconditions.checkArgument(field != null,
+        "Cannot find column of the given name in existing or newly added columns: %s", name);
+    Preconditions.checkArgument(field.isRequired(),
+        "Cannot add column %s as an identifier field because it is not a required field", name);
+    Preconditions.checkArgument(field.type().isPrimitiveType(),
+        "Cannot add column %s as an identifier field because it is not a primitive field", name);

Review comment:
       We normally omit the "because ..." and use `:` instead. Like this: `Cannot add %s as an identifier field: not a primitive` or `Cannot add %s as an identifier field: already an identifier field`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r617127118



##########
File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
##########
@@ -317,6 +320,31 @@ public UpdateSchema unionByNameWith(Schema newSchema) {
     return this;
   }
 
+  @Override
+  public UpdateSchema setIdentifierFields(Set<String> names) {

Review comment:
       I'm rethinking this; sorry for the churn.
   
   We should actually be able to do this with a single string and no parent name. When `add` or `set` is called, the column must already exist. We ensure that each name is unique: `a.b` is either a field named `a.b` or is a field named `b` within field `a`, but never both. Because of that, the name here is never ambiguous. We will always be able to look it up. So supporting `Collection<String>` or `String...` should work as long as we track two things: existing identifier fields as ids and new identifier fields as names. Then when we create the identifier list later we can look up the names.
   
   Does that sound reasonable?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r620700928



##########
File path: core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java
##########
@@ -1223,4 +1223,232 @@ public void testMoveBetweenStructsFails() {
                 .moveBefore("s2.x", "s1.a")
                 .apply());
   }
+
+  @Test
+  public void testAddExistingIdentifierFields() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("id"))
+        .apply();
+
+    Assert.assertEquals("add an existing field as identifier field should succeed",
+        Sets.newHashSet(newSchema.findField("id").fieldId()),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testAddNewIdentifierFieldColumns() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new_field", Types.StringType.get())
+        .setIdentifierFields(Sets.newHashSet("id", "new_field"))
+        .apply();
+
+    Assert.assertEquals("add column then set as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("id").fieldId(), newSchema.findField("new_field").fieldId()),
+        newSchema.identifierFieldIds());
+
+    newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("id", "new_field"))
+        .addColumn("new_field", Types.StringType.get())
+        .apply();
+
+    Assert.assertEquals("set identifier then add column should succeed",
+        Sets.newHashSet(newSchema.findField("id").fieldId(), newSchema.findField("new_field").fieldId()),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testAddNestedIdentifierFieldColumns() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("preferences.feature1"))
+        .apply();
+
+    Assert.assertEquals("set existing nested field as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("preferences.feature1").fieldId()),
+        newSchema.identifierFieldIds());
+
+    newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new", Types.StructType.of(
+            Types.NestedField.optional(SCHEMA_LAST_COLUMN_ID + 1, "field", Types.StringType.get())
+        ))
+        .setIdentifierFields(Sets.newHashSet("new.field"))
+        .apply();
+
+    Assert.assertEquals("set newly added nested field as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("new.field").fieldId()),
+        newSchema.identifierFieldIds());
+
+    newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new", Types.StructType.of(
+            Types.NestedField.optional(SCHEMA_LAST_COLUMN_ID + 1, "field", Types.StructType.of(
+                Types.NestedField.optional(SCHEMA_LAST_COLUMN_ID + 2, "nested", Types.StringType.get())))))
+        .setIdentifierFields(Sets.newHashSet("new.field.nested"))
+        .apply();
+
+    Assert.assertEquals("set newly added multi-layer nested field as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("new.field.nested").fieldId()),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testAddDottedIdentifierFieldColumns() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn(null, "dot.field", Types.StringType.get())
+        .setIdentifierFields(Sets.newHashSet("id", "dot.field"))
+        .apply();
+
+    Assert.assertEquals("add a field with dot as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("id").fieldId(), newSchema.findField("dot.field").fieldId()),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testRemoveIdentifierFields() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new_field", Types.StringType.get())
+        .addColumn("new_field2", Types.StringType.get())
+        .setIdentifierFields(Sets.newHashSet("id", "new_field", "new_field2"))
+        .apply();
+
+    newSchema = new SchemaUpdate(newSchema, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("new_field", "new_field2"))
+        .apply();
+
+    Assert.assertEquals("remove an identifier field should succeed",
+        Sets.newHashSet(newSchema.findField("new_field").fieldId(), newSchema.findField("new_field2").fieldId()),
+        newSchema.identifierFieldIds());
+
+    newSchema = new SchemaUpdate(newSchema, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet())
+        .apply();
+
+    Assert.assertEquals("remove all identifier fields should succeed",
+        Sets.newHashSet(),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testSetIdentifierFieldsFails() {
+    AssertHelpers.assertThrows("add a field with name not exist should fail",
+        IllegalArgumentException.class,
+        "not found in current schema or added columns",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+            .setIdentifierFields(Sets.newHashSet("unknown"))
+            .apply());
+
+    AssertHelpers.assertThrows("add a field of non-primitive type should fail",
+        IllegalArgumentException.class,
+        "not a primitive type field",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+            .setIdentifierFields(Sets.newHashSet("locations"))
+            .apply());
+
+    AssertHelpers.assertThrows("add a nested field in map should fail",
+        IllegalArgumentException.class,
+        "must not be nested in " + SCHEMA.findField("locations"),
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+            .setIdentifierFields(Sets.newHashSet("locations.key.zip"))
+            .apply());
+
+    AssertHelpers.assertThrows("add a nested field in list should fail",
+        IllegalArgumentException.class,
+        "must not be nested in " + SCHEMA.findField("points"),
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+            .setIdentifierFields(Sets.newHashSet("points.element.x"))
+            .apply());
+
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new", Types.StructType.of(
+            Types.NestedField.optional(SCHEMA_LAST_COLUMN_ID + 1, "fields", Types.ListType.ofOptional(
+                SCHEMA_LAST_COLUMN_ID + 2, Types.StructType.of(
+                    Types.NestedField.optional(SCHEMA_LAST_COLUMN_ID + 3, "nested", Types.StringType.get())
+                ))
+            )
+        ))
+        .apply();
+
+    AssertHelpers.assertThrows("add a nested field in struct of a map should fail",
+        IllegalArgumentException.class,
+        "must not be nested in " + newSchema.findField("new.fields"),
+        () -> new SchemaUpdate(newSchema, SCHEMA_LAST_COLUMN_ID + 3)
+            .setIdentifierFields(Sets.newHashSet("new.fields.element.nested"))
+            .apply());
+  }
+
+  @Test
+  public void testDeleteIdentifierFieldColumns() {
+    Schema schemaWithIdentifierFields = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("id"))
+        .apply();
+
+    Assert.assertEquals("delete column and then reset identifier field should succeed",
+        Sets.newHashSet(),
+        new SchemaUpdate(schemaWithIdentifierFields, SCHEMA_LAST_COLUMN_ID)
+            .deleteColumn("id").setIdentifierFields(Sets.newHashSet()).apply()
+            .identifierFieldIds());
+
+    Assert.assertEquals("delete reset identifier field and then delete column should succeed",
+        Sets.newHashSet(),
+        new SchemaUpdate(schemaWithIdentifierFields, SCHEMA_LAST_COLUMN_ID)
+            .setIdentifierFields(Sets.newHashSet()).deleteColumn("id").apply()
+            .identifierFieldIds());
+  }
+
+  @Test
+  public void testDeleteIdentifierFieldColumnsFails() {
+    Schema schemaWithIdentifierFields = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("id"))
+        .apply();
+
+    AssertHelpers.assertThrows("delete an identifier column without setting identifier fields should fail",
+        IllegalArgumentException.class,
+        "Cannot delete identifier field 1: id: required int. To force deletion, " +
+            "also call setIdentifierFields to update identifier fields.",
+        () -> new SchemaUpdate(schemaWithIdentifierFields, SCHEMA_LAST_COLUMN_ID).deleteColumn("id").apply());
+  }
+
+  @Test
+  public void testRenameIdentifierFields() {
+    Schema schemaWithIdentifierFields = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("id"))
+        .apply();
+
+    Schema newSchema = new SchemaUpdate(schemaWithIdentifierFields, SCHEMA_LAST_COLUMN_ID)
+        .renameColumn("id", "id2")
+        .apply();
+
+    Assert.assertEquals("rename should not affect identifier fields",
+        Sets.newHashSet(SCHEMA.findField("id").fieldId()),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testMoveIdentifierFields() {
+    Schema schemaWithIdentifierFields = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .allowIncompatibleChanges()
+        .setIdentifierFields(Sets.newHashSet("id"))
+        .apply();
+
+    Schema newSchema = new SchemaUpdate(schemaWithIdentifierFields, SCHEMA_LAST_COLUMN_ID)
+        .moveAfter("id", "locations")
+        .apply();
+
+    Assert.assertEquals("move after should not affect identifier fields",
+        Sets.newHashSet(SCHEMA.findField("id").fieldId()),
+        newSchema.identifierFieldIds());
+
+    newSchema = new SchemaUpdate(schemaWithIdentifierFields, SCHEMA_LAST_COLUMN_ID)
+        .moveBefore("id", "locations")
+        .apply();
+
+    Assert.assertEquals("move before should not affect identifier fields",
+        Sets.newHashSet(SCHEMA.findField("id").fieldId()),
+        newSchema.identifierFieldIds());
+
+    newSchema = new SchemaUpdate(schemaWithIdentifierFields, SCHEMA_LAST_COLUMN_ID)
+        .moveFirst("id")
+        .apply();
+
+    Assert.assertEquals("move first should not affect identifier fields",
+        Sets.newHashSet(1), newSchema.identifierFieldIds());

Review comment:
       Looks like this wasn't updated to load the field by name.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r616304204



##########
File path: core/src/test/java/org/apache/iceberg/TestTableMetadata.java
##########
@@ -646,6 +646,34 @@ public void testUpdateSortOrder() {
         NullOrder.NULLS_FIRST, sortedByX.sortOrder().fields().get(0).nullOrder());
   }
 
+  @Test
+  public void testParseSchemaIdentifierFields() throws Exception {
+    String data = readTableMetadataInputFile("TableMetadataV2Valid.json");
+    TableMetadata parsed = TableMetadataParser.fromJson(
+        ops.io(), null, JsonUtil.mapper().readValue(data, JsonNode.class));
+    Assert.assertEquals(Sets.newHashSet(), parsed.schemasById().get(0).identifierFieldIds());
+    Assert.assertEquals(Sets.newHashSet(1, 2), parsed.schemasById().get(1).identifierFieldIds());
+  }
+
+  @Test
+  public void testUpdateSchemaIdentifierFields() {
+    Schema schema = new Schema(
+        Types.NestedField.required(10, "x", Types.StringType.get())
+    );
+
+    TableMetadata meta = TableMetadata.newTableMetadata(
+        schema, PartitionSpec.unpartitioned(), null, ImmutableMap.of());
+    Assert.assertTrue("Should default to no identifier field", meta.schema().identifierFieldIds().isEmpty());

Review comment:
       Yes you are correct, this is not validating TableMetadata behavior. Let me just remove it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r612904117



##########
File path: api/src/main/java/org/apache/iceberg/UpdateSchema.java
##########
@@ -384,4 +384,25 @@ default UpdateSchema updateColumn(String name, Type.PrimitiveType newType, Strin
    *                                  with other additions, renames, or updates.
    */
   UpdateSchema unionByNameWith(Schema newSchema);
+
+  /**
+   * Add a new row identifier given the column name.
+   * <p>
+   * The column must be a required field of a primitive type.
+   * It must exist in the current schema to update, or is added as a part of this update.
+   *
+   * @param columnName name of the column to add as a row identifier
+   * @return this for method chaining
+   */
+  UpdateSchema addRowIdentifier(String columnName);

Review comment:
       Functionally,  there should be no difference between the `addIdentifierField`/`removeIdentifierField ` and `setIdentifierFields`,  but for the perspective of usability,  if people want to reset to use a totally different identifier fields  then he/she will need to: 
   
   ```java
       Schema schema = ...
       Set<Integer> newFieldIds = ...
   
       UpdateSchema update = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID);
       for (Integer identifier : schema.rowIdentifiers()) {
         String columnName = schema.findColumnName(identifier);
         Preconditions.checkNotNull(columnName, "Can not find the column with field id %s", identifier);
   
         update.deleteRowIdentifier(columnName);
       }
   
   
       for (Integer identifier : schema.rowIdentifiers()) {
         String columnName = schema.findColumnName(identifier);
         Preconditions.checkNotNull(columnName, "Can not find the column with field id %s", identifier);
   
         update.addRowIdentifier(columnName);
       }
   
       update.apply();
   ```
   
   It does make it more difficult to use the API.  Besides the `setIdentifierFields`  API,   would you think  it's necessary to introduce a `Set<String> identifierColumns()` in `Schema` ?   I find the schema will always return the `fieldId` while the `UpdateSchema` will always use the `columnName` then we have to do the conversion from `fieldId` to `columnName` everywhere.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#issuecomment-830870113


   Overall this looks correct to me. I think it would be nice to replace the `Set` with `Collection` in the `UpdateSchema` API, but that can be done later. Thanks for building this, @jackye1995!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r612851229



##########
File path: api/src/main/java/org/apache/iceberg/UpdateSchema.java
##########
@@ -384,4 +384,25 @@ default UpdateSchema updateColumn(String name, Type.PrimitiveType newType, Strin
    *                                  with other additions, renames, or updates.
    */
   UpdateSchema unionByNameWith(Schema newSchema);
+
+  /**
+   * Add a new row identifier given the column name.
+   * <p>
+   * The column must be a required field of a primitive type.
+   * It must exist in the current schema to update, or is added as a part of this update.
+   *
+   * @param columnName name of the column to add as a row identifier
+   * @return this for method chaining
+   */
+  UpdateSchema addRowIdentifier(String columnName);

Review comment:
       > I'd use remove rather than delete because it is more clear that the field itself will not be deleted
   
   Sounds good, I also called it `remove` in the doc, `delete` was trying to match `deleteColumn`, but I agree it would cause some confusion.
   
   > What about also adding setIdentifierFields
   
   I thought about `setIdentifierFields`, the reason I did not add it was because the `UpdateSchema` interface seems to be all individual operations instead of bulk operations (except for `unionByNameWith`), and our DDL SQL statements are also not bulk updates. 
   
   > I think that this will tend to be idempotent
   
   I think both approaches are idempotent.  you can write it as:
   
   ```
   update.setIdentifierFields(Sets.newHashSet("profile_id")).apply();
   ```
   
   or 
   
   ```
   update.addIdentifierField("profile_id").removeIdentifierField("account_id").apply();
   ```
   
   I am fine with both approaches. Technically if we go with `setIdentifierFields`, then we do not need `addIdentifierField` and `removeIdentifierField` and I can try to figure out the diff for every update. 
   
   @openinx do you think there is any advantage for any particular approach for the update API?
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r616250744



##########
File path: api/src/main/java/org/apache/iceberg/UpdateSchema.java
##########
@@ -384,4 +384,25 @@ default UpdateSchema updateColumn(String name, Type.PrimitiveType newType, Strin
    *                                  with other additions, renames, or updates.
    */
   UpdateSchema unionByNameWith(Schema newSchema);
+
+  /**
+   * Add a new row identifier given the column name.
+   * <p>
+   * The column must be a required field of a primitive type.
+   * It must exist in the current schema to update, or is added as a part of this update.
+   *
+   * @param columnName name of the column to add as a row identifier
+   * @return this for method chaining
+   */
+  UpdateSchema addRowIdentifier(String columnName);

Review comment:
       Yes it is not idempotent, and using set to write add/remove is easier than using add/remove to write set, I will change to set then, thanks for the feedbacks.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r616256611



##########
File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
##########
@@ -51,6 +53,7 @@
   private final TableMetadata base;
   private final Schema schema;
   private final Map<Integer, Integer> idToParent;
+  private final Set<String> identifierFieldNames;

Review comment:
       Yes with the set approach the concern that led me to use name is gone, I will change to use ID.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r620701257



##########
File path: core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java
##########
@@ -1223,4 +1223,232 @@ public void testMoveBetweenStructsFails() {
                 .moveBefore("s2.x", "s1.a")
                 .apply());
   }
+
+  @Test
+  public void testAddExistingIdentifierFields() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("id"))
+        .apply();
+
+    Assert.assertEquals("add an existing field as identifier field should succeed",
+        Sets.newHashSet(newSchema.findField("id").fieldId()),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testAddNewIdentifierFieldColumns() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new_field", Types.StringType.get())
+        .setIdentifierFields(Sets.newHashSet("id", "new_field"))
+        .apply();
+
+    Assert.assertEquals("add column then set as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("id").fieldId(), newSchema.findField("new_field").fieldId()),
+        newSchema.identifierFieldIds());
+
+    newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("id", "new_field"))
+        .addColumn("new_field", Types.StringType.get())
+        .apply();
+
+    Assert.assertEquals("set identifier then add column should succeed",
+        Sets.newHashSet(newSchema.findField("id").fieldId(), newSchema.findField("new_field").fieldId()),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testAddNestedIdentifierFieldColumns() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("preferences.feature1"))
+        .apply();
+
+    Assert.assertEquals("set existing nested field as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("preferences.feature1").fieldId()),
+        newSchema.identifierFieldIds());
+
+    newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new", Types.StructType.of(
+            Types.NestedField.optional(SCHEMA_LAST_COLUMN_ID + 1, "field", Types.StringType.get())
+        ))
+        .setIdentifierFields(Sets.newHashSet("new.field"))
+        .apply();
+
+    Assert.assertEquals("set newly added nested field as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("new.field").fieldId()),
+        newSchema.identifierFieldIds());
+
+    newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new", Types.StructType.of(
+            Types.NestedField.optional(SCHEMA_LAST_COLUMN_ID + 1, "field", Types.StructType.of(
+                Types.NestedField.optional(SCHEMA_LAST_COLUMN_ID + 2, "nested", Types.StringType.get())))))
+        .setIdentifierFields(Sets.newHashSet("new.field.nested"))
+        .apply();
+
+    Assert.assertEquals("set newly added multi-layer nested field as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("new.field.nested").fieldId()),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testAddDottedIdentifierFieldColumns() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn(null, "dot.field", Types.StringType.get())
+        .setIdentifierFields(Sets.newHashSet("id", "dot.field"))
+        .apply();
+
+    Assert.assertEquals("add a field with dot as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("id").fieldId(), newSchema.findField("dot.field").fieldId()),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testRemoveIdentifierFields() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new_field", Types.StringType.get())
+        .addColumn("new_field2", Types.StringType.get())
+        .setIdentifierFields(Sets.newHashSet("id", "new_field", "new_field2"))
+        .apply();
+
+    newSchema = new SchemaUpdate(newSchema, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("new_field", "new_field2"))
+        .apply();
+
+    Assert.assertEquals("remove an identifier field should succeed",
+        Sets.newHashSet(newSchema.findField("new_field").fieldId(), newSchema.findField("new_field2").fieldId()),
+        newSchema.identifierFieldIds());
+
+    newSchema = new SchemaUpdate(newSchema, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet())
+        .apply();
+
+    Assert.assertEquals("remove all identifier fields should succeed",
+        Sets.newHashSet(),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testSetIdentifierFieldsFails() {
+    AssertHelpers.assertThrows("add a field with name not exist should fail",
+        IllegalArgumentException.class,
+        "not found in current schema or added columns",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+            .setIdentifierFields(Sets.newHashSet("unknown"))
+            .apply());
+
+    AssertHelpers.assertThrows("add a field of non-primitive type should fail",
+        IllegalArgumentException.class,
+        "not a primitive type field",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+            .setIdentifierFields(Sets.newHashSet("locations"))
+            .apply());
+
+    AssertHelpers.assertThrows("add a nested field in map should fail",
+        IllegalArgumentException.class,
+        "must not be nested in " + SCHEMA.findField("locations"),
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+            .setIdentifierFields(Sets.newHashSet("locations.key.zip"))
+            .apply());
+
+    AssertHelpers.assertThrows("add a nested field in list should fail",
+        IllegalArgumentException.class,
+        "must not be nested in " + SCHEMA.findField("points"),
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+            .setIdentifierFields(Sets.newHashSet("points.element.x"))
+            .apply());
+
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new", Types.StructType.of(
+            Types.NestedField.optional(SCHEMA_LAST_COLUMN_ID + 1, "fields", Types.ListType.ofOptional(
+                SCHEMA_LAST_COLUMN_ID + 2, Types.StructType.of(
+                    Types.NestedField.optional(SCHEMA_LAST_COLUMN_ID + 3, "nested", Types.StringType.get())
+                ))
+            )
+        ))
+        .apply();
+
+    AssertHelpers.assertThrows("add a nested field in struct of a map should fail",
+        IllegalArgumentException.class,
+        "must not be nested in " + newSchema.findField("new.fields"),
+        () -> new SchemaUpdate(newSchema, SCHEMA_LAST_COLUMN_ID + 3)
+            .setIdentifierFields(Sets.newHashSet("new.fields.element.nested"))
+            .apply());
+  }
+
+  @Test
+  public void testDeleteIdentifierFieldColumns() {
+    Schema schemaWithIdentifierFields = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("id"))
+        .apply();
+
+    Assert.assertEquals("delete column and then reset identifier field should succeed",
+        Sets.newHashSet(),
+        new SchemaUpdate(schemaWithIdentifierFields, SCHEMA_LAST_COLUMN_ID)
+            .deleteColumn("id").setIdentifierFields(Sets.newHashSet()).apply()
+            .identifierFieldIds());
+
+    Assert.assertEquals("delete reset identifier field and then delete column should succeed",
+        Sets.newHashSet(),
+        new SchemaUpdate(schemaWithIdentifierFields, SCHEMA_LAST_COLUMN_ID)
+            .setIdentifierFields(Sets.newHashSet()).deleteColumn("id").apply()
+            .identifierFieldIds());
+  }
+
+  @Test
+  public void testDeleteIdentifierFieldColumnsFails() {
+    Schema schemaWithIdentifierFields = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("id"))

Review comment:
       Do any tests use the `String...` overload?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r620719498



##########
File path: core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java
##########
@@ -1223,4 +1223,232 @@ public void testMoveBetweenStructsFails() {
                 .moveBefore("s2.x", "s1.a")
                 .apply());
   }
+
+  @Test
+  public void testAddExistingIdentifierFields() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("id"))
+        .apply();
+
+    Assert.assertEquals("add an existing field as identifier field should succeed",
+        Sets.newHashSet(newSchema.findField("id").fieldId()),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testAddNewIdentifierFieldColumns() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new_field", Types.StringType.get())
+        .setIdentifierFields(Sets.newHashSet("id", "new_field"))
+        .apply();
+
+    Assert.assertEquals("add column then set as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("id").fieldId(), newSchema.findField("new_field").fieldId()),
+        newSchema.identifierFieldIds());
+
+    newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("id", "new_field"))
+        .addColumn("new_field", Types.StringType.get())
+        .apply();
+
+    Assert.assertEquals("set identifier then add column should succeed",
+        Sets.newHashSet(newSchema.findField("id").fieldId(), newSchema.findField("new_field").fieldId()),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testAddNestedIdentifierFieldColumns() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("preferences.feature1"))
+        .apply();
+
+    Assert.assertEquals("set existing nested field as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("preferences.feature1").fieldId()),
+        newSchema.identifierFieldIds());
+
+    newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new", Types.StructType.of(
+            Types.NestedField.optional(SCHEMA_LAST_COLUMN_ID + 1, "field", Types.StringType.get())
+        ))
+        .setIdentifierFields(Sets.newHashSet("new.field"))
+        .apply();
+
+    Assert.assertEquals("set newly added nested field as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("new.field").fieldId()),
+        newSchema.identifierFieldIds());
+
+    newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new", Types.StructType.of(
+            Types.NestedField.optional(SCHEMA_LAST_COLUMN_ID + 1, "field", Types.StructType.of(
+                Types.NestedField.optional(SCHEMA_LAST_COLUMN_ID + 2, "nested", Types.StringType.get())))))
+        .setIdentifierFields(Sets.newHashSet("new.field.nested"))
+        .apply();
+
+    Assert.assertEquals("set newly added multi-layer nested field as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("new.field.nested").fieldId()),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testAddDottedIdentifierFieldColumns() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn(null, "dot.field", Types.StringType.get())
+        .setIdentifierFields(Sets.newHashSet("id", "dot.field"))
+        .apply();
+
+    Assert.assertEquals("add a field with dot as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("id").fieldId(), newSchema.findField("dot.field").fieldId()),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testRemoveIdentifierFields() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new_field", Types.StringType.get())
+        .addColumn("new_field2", Types.StringType.get())
+        .setIdentifierFields(Sets.newHashSet("id", "new_field", "new_field2"))
+        .apply();
+
+    newSchema = new SchemaUpdate(newSchema, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("new_field", "new_field2"))
+        .apply();
+
+    Assert.assertEquals("remove an identifier field should succeed",
+        Sets.newHashSet(newSchema.findField("new_field").fieldId(), newSchema.findField("new_field2").fieldId()),
+        newSchema.identifierFieldIds());
+
+    newSchema = new SchemaUpdate(newSchema, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet())
+        .apply();
+
+    Assert.assertEquals("remove all identifier fields should succeed",
+        Sets.newHashSet(),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testSetIdentifierFieldsFails() {
+    AssertHelpers.assertThrows("add a field with name not exist should fail",
+        IllegalArgumentException.class,
+        "not found in current schema or added columns",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+            .setIdentifierFields(Sets.newHashSet("unknown"))
+            .apply());
+
+    AssertHelpers.assertThrows("add a field of non-primitive type should fail",
+        IllegalArgumentException.class,
+        "not a primitive type field",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+            .setIdentifierFields(Sets.newHashSet("locations"))
+            .apply());
+
+    AssertHelpers.assertThrows("add a nested field in map should fail",
+        IllegalArgumentException.class,
+        "must not be nested in " + SCHEMA.findField("locations"),
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+            .setIdentifierFields(Sets.newHashSet("locations.key.zip"))
+            .apply());
+
+    AssertHelpers.assertThrows("add a nested field in list should fail",
+        IllegalArgumentException.class,
+        "must not be nested in " + SCHEMA.findField("points"),
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+            .setIdentifierFields(Sets.newHashSet("points.element.x"))
+            .apply());
+
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new", Types.StructType.of(
+            Types.NestedField.optional(SCHEMA_LAST_COLUMN_ID + 1, "fields", Types.ListType.ofOptional(
+                SCHEMA_LAST_COLUMN_ID + 2, Types.StructType.of(
+                    Types.NestedField.optional(SCHEMA_LAST_COLUMN_ID + 3, "nested", Types.StringType.get())
+                ))
+            )
+        ))
+        .apply();
+
+    AssertHelpers.assertThrows("add a nested field in struct of a map should fail",

Review comment:
       Is this different than the `points.element.x` case?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r615319054



##########
File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
##########
@@ -408,11 +450,15 @@ private TableMetadata applyChangesToMapping(TableMetadata metadata) {
   private static Schema applyChanges(Schema schema, List<Integer> deletes,
                                      Map<Integer, Types.NestedField> updates,
                                      Multimap<Integer, Types.NestedField> adds,
-                                     Multimap<Integer, Move> moves) {
+                                     Multimap<Integer, Move> moves,
+                                     Set<String> rowIdentifierNames) {
     Types.StructType struct = TypeUtil
         .visit(schema, new ApplyChanges(deletes, updates, adds, moves))
         .asNestedType().asStructType();
-    return new Schema(struct.fields());
+    Set<Integer> refreshedRowIdentifiers = rowIdentifierNames.stream()
+        .map(name -> struct.field(name).fieldId())

Review comment:
       Yes, we should definitely have a test for this case.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r612209228



##########
File path: core/src/main/java/org/apache/iceberg/util/JsonUtil.java
##########
@@ -116,19 +122,42 @@ public static String getStringOrNull(String property, JsonNode node) {
   }
 
   public static List<String> getStringList(String property, JsonNode node) {
-    Preconditions.checkArgument(node.has(property), "Cannot parse missing list %s", property);
+    ImmutableList.Builder<String> builder = ImmutableList.builder();
+    return getCollection(property, node, builder::build, builder::add, JsonNode::asText,
+        e -> Preconditions.checkArgument(e.isTextual(), "Cannot parse string from non-text value: %s", e),
+        false);
+  }
+
+  public static Set<Integer> getIntegerSetOrNull(String property, JsonNode node) {
+    ImmutableSet.Builder<Integer> builder = ImmutableSet.builder();
+    return getCollection(property, node, builder::build, builder::add, JsonNode::asInt,
+        e -> Preconditions.checkArgument(e.isInt(), "Cannot parse string from non-integer value: %s", e),
+        true);
+  }
+
+  private static <C extends Collection<T>, T> C getCollection(
+      String property, JsonNode node,
+      Supplier<C> builder, Consumer<T> appender,
+      Function<JsonNode, T> getter, Consumer<JsonNode> validator,

Review comment:
       Those `builder`, `appender`, `getter` and `validator` makes the method to be really hard to follow, could we just defines a simple `visitor` interface to visit the `JsonNode` in this collection and collect all the final result to the upper layer method ? 

##########
File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
##########
@@ -408,11 +450,15 @@ private TableMetadata applyChangesToMapping(TableMetadata metadata) {
   private static Schema applyChanges(Schema schema, List<Integer> deletes,
                                      Map<Integer, Types.NestedField> updates,
                                      Multimap<Integer, Types.NestedField> adds,
-                                     Multimap<Integer, Move> moves) {
+                                     Multimap<Integer, Move> moves,
+                                     Set<String> rowIdentifierNames) {
     Types.StructType struct = TypeUtil
         .visit(schema, new ApplyChanges(deletes, updates, adds, moves))
         .asNestedType().asStructType();
-    return new Schema(struct.fields());
+    Set<Integer> refreshedRowIdentifiers = rowIdentifierNames.stream()
+        .map(name -> struct.field(name).fieldId())

Review comment:
       If we delete column `b` by using the `SchemaUpdate#deleteColumn` API,  then the `struct` variable which has been applied by those changes  won't include the field that has the colum name `b`, right ?   If sure, then how could we locate the field id for that given column name `b` in this sentence ? 

##########
File path: core/src/main/java/org/apache/iceberg/SchemaParser.java
##########
@@ -158,7 +169,7 @@ public static String toJson(Schema schema, boolean pretty) {
       if (pretty) {
         generator.useDefaultPrettyPrinter();
       }
-      toJson(schema.asStruct(), generator);
+      toJson(schema.asStruct(), schema.schemaId(), schema.rowIdentifiers(), generator);

Review comment:
       I think this [PR](https://github.com/apache/iceberg/pull/2096/files#diff-0fd54fb65a9c5d2fce5bd3386550b2e829e0f70c2992fdad8fdb9bd2d866a74fR161) forget to refresh this `toJson`  method.

##########
File path: api/src/main/java/org/apache/iceberg/Schema.java
##########
@@ -355,9 +387,17 @@ private Schema internalSelect(Collection<String> names, boolean caseSensitive) {
 
   @Override
   public String toString() {
-    return String.format("table {\n%s\n}",
-        NEWLINE.join(struct.fields().stream()
-            .map(f -> "  " + f)
-            .collect(Collectors.toList())));
+    StringBuilder sb = new StringBuilder();
+    sb.append("table {\n");
+    sb.append("  fields {\n");
+    sb.append(NEWLINE.join(struct.fields().stream()
+        .map(f -> "    " + f)
+        .collect(Collectors.toList())));
+    sb.append("\n  }\n  row identifiers { ");
+    sb.append(COMMA.join(Arrays.stream(rowIdentifiers)
+        .mapToObj(id -> String.format("%d", id))
+        .collect(Collectors.toList())));

Review comment:
       How about just simplify this as `sb.append(COMMA.join(lazyRowIdentifiersSet()))` ? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r612860695



##########
File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
##########
@@ -408,11 +450,15 @@ private TableMetadata applyChangesToMapping(TableMetadata metadata) {
   private static Schema applyChanges(Schema schema, List<Integer> deletes,
                                      Map<Integer, Types.NestedField> updates,
                                      Multimap<Integer, Types.NestedField> adds,
-                                     Multimap<Integer, Move> moves) {
+                                     Multimap<Integer, Move> moves,
+                                     Set<String> rowIdentifierNames) {
     Types.StructType struct = TypeUtil
         .visit(schema, new ApplyChanges(deletes, updates, adds, moves))
         .asNestedType().asStructType();
-    return new Schema(struct.fields());
+    Set<Integer> refreshedRowIdentifiers = rowIdentifierNames.stream()
+        .map(name -> struct.field(name).fieldId())
+        .collect(Collectors.toSet());
+    return new Schema(struct.fields(), refreshedRowIdentifiers);

Review comment:
       thanks! I completely forgot that part... I added a new method `sameSchema` and now use that to check equality in table metadata.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r617123843



##########
File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
##########
@@ -317,6 +320,31 @@ public UpdateSchema unionByNameWith(Schema newSchema) {
     return this;
   }
 
+  @Override
+  public UpdateSchema setIdentifierFields(Set<String> names) {

Review comment:
       I think we may want to also keep `addIdentifierField`. There's still a use case around adding a field at a time. Plus, given that I think we want to be able to specify the parent (see my comment below), this form of the method is a bit awkward because we can't pass pairs of strings, `(parent, name)`.
   
   What about changing this to `resetIdentifierFields`, `addIdentifierField`, and `removeIdentifierField`? I think those get the same job done because `set` is `reset` plus `add` calls.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r617124392



##########
File path: core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java
##########
@@ -1223,4 +1223,173 @@ public void testMoveBetweenStructsFails() {
                 .moveBefore("s2.x", "s1.a")
                 .apply());
   }
+
+  @Test
+  public void testSetExistingIdentifierFields() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("id"))
+        .apply();
+
+    Assert.assertEquals("add an existing field as identifier field should succeed",
+        Sets.newHashSet(newSchema.findField("id").fieldId()),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testSetNewIdentifierFieldColumns() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new_field", Types.StringType.get())
+        .setIdentifierFields(Sets.newHashSet("id", "new_field"))
+        .apply();
+
+    Assert.assertEquals("add a new field as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("id").fieldId(), newSchema.findField("new_field").fieldId()),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testAddDottedIdentifierFieldColumns() {

Review comment:
       +1 for adding this case!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] yyanyy commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

yyanyy commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r612661016



##########
File path: core/src/main/java/org/apache/iceberg/SchemaParser.java
##########
@@ -158,7 +169,7 @@ public static String toJson(Schema schema, boolean pretty) {
       if (pretty) {
         generator.useDefaultPrettyPrinter();
       }
-      toJson(schema.asStruct(), generator);
+      toJson(schema.asStruct(), schema.schemaId(), schema.rowIdentifiers(), generator);

Review comment:
       Haven't reviewed the entire PR yet, just for this specific comment - the reason was mentioned in https://github.com/apache/iceberg/pull/2096#discussion_r575716542. Basically I think there are a lot of places where this method is called for serialization/deserialization of the schema as well as preserve the schema to some file metadata, and at that point sometimes `schema` object is already transformed/projected so it is no longer an existing schema of the table, and thus shouldn't have schema ID. Since schema ID is in theory only useful when retrieved directly from table metadata to get the right schema version for time travel, I think it's better to not explicitly write a schema Id except for writing to table metadata file, to avoid persisting an incorrect version. 
   
   For the case of supporting row identifier to schema, I wonder if we want to do something similar - since row identifier is mostly useful during writing (to check table to get the default row identifier for delete) and not reading (since data/delete file should have this preserved in another place) or any kind of projection (especially that with the current change during projection we will lose row identifier information), to ensure that we don't involve in the complications when working with a mutated schema object I wonder if we want to have a separate `toJson`/`fromJson` to handle the row identifier parsing just for table metadata, and once we translate table metadata file to its model, we read row id from table instead of schema with something like `table.getRowIdentifier` which may help with table serialization/deserialization. But this approach seems to contradict with some aspects in https://github.com/apache/iceberg/pull/2354#issuecomment-816337252. 
   
   @rdblue since you involved in both conversations, could you please take a look? Thank you!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r612011177



##########
File path: core/src/main/java/org/apache/iceberg/SchemaParser.java
##########
@@ -144,7 +155,7 @@ static void toJson(Type type, JsonGenerator generator) throws IOException {
   }
 
   public static void toJson(Schema schema, JsonGenerator generator) throws IOException {
-    toJson(schema.asStruct(), schema.schemaId(), generator);
+    toJson(schema.asStruct(), schema.schemaId(), schema.rowIdentifiers(), generator);

Review comment:
       not sure why this method was not updated in the past, is there any risk to use the v2 json writer?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r617121752



##########
File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
##########
@@ -408,11 +436,18 @@ private TableMetadata applyChangesToMapping(TableMetadata metadata) {
   private static Schema applyChanges(Schema schema, List<Integer> deletes,
                                      Map<Integer, Types.NestedField> updates,
                                      Multimap<Integer, Types.NestedField> adds,
-                                     Multimap<Integer, Move> moves) {
+                                     Multimap<Integer, Move> moves,
+                                     Set<Integer> identifierFields) {
     Types.StructType struct = TypeUtil
         .visit(schema, new ApplyChanges(deletes, updates, adds, moves))
         .asNestedType().asStructType();
-    return new Schema(struct.fields());
+
+    // validate latest identifier fields are not deleted
+    identifierFields.forEach(f -> Preconditions.checkArgument(!deletes.contains(f),
+        "Cannot delete identifier field %s. To force deletion, " +
+            "also call setIdentifierFields to reset identifier fields.", schema.findField(f)));

Review comment:
       I think it is more of an "update" than a "reset". Could you change this?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] pvary commented on pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

pvary commented on pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#issuecomment-829117889


   > Also @pvary since hive module uses schema `toJson` indicated in the comment below.
   
   Thanks @yyanyy for the notification.
   
   Checked the Hive related codepaths and I do not see any issue with the new Id which are not covered by the tests.
   We do not persist the schema json. We use the `toJson` to serialize the schema during the create table call, which should be ok I think.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#issuecomment-822935951


   @rdblue @openinx @yyanyy should be ready for another view, thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r620750285



##########
File path: core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java
##########
@@ -1223,4 +1223,232 @@ public void testMoveBetweenStructsFails() {
                 .moveBefore("s2.x", "s1.a")
                 .apply());
   }
+
+  @Test
+  public void testAddExistingIdentifierFields() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("id"))
+        .apply();
+
+    Assert.assertEquals("add an existing field as identifier field should succeed",
+        Sets.newHashSet(newSchema.findField("id").fieldId()),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testAddNewIdentifierFieldColumns() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new_field", Types.StringType.get())
+        .setIdentifierFields(Sets.newHashSet("id", "new_field"))
+        .apply();
+
+    Assert.assertEquals("add column then set as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("id").fieldId(), newSchema.findField("new_field").fieldId()),
+        newSchema.identifierFieldIds());
+
+    newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("id", "new_field"))
+        .addColumn("new_field", Types.StringType.get())
+        .apply();
+
+    Assert.assertEquals("set identifier then add column should succeed",
+        Sets.newHashSet(newSchema.findField("id").fieldId(), newSchema.findField("new_field").fieldId()),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testAddNestedIdentifierFieldColumns() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("preferences.feature1"))
+        .apply();
+
+    Assert.assertEquals("set existing nested field as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("preferences.feature1").fieldId()),
+        newSchema.identifierFieldIds());
+
+    newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new", Types.StructType.of(
+            Types.NestedField.optional(SCHEMA_LAST_COLUMN_ID + 1, "field", Types.StringType.get())
+        ))
+        .setIdentifierFields(Sets.newHashSet("new.field"))
+        .apply();
+
+    Assert.assertEquals("set newly added nested field as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("new.field").fieldId()),
+        newSchema.identifierFieldIds());
+
+    newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new", Types.StructType.of(
+            Types.NestedField.optional(SCHEMA_LAST_COLUMN_ID + 1, "field", Types.StructType.of(
+                Types.NestedField.optional(SCHEMA_LAST_COLUMN_ID + 2, "nested", Types.StringType.get())))))
+        .setIdentifierFields(Sets.newHashSet("new.field.nested"))
+        .apply();
+
+    Assert.assertEquals("set newly added multi-layer nested field as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("new.field.nested").fieldId()),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testAddDottedIdentifierFieldColumns() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn(null, "dot.field", Types.StringType.get())
+        .setIdentifierFields(Sets.newHashSet("id", "dot.field"))
+        .apply();
+
+    Assert.assertEquals("add a field with dot as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("id").fieldId(), newSchema.findField("dot.field").fieldId()),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testRemoveIdentifierFields() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new_field", Types.StringType.get())
+        .addColumn("new_field2", Types.StringType.get())
+        .setIdentifierFields(Sets.newHashSet("id", "new_field", "new_field2"))
+        .apply();
+
+    newSchema = new SchemaUpdate(newSchema, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("new_field", "new_field2"))
+        .apply();
+
+    Assert.assertEquals("remove an identifier field should succeed",
+        Sets.newHashSet(newSchema.findField("new_field").fieldId(), newSchema.findField("new_field2").fieldId()),
+        newSchema.identifierFieldIds());
+
+    newSchema = new SchemaUpdate(newSchema, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet())
+        .apply();
+
+    Assert.assertEquals("remove all identifier fields should succeed",
+        Sets.newHashSet(),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testSetIdentifierFieldsFails() {
+    AssertHelpers.assertThrows("add a field with name not exist should fail",
+        IllegalArgumentException.class,
+        "not found in current schema or added columns",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+            .setIdentifierFields(Sets.newHashSet("unknown"))
+            .apply());
+
+    AssertHelpers.assertThrows("add a field of non-primitive type should fail",
+        IllegalArgumentException.class,
+        "not a primitive type field",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+            .setIdentifierFields(Sets.newHashSet("locations"))
+            .apply());
+
+    AssertHelpers.assertThrows("add a nested field in map should fail",

Review comment:
       added




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r615318105



##########
File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
##########
@@ -193,6 +204,11 @@ public UpdateSchema renameColumn(String name, String newName) {
       updates.put(fieldId, Types.NestedField.of(fieldId, field.isOptional(), newName, field.type(), field.doc()));
     }
 
+    if (identifierFieldNames.contains(name)) {

Review comment:
       I agree with this. I think it would make sense to check whether `newName` is already a field in the schema. But that is catching a problem with a rename earlier and isn't really related to this PR.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r620719667



##########
File path: core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java
##########
@@ -1223,4 +1223,232 @@ public void testMoveBetweenStructsFails() {
                 .moveBefore("s2.x", "s1.a")
                 .apply());
   }
+
+  @Test
+  public void testAddExistingIdentifierFields() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("id"))
+        .apply();
+
+    Assert.assertEquals("add an existing field as identifier field should succeed",
+        Sets.newHashSet(newSchema.findField("id").fieldId()),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testAddNewIdentifierFieldColumns() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new_field", Types.StringType.get())
+        .setIdentifierFields(Sets.newHashSet("id", "new_field"))
+        .apply();
+
+    Assert.assertEquals("add column then set as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("id").fieldId(), newSchema.findField("new_field").fieldId()),
+        newSchema.identifierFieldIds());
+
+    newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("id", "new_field"))
+        .addColumn("new_field", Types.StringType.get())
+        .apply();
+
+    Assert.assertEquals("set identifier then add column should succeed",
+        Sets.newHashSet(newSchema.findField("id").fieldId(), newSchema.findField("new_field").fieldId()),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testAddNestedIdentifierFieldColumns() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("preferences.feature1"))
+        .apply();
+
+    Assert.assertEquals("set existing nested field as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("preferences.feature1").fieldId()),
+        newSchema.identifierFieldIds());
+
+    newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new", Types.StructType.of(
+            Types.NestedField.optional(SCHEMA_LAST_COLUMN_ID + 1, "field", Types.StringType.get())
+        ))
+        .setIdentifierFields(Sets.newHashSet("new.field"))
+        .apply();
+
+    Assert.assertEquals("set newly added nested field as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("new.field").fieldId()),
+        newSchema.identifierFieldIds());
+
+    newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new", Types.StructType.of(
+            Types.NestedField.optional(SCHEMA_LAST_COLUMN_ID + 1, "field", Types.StructType.of(
+                Types.NestedField.optional(SCHEMA_LAST_COLUMN_ID + 2, "nested", Types.StringType.get())))))
+        .setIdentifierFields(Sets.newHashSet("new.field.nested"))
+        .apply();
+
+    Assert.assertEquals("set newly added multi-layer nested field as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("new.field.nested").fieldId()),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testAddDottedIdentifierFieldColumns() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn(null, "dot.field", Types.StringType.get())
+        .setIdentifierFields(Sets.newHashSet("id", "dot.field"))
+        .apply();
+
+    Assert.assertEquals("add a field with dot as identifier should succeed",
+        Sets.newHashSet(newSchema.findField("id").fieldId(), newSchema.findField("dot.field").fieldId()),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testRemoveIdentifierFields() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .addColumn("new_field", Types.StringType.get())
+        .addColumn("new_field2", Types.StringType.get())
+        .setIdentifierFields(Sets.newHashSet("id", "new_field", "new_field2"))
+        .apply();
+
+    newSchema = new SchemaUpdate(newSchema, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet("new_field", "new_field2"))
+        .apply();
+
+    Assert.assertEquals("remove an identifier field should succeed",
+        Sets.newHashSet(newSchema.findField("new_field").fieldId(), newSchema.findField("new_field2").fieldId()),
+        newSchema.identifierFieldIds());
+
+    newSchema = new SchemaUpdate(newSchema, SCHEMA_LAST_COLUMN_ID)
+        .setIdentifierFields(Sets.newHashSet())
+        .apply();
+
+    Assert.assertEquals("remove all identifier fields should succeed",
+        Sets.newHashSet(),
+        newSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testSetIdentifierFieldsFails() {
+    AssertHelpers.assertThrows("add a field with name not exist should fail",
+        IllegalArgumentException.class,
+        "not found in current schema or added columns",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+            .setIdentifierFields(Sets.newHashSet("unknown"))
+            .apply());
+
+    AssertHelpers.assertThrows("add a field of non-primitive type should fail",
+        IllegalArgumentException.class,
+        "not a primitive type field",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+            .setIdentifierFields(Sets.newHashSet("locations"))
+            .apply());
+
+    AssertHelpers.assertThrows("add a nested field in map should fail",

Review comment:
       Can you add a test for trying to reference a field in a map value struct?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r617124083



##########
File path: api/src/main/java/org/apache/iceberg/UpdateSchema.java
##########
@@ -384,4 +385,21 @@ default UpdateSchema updateColumn(String name, Type.PrimitiveType newType, Strin
    *                                  with other additions, renames, or updates.
    */
   UpdateSchema unionByNameWith(Schema newSchema);
+
+  /**
+   * Set the identifier fields given a set of column names.
+   * <p>
+   * Identifier field is a similar concept as primary key in a relational database system.
+   * A row should be unique based on the values of the identifier fields.
+   * However, unlike a primary key, Iceberg does not enforce the uniqueness of a row based on this information.
+   * It is used for operations like upsert to define the default upsert key.
+   * <p>
+   * A column in the identifier fields must be of a primitive type.
+   * Each column set in this operation must exist in the current schema to update,
+   * or is added as a part of this update.
+   *
+   * @param names names of the columns to set as identifier fields
+   * @return this for method chaining
+   */
+  UpdateSchema setIdentifierFields(Set<String> names);

Review comment:
       Nevermind, I thought about this with the need to pass a parent name and it doesn't make sense. I'm going to resolve this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r615310922



##########
File path: api/src/main/java/org/apache/iceberg/Schema.java
##########
@@ -331,6 +363,16 @@ public Schema caseInsensitiveSelect(Collection<String> names) {
     return internalSelect(names, false);
   }
 
+  /**
+   * Checks whether this schema is equivalent to another schema while ignoring the schema ID.
+   * @param anotherSchema another schema
+   * @return true if this schema is equivalent to the given schema
+   */
+  public boolean sameSchema(Schema anotherSchema) {
+    return asStruct().equals(anotherSchema.asStruct()) &&
+        identifierFieldIds().equals(anotherSchema.identifierFieldIds());

Review comment:
       Yeah, I think it should be fine to check the sets.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#issuecomment-825419521


   @openinx thanks, fixed based on your comments


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r616256308



##########
File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
##########
@@ -317,6 +333,38 @@ public UpdateSchema unionByNameWith(Schema newSchema) {
     return this;
   }
 
+  @Override
+  public UpdateSchema addIdentifierField(String name) {
+    Types.NestedField field = schema.findField(name);
+    if (field == null) {
+      field = adds.get(TABLE_ROOT_ID).stream()
+          .filter(f -> f.name().equals(name))
+          .findAny().orElse(null);

Review comment:
       Combining with your last comment, it seems like the best way to go is to just check the column isn't nested in a map/list, and in that case the root will always be TABLE_ROOT_ID. With that check added, I feel there is a bit redundant to have 2 methods like `addColumn`, because the private `addIdentifierField` method would always use root as parent. Let me know what you think with the new code.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r612910279



##########
File path: api/src/main/java/org/apache/iceberg/Schema.java
##########
@@ -331,6 +363,16 @@ public Schema caseInsensitiveSelect(Collection<String> names) {
     return internalSelect(names, false);
   }
 
+  /**
+   * Checks whether this schema is equivalent to another schema while ignoring the schema ID.
+   * @param anotherSchema another schema
+   * @return true if this schema is equivalent to the given schema
+   */
+  public boolean sameSchema(Schema anotherSchema) {
+    return asStruct().equals(anotherSchema.asStruct()) &&
+        identifierFieldIds().equals(anotherSchema.identifierFieldIds());

Review comment:
       Using the `Set` to check the equality should be correct,  I was thinking that we could use the simple `Arrays.equals(identifierFieldIds, anotherSchema.identifierFieldIds)`, that means the identifier field id's order will matter and that's not what we expected.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r615318441



##########
File path: api/src/main/java/org/apache/iceberg/UpdateSchema.java
##########
@@ -384,4 +384,25 @@ default UpdateSchema updateColumn(String name, Type.PrimitiveType newType, Strin
    *                                  with other additions, renames, or updates.
    */
   UpdateSchema unionByNameWith(Schema newSchema);
+
+  /**
+   * Add a new row identifier given the column name.
+   * <p>
+   * The column must be a required field of a primitive type.
+   * It must exist in the current schema to update, or is added as a part of this update.
+   *
+   * @param columnName name of the column to add as a row identifier
+   * @return this for method chaining
+   */
+  UpdateSchema addRowIdentifier(String columnName);

Review comment:
       @jackye1995, I don't think that the add/remove example you gave is idempotent because of the check `identifierFieldNames.contains(name)` that will result in an error: "Cannot remove column from identifier fields account_id because it is not an identifier field". @openinx gives a good example of why it's awkward to make users fall back to `remove`. And it is easy to make `set` work internally because you just clear the field ID set before calling `add` in a loop.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] yyanyy commented on pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

yyanyy commented on pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#issuecomment-828901103


   LGTM overall, just want to bring the below comment (that's not really related to the focus of this PR itself) I had originally [here](https://github.com/apache/iceberg/pull/2465#discussion_r616222010) to people's attention and make sure we are all aligned, since this PR does change the behavior of what we preserve in metadata fields. If we do go with the current approach, I'll raise a separate PR to mention the usage of schema ID and the caveat in spec. 
   
   Also @pvary since hive module uses schema `toJson` indicated in the comment below. 
   
   > Basically I think there are a lot of places where this method is called for serialization/deserialization of the schema as well as preserve the schema to some file metadata, and at that point sometimes schema object is already transformed/projected so it is no longer an existing schema of the table, and thus shouldn't have schema ID. Since schema ID is in theory only useful when retrieved directly from table metadata to get the right schema version for time travel, I think it's better to not explicitly write a schema Id except for writing to table metadata file, to avoid persisting an incorrect version.
   > ... 
   >Thanks for the response @rdblue ! Just want to mention an additional case: in addition to the possibility of writing incorrect ID for a table schema, writing id here also makes us persisting id for things not related to the table schema itself, e.g. [manifest list](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/ManifestListWriter.java#L94-L98) and [delete files](https://github.com/apache/iceberg/blob/master/parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java#L387-L414). But if we are fine with writing incorrect ID, we probably will be fine with writing ID for non-schema structs as well...
   > 
   > I guess essentially we are choosing between 1) simpler code with potential confusing IDs and 2) code complexities (due to avoid writing incorrect/unuseful IDs); I personally would prefer the latter but I don't strongly oppose the former if we think it's easier to reason about in code. Also we might want to involve people from Hive as part of this discussion since they are using this method as well to preserve the schema ([example](https://github.com/apache/iceberg/blob/master/mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java#L119))
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#issuecomment-821907154


   Thanks, @jackye1995! I think overall the direction is looking right with `identifierFieldIds`. I just noted a few things to fix:
   * Need to add `setIdentifierFields`
   * Needs some implementation fixes (track by ID, etc.)
   * Need to refactor tests into separate cases


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r618917879



##########
File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
##########
@@ -408,11 +427,55 @@ private TableMetadata applyChangesToMapping(TableMetadata metadata) {
   private static Schema applyChanges(Schema schema, List<Integer> deletes,
                                      Map<Integer, Types.NestedField> updates,
                                      Multimap<Integer, Types.NestedField> adds,
-                                     Multimap<Integer, Move> moves) {
+                                     Multimap<Integer, Move> moves,
+                                     Set<String> identifierNames) {
+    // validate existing identifier fields are not deleted
+    for (String name : identifierNames) {
+      Types.NestedField field = schema.findField(name);

Review comment:
       I listed the steps because I wanna to confirm that others are feeling good for this (I think it's OK). Seems like every one is OK, then I will mark this conversation resolved.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r617117855



##########
File path: api/src/main/java/org/apache/iceberg/UpdateSchema.java
##########
@@ -384,4 +385,21 @@ default UpdateSchema updateColumn(String name, Type.PrimitiveType newType, Strin
    *                                  with other additions, renames, or updates.
    */
   UpdateSchema unionByNameWith(Schema newSchema);
+
+  /**
+   * Set the identifier fields given a set of column names.
+   * <p>
+   * Identifier field is a similar concept as primary key in a relational database system.
+   * A row should be unique based on the values of the identifier fields.
+   * However, unlike a primary key, Iceberg does not enforce the uniqueness of a row based on this information.
+   * It is used for operations like upsert to define the default upsert key.
+   * <p>
+   * A column in the identifier fields must be of a primitive type.
+   * Each column set in this operation must exist in the current schema to update,
+   * or is added as a part of this update.

Review comment:
       Minor: I think this should be: "Each column must exist in the current schema or have been added in this update".




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

openinx commented on pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#issuecomment-826536901


   @rdblue ,  would you have any other concern ?  If no,  I will plan to get this merged.  Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r617121331



##########
File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
##########
@@ -317,6 +320,31 @@ public UpdateSchema unionByNameWith(Schema newSchema) {
     return this;
   }
 
+  @Override
+  public UpdateSchema setIdentifierFields(Set<String> names) {
+    identifierFields = Sets.newHashSet();
+    names.forEach(this::addIdentifierField);
+    return this;
+  }
+
+  private void addIdentifierField(String name) {
+    Types.NestedField field = schema.findField(name);
+    if (field == null) {
+      field = adds.get(TABLE_ROOT_ID).stream()
+          .filter(f -> f.name().equals(name))
+          .findAny().orElse(null);
+    } else {
+      Preconditions.checkArgument(idToParent.get(field.fieldId()) == null,
+          "Cannot add column %s as an identifier: must not be nested in a map or list", name);

Review comment:
       [`IndexParents` supports maps and lists](https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/types/IndexParents.java#L67), so I think you need to actually look up the parent field and check its type. You'd also need to do that all the way up to the root to ensure that this isn't wrong because of multiple levels of nesting, like a struct as a map value.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2465: Core: add row identifier to schema

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r615319323



##########
File path: core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java
##########
@@ -1223,4 +1223,98 @@ public void testMoveBetweenStructsFails() {
                 .moveBefore("s2.x", "s1.a")
                 .apply());
   }
+
+  @Test
+  public void testAddIdentifierField() {
+    Schema newSchema = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .allowIncompatibleChanges()
+        .addIdentifierField("id")
+        .addRequiredColumn("id2", Types.StringType.get())
+        .addIdentifierField("id2")
+        .apply();
+    Assert.assertEquals(Sets.newHashSet(1, SCHEMA_LAST_COLUMN_ID + 1), newSchema.identifierFieldIds());
+
+    AssertHelpers.assertThrows("Should fail if column with name not exist",
+        IllegalArgumentException.class,
+        "Cannot find column of the given name in existing or newly added columns",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID).addIdentifierField("unknown").apply());
+
+    AssertHelpers.assertThrows("Should fail if column with name not exist",
+        IllegalArgumentException.class,
+        "it is not a required field",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID).addIdentifierField("data").apply());
+
+    AssertHelpers.assertThrows("Should fail if column with name not exist",
+        IllegalArgumentException.class,
+        "it is not a primitive field",
+        () -> new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID).addIdentifierField("locations").apply());
+  }
+
+  @Test
+  public void testRenameIdentifierField() {
+    Schema schemaWithIdentifierFields = new SchemaUpdate(SCHEMA, SCHEMA_LAST_COLUMN_ID)
+        .allowIncompatibleChanges()

Review comment:
       Why does this call `allowIncompatibleChanges`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org