You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/03/21 02:45:14 UTC

[GitHub] [iceberg] jackye1995 opened a new pull request #2354: Core: add row identifier to format v2

jackye1995 opened a new pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354


   This is the continuation for #2010 for adding a concept that describes how a row in a table should be uniquely identified. I have the implementation ready up to the Spark SQL extension to update the row identifier, and will separate them into multiple PRs for review. This PR should have the same amount of content as what openInx had in the old PR.
   
   This PR adds row identifier to the Table and TableMetadata API, and writes the metadata information as something like:
   
   ```json
     ...
     "default-row-id-version": 1,
     "row-ids": [
       {
         "row-id-version": 3,
         "fields": [
           {
             "source-id": 1
           },
           {
             "source-id": 3
           }
         ]
       }
     ],
     ...
   ```
   
   I will add reasons behind the namings inline.
   
   @openinx @rdblue @aokolnychyi 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#issuecomment-812076981


   @openinx, what I meant is that we can figure out the upsert columns from the ON condition of MERGE INTO.
   
   ```
   MERGE INTO t USING s
   ON t.id = s.id
   ....
   ```
   
   ```
   MERGE INTO t USING s
   ON t.id = s.id AND t.account_id = s.account_id
   ....
   ```
   
   That's different from upsert use cases where we don't have the upsert columns in the command itself.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#issuecomment-810410893


   @openinx, I think understand what you're saying: that if we track the schema for a snapshot then why not track the partition spec, sort order, and row identifier information?
   
   The reason is that the partition spec, sort order, and row identifier do not affect existing snapshots. Those are used when writing, not when reading. When reading, we always use whatever the data files are written with.
   
   A good example of this is partition spec. The table has a "default" partition spec that is used when writing new data, but existing data in the table was already written with one or more specs that don't change. Once a manifest is written using some partition spec, that spec is fixed and part of the metadata tree. We don't need to keep around the spec that was used to produce a given snapshot because it isn't useful. What is useful is what spec was used to write any given manifest, and that's encoded in the manifest's metadata.
   
   Sort order and row identifier fields are the same way. Once we've written data in some order, we attach the order and we don't need to know what the table's order was anymore. For delete file fields, we can use the table's row identifier fields but once we've done that we don't need to know what the table's configuration was any more.
   
   Schema is a bit different because we need to know what the "current" schema was in order to time-travel because we can change the schema and continue reading the same data files afterward.
   
   I think that we should not attach the row identifier fields to the schema, and should instead consider them table metadata that gets used on write, but is not needed for time travel.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#issuecomment-811571178


   I support the idea of a row identifier as long as Iceberg does not enforce it. I see its primary usage in `UPSERT` statements where we don't know the upsert columns unless they are provided in the command.
   
   I also think it is important not to limit equality deletes to row identifier alone, which is currently handled by the spec as each delete file is associated with arbitrary column ids. We plan to leverage it in some MERGE INTO use cases, where the we can derive the delete column from the ON clause and merge columns can vary from operation to operation.
   
   W.r.t. versioning, I'd go simple. I think the current rollback semantics applies only to snapshots. We don't revert table properties or sort order. I believe we should treat row identifiers in the same way.
   
   That said, @openinx's use case is also valid. I have seen scenarios when users want to rollback the table state completely rather the current snapshot. I think that should be done by replacing the current pointer in the catalog to an old JSON file rather than by calling the table rollback API. Do we want to expose ways for rolling back table state to the users? I think that may be handy and should cover the use case that @openinx brought up.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r605850824



##########
File path: api/src/main/java/org/apache/iceberg/RowKey.java
##########
@@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.List;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.types.Types;
+
+/**
+ * Row key of a table.
+ * <p>
+ * Row key is a definition of table row uniqueness,
+ * similar to the concept of primary key in a relational database system.
+ * A row should be unique in a table based on the values of each {@link RowKeyField}.
+ * Iceberg itself does not enforce row uniqueness based on this key.
+ * It is leveraged by operations such as streaming upsert.
+ */
+public class RowKey implements Serializable {

Review comment:
       I agree with Jack's logic that "id" is typically used in Iceberg to refer to a numeric identifier. It would be odd to use `RowId`, especially given the overlap with the JDBC one. But, we have had a significant number of people that find "key" confusing when it is a non-unique "key".
   
   What about shifting the focus from the "key" or "identifier" to the fields? We could use `identifier-field-ids` to hold the collection and add `identifierFieldIds` to APIs.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r605377769



##########
File path: api/src/main/java/org/apache/iceberg/RowKey.java
##########
@@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.List;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.types.Types;
+
+/**
+ * Row key of a table.
+ * <p>
+ * Row key is a definition of table row uniqueness,
+ * similar to the concept of primary key in a relational database system.
+ * A row should be unique in a table based on the values of each {@link RowKeyField}.
+ * Iceberg itself does not enforce row uniqueness based on this key.
+ * It is leveraged by operations such as streaming upsert.
+ */
+public class RowKey implements Serializable {

Review comment:
       I think we have gone back and forth on this naming, and for now I would prefer the `Key` case because `Id` is heavily used in table metadata to mean concepts such as `spec-id`, `schema-id`, `order-id`, etc. which are the increasing ID of different specs. Using a different keyword `Key` would provide more clarity in the table metadata. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
openinx commented on pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#issuecomment-810708037


   @rdblue Yes, if we only consider the records time-travel among different snapshots,  the old versions of `partitionSpec`, `sortOrder`, `rowIdentifier` are indeed useless. Maybe I'm considering a slight different case (Still use the same example): 
   
   `t1`: User defines the account_id as the row identifier;
   `t2`: Write few records into the table;
   `t3`: Write few equality deletions (by account_id) into table;
   `t4`: Adding profile_id to row identifier, now the identifier is account_id & profile_id;
   `t5`: Write few equality deletions ( by account_id & profile_id) into table;
   
   People modified the `RowIdentifier` from `account_id` to `(account_id, profile_id)` at timestamp `t4`, after writing few records at timestamp `t5`.  He find that the `(account_id, profile_id)` deletions will lead to business error,  so he plan to revert this iceberg table to `t3` and replaying all deletions whose identifier are `(account_id)`. 
   
   My question is:  after reverting the table to `t3`,  should people still see the incorrect row identifier `(account_id, profile_id)` by default or people should see the correct row identifier `(account_id)` by default ?  
   
   Currently,   we implementation is the first one, that means people will need to  manually change the `(account_id, profile_id)` to `(account_id)` .  If we are very clear that we will continue to use this behavior in the future, then we really do not need to maintain multiple versions of the row key. Otherwise, maintaining multiple versions is necessary. From my understanding, the second behavior is actually more user-friendly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2354: Core: add row identifier to format v2

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r598210919



##########
File path: api/src/main/java/org/apache/iceberg/RowIdentifier.java
##########
@@ -0,0 +1,199 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.List;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.types.Types;
+
+/**
+ * Row identifier of a table.
+ * <p>
+ * Row identifier is a definition of table row uniqueness,
+ * similar to the concept of primary key in a relational database system.
+ * A row should be unique in a table based on the values of each {@link RowIdentifyField}.
+ * Iceberg itself does not enforce row uniqueness based on this identifier.
+ * It is leveraged by operations such as streaming upsert.
+ */
+public class RowIdentifier implements Serializable {

Review comment:
       I have considered some alternatives for the name, and discarded the following:
     1. primary key: used in db systems and implies enforcing uniqueness, which does not fit Iceberg use case
     2. upsert id: too specific for the upsert use case
     3. row id: conflicts with `java.sql.RowId`, and gives a feeling of referring to a specific row
     4. default row id: defaultXxx is used a lot in metadata, having a class with a default in prefix makes code in table metadata confusing. 
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r603455719



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -573,8 +610,43 @@ public TableMetadata replaceSortOrder(SortOrder newOrder) {
 
     return new TableMetadata(null, formatVersion, uuid, location,
         lastSequenceNumber, System.currentTimeMillis(), lastColumnId, currentSchemaId, schemas, defaultSpecId, specs,
-        lastAssignedPartitionId, newOrderId, builder.build(), properties, currentSnapshotId, snapshots, snapshotLog,
-        addPreviousFile(file, lastUpdatedMillis));
+        lastAssignedPartitionId, newOrderId, builder.build(), defaultRowKeyId, rowKeys,
+        properties, currentSnapshotId, snapshots, snapshotLog, addPreviousFile(file, lastUpdatedMillis));
+  }
+
+  public TableMetadata updateRowKey(RowKey newKey) {

Review comment:
       Yes, I have a separate PR for that after this is out.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r606060794



##########
File path: api/src/main/java/org/apache/iceberg/RowKey.java
##########
@@ -0,0 +1,179 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.Set;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.apache.iceberg.types.Types;
+
+/**
+ * Row key of a table.
+ * <p>
+ * Row key is a definition of table row uniqueness,
+ * similar to the concept of primary key in a relational database system.
+ * A row should be unique in a table based on the values of each {@link RowKeyIdentifierField}.
+ * Iceberg itself does not enforce row uniqueness based on this key.
+ * It is leveraged by operations such as streaming upsert.
+ */
+public class RowKey implements Serializable {
+
+  private static final RowKey NOT_IDENTIFIED = new RowKey(new Schema(), Sets.newHashSet());
+
+  private final Schema schema;
+  private final RowKeyIdentifierField[] identifierFields;
+
+  private transient volatile Set<RowKeyIdentifierField> identifierFieldSet;
+
+  private RowKey(Schema schema, Set<RowKeyIdentifierField> identifierFields) {
+    this.schema = schema;
+    this.identifierFields = identifierFields.toArray(new RowKeyIdentifierField[0]);
+  }
+
+  /**
+   * Returns the {@link Schema} referenced by the row key
+   */
+  public Schema schema() {
+    return schema;
+  }
+
+  /**
+   * Return the set of {@link RowKeyIdentifierField} in the row key
+   * <p>
+   * @return the set of fields in the row key
+   */
+  public Set<RowKeyIdentifierField> identifierFields() {
+    return lazyIdentifierFieldSet();
+  }
+
+  private Set<RowKeyIdentifierField> lazyIdentifierFieldSet() {
+    if (identifierFieldSet == null) {
+      synchronized (this) {
+        if (identifierFieldSet == null) {
+          identifierFieldSet = ImmutableSet.copyOf(identifierFields);
+        }
+      }
+    }
+
+    return identifierFieldSet;
+  }
+
+  /**
+   * Returns the default row key that has no field
+   */
+  public static RowKey notIdentified() {
+    return NOT_IDENTIFIED;
+  }
+
+  /**
+   * Returns true if the row key is the default one with no field
+   */
+  public boolean isNotIdentified() {
+    return identifierFields.length < 1;
+  }
+
+  @Override
+  public boolean equals(Object other) {
+    if (this == other) {
+      return true;
+    } else if (other == null || getClass() != other.getClass()) {
+      return false;
+    }
+
+    RowKey that = (RowKey) other;
+    return identifierFields().equals(that.identifierFields());
+  }
+
+  @Override
+  public int hashCode() {
+    return Arrays.hashCode(identifierFields);

Review comment:
       Is this correct ?  Here different fields order will produce different hash code ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r606061307



##########
File path: api/src/main/java/org/apache/iceberg/RowKey.java
##########
@@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.List;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.types.Types;
+
+/**
+ * Row key of a table.
+ * <p>
+ * Row key is a definition of table row uniqueness,
+ * similar to the concept of primary key in a relational database system.
+ * A row should be unique in a table based on the values of each {@link RowKeyField}.
+ * Iceberg itself does not enforce row uniqueness based on this key.
+ * It is leveraged by operations such as streaming upsert.
+ */
+public class RowKey implements Serializable {

Review comment:
       Personally, I don't have strong opinion about the `RowKeyField` or `RowKeyIdentifierField`.  I'm okay if you think it's good for one of them.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r605813283



##########
File path: api/src/main/java/org/apache/iceberg/catalog/Catalog.java
##########
@@ -22,6 +22,7 @@
 import java.util.List;
 import java.util.Map;
 import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.RowKey;

Review comment:
       That's a good point, I thought about that, and there were 2 reasons that let me decide to not add it:
   
   1. currently both Spark and Hive SQL specification does not allow a primary key clause. I would expect the user to run something like `ALTER TABLE ... ADD ROW KEY` (actual syntax TBD) to add this row key in Spark.
   
   2. it's now becoming hard to iterate through all the combinations of parameters in `createTable`, we do not have the methods with SortOrder in parameter yet, although it can be mapped to `SORTED BY` clause. I would prefer us to switch to use the table builder if possible instead  of adding all those overloading methods.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r610211043



##########
File path: api/src/main/java/org/apache/iceberg/RowKey.java
##########
@@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.List;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.types.Types;
+
+/**
+ * Row key of a table.
+ * <p>
+ * Row key is a definition of table row uniqueness,
+ * similar to the concept of primary key in a relational database system.
+ * A row should be unique in a table based on the values of each {@link RowKeyField}.
+ * Iceberg itself does not enforce row uniqueness based on this key.
+ * It is leveraged by operations such as streaming upsert.
+ */
+public class RowKey implements Serializable {

Review comment:
       To be more clear, I don't think we should ignore the near consensus from our sync discussion that "key" is misleading. I think we should instead call this class `IdentityFields` (or something similar) and store `identifier-field-ids` in table metadata.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 closed pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
jackye1995 closed pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2354: Core: add row identifier to format v2

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r598211085



##########
File path: api/src/main/java/org/apache/iceberg/RowIdentifyField.java
##########
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Objects;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.types.Types;
+
+/**
+ * A field in {@link RowIdentifier}
+ * <p>
+ * The field must be:
+ *  1. a required column in the table schema
+ *  2. a primitive type column
+ */
+public class RowIdentifyField implements Serializable {

Review comment:
       the field only contains the source column id. Technically we can directly make `fields()` a `Integer[]`, but because this will be a public API, it will be hard to change in case we want to amend it in the future, so I decide to still have this separated class for the individual field information. This also aligns better with the structure of partition spec and sort order.
   
   In `SortOrder`, I see transform is also added as a part of the field, but it seems that only identity transform is permitted. It is the same situation here, so maybe we can also add transform here if anyone thinks there might be a use case of it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on pull request #2354: Core: add row identifier to format v2

Posted by GitBox <gi...@apache.org>.
openinx commented on pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#issuecomment-806419842


   I like this PR ,  just left several comments. We may need to change this title to `Core: Add RowKey Specification in format v2`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r605806275



##########
File path: api/src/main/java/org/apache/iceberg/RowKey.java
##########
@@ -0,0 +1,178 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.Set;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.apache.iceberg.types.Types;
+
+/**
+ * Row key of a table.
+ * <p>
+ * Row key is a definition of table row uniqueness,
+ * similar to the concept of primary key in a relational database system.
+ * A row should be unique in a table based on the values of each {@link RowKeyField}.
+ * Iceberg itself does not enforce row uniqueness based on this key.
+ * It is leveraged by operations such as streaming upsert.
+ */
+public class RowKey implements Serializable {
+
+  private static final RowKey NOT_IDENTIFIED = new RowKey(null, Sets.newHashSet());
+
+  private final Schema schema;
+  private final RowKeyField[] fields;
+
+  private transient volatile Set<RowKeyField> fieldSet;
+
+  private RowKey(Schema schema, Set<RowKeyField> fields) {
+    this.schema = schema;
+    this.fields = fields.toArray(new RowKeyField[0]);
+  }
+
+  /**
+   * Returns the {@link Schema} referenced by the row key
+   */
+  public Schema schema() {
+    return schema;
+  }
+
+  /**
+   * Return the set of {@link RowKeyField} in the row key
+   * <p>
+   * @return the set of fields in the row key
+   */
+  public Set<RowKeyField> fields() {
+    return lazyFieldSet();
+  }
+
+  private Set<RowKeyField> lazyFieldSet() {
+    if (fieldSet == null) {
+      synchronized (this) {
+        if (fieldSet == null) {
+          fieldSet = Sets.newHashSet(Arrays.asList(fields));

Review comment:
       Yeah good point




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
openinx commented on pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#issuecomment-811609723


   Okay,  I think everyone has reached a consensus on this issue `Keeping table metadata and data separate (and only versioning data) is the right behavior`.   Then let's keep this consensus.  @aokolnychyi 's suggestion about `replacing the current pointer in the catalog to an old JSON file rather than by calling the table rollback API.`  looks good to me, I think this way we can also achieve the rollback of the table metadata (for now, this priority does not sound that high because people could change table state as they want by calling table API).
   
   > I support the idea of a row identifier as long as Iceberg does not enforce it
   
   As a common iceberg table specification,  the row identifier don't have to be enforced. (I've left a comment [here](https://github.com/apache/iceberg/pull/2010#issuecomment-800769586)).
   
   > We plan to leverage it in some MERGE INTO use cases, where the we can derive the delete column from the ON clause and merge columns can vary from operation to operation.
   
   I don't know much about this point, I guess you may want to use row identifier to achieve some optimizations at the spark engine level. Can you provide more information?
   
   @jackye1995 ,  I think we could update this PR now, thanks for the great work
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r605850824



##########
File path: api/src/main/java/org/apache/iceberg/RowKey.java
##########
@@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.List;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.types.Types;
+
+/**
+ * Row key of a table.
+ * <p>
+ * Row key is a definition of table row uniqueness,
+ * similar to the concept of primary key in a relational database system.
+ * A row should be unique in a table based on the values of each {@link RowKeyField}.
+ * Iceberg itself does not enforce row uniqueness based on this key.
+ * It is leveraged by operations such as streaming upsert.
+ */
+public class RowKey implements Serializable {

Review comment:
       I agree with Jack's logic that "id" is typically used in Iceberg to refer to a numeric identifier. It would be odd to use `RowId`, especially given the overlap with the JDBC one. But, we have had a significant number of people that find "key" confusing when it is a non-unique "key".
   
   What about shifting the focus from the "key" or "identifier" to the fields? We could use `identity-field-ids` to hold the collection and add `identityFieldIds` to APIs.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r605939943



##########
File path: api/src/main/java/org/apache/iceberg/RowKey.java
##########
@@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.List;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.types.Types;
+
+/**
+ * Row key of a table.
+ * <p>
+ * Row key is a definition of table row uniqueness,
+ * similar to the concept of primary key in a relational database system.
+ * A row should be unique in a table based on the values of each {@link RowKeyField}.
+ * Iceberg itself does not enforce row uniqueness based on this key.
+ * It is leveraged by operations such as streaming upsert.
+ */
+public class RowKey implements Serializable {

Review comment:
       I renamed class `RowKeyField` to `RowKeyIdentifierField`, and `fields` to `identifier-fields` in metadata. Please let me know if that feels better.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2354: Core: add row identifier to format v2

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r601031583



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -877,6 +986,26 @@ private static SortOrder freshSortOrder(int orderId, Schema schema, SortOrder so
     return builder.build();
   }
 
+  private static RowKey freshRowKey(int keyId, Schema schema, RowKey rowKey) {
+    RowKey.Builder builder = RowKey.builderFor(schema).withKeyId(keyId);
+
+    for (RowKeyField field : rowKey.fields()) {
+      // look up the name of the source field in the old schema to get the new schema's id
+      String columnName = rowKey.schema().findColumnName(field.sourceId());
+      Preconditions.checkNotNull(columnName,
+          "Cannot find column in the row key's schema. id: %s, schema: %s",
+          field.sourceId(), rowKey.schema());
+
+      // reassign all row keys with fresh column IDs.
+      Types.NestedField column = schema.findField(columnName);
+      Preconditions.checkNotNull(column,
+          "Cannot find column in the fresh schema. name: %s, schema: %s", columnName, schema);
+      builder.addField(column.fieldId());

Review comment:
       Why not just use `builder.addField(columnName)` here ?   The builder#addField will validate the existence of the column inside, so we don't have to check it again here. 

##########
File path: api/src/main/java/org/apache/iceberg/RowKey.java
##########
@@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.List;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.types.Types;
+
+/**
+ * Row key of a table.
+ * <p>
+ * Row key is a definition of table row uniqueness,
+ * similar to the concept of primary key in a relational database system.
+ * A row should be unique in a table based on the values of each {@link RowKeyField}.
+ * Iceberg itself does not enforce row uniqueness based on this key.
+ * It is leveraged by operations such as streaming upsert.
+ */
+public class RowKey implements Serializable {
+
+  private static final RowKey NOT_IDENTIFIED = new RowKey(null, 0, ImmutableList.of());
+
+  private final Schema schema;
+  private final int keyId;
+  private final RowKeyField[] fields;
+
+  private transient volatile List<RowKeyField> fieldList;
+
+  private RowKey(Schema schema, int keyId, List<RowKeyField> fields) {
+    this.schema = schema;
+    this.keyId = keyId;
+    this.fields = fields.toArray(new RowKeyField[0]);
+  }
+
+  /**
+   * Returns the {@link Schema} referenced by the row key
+   */
+  public Schema schema() {
+    return schema;
+  }
+
+  /**
+   * Returns the ID of the row key
+   */
+  public int keyId() {
+    return keyId;
+  }
+
+  /**
+   * Return the list of {@link RowKeyField} in the row key
+   * <p>
+   * Notice that the order of each field matters.
+   * 2 keys with the same set of fields but different order are viewed as different.
+   * The fields of the key should ideally be ordered based on the importance of each field
+   * to be leveraged by features like secondary index.

Review comment:
       This comment looks great, thanks for the doc.

##########
File path: api/src/main/java/org/apache/iceberg/RowKey.java
##########
@@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.List;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.types.Types;
+
+/**
+ * Row key of a table.
+ * <p>
+ * Row key is a definition of table row uniqueness,
+ * similar to the concept of primary key in a relational database system.
+ * A row should be unique in a table based on the values of each {@link RowKeyField}.
+ * Iceberg itself does not enforce row uniqueness based on this key.
+ * It is leveraged by operations such as streaming upsert.
+ */
+public class RowKey implements Serializable {
+
+  private static final RowKey NOT_IDENTIFIED = new RowKey(null, 0, ImmutableList.of());
+
+  private final Schema schema;
+  private final int keyId;
+  private final RowKeyField[] fields;
+
+  private transient volatile List<RowKeyField> fieldList;
+
+  private RowKey(Schema schema, int keyId, List<RowKeyField> fields) {
+    this.schema = schema;
+    this.keyId = keyId;
+    this.fields = fields.toArray(new RowKeyField[0]);
+  }
+
+  /**
+   * Returns the {@link Schema} referenced by the row key
+   */
+  public Schema schema() {
+    return schema;
+  }
+
+  /**
+   * Returns the ID of the row key
+   */
+  public int keyId() {
+    return keyId;
+  }
+
+  /**
+   * Return the list of {@link RowKeyField} in the row key
+   * <p>
+   * Notice that the order of each field matters.
+   * 2 keys with the same set of fields but different order are viewed as different.
+   * The fields of the key should ideally be ordered based on the importance of each field
+   * to be leveraged by features like secondary index.
+   *
+   * @return the list of fields in the row key
+   */
+  public List<RowKeyField> fields() {
+    return lazyFieldList();
+  }
+
+  private List<RowKeyField> lazyFieldList() {
+    if (fieldList == null) {
+      synchronized (this) {
+        if (fieldList == null) {
+          this.fieldList = ImmutableList.copyOf(fields);
+        }
+      }
+    }
+
+    return fieldList;
+  }
+
+  /**
+   * Checks whether this row key is equivalent to another ignoring the key ID.
+   *
+   * @param another a different row key
+   * @return true if this row key is equivalent to the given one
+   */
+  public boolean sameRowKey(RowKey another) {
+    return Arrays.equals(fields, another.fields);
+  }
+
+  /**
+   * Returns the initial default row key that has no field
+   */
+  public static RowKey notIdentified() {
+    return NOT_IDENTIFIED;
+  }
+
+  /**
+   * Returns true if the row key is the default one with no field
+   */
+  public boolean isNotIdentified() {
+    return fields.length < 1;
+  }
+
+  @Override
+  public boolean equals(Object other) {
+    if (this == other) {
+      return true;
+    } else if (other == null || getClass() != other.getClass()) {
+      return false;
+    }
+
+    RowKey that = (RowKey) other;
+    return this.keyId == that.keyId && sameRowKey(that);
+  }
+
+  @Override
+  public int hashCode() {
+    return 31 * Integer.hashCode(keyId) + Arrays.hashCode(fields);
+  }
+
+  @Override
+  public String toString() {
+    StringBuilder sb = new StringBuilder();
+    sb.append("[");
+    for (RowKeyField field : fields) {
+      sb.append("\n");
+      sb.append("  ").append(field);
+    }
+    if (fields.length > 0) {
+      sb.append("\n");
+    }
+    sb.append("]");
+    return sb.toString();
+  }
+
+  /**
+   * Creates a new {@link Builder row key builder} for the given {@link Schema}.
+   *
+   * @param schema a schema
+   * @return a row key builder for the given schema.
+   */
+  public static Builder builderFor(Schema schema) {
+    return new Builder(schema);
+  }
+
+  /**
+   * A builder to create valid {@link RowKey row key}.
+   * <p>
+   * Call {@link #builderFor(Schema)} to create a new builder.
+   */
+  public static class Builder {
+    private final Schema schema;
+    private final List<RowKeyField> fields = Lists.newArrayList();
+    // Default key ID is 1 because 0 is reserved for default
+    private int keyId = 1;
+
+    private Builder(Schema schema) {
+      this.schema = schema;
+    }
+
+    public Builder withKeyId(int id) {
+      ValidationException.check(id >= 0, "Row key id must not be less than 0");
+      this.keyId = id;
+      return this;
+    }
+
+    public Builder addField(String name) {
+      Types.NestedField column = schema.findField(name);
+      ValidationException.check(column != null, "Cannot find column with name %s in schema", name);

Review comment:
       How about appending the schema string at the end of error message so that we could easily find out what's wrong when encountered the validation exception ?

##########
File path: hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveCatalog.java
##########
@@ -235,6 +236,62 @@ public void testCreateTableCustomSortOrder() {
     }
   }
 
+  @Test
+  public void testCreateTableDefaultRowKey() {
+    Schema schema = new Schema(
+        required(1, "id", Types.IntegerType.get(), "unique ID"),
+        required(2, "data", Types.StringType.get())
+    );
+    PartitionSpec spec = PartitionSpec.builderFor(schema)
+        .bucket("data", 16)
+        .build();

Review comment:
       Nit:   seems we don't have to specify the partitionSpec to test the default `RowKey` case.

##########
File path: api/src/main/java/org/apache/iceberg/Table.java
##########
@@ -88,6 +88,20 @@ default String name() {
    */
   Map<Integer, SortOrder> sortOrders();
 
+  /**
+   * Return the {@link RowKey row key} for this table.
+   *
+   * @return this table's row key.
+   */
+  RowKey rowKey();
+
+  /**
+   * Return a map of row key versions to {@link RowKey row key} for this table.
+   *
+   * @return this table's row key map.

Review comment:
       Nit:  this table's row key map -> this table's row keys map

##########
File path: hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveCatalog.java
##########
@@ -235,6 +236,62 @@ public void testCreateTableCustomSortOrder() {
     }
   }
 
+  @Test
+  public void testCreateTableDefaultRowKey() {
+    Schema schema = new Schema(
+        required(1, "id", Types.IntegerType.get(), "unique ID"),
+        required(2, "data", Types.StringType.get())
+    );
+    PartitionSpec spec = PartitionSpec.builderFor(schema)
+        .bucket("data", 16)
+        .build();
+    TableIdentifier tableIdent = TableIdentifier.of(DB_NAME, "tbl");
+
+    try {
+      Table table = catalog.createTable(tableIdent, schema, spec);
+      RowKey rowKey = table.rowKey();
+      Assert.assertEquals("Row key ID must match", 0, rowKey.keyId());
+      Assert.assertTrue("Row key must be default", rowKey.isNotIdentified());
+    } finally {
+      catalog.dropTable(tableIdent);
+    }
+  }
+
+  @Test
+  public void testCreateTableCustomRowKey() {
+    Schema schema = new Schema(
+        required(1, "id", Types.IntegerType.get(), "unique ID"),
+        required(2, "data", Types.StringType.get())
+    );
+    PartitionSpec spec = PartitionSpec.builderFor(schema)
+        .bucket("data", 16)
+        .build();
+    RowKey key = RowKey.builderFor(schema)
+        .withKeyId(1)
+        .addField("id")
+        .addField("data")
+        .build();
+    TableIdentifier tableIdent = TableIdentifier.of(DB_NAME, "tbl");
+
+    try {
+      Table table = catalog.buildTable(tableIdent, schema)
+          .withPartitionSpec(spec)
+          .withRowKey(key)
+          .create();
+
+      RowKey actualKey = table.rowKey();
+      Assert.assertEquals("Row key ID must match", 1, actualKey.keyId());
+      Assert.assertEquals("Row key must have 2 field", 2, actualKey.fields().size());
+      Assert.assertEquals("Row key must have the expected field",

Review comment:
       To test the logic that refreshing the field id of `RowKey` for keeping consistency with persisted schema when creating table,  I will suggest to use a different field id in the original [schema](https://github.com/apache/iceberg/pull/2354/files#diff-3fd28ad4ce0f08ba3a14e230c3b172af90b4bf8e28b8948d12c61e1807e22105R262-R265).  By default the newly created table's schema will be starting from 1, then we don't know whether those field ids in RowKey has been refreshed or not.

##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -253,13 +267,17 @@ public String toString() {
                 int lastAssignedPartitionId,
                 int defaultSortOrderId,
                 List<SortOrder> sortOrders,
+                int defaultRowKeyId,
+                List<RowKey> rowKeys,
                 Map<String, String> properties,
                 long currentSnapshotId,
                 List<Snapshot> snapshots,
                 List<HistoryEntry> snapshotLog,
                 List<MetadataLogEntry> previousFiles) {
     Preconditions.checkArgument(specs != null && !specs.isEmpty(), "Partition specs cannot be null or empty");
     Preconditions.checkArgument(sortOrders != null && !sortOrders.isEmpty(), "Sort orders cannot be null or empty");
+    Preconditions.checkArgument(rowKeys != null && !rowKeys.isEmpty(),
+        "Row keys cannot be null or empty");

Review comment:
       Nit:  this don't need to change to a new line.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2354: Core: add row identifier to format v2

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r598212132



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadataParser.java
##########
@@ -152,6 +154,7 @@ public static String toJson(TableMetadata metadata) {
     }
   }
 
+  @SuppressWarnings("checkstyle:CyclomaticComplexity")

Review comment:
       I will do another PR to fix all the code complexity complaints in this class after this is merged.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2354: Core: add row identifier to format v2

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r598211085



##########
File path: api/src/main/java/org/apache/iceberg/RowIdentifyField.java
##########
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Objects;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.types.Types;
+
+/**
+ * A field in {@link RowIdentifier}
+ * <p>
+ * The field must be:
+ *  1. a required column in the table schema
+ *  2. a primitive type column
+ */
+public class RowIdentifyField implements Serializable {

Review comment:
       the field only contains the source column id. Technically we can directly make `fields()` a `Integer[]`, but because this will be a public API, it will be hard to change in case we want to amend it in the future, so I decide to still have this separated class for the individual field information. This also aligns better with the structure of partition spec and sort order.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r605376758



##########
File path: api/src/main/java/org/apache/iceberg/RowKey.java
##########
@@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.List;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.types.Types;
+
+/**
+ * Row key of a table.
+ * <p>
+ * Row key is a definition of table row uniqueness,
+ * similar to the concept of primary key in a relational database system.
+ * A row should be unique in a table based on the values of each {@link RowKeyField}.
+ * Iceberg itself does not enforce row uniqueness based on this key.
+ * It is leveraged by operations such as streaming upsert.
+ */
+public class RowKey implements Serializable {
+
+  private static final RowKey NOT_IDENTIFIED = new RowKey(null, 0, ImmutableList.of());

Review comment:
       cool, will do that




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2354: Core: add row identifier to format v2

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r598210975



##########
File path: api/src/main/java/org/apache/iceberg/RowIdentifier.java
##########
@@ -0,0 +1,199 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.List;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.types.Types;
+
+/**
+ * Row identifier of a table.
+ * <p>
+ * Row identifier is a definition of table row uniqueness,
+ * similar to the concept of primary key in a relational database system.
+ * A row should be unique in a table based on the values of each {@link RowIdentifyField}.
+ * Iceberg itself does not enforce row uniqueness based on this identifier.
+ * It is leveraged by operations such as streaming upsert.
+ */
+public class RowIdentifier implements Serializable {
+
+  private static final RowIdentifier NOT_IDENTIFIED = new RowIdentifier(null, 0, ImmutableList.of());
+
+  private final Schema schema;
+  private final int rowIdVersion;

Review comment:
       usually this is called xxxId, but that will make this filed like `rowIdId` or `rowIdentifierId` which sounds awkward, so I used version instead of id
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#issuecomment-813535832


   @rdblue I think all other people are fine with the PR, please let me know if there are any other comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2354: Core: add row identifier to format v2

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r600946368



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -111,6 +117,11 @@ static TableMetadata newTableMetadata(Schema schema,
     int freshSortOrderId = sortOrder.isUnsorted() ? sortOrder.orderId() : INITIAL_SORT_ORDER_ID;
     SortOrder freshSortOrder = freshSortOrder(freshSortOrderId, freshSchema, sortOrder);
 
+    // rebuild the row identifier using the new column ids
+    int freshRowIdentifierVersion = rowIdentifier.isNotIdentified() ?
+        rowIdentifier.rowIdVersion() : INITIAL_ROW_IDENTIFIER_VERSION;

Review comment:
       Yes, because not identified is default that has ID 0. Otherwise the row key should have ID 1 in this method, and then get merged with current metadata in create/replace table transaction.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r605353346



##########
File path: api/src/main/java/org/apache/iceberg/RowKey.java
##########
@@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.List;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.types.Types;
+
+/**
+ * Row key of a table.
+ * <p>
+ * Row key is a definition of table row uniqueness,
+ * similar to the concept of primary key in a relational database system.
+ * A row should be unique in a table based on the values of each {@link RowKeyField}.
+ * Iceberg itself does not enforce row uniqueness based on this key.
+ * It is leveraged by operations such as streaming upsert.
+ */
+public class RowKey implements Serializable {

Review comment:
       The `key` does not always means `uniqueness` in my mind.  from the MySQL [document](https://dev.mysql.com/doc/refman/8.0/en/create-index.html),  `index` could be created on top of key columns and the `index` could choose to be unique or non-unique.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#issuecomment-809758474


   @rdblue sounds good to me, I will make the update, thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2354: Core: add row identifier to format v2

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r600134551



##########
File path: api/src/main/java/org/apache/iceberg/RowIdentifyField.java
##########
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Objects;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.types.Types;
+
+/**
+ * A field in {@link RowIdentifier}
+ * <p>
+ * The field must be:
+ *  1. a required column in the table schema
+ *  2. a primitive type column
+ */
+public class RowIdentifyField implements Serializable {
+
+  private final int sourceId;
+
+  RowIdentifyField(int sourceId) {
+    this.sourceId = sourceId;
+  }
+
+  RowIdentifyField(Types.NestedField column) {
+    ValidationException.check(column.isRequired(),
+        "Cannot add column %s to row identifier because it is not a required column", column);
+    ValidationException.check(column.type().isPrimitiveType(),
+        "Cannot add column %s to row identifier because it is not of a primitive data type", column);

Review comment:
       Nit:  should we remove the `of` in the error message ? 

##########
File path: api/src/main/java/org/apache/iceberg/RowIdentifyField.java
##########
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Objects;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.types.Types;
+
+/**
+ * A field in {@link RowIdentifier}
+ * <p>
+ * The field must be:
+ *  1. a required column in the table schema
+ *  2. a primitive type column
+ */
+public class RowIdentifyField implements Serializable {
+
+  private final int sourceId;
+
+  RowIdentifyField(int sourceId) {
+    this.sourceId = sourceId;
+  }
+
+  RowIdentifyField(Types.NestedField column) {
+    ValidationException.check(column.isRequired(),
+        "Cannot add column %s to row identifier because it is not a required column", column);

Review comment:
       Nit:  In this `RowIdentifyField` constructor,  we will throw the exception with message saying `Cannot add column ...` , that seems unreasonable because we are mixed the column adding and instance constructing together.   How about moving those validation check out of this constructor ?

##########
File path: api/src/main/java/org/apache/iceberg/RowIdentifier.java
##########
@@ -0,0 +1,199 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.List;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.types.Types;
+
+/**
+ * Row identifier of a table.
+ * <p>
+ * Row identifier is a definition of table row uniqueness,
+ * similar to the concept of primary key in a relational database system.
+ * A row should be unique in a table based on the values of each {@link RowIdentifyField}.
+ * Iceberg itself does not enforce row uniqueness based on this identifier.
+ * It is leveraged by operations such as streaming upsert.
+ */
+public class RowIdentifier implements Serializable {

Review comment:
       Yeah,  the `Row identifier` is the correct semantic to express. On minor thing for me is : the name seems a bit long as a part of table spec,  how about using `RowKey` as the name ?  

##########
File path: api/src/main/java/org/apache/iceberg/RowIdentifier.java
##########
@@ -0,0 +1,199 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.List;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.types.Types;
+
+/**
+ * Row identifier of a table.
+ * <p>
+ * Row identifier is a definition of table row uniqueness,
+ * similar to the concept of primary key in a relational database system.
+ * A row should be unique in a table based on the values of each {@link RowIdentifyField}.
+ * Iceberg itself does not enforce row uniqueness based on this identifier.
+ * It is leveraged by operations such as streaming upsert.
+ */
+public class RowIdentifier implements Serializable {
+
+  private static final RowIdentifier NOT_IDENTIFIED = new RowIdentifier(null, 0, ImmutableList.of());
+
+  private final Schema schema;
+  private final int rowIdVersion;

Review comment:
       If we plan to change the name as `RowKey`, then we could use the `rowKeyId` here. 

##########
File path: api/src/main/java/org/apache/iceberg/RowIdentifyField.java
##########
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Objects;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.types.Types;
+
+/**
+ * A field in {@link RowIdentifier}
+ * <p>
+ * The field must be:
+ *  1. a required column in the table schema
+ *  2. a primitive type column
+ */
+public class RowIdentifyField implements Serializable {

Review comment:
       I agreed that it's extensible to introduce a separate `RowIdentifyField` class here. For the `transform` function, currently I don't see there's any requirement that we will need it for this row identifier fields, but I'm OK to make it extensible for future usage.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r602005069



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -877,6 +985,26 @@ private static SortOrder freshSortOrder(int orderId, Schema schema, SortOrder so
     return builder.build();
   }
 
+  private static RowKey freshRowKey(int keyId, Schema schema, RowKey rowKey) {
+    RowKey.Builder builder = RowKey.builderFor(schema).withKeyId(keyId);
+
+    for (RowKeyField field : rowKey.fields()) {
+      // look up the name of the source field in the old schema to get the new schema's id
+      String columnName = rowKey.schema().findColumnName(field.sourceId());
+      Preconditions.checkNotNull(columnName,
+          "Cannot find column in the row key's schema. id: %s, schema: %s",
+          field.sourceId(), rowKey.schema());
+
+      // reassign all row keys with fresh column IDs.
+      Types.NestedField column = schema.findField(columnName);
+      Preconditions.checkNotNull(column,
+          "Cannot find column in the fresh schema. name: %s, schema: %s", columnName, schema);

Review comment:
       We don't have to do the nullable check here, because the `builder.addField(columnName)` will do the check inside its implementation.  (I'm sorry I did not describe this clearly in the last comment).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#issuecomment-809749423


   @jackye1995, I don't think that the row identifier fields are strongly related to a clustering index. Records in an Iceberg table are physically sorted according to some order used at write time, and also distributed by a hash or ordered distribution. That logic is independent. And a future index type that has some order for its contents would probably be separately configured and would also not use the row identifier fields.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#issuecomment-811486171


   > My question is: after reverting the table to t3, should people still see the incorrect row identifier (account_id, profile_id) by default or people should see the correct row identifier (account_id) by default ?
   
   I think I see the miscommunication. I don't think there is a way to roll back to t3. There is a snapshot created at t2, t3, and t5. Those snapshots are accessible via time travel and rollback. The rest of the table metadata is independent so rolling back doesn't change it. To revert both the bad write and the configuration change, the user should roll back and then set the row identifier fields to just `account_id`.
   
   Keeping table metadata and data separate (and only versioning data) is the right behavior, I think. Data is constantly evolving and we don't want to accidentally revert metadata changes -- like updating table properties -- when the data snapshot is rolled back.
   
   Consider a slightly different scenario where the rollback to t3 was needed because the source was producing bad data. Why should the `profile_id` be removed from the row identifier in that case? If Iceberg did that implicitly, then after the corrected data is turned back on, Iceberg would start deleting rows incorrectly using the wrong key.
   
   I think the right approach is to keep data a separate dimension. Since we want Iceberg to be a coordination layer between multiple services that don't know about one another,  I think it would be bad for actions that fix data to also make possibly unknown changes to metadata.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#issuecomment-808606039


   @jackye1995 and @openinx, I have a few questions about this before I'm comfortable merging it. Thanks for working on this so far!
   
   Why do we need to track multiple versions of the row identifier like we do for schema, partition spec, and sort order? I think of this as the "fields that identify a row". Is it helpful to have more than one view of how rows are identified?
   
   To answer that, we need to consider whether two versions are ever valid at the same time, and how row IDs are going to evolve over time:
   * Row identifier columns may be set, either to initialize or to fix a mistake (e.g., used account_id instead of profile_id)
   * Row identifier columns may be added, when a new identifying column is added to the schema (e.g., adding profile_id to a table previously identified by only account_id)
   
   I think both of those operations only require setting the current way of identifying rows, not keeping track of the previous ways. I'm interested to hear what everyone thinks about that and whether there is agreement.
   
   If I'm correct, then I would probably not keep track of multiple versions here. If I'm not, then I think we should ask whether the row ID columns should be tracked in the schema itself rather than separately versioned, since they will probably change at the same time the schema does -- when adding a new column that is now part of the identifier.
   
   It would be great to hear from @aokolnychyi on this as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
openinx commented on pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#issuecomment-809033279


   I think we should keep trace of multiple version in apache iceberg schema, let's discuss the case you described: adding profile_id to table previously identified by only account_id. 
   
   t1:   User defines the `account_id` as the row identifier; 
   t2:   Write few records into the table; 
   t3:   Write few equality deletions (by `account_id`) into table; 
   t4:   Adding `profile_id` to row identifier, now the identifier is `account_id` & `profile_id`; 
   t5:   Write few equality deletions ( by `account_id` & `profile_id`) into table;
   
   In my option,  the iceberg table format's row identifier specification is introduced because we expect the standard SQL's `PRIMARY KEY` could be mapped to those row identifier columns automatically 
   
   (
   if we don't have the row identifier spec then we don't know how to track those keys when create table like: 
   
   ```sql
   CREATE TABLE sample(id INT, data STRING,  PRIMARY KEY (id) NOT ENFORCED);
   ```
   )
   
   Back to the above case,  at the timestamp `t4` & `t5`,  the table's row identifier is `account_id` & `profile_id`.  If people want to read the snapshot at timestamp `t3`, then we should use the row identifier `account_id`.  So if we don't track the multiple version of identifier,  How could we read the row identifier from old snapshots ?  If use the latest `account_id` & `profile_id`,  that seems confuse people a lot because those rows are deleted only by field `account_id` actually.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2354: Core: add row identifier to format v2

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#issuecomment-804391687


   Thanks for working on this, @jackye1995! I'll take a look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#issuecomment-811714767


   @openinx @rdblue should be ready for review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r605483515



##########
File path: api/src/main/java/org/apache/iceberg/RowKey.java
##########
@@ -0,0 +1,178 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.Set;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.apache.iceberg.types.Types;
+
+/**
+ * Row key of a table.
+ * <p>
+ * Row key is a definition of table row uniqueness,
+ * similar to the concept of primary key in a relational database system.
+ * A row should be unique in a table based on the values of each {@link RowKeyField}.
+ * Iceberg itself does not enforce row uniqueness based on this key.
+ * It is leveraged by operations such as streaming upsert.
+ */
+public class RowKey implements Serializable {
+
+  private static final RowKey NOT_IDENTIFIED = new RowKey(null, Sets.newHashSet());
+
+  private final Schema schema;
+  private final RowKeyField[] fields;
+
+  private transient volatile Set<RowKeyField> fieldSet;
+
+  private RowKey(Schema schema, Set<RowKeyField> fields) {
+    this.schema = schema;
+    this.fields = fields.toArray(new RowKeyField[0]);
+  }
+
+  /**
+   * Returns the {@link Schema} referenced by the row key
+   */
+  public Schema schema() {
+    return schema;
+  }
+
+  /**
+   * Return the set of {@link RowKeyField} in the row key
+   * <p>
+   * @return the set of fields in the row key
+   */
+  public Set<RowKeyField> fields() {
+    return lazyFieldSet();
+  }
+
+  private Set<RowKeyField> lazyFieldSet() {
+    if (fieldSet == null) {
+      synchronized (this) {
+        if (fieldSet == null) {
+          fieldSet = Sets.newHashSet(Arrays.asList(fields));

Review comment:
       Nit:   Here we don't need to create a list then put all elements into a newly created `HashSet`. Since the `fieldSet` is a `transient` variable (which don't have to be serialized), so we could just use `fieldSet=ImmutableSet.copyOf(fields)`.

##########
File path: api/src/main/java/org/apache/iceberg/RowKey.java
##########
@@ -0,0 +1,178 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.Set;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.apache.iceberg.types.Types;
+
+/**
+ * Row key of a table.
+ * <p>
+ * Row key is a definition of table row uniqueness,
+ * similar to the concept of primary key in a relational database system.
+ * A row should be unique in a table based on the values of each {@link RowKeyField}.
+ * Iceberg itself does not enforce row uniqueness based on this key.
+ * It is leveraged by operations such as streaming upsert.
+ */
+public class RowKey implements Serializable {
+
+  private static final RowKey NOT_IDENTIFIED = new RowKey(null, Sets.newHashSet());

Review comment:
       The `schema` should be `null` or `new Schema()` ?  Seems `new Schema()` is more friendly because people could get rid of nullable checking case by case.

##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -65,27 +65,31 @@ public static TableMetadata newTableMetadata(TableOperations ops,
                                                PartitionSpec spec,
                                                String location,
                                                Map<String, String> properties) {
-    return newTableMetadata(schema, spec, SortOrder.unsorted(), location, properties, DEFAULT_TABLE_FORMAT_VERSION);
+    return newTableMetadata(schema, spec, SortOrder.unsorted(), RowKey.notIdentified(), location,
+        properties, DEFAULT_TABLE_FORMAT_VERSION);
   }
 
   public static TableMetadata newTableMetadata(Schema schema,
                                                PartitionSpec spec,
                                                SortOrder sortOrder,
+                                               RowKey rowKey,

Review comment:
       Q:  Do we need to maintain the compatibility for this `public` method: 
   
   ```java
     public static TableMetadata newTableMetadata(Schema schema,
                                                  PartitionSpec spec,
                                                  SortOrder sortOrder,
                                                  String location,
                                                  Map<String, String> properties) {
        ....
     }
   ```

##########
File path: api/src/main/java/org/apache/iceberg/catalog/Catalog.java
##########
@@ -22,6 +22,7 @@
 import java.util.List;
 import java.util.Map;
 import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.RowKey;

Review comment:
       Do we need to introduce a createTable method that could accept `RowKey` in this interface ?  I rise this question because when I  support [flink SQL primary key](https://github.com/openinx/incubator-iceberg/pull/4/files#diff-45ca81857734430272d8132ed4008d1ce97560168a6d2c2d660295cbf265f4f9R57-R77) , I find it's necessary method.  Of course,  we could publish separate  PR for this if possible.

##########
File path: api/src/main/java/org/apache/iceberg/catalog/Catalog.java
##########
@@ -22,6 +22,7 @@
 import java.util.List;
 import java.util.Map;
 import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.RowKey;

Review comment:
       Opening a separate issue for this is good enough for me now.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r610211685



##########
File path: api/src/main/java/org/apache/iceberg/RowKeyIdentifierField.java
##########
@@ -0,0 +1,65 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Objects;
+
+/**
+ * An identifier field in {@link RowKey}
+ * <p>
+ * The field must be:
+ *  1. a required column in the table schema
+ *  2. a primitive type column
+ */
+public class RowKeyIdentifierField implements Serializable {

Review comment:
       This seems to be a class that wraps a single ID. Could we get rid of it? Instead, `IdentifierFields` could expose `idSet` or `ids`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
openinx commented on pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#issuecomment-809869373


   > basically the row key information at that point of time is already in the delete file written at that time.  
   
   @jackye1995 @rdblue  Yes,  I can understand that we've maintained the equality field ids inside each equality delete files and iterating records from an old snapshot should be correct because we've considered [applying process](https://github.com/apache/iceberg/blob/90225d6c9413016d611e2ce5eff37db1bc1b4fc5/data/src/main/java/org/apache/iceberg/data/DeleteFilter.java#L126) for different equality fields ids. 
   
   But in this [comment](https://github.com/apache/iceberg/pull/2354#issuecomment-809033279), I'm not discussing the data correctness (the data correctness has no problem).  I mean people should be able to read the old `RowKey` from an older snapshot. In this issue https://github.com/apache/iceberg/issues/1029,  we are proposing to attach the `schema` in snapshot so that users could create a complete table based on an existing snapshot by using CTAS or RTAS, in this scenarios I think we also need to use the old `PartitionSpec`, `SortOrder`, `RowKey` in the newly created table.  Then at the iceberg table level,  people could see the consistent table metadata.
   
   Another case is rollback to an older snapshot by replacing the latest snapshot,  though we currently do not support replacing schema/partition-spec/sort-order with the old one,  but in my mind I think we'd better to because providing an uniformed view of data files, schema and other table metadata at the old timestamp `t3` matches the expectations from end users. 
   
   That's why I recommended to track the multiple versions of `RowKey` in iceberg table metadata. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#issuecomment-809745464


   @rdblue for the ordering, I was thinking about potential features like clustering index, where the order of primary key matters for performance of the index. But I am okay to start simple by removing versioning and making the fields unordered for now. 
   
   @openinx are you okay with those changes?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r605354220



##########
File path: api/src/main/java/org/apache/iceberg/RowKey.java
##########
@@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.List;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.types.Types;
+
+/**
+ * Row key of a table.
+ * <p>
+ * Row key is a definition of table row uniqueness,
+ * similar to the concept of primary key in a relational database system.
+ * A row should be unique in a table based on the values of each {@link RowKeyField}.
+ * Iceberg itself does not enforce row uniqueness based on this key.
+ * It is leveraged by operations such as streaming upsert.
+ */
+public class RowKey implements Serializable {
+
+  private static final RowKey NOT_IDENTIFIED = new RowKey(null, 0, ImmutableList.of());

Review comment:
       That's a great point,  because we've suffered from the kryo serialization issues in here https://github.com/apache/iceberg/pull/2343/files.  Maybe we could provide an unit test to cover the `RowKey` kryo serialization cases ( Similar to [TestDataFileSerialization](https://github.com/apache/iceberg/pull/2343/files#diff-c89bc48a08a007c0b2ef2b037b971ef0bcdde6ccc7d54850b963b566c006f6f1R51)).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2354: Core: add row identifier to format v2

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r598211147



##########
File path: core/src/main/java/org/apache/iceberg/RowIdentifierParser.java
##########
@@ -0,0 +1,106 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import com.fasterxml.jackson.core.JsonGenerator;
+import com.fasterxml.jackson.databind.JsonNode;
+import java.io.IOException;
+import java.io.StringWriter;
+import java.io.UncheckedIOException;
+import java.util.Iterator;
+import org.apache.iceberg.exceptions.RuntimeIOException;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.util.JsonUtil;
+
+public class RowIdentifierParser {
+  private static final String ROW_IDENTIFIER_VERSION = "row-id-version";

Review comment:
       `row-id-xxx` is used instead of the full `row-identifier-xxx` in the serialized metadata to make the object key shorter.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r605993041



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -482,6 +506,7 @@ public TableMetadata updateSchema(Schema newSchema, int newLastColumnId) {
     // rebuild all of the partition specs and sort orders for the new current schema
     List<PartitionSpec> updatedSpecs = Lists.transform(specs, spec -> updateSpecSchema(newSchema, spec));
     List<SortOrder> updatedSortOrders = Lists.transform(sortOrders, order -> updateSortOrderSchema(newSchema, order));
+    RowKey updatedRowKey = updateRowKeySchema(newSchema, rowKey);

Review comment:
       That's a good point, I was thinking about having another PR for blocking column drop for column in row key after this one, do you think it's better to have it here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r602002513



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -573,8 +610,43 @@ public TableMetadata replaceSortOrder(SortOrder newOrder) {
 
     return new TableMetadata(null, formatVersion, uuid, location,
         lastSequenceNumber, System.currentTimeMillis(), lastColumnId, currentSchemaId, schemas, defaultSpecId, specs,
-        lastAssignedPartitionId, newOrderId, builder.build(), properties, currentSnapshotId, snapshots, snapshotLog,
-        addPreviousFile(file, lastUpdatedMillis));
+        lastAssignedPartitionId, newOrderId, builder.build(), defaultRowKeyId, rowKeys,
+        properties, currentSnapshotId, snapshots, snapshotLog, addPreviousFile(file, lastUpdatedMillis));
+  }
+
+  public TableMetadata updateRowKey(RowKey newKey) {

Review comment:
       We will need to introduce a public iceberg table API to update the RowKey specification, right ?  Maybe we could file a separate issue to address this, we can publish the interface in that issue. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r605319429



##########
File path: api/src/main/java/org/apache/iceberg/RowKey.java
##########
@@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.List;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.types.Types;
+
+/**
+ * Row key of a table.
+ * <p>
+ * Row key is a definition of table row uniqueness,
+ * similar to the concept of primary key in a relational database system.
+ * A row should be unique in a table based on the values of each {@link RowKeyField}.
+ * Iceberg itself does not enforce row uniqueness based on this key.
+ * It is leveraged by operations such as streaming upsert.
+ */
+public class RowKey implements Serializable {
+
+  private static final RowKey NOT_IDENTIFIED = new RowKey(null, 0, ImmutableList.of());

Review comment:
       Guava collections are not Kryo friendly. We have to be careful.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#issuecomment-816337252


   @jackye1995, looking at this, I think that we can simplify it quite a bit since we are no longer tracking multiple versions of the identifier fields.
   
   I started by looking at `RowKeyIdentifierField`, which is basically an `Integer`, then at `RowKey`, which is basically a `Collection<Integer>` or `Set<Integer>`. I think that we can remove those two classes fairly easily. Next, this requires quite a few API changes to expose the row key along side a schema, when I think most SQL tables consider those part of the same thing. If we also consider the identifier fields to be part of the schema, then we no longer need so many changes.
   
   For example, we can reuse the `UpdateSchema` operation to set the schema's identifier fields, `setIdentifierFields(String...)`. We can also expose `Schema.identifierFieldIds()` to access them. Not requiring a new operation on `Table` or modifying `TableMetadata` is a plus. But it also makes sense to do this: then we could have a single operation that adds `profile_id` and also adds it to the table's identifier fields at the same time.
   
   Combining the identifier fields with schema also causes them to be versioned with schema -- we'd need to update `Schema` so that the identifier fields are part of equality -- but then we would be able to detect cases like @openinx raised. While we wouldn't roll back the schema/identifier update at the same time as the snapshot without a transaction, this would allow us to easily detect that the bad snapshot used the updated identifier, or detect which snapshots in a table used an updated identifier.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r605980693



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -482,6 +506,7 @@ public TableMetadata updateSchema(Schema newSchema, int newLastColumnId) {
     // rebuild all of the partition specs and sort orders for the new current schema
     List<PartitionSpec> updatedSpecs = Lists.transform(specs, spec -> updateSpecSchema(newSchema, spec));
     List<SortOrder> updatedSortOrders = Lists.transform(sortOrders, order -> updateSortOrderSchema(newSchema, order));
+    RowKey updatedRowKey = updateRowKeySchema(newSchema, rowKey);

Review comment:
       should we fail if schema update drops column defined in the row key? probably want to add test for that too

##########
File path: core/src/test/java/org/apache/iceberg/TestTableMetadata.java
##########
@@ -646,6 +659,55 @@ public void testUpdateSortOrder() {
         NullOrder.NULLS_FIRST, sortedByX.sortOrder().fields().get(0).nullOrder());
   }
 
+  @Test
+  public void testRowKey() {
+    Schema schema = new Schema(
+        Types.NestedField.required(10, "x", Types.StringType.get())
+    );
+
+    TableMetadata meta = TableMetadata.newTableMetadata(
+        schema, PartitionSpec.unpartitioned(), null, ImmutableMap.of());
+    RowKey rowKey = meta.rowKey();
+    Assert.assertTrue("Row key must be default", rowKey.isNotIdentified());
+  }
+
+  @Test
+  public void testUpdateRowKey() {
+    Schema schema = new Schema(
+        Types.NestedField.required(10, "x", Types.StringType.get()),
+        Types.NestedField.required(11, "y", Types.StringType.get())
+    );
+
+    RowKey id = RowKey.builderFor(schema).addField("x").build();
+
+    TableMetadata identifiedByX = TableMetadata.newTableMetadata(
+        schema, PartitionSpec.unpartitioned(), SortOrder.unsorted(), id, null, ImmutableMap.of());
+    RowKey actualKey = identifiedByX.rowKey();
+    Assert.assertEquals("Row key must have 1 field", 1, actualKey.identifierFields().size());
+    Assert.assertEquals("Row key must have the expected field",
+        org.apache.commons.compress.utils.Sets.newHashSet(1),

Review comment:
       nit: wrong import? Also may worth confirming "x" in schema is indeed assigned as id 1




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#issuecomment-809539515


   @openinx for your example, my understanding is that at t3, the equality delete file would have a row like `account_id=xxx`, and at t5, the new equality file would have a row like `account_id=yyy, profile_id=zzz`, basically the row key information at that point of time is already in the delete file written at that time. When time traveling to t4, the delete file written at that time can still work without the need to consult the latest row key.
   
   @rdblue I actually mostly agree with what you mention, as I don't see why the example mentioned by openinx would not work, but maybe I missed something there. But I decided to go with the versioned approach because I think it can potentially be used to provide some uniqueness guarantee at read time in the future by merging rows, given the fact that now we basically have a primary key concept through RowKey and a sort key concept through SortOrder. And at that time, we will need this information to be present in the specific snapshot that we time travel to.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#issuecomment-811641120


   >  I have seen scenarios when users want to rollback the table state completely rather the current snapshot. I think that should be done by replacing the current pointer in the catalog to an old JSON file rather than by calling the table rollback API.
   
   +1 on this
   
   > @jackye1995 , I think we could update this PR now, thanks for the great work
   
   sounds good, will do it now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r606003885



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -482,6 +506,7 @@ public TableMetadata updateSchema(Schema newSchema, int newLastColumnId) {
     // rebuild all of the partition specs and sort orders for the new current schema
     List<PartitionSpec> updatedSpecs = Lists.transform(specs, spec -> updateSpecSchema(newSchema, spec));
     List<SortOrder> updatedSortOrders = Lists.transform(sortOrders, order -> updateSortOrderSchema(newSchema, order));
+    RowKey updatedRowKey = updateRowKeySchema(newSchema, rowKey);

Review comment:
       I was thinking about the following: 1 PR for the update related changes including update row key API, implementation, and this fix in schema update; 1PR for work in Spark for SQL extension; 1PR in Flink for primary key clause (if no one did it yet)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#issuecomment-812077531


   I think this is now unrelated to this PR, but since discussion is happening here I want to mention it:
   
   > I have seen scenarios when users want to rollback the table state completely rather the current snapshot. I think that should be done by replacing the current pointer in the catalog to an old JSON file rather than by calling the table rollback API.
   
   I don't agree with the direction of adding an API to roll back the JSON file itself. That approach discards relevant history, like the fact that after `t5`, the table had a bad snapshot. That history is relevant and valuable. Here are a few examples:
   
   1. Users can see how long the bad snapshot was the current
   2. Users can see what the snapshot ID was and find out which jobs read the bad snapshot
   3. Users can see what other changes were rolled back at the same time (e.g. a compaction was also rolled back)
   
   If we want to support this use case, then I think we need to make an API that will roll back a table to some point in time. That would roll back the snapshot (preserving the snapshot log) and revert metadata changes. We could do this by having a rollback API that actually uses a transaction to make multiple different changes. I've been thinking about updating the `SnapshotManager` to use a transaction as well, which would allow cherry-picking multiple commits as a single operation.
   
   There's a big more discussion that should happen here, but I'm open to the transaction approach. I just don't think rolling back the metadata file instead of moving forward and keeping history is a good idea.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2354: Core: add row identifier to format v2

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r600852092



##########
File path: api/src/main/java/org/apache/iceberg/RowIdentifier.java
##########
@@ -0,0 +1,199 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.List;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.types.Types;
+
+/**
+ * Row identifier of a table.
+ * <p>
+ * Row identifier is a definition of table row uniqueness,
+ * similar to the concept of primary key in a relational database system.
+ * A row should be unique in a table based on the values of each {@link RowIdentifyField}.
+ * Iceberg itself does not enforce row uniqueness based on this identifier.
+ * It is leveraged by operations such as streaming upsert.
+ */
+public class RowIdentifier implements Serializable {

Review comment:
       `RowKey` sounds good to me, I will update based on that!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2354: Core: add row identifier to format v2

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r600949014



##########
File path: core/src/test/java/org/apache/iceberg/TestRowIdentifierParser.java
##########
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import org.junit.Assert;
+import org.junit.Test;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+
+@RunWith(Parameterized.class)
+public class TestRowIdentifierParser extends TableTestBase {
+
+  @Parameterized.Parameters(name = "format = {0}")
+  public static Object[][] parameters() {
+    return new Object[][] {
+        {1},
+        {2}
+    };
+  }
+
+  public TestRowIdentifierParser(int formatVersion) {
+    super(formatVersion);
+  }
+
+  @Test
+  public void testToJson() {
+    String expected = "{\"row-id-version\":0,\"fields\":[]}";
+    Assert.assertEquals(expected, RowIdentifierParser.toJson(table.rowIdentifier(), false));
+
+    RowIdentifier id = RowIdentifier.builderFor(table.schema())
+        .addField("id")
+        .addField("data")
+        .build();
+
+    table.ops().commit(table.ops().current(), table.ops().current().updateRowIdentifier(id));
+
+    expected = "{\n" +
+        "  \"row-id-version\" : 1,\n" +
+        "  \"fields\" : [ {\n" +
+        "    \"source-id\" : 1\n" +
+        "  }, {\n" +
+        "    \"source-id\" : 2\n" +
+        "  } ]\n" +
+        "}";
+    Assert.assertEquals(expected, RowIdentifierParser.toJson(id, true));
+  }
+
+  @Test
+  public void testFromJson() {
+    String expected = "{\n" +
+        "  \"row-id-version\" : 1,\n" +
+        "  \"fields\" : [ {\n" +
+        "    \"source-id\" : 1\n" +
+        "  } ]\n" +
+        "}";
+
+    RowIdentifier expectedId = RowIdentifier.builderFor(table.schema())
+        .addField("id")
+        .build();
+
+    RowIdentifier id = RowIdentifierParser.fromJson(table.schema(), expected);
+    Assert.assertEquals(expectedId, id);

Review comment:
       added




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#issuecomment-811532551


   Let me catch up with the discussion today.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
openinx commented on pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#issuecomment-817560130


   Thanks @rdblue & @jackye1995's comment, I'm okay about the simplest way to implement the row identifier specification. Let's push this work forward. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r601879489



##########
File path: api/src/main/java/org/apache/iceberg/RowKey.java
##########
@@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.List;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.types.Types;
+
+/**
+ * Row key of a table.
+ * <p>
+ * Row key is a definition of table row uniqueness,
+ * similar to the concept of primary key in a relational database system.
+ * A row should be unique in a table based on the values of each {@link RowKeyField}.
+ * Iceberg itself does not enforce row uniqueness based on this key.
+ * It is leveraged by operations such as streaming upsert.
+ */
+public class RowKey implements Serializable {
+
+  private static final RowKey NOT_IDENTIFIED = new RowKey(null, 0, ImmutableList.of());
+
+  private final Schema schema;
+  private final int keyId;
+  private final RowKeyField[] fields;
+
+  private transient volatile List<RowKeyField> fieldList;
+
+  private RowKey(Schema schema, int keyId, List<RowKeyField> fields) {
+    this.schema = schema;
+    this.keyId = keyId;
+    this.fields = fields.toArray(new RowKeyField[0]);
+  }
+
+  /**
+   * Returns the {@link Schema} referenced by the row key
+   */
+  public Schema schema() {
+    return schema;
+  }
+
+  /**
+   * Returns the ID of the row key
+   */
+  public int keyId() {
+    return keyId;
+  }
+
+  /**
+   * Return the list of {@link RowKeyField} in the row key
+   * <p>
+   * Notice that the order of each field matters.
+   * 2 keys with the same set of fields but different order are viewed as different.
+   * The fields of the key should ideally be ordered based on the importance of each field
+   * to be leveraged by features like secondary index.
+   *
+   * @return the list of fields in the row key
+   */
+  public List<RowKeyField> fields() {
+    return lazyFieldList();
+  }
+
+  private List<RowKeyField> lazyFieldList() {
+    if (fieldList == null) {
+      synchronized (this) {
+        if (fieldList == null) {
+          this.fieldList = ImmutableList.copyOf(fields);
+        }
+      }
+    }
+
+    return fieldList;
+  }
+
+  /**
+   * Checks whether this row key is equivalent to another ignoring the key ID.
+   *
+   * @param another a different row key
+   * @return true if this row key is equivalent to the given one
+   */
+  public boolean sameRowKey(RowKey another) {
+    return Arrays.equals(fields, another.fields);
+  }
+
+  /**
+   * Returns the initial default row key that has no field
+   */
+  public static RowKey notIdentified() {
+    return NOT_IDENTIFIED;
+  }
+
+  /**
+   * Returns true if the row key is the default one with no field
+   */
+  public boolean isNotIdentified() {
+    return fields.length < 1;
+  }
+
+  @Override
+  public boolean equals(Object other) {
+    if (this == other) {
+      return true;
+    } else if (other == null || getClass() != other.getClass()) {
+      return false;
+    }
+
+    RowKey that = (RowKey) other;
+    return this.keyId == that.keyId && sameRowKey(that);
+  }
+
+  @Override
+  public int hashCode() {
+    return 31 * Integer.hashCode(keyId) + Arrays.hashCode(fields);
+  }
+
+  @Override
+  public String toString() {
+    StringBuilder sb = new StringBuilder();
+    sb.append("[");
+    for (RowKeyField field : fields) {
+      sb.append("\n");
+      sb.append("  ").append(field);
+    }
+    if (fields.length > 0) {
+      sb.append("\n");
+    }
+    sb.append("]");
+    return sb.toString();
+  }
+
+  /**
+   * Creates a new {@link Builder row key builder} for the given {@link Schema}.
+   *
+   * @param schema a schema
+   * @return a row key builder for the given schema.
+   */
+  public static Builder builderFor(Schema schema) {
+    return new Builder(schema);
+  }
+
+  /**
+   * A builder to create valid {@link RowKey row key}.
+   * <p>
+   * Call {@link #builderFor(Schema)} to create a new builder.
+   */
+  public static class Builder {
+    private final Schema schema;
+    private final List<RowKeyField> fields = Lists.newArrayList();
+    // Default key ID is 1 because 0 is reserved for default
+    private int keyId = 1;
+
+    private Builder(Schema schema) {
+      this.schema = schema;
+    }
+
+    public Builder withKeyId(int id) {
+      ValidationException.check(id >= 0, "Row key id must not be less than 0");
+      this.keyId = id;
+      return this;
+    }
+
+    public Builder addField(String name) {
+      Types.NestedField column = schema.findField(name);
+      ValidationException.check(column != null, "Cannot find column with name %s in schema", name);

Review comment:
       sounds good to me




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r605994202



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -482,6 +506,7 @@ public TableMetadata updateSchema(Schema newSchema, int newLastColumnId) {
     // rebuild all of the partition specs and sort orders for the new current schema
     List<PartitionSpec> updatedSpecs = Lists.transform(specs, spec -> updateSpecSchema(newSchema, spec));
     List<SortOrder> updatedSortOrders = Lists.transform(sortOrders, order -> updateSortOrderSchema(newSchema, order));
+    RowKey updatedRowKey = updateRowKeySchema(newSchema, rowKey);

Review comment:
       I think it would be a relatively small code change in this class and some testing which might be easy to include in this same PR, unless you are thinking about something different? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
openinx commented on pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#issuecomment-807982041


   I checked this RowKey specification twice again,  in my opinion  I think it's good enough to get this merged,  but since it's an  basic specification for iceberg table format.  I'd like to invite @rdblue and @aokolnychyi to do the double-check,  will be so appreciate if you two get this check when you have time, thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#issuecomment-809718836


   @openinx, I think that @jackye1995 is right about how the case you described would be encoded. The delete files themselves always encode what columns are used for the equality delete.
   
   There is no requirement that a delete file's delete columns match the table's row identifier fields. That's one reason why we can encode deletes right now, before we've added the row identifier tracking. That also enables deleting rows by different fields than the row identifier fields, which is what makes the evolution case possible.
   
   The row identifier fields are related to deletes only in that in situations where we don't have explicit delete columns in the operation, we can default the delete columns to the row identifier fields. That's to support the `UPSERT` case, where we define the identity fields in table metadata rather than in the sink configuration.
   
   From @jackye1995's second comment, I think there is at least some agreement that the row identifier columns don't need to be tracked over time. That's because there is no way to go back to an older snapshot and then manipulate that data. Time travel is read-only and data manipulation is always applied to the current snapshot, so it is reasonable that there is only ever one version of the row identifier that matters: the one that is configured at the start of the operation.
   
   Before moving ahead with this, I think we should simplify it and remove the versioning.
   
   I'm also wondering about the field ordering mentioned in the code. Is that relevant? I think of the row identifier fields as unordered and simply used to produce a projection of the table schema that is a row identifier, in whatever field order the schema had. So I would model this as an unordered set of IDs rather than as an ordered collection.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#issuecomment-816373542


   Thanks for the comment Ryan, I agree that this can be simplified into a part of the schema itself, something like:
   
   ```
   {
         "type": "struct",
         "schema-id": 0,
         "row-identifiers": [
           1
         ],
         "fields": [
           {
             "id": 1,
             "name": "x",
             "required": true,
             "type": "long"
           }
         ]
       }
   
   ```
   
   Originally we were trying to add this as a component of table metadata because of (1) benefit of versioning, (2) possibility to extend each row-identifying field for more potential use cases. Since we all agree that (2) is not likely, and we can also get versioning by adding it into schema, I think this is a good approach to go.
   
   I will update this PR tomorrow, @openinx meanwhile please let me know if you have any concern with the new approach.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r605320534



##########
File path: api/src/main/java/org/apache/iceberg/RowKey.java
##########
@@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.List;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.types.Types;
+
+/**
+ * Row key of a table.
+ * <p>
+ * Row key is a definition of table row uniqueness,
+ * similar to the concept of primary key in a relational database system.
+ * A row should be unique in a table based on the values of each {@link RowKeyField}.
+ * Iceberg itself does not enforce row uniqueness based on this key.
+ * It is leveraged by operations such as streaming upsert.
+ */
+public class RowKey implements Serializable {

Review comment:
       I don't have a strong opinion but I'd go for `RowIdentifier` or `RowId`. I think `Key` means uniqueness but I'll be fine this way too as long as we agree Iceberg does not ensure uniqueness.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r610211685



##########
File path: api/src/main/java/org/apache/iceberg/RowKeyIdentifierField.java
##########
@@ -0,0 +1,65 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Objects;
+
+/**
+ * An identifier field in {@link RowKey}
+ * <p>
+ * The field must be:
+ *  1. a required column in the table schema
+ *  2. a primitive type column
+ */
+public class RowKeyIdentifierField implements Serializable {

Review comment:
       This seems to be a class that wraps a single ID. Could we get rid of it? Instead, `IdentifierFields` could expose `idSet` or `ids` that returns `Collection<Integer>`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r601875062



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -877,6 +986,26 @@ private static SortOrder freshSortOrder(int orderId, Schema schema, SortOrder so
     return builder.build();
   }
 
+  private static RowKey freshRowKey(int keyId, Schema schema, RowKey rowKey) {
+    RowKey.Builder builder = RowKey.builderFor(schema).withKeyId(keyId);
+
+    for (RowKeyField field : rowKey.fields()) {
+      // look up the name of the source field in the old schema to get the new schema's id
+      String columnName = rowKey.schema().findColumnName(field.sourceId());
+      Preconditions.checkNotNull(columnName,
+          "Cannot find column in the row key's schema. id: %s, schema: %s",
+          field.sourceId(), rowKey.schema());
+
+      // reassign all row keys with fresh column IDs.
+      Types.NestedField column = schema.findField(columnName);
+      Preconditions.checkNotNull(column,
+          "Cannot find column in the fresh schema. name: %s, schema: %s", columnName, schema);
+      builder.addField(column.fieldId());

Review comment:
       good point, updated




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r606062513



##########
File path: api/src/main/java/org/apache/iceberg/catalog/Catalog.java
##########
@@ -22,6 +22,7 @@
 import java.util.List;
 import java.util.Map;
 import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.RowKey;

Review comment:
       Okay,  I agree it's good to use `TableBuilder` to `createTable` in future usage, as we are introducing more and more arguments when creating table.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2354: Core: add row identifier to format v2

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r600214178



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -111,6 +117,11 @@ static TableMetadata newTableMetadata(Schema schema,
     int freshSortOrderId = sortOrder.isUnsorted() ? sortOrder.orderId() : INITIAL_SORT_ORDER_ID;
     SortOrder freshSortOrder = freshSortOrder(freshSortOrderId, freshSchema, sortOrder);
 
+    // rebuild the row identifier using the new column ids
+    int freshRowIdentifierVersion = rowIdentifier.isNotIdentified() ?
+        rowIdentifier.rowIdVersion() : INITIAL_ROW_IDENTIFIER_VERSION;

Review comment:
       If the `RowIdentifier` is not identified, then the row-id-version should be 0 rather than 1 ? 

##########
File path: core/src/test/java/org/apache/iceberg/TestRowIdentifierParser.java
##########
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import org.junit.Assert;
+import org.junit.Test;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+
+@RunWith(Parameterized.class)
+public class TestRowIdentifierParser extends TableTestBase {
+
+  @Parameterized.Parameters(name = "format = {0}")
+  public static Object[][] parameters() {
+    return new Object[][] {
+        {1},
+        {2}
+    };
+  }
+
+  public TestRowIdentifierParser(int formatVersion) {
+    super(formatVersion);
+  }
+
+  @Test
+  public void testToJson() {
+    String expected = "{\"row-id-version\":0,\"fields\":[]}";
+    Assert.assertEquals(expected, RowIdentifierParser.toJson(table.rowIdentifier(), false));
+
+    RowIdentifier id = RowIdentifier.builderFor(table.schema())
+        .addField("id")
+        .addField("data")
+        .build();
+
+    table.ops().commit(table.ops().current(), table.ops().current().updateRowIdentifier(id));
+
+    expected = "{\n" +
+        "  \"row-id-version\" : 1,\n" +
+        "  \"fields\" : [ {\n" +
+        "    \"source-id\" : 1\n" +
+        "  }, {\n" +
+        "    \"source-id\" : 2\n" +
+        "  } ]\n" +
+        "}";
+    Assert.assertEquals(expected, RowIdentifierParser.toJson(id, true));
+  }
+
+  @Test
+  public void testFromJson() {
+    String expected = "{\n" +
+        "  \"row-id-version\" : 1,\n" +
+        "  \"fields\" : [ {\n" +
+        "    \"source-id\" : 1\n" +
+        "  } ]\n" +
+        "}";
+
+    RowIdentifier expectedId = RowIdentifier.builderFor(table.schema())
+        .addField("id")
+        .build();
+
+    RowIdentifier id = RowIdentifierParser.fromJson(table.schema(), expected);
+    Assert.assertEquals(expectedId, id);

Review comment:
       Nit: do we also need to add an unit test for `NOT_IDENTIFIED` ? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
openinx commented on pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#issuecomment-814645938


   @rdblue ,  Pls take a final look at this specification if you have a chance.  It will be great if we could reach consistence on this PR and merge this into master branch.  (Some other PRs are blocked by this PR).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2354: Core: add row identifier to format v2

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r600946710



##########
File path: api/src/main/java/org/apache/iceberg/RowIdentifyField.java
##########
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Objects;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.types.Types;
+
+/**
+ * A field in {@link RowIdentifier}
+ * <p>
+ * The field must be:
+ *  1. a required column in the table schema
+ *  2. a primitive type column
+ */
+public class RowIdentifyField implements Serializable {
+
+  private final int sourceId;
+
+  RowIdentifyField(int sourceId) {
+    this.sourceId = sourceId;
+  }
+
+  RowIdentifyField(Types.NestedField column) {
+    ValidationException.check(column.isRequired(),
+        "Cannot add column %s to row identifier because it is not a required column", column);

Review comment:
       makes sense, will update




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r602006290



##########
File path: core/src/main/java/org/apache/iceberg/RowKeyParser.java
##########
@@ -0,0 +1,106 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import com.fasterxml.jackson.core.JsonGenerator;
+import com.fasterxml.jackson.databind.JsonNode;
+import java.io.IOException;
+import java.io.StringWriter;
+import java.io.UncheckedIOException;
+import java.util.Iterator;
+import org.apache.iceberg.exceptions.RuntimeIOException;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.util.JsonUtil;
+
+public class RowKeyParser {
+  private static final String ROW_KEY_ID = "key-id";
+  private static final String FIELDS = "fields";
+  private static final String SOURCE_ID = "source-id";
+
+  private RowKeyParser() {
+  }
+
+  public static void toJson(RowKey rowKey, JsonGenerator generator) throws IOException {
+    generator.writeStartObject();
+    generator.writeNumberField(ROW_KEY_ID, rowKey.keyId());
+    generator.writeFieldName(FIELDS);
+    toJsonFields(rowKey, generator);
+    generator.writeEndObject();
+  }
+
+  public static String toJson(RowKey rowKey) {
+    return toJson(rowKey, false);
+  }
+
+  public static String toJson(RowKey rowKey, boolean pretty) {
+    try {
+      StringWriter writer = new StringWriter();
+      JsonGenerator generator = JsonUtil.factory().createGenerator(writer);
+      if (pretty) {
+        generator.useDefaultPrettyPrinter();
+      }
+      toJson(rowKey, generator);
+      generator.flush();
+      return writer.toString();
+
+    } catch (IOException e) {
+      throw new RuntimeIOException(e);

Review comment:
       We iceberg prefer to use the `UncheckedIOException`  rather than the deprecated `RuntimeIOException`, right ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r606065181



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -482,6 +506,7 @@ public TableMetadata updateSchema(Schema newSchema, int newLastColumnId) {
     // rebuild all of the partition specs and sort orders for the new current schema
     List<PartitionSpec> updatedSpecs = Lists.transform(specs, spec -> updateSpecSchema(newSchema, spec));
     List<SortOrder> updatedSortOrders = Lists.transform(sortOrders, order -> updateSortOrderSchema(newSchema, order));
+    RowKey updatedRowKey = updateRowKeySchema(newSchema, rowKey);

Review comment:
       @jackye1995  your plan looks great !




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r605992816



##########
File path: core/src/test/java/org/apache/iceberg/TestTableMetadata.java
##########
@@ -646,6 +659,55 @@ public void testUpdateSortOrder() {
         NullOrder.NULLS_FIRST, sortedByX.sortOrder().fields().get(0).nullOrder());
   }
 
+  @Test
+  public void testRowKey() {
+    Schema schema = new Schema(
+        Types.NestedField.required(10, "x", Types.StringType.get())
+    );
+
+    TableMetadata meta = TableMetadata.newTableMetadata(
+        schema, PartitionSpec.unpartitioned(), null, ImmutableMap.of());
+    RowKey rowKey = meta.rowKey();
+    Assert.assertTrue("Row key must be default", rowKey.isNotIdentified());
+  }
+
+  @Test
+  public void testUpdateRowKey() {
+    Schema schema = new Schema(
+        Types.NestedField.required(10, "x", Types.StringType.get()),
+        Types.NestedField.required(11, "y", Types.StringType.get())
+    );
+
+    RowKey id = RowKey.builderFor(schema).addField("x").build();
+
+    TableMetadata identifiedByX = TableMetadata.newTableMetadata(
+        schema, PartitionSpec.unpartitioned(), SortOrder.unsorted(), id, null, ImmutableMap.of());
+    RowKey actualKey = identifiedByX.rowKey();
+    Assert.assertEquals("Row key must have 1 field", 1, actualKey.identifierFields().size());
+    Assert.assertEquals("Row key must have the expected field",
+        org.apache.commons.compress.utils.Sets.newHashSet(1),

Review comment:
       oh yeah, Intellij keeps giving me this, I fixed a few and missed this one, thanks




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2354: Core: add row key to format v2

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#discussion_r606003885



##########
File path: core/src/main/java/org/apache/iceberg/TableMetadata.java
##########
@@ -482,6 +506,7 @@ public TableMetadata updateSchema(Schema newSchema, int newLastColumnId) {
     // rebuild all of the partition specs and sort orders for the new current schema
     List<PartitionSpec> updatedSpecs = Lists.transform(specs, spec -> updateSpecSchema(newSchema, spec));
     List<SortOrder> updatedSortOrders = Lists.transform(sortOrders, order -> updateSortOrderSchema(newSchema, order));
+    RowKey updatedRowKey = updateRowKeySchema(newSchema, rowKey);

Review comment:
       I was thinking about the following: 1 PR for the update related changes including update row key API, implementation, and this one; 1PR for work in Spark for SQL extension; 1PR in Flink for primary key clause (if no one did it yet)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org