You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/05/06 06:40:57 UTC

[GitHub] [iceberg] jackye1995 opened a new pull request #2556: Core: update schema constructor callers to include fresh identifier

jackye1995 opened a new pull request #2556:
URL: https://github.com/apache/iceberg/pull/2556


   fixes #2528 
   
   Allow callers of the schema constructor to take in identifier information and refresh the identifier field IDs if necessary.
   
   This mostly applies to schema update related code paths. The following 2 cases will continue to use the existing constructor without identifier:
   
   1. callers that are coverting from a file format schema or an engine schema except Flink, because they do not have a primary key-like concept.
       1. I assume @openinx will do all necessary changes in Flink through #2410. 
       2. I will do another PR for Spark with SQL extension added.
   2. callers that are converting schema for query transformation purpose such as `select`, `join`, file projection, etc., because identifier is not well-defined for those transformed schemas.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2556: Core: update schema constructor callers to include fresh identifier

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2556:
URL: https://github.com/apache/iceberg/pull/2556#discussion_r629064200



##########
File path: api/src/main/java/org/apache/iceberg/types/TypeUtil.java
##########
@@ -181,10 +183,24 @@ public static Schema assignFreshIds(int schemaId, Schema schema, NextID nextId)
    * @return a structurally identical schema with new ids assigned by the nextId function
    */
   public static Schema assignFreshIds(Schema schema, Schema baseSchema, NextID nextId) {
-    return new Schema(TypeUtil
-        .visit(schema.asStruct(), new AssignFreshIds(schema, baseSchema, nextId))
-        .asNestedType()
-        .fields());
+    Types.StructType freshSchemaStruct = TypeUtil

Review comment:
       @openinx so after reading the code a bit more, I think it does not make sense to convert alias in the methods. The reason is that, as the documentation suggests:
   
   > Alias maps are created when translating an external schema, like an Avro Schema, to this format. The original column names can be provided in a Map when constructing this Schema.
   
   So this happens in methods such as `AvroSchemaUtil.toIceberg`, `ParquetSchemaUtil.convert`, etc. However, these alias are never persisted in the actual table metadata. As a proof, the `TypeUtil.assignFreshIds` is called for every table metadata replacement, but the alias is never passed in. So changing the method to pass in the alias is a change of behavior and we should not do that. So I think the current implementation should be good enough, I will add a few tests based on what you and Yan suggested.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2556: Core: update schema constructor callers to include fresh identifier

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2556:
URL: https://github.com/apache/iceberg/pull/2556#discussion_r628560818



##########
File path: api/src/main/java/org/apache/iceberg/types/TypeUtil.java
##########
@@ -181,10 +183,24 @@ public static Schema assignFreshIds(int schemaId, Schema schema, NextID nextId)
    * @return a structurally identical schema with new ids assigned by the nextId function
    */
   public static Schema assignFreshIds(Schema schema, Schema baseSchema, NextID nextId) {
-    return new Schema(TypeUtil
-        .visit(schema.asStruct(), new AssignFreshIds(schema, baseSchema, nextId))
-        .asNestedType()
-        .fields());
+    Types.StructType freshSchemaStruct = TypeUtil

Review comment:
       yes practically speaking the alias is not used in any code path related, that was why I did not care about that when making the change. But from correctness perspective I agree with @openinx that if the alias exists, we should do the conversion just in case it is somehow used somewhere for that purpose in the future.
   
   What I am trying out is to change the `AssignFreshIds` visitor so that it can update the id along the way for `alias` and `identifier`. Will update the PR after completing the implementation.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2556: Core: update schema constructor callers to include fresh identifier

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2556:
URL: https://github.com/apache/iceberg/pull/2556#discussion_r627898753



##########
File path: api/src/main/java/org/apache/iceberg/types/TypeUtil.java
##########
@@ -181,10 +183,24 @@ public static Schema assignFreshIds(int schemaId, Schema schema, NextID nextId)
    * @return a structurally identical schema with new ids assigned by the nextId function
    */
   public static Schema assignFreshIds(Schema schema, Schema baseSchema, NextID nextId) {
-    return new Schema(TypeUtil
-        .visit(schema.asStruct(), new AssignFreshIds(schema, baseSchema, nextId))
-        .asNestedType()
-        .fields());
+    Types.StructType freshSchemaStruct = TypeUtil

Review comment:
       sorry forgot those methods, updated. So far I don't see a place those methods are called and need to rely on the alias, I will continue to look, if there is an exception I will put it in another PR.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2556: Core: update schema constructor callers to include fresh identifier

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2556:
URL: https://github.com/apache/iceberg/pull/2556#discussion_r627360225



##########
File path: api/src/main/java/org/apache/iceberg/types/TypeUtil.java
##########
@@ -181,10 +183,24 @@ public static Schema assignFreshIds(int schemaId, Schema schema, NextID nextId)
    * @return a structurally identical schema with new ids assigned by the nextId function
    */
   public static Schema assignFreshIds(Schema schema, Schema baseSchema, NextID nextId) {
-    return new Schema(TypeUtil
-        .visit(schema.asStruct(), new AssignFreshIds(schema, baseSchema, nextId))
-        .asNestedType()
-        .fields());
+    Types.StructType freshSchemaStruct = TypeUtil

Review comment:
       Except this `assignFreshIds`,   other methods that have the same mehold name should also refresh its identified field id list , right ?  I also think we will need more unit tests to address this changes.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2556: Core: update schema constructor callers to include fresh identifier

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2556:
URL: https://github.com/apache/iceberg/pull/2556#discussion_r627964098



##########
File path: api/src/main/java/org/apache/iceberg/types/TypeUtil.java
##########
@@ -181,10 +178,24 @@ public static Schema assignFreshIds(int schemaId, Schema schema, NextID nextId)
    * @return a structurally identical schema with new ids assigned by the nextId function
    */
   public static Schema assignFreshIds(Schema schema, Schema baseSchema, NextID nextId) {
-    return new Schema(TypeUtil
-        .visit(schema.asStruct(), new AssignFreshIds(schema, baseSchema, nextId))
-        .asNestedType()
-        .fields());
+    Types.StructType struct = TypeUtil
+            .visit(schema.asStruct(), new AssignFreshIds(schema, baseSchema, nextId))
+            .asStructType();
+    return new Schema(struct.fields(), refreshIdentifierFields(struct, schema));
+  }
+
+  /**
+   * Get the identifier fields in the fresh schema based on the identifier fields in the base schema.
+   * @param freshSchema fresh schema
+   * @param baseSchema base schema
+   * @return idnetifier fields in the fresh schema

Review comment:
       Nit: `idnetifier` -> `identifier`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2556: Core: update schema constructor callers to include fresh identifier

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2556:
URL: https://github.com/apache/iceberg/pull/2556#discussion_r628669046



##########
File path: api/src/main/java/org/apache/iceberg/types/TypeUtil.java
##########
@@ -213,7 +224,7 @@ public static Schema assignIncreasingFreshIds(Schema schema) {
    */
   public static Schema reassignIds(Schema schema, Schema idSourceSchema) {
     Types.StructType struct = visit(schema, new ReassignIds(idSourceSchema)).asStructType();
-    return new Schema(struct.fields());
+    return new Schema(struct.fields(), refreshIdentifierFields(struct, schema));

Review comment:
       I would prefer to call `refreshIdentifierFields` here to get an consistent view of identifier field id list for the `schema`,   without caring about the idsourceSchema’s own identifier field id list.  This makes the `reassignIds` method looks more general. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on pull request #2556: Core: update schema constructor callers to include fresh identifier

Posted by GitBox <gi...@apache.org>.
openinx commented on pull request #2556:
URL: https://github.com/apache/iceberg/pull/2556#issuecomment-836254599


   Looks like we are encountering the flaky unit test. Let's retry the travis ci !
   
   ```
   org.apache.iceberg.flink.TestFlinkTableSink > testHashDistributeMode[catalogName=testhive, baseNamespace=, format=AVRO, isStreaming=true] FAILED
       java.lang.AssertionError: There should be only 1 data file in partition 'aaa' expected:<1> but was:<2>
           at org.junit.Assert.fail(Assert.java:88)
           at org.junit.Assert.failNotEquals(Assert.java:834)
           at org.junit.Assert.assertEquals(Assert.java:645)
           at org.apache.iceberg.flink.TestFlinkTableSink.testHashDistributeMode(TestFlinkTableSink.java:274)
   
       org.apache.flink.table.api.ValidationException: Could not execute DROP DATABASE IF EXISTS  testhive.db RESTRICT
           at org.apache.flink.table.api.internal.TableEnvironmentImpl.executeOperation(TableEnvironmentImpl.java:989)
           at org.apache.flink.table.api.internal.TableEnvironmentImpl.executeSql(TableEnvironmentImpl.java:666)
           at org.apache.iceberg.flink.FlinkTestBase.exec(FlinkTestBase.java:91)
           at org.apache.iceberg.flink.FlinkTestBase.exec(FlinkTestBase.java:95)
           at org.apache.iceberg.flink.FlinkTestBase.sql(FlinkTestBase.java:99)
           at org.apache.iceberg.flink.TestFlinkTableSink.clean(TestFlinkTableSink.java:126)
   
           Caused by:
           org.apache.flink.table.catalog.exceptions.DatabaseNotEmptyException: Database db in catalog testhive is not empty.
               at org.apache.iceberg.flink.FlinkCatalog.dropDatabase(FlinkCatalog.java:240)
               at org.apache.flink.table.api.internal.TableEnvironmentImpl.executeOperation(TableEnvironmentImpl.java:983)
               ... 5 more
   
               Caused by:
               org.apache.iceberg.exceptions.NamespaceNotEmptyException: Namespace db is not empty. One or more tables exist.
                   at org.apache.iceberg.hive.HiveCatalog.dropNamespace(HiveCatalog.java:307)
                   at org.apache.iceberg.flink.FlinkCatalog.dropDatabase(FlinkCatalog.java:231)
                   ... 6 more
   
                   Caused by:
                   InvalidOperationException(message:Database db is not empty. One or more tables exist.)
                       at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$drop_database_result$drop_database_resultStandardScheme.read(ThriftHiveMetastore.java:28714)
                       at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$drop_database_result$drop_database_resultStandardScheme.read(ThriftHiveMetastore.java:28691)
                       at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$drop_database_result.read(ThriftHiveMetastore.java:28625)
                       at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:86)
                       at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_drop_database(ThriftHiveMetastore.java:813)
                       at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.drop_database(ThriftHiveMetastore.java:798)
                       at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.dropDatabase(HiveMetaStoreClient.java:868)
                       at org.apache.iceberg.hive.HiveCatalog.lambda$dropNamespace$9(HiveCatalog.java:296)
                       at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:51)
                       at org.apache.iceberg.hive.CachedClientPool.run(CachedClientPool.java:77)
                       at org.apache.iceberg.hive.HiveCatalog.dropNamespace(HiveCatalog.java:295)
                       ... 7 more
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2556: Core: update schema constructor callers to include fresh identifier

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2556:
URL: https://github.com/apache/iceberg/pull/2556#discussion_r627885078



##########
File path: api/src/main/java/org/apache/iceberg/types/TypeUtil.java
##########
@@ -181,10 +183,24 @@ public static Schema assignFreshIds(int schemaId, Schema schema, NextID nextId)
    * @return a structurally identical schema with new ids assigned by the nextId function
    */
   public static Schema assignFreshIds(Schema schema, Schema baseSchema, NextID nextId) {
-    return new Schema(TypeUtil
-        .visit(schema.asStruct(), new AssignFreshIds(schema, baseSchema, nextId))
-        .asNestedType()
-        .fields());
+    Types.StructType freshSchemaStruct = TypeUtil
+            .visit(schema.asStruct(), new AssignFreshIds(schema, baseSchema, nextId))
+            .asStructType();
+    return new Schema(freshSchemaStruct.fields(), refreshIdentifierFields(freshSchemaStruct, schema));
+  }
+
+  /**
+   * Get the identifier fields in the fresh schema based on the identifier fields in the base schema.
+   * @param freshSchema fresh schema
+   * @param baseSchema base schema
+   * @return idnetifier fields in the fresh schema
+   */
+  public static Set<Integer> refreshIdentifierFields(Types.StructType freshSchema, Schema baseSchema) {
+    Map<String, Integer> nameToId = TypeUtil.indexByName(freshSchema);
+    return baseSchema.identifierFieldNames().stream()
+            .map(nameToId::get)
+            .filter(Objects::nonNull)

Review comment:
       I chose that approach for use case like select, but later realized we should not support that, so throwing exception does make more sense, will update.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2556: Core: update schema constructor callers to include fresh identifier

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2556:
URL: https://github.com/apache/iceberg/pull/2556#discussion_r627973307



##########
File path: api/src/main/java/org/apache/iceberg/types/TypeUtil.java
##########
@@ -151,10 +152,8 @@ public static Type assignFreshIds(Type type, NextID nextId) {
    * @return a structurally identical schema with new ids assigned by the nextId function
    */
   public static Schema assignFreshIds(Schema schema, NextID nextId) {
-    return new Schema(TypeUtil
-        .visit(schema.asStruct(), new AssignFreshIds(nextId))
-        .asNestedType()
-        .fields());
+    Types.StructType struct = TypeUtil.visit(schema.asStruct(), new AssignFreshIds(nextId)).asStructType();

Review comment:
       I have another question about the `Schema#select`,  should we also project the field ids on top of `identifierFieldIds` ? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #2556: Core: update schema constructor callers to include fresh identifier

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #2556:
URL: https://github.com/apache/iceberg/pull/2556#discussion_r628525005



##########
File path: api/src/main/java/org/apache/iceberg/types/TypeUtil.java
##########
@@ -181,10 +183,24 @@ public static Schema assignFreshIds(int schemaId, Schema schema, NextID nextId)
    * @return a structurally identical schema with new ids assigned by the nextId function
    */
   public static Schema assignFreshIds(Schema schema, Schema baseSchema, NextID nextId) {
-    return new Schema(TypeUtil
-        .visit(schema.asStruct(), new AssignFreshIds(schema, baseSchema, nextId))
-        .asNestedType()
-        .fields());
+    Types.StructType freshSchemaStruct = TypeUtil

Review comment:
       I think alias is a bit different here, I think alias is used mostly for integration purpose from converting a file schema to iceberg schema for easier lookup (i.e. as a form of helper method), and isn't part of the the iceberg schema itself and isn't written to table metadata; so I think they may not be strictly required when constructing new schemas from this class. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx closed pull request #2556: Core: update schema constructor callers to include fresh identifier

Posted by GitBox <gi...@apache.org>.
openinx closed pull request #2556:
URL: https://github.com/apache/iceberg/pull/2556


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on pull request #2556: Core: update schema constructor callers to include fresh identifier

Posted by GitBox <gi...@apache.org>.
openinx commented on pull request #2556:
URL: https://github.com/apache/iceberg/pull/2556#issuecomment-837547761


   Looks good to me now.  I will get this merged !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2556: Core: update schema constructor callers to include fresh identifier

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2556:
URL: https://github.com/apache/iceberg/pull/2556#discussion_r627358314



##########
File path: api/src/main/java/org/apache/iceberg/types/TypeUtil.java
##########
@@ -181,10 +183,24 @@ public static Schema assignFreshIds(int schemaId, Schema schema, NextID nextId)
    * @return a structurally identical schema with new ids assigned by the nextId function
    */
   public static Schema assignFreshIds(Schema schema, Schema baseSchema, NextID nextId) {
-    return new Schema(TypeUtil
-        .visit(schema.asStruct(), new AssignFreshIds(schema, baseSchema, nextId))
-        .asNestedType()
-        .fields());
+    Types.StructType freshSchemaStruct = TypeUtil
+            .visit(schema.asStruct(), new AssignFreshIds(schema, baseSchema, nextId))
+            .asStructType();
+    return new Schema(freshSchemaStruct.fields(), refreshIdentifierFields(freshSchemaStruct, schema));
+  }
+
+  /**
+   * Get the identifier fields in the fresh schema based on the identifier fields in the base schema.
+   * @param freshSchema fresh schema
+   * @param baseSchema base schema
+   * @return idnetifier fields in the fresh schema
+   */
+  public static Set<Integer> refreshIdentifierFields(Types.StructType freshSchema, Schema baseSchema) {
+    Map<String, Integer> nameToId = TypeUtil.indexByName(freshSchema);
+    return baseSchema.identifierFieldNames().stream()
+            .map(nameToId::get)
+            .filter(Objects::nonNull)

Review comment:
       We couldn't just filter the field which couldn't locate a correct field id by the given column name,  instead we should throw an NullPointerException to say that the refresh has been failure.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2556: Core: update schema constructor callers to include fresh identifier

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2556:
URL: https://github.com/apache/iceberg/pull/2556#discussion_r628665313



##########
File path: api/src/main/java/org/apache/iceberg/types/TypeUtil.java
##########
@@ -151,10 +152,8 @@ public static Type assignFreshIds(Type type, NextID nextId) {
    * @return a structurally identical schema with new ids assigned by the nextId function
    */
   public static Schema assignFreshIds(Schema schema, NextID nextId) {
-    return new Schema(TypeUtil
-        .visit(schema.asStruct(), new AssignFreshIds(nextId))
-        .asNestedType()
-        .fields());
+    Types.StructType struct = TypeUtil.visit(schema.asStruct(), new AssignFreshIds(nextId)).asStructType();

Review comment:
       Yeah,   providing a projected identifier field ids looks very strange for people.  I think it's okay to remove the identifier fields when projecting.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2556: Core: update schema constructor callers to include fresh identifier

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2556:
URL: https://github.com/apache/iceberg/pull/2556#discussion_r627975342



##########
File path: api/src/main/java/org/apache/iceberg/types/TypeUtil.java
##########
@@ -181,10 +183,24 @@ public static Schema assignFreshIds(int schemaId, Schema schema, NextID nextId)
    * @return a structurally identical schema with new ids assigned by the nextId function
    */
   public static Schema assignFreshIds(Schema schema, Schema baseSchema, NextID nextId) {
-    return new Schema(TypeUtil
-        .visit(schema.asStruct(), new AssignFreshIds(schema, baseSchema, nextId))
-        .asNestedType()
-        .fields());
+    Types.StructType freshSchemaStruct = TypeUtil

Review comment:
       In theory, a `schema` is composed by `schemaId`, `fields`,  `aliases`, `identifierFieldIds` . We will need to maintain all those members when refresh or reassign field ids based on the old schema,  by default we should pass the old schema's  info to the fresh schema if people don't provide a necessary info to fill.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2556: Core: update schema constructor callers to include fresh identifier

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2556:
URL: https://github.com/apache/iceberg/pull/2556#discussion_r628558710



##########
File path: api/src/main/java/org/apache/iceberg/types/TypeUtil.java
##########
@@ -151,10 +152,8 @@ public static Type assignFreshIds(Type type, NextID nextId) {
    * @return a structurally identical schema with new ids assigned by the nextId function
    */
   public static Schema assignFreshIds(Schema schema, NextID nextId) {
-    return new Schema(TypeUtil
-        .visit(schema.asStruct(), new AssignFreshIds(nextId))
-        .asNestedType()
-        .fields());
+    Types.StructType struct = TypeUtil.visit(schema.asStruct(), new AssignFreshIds(nextId)).asStructType();

Review comment:
       Yes, that was why I chose to just remove the fields not matched instead of throwing NPE,. but then I realized it dies not make sense. 
   
   For example, for a table schema with identifier `(id, catalog)`,  it does not really make sense to say that, when selecting the `id` field, the new schema has identifier `(id)`. So I don't think it the identifier fields should be added there.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #2556: Core: update schema constructor callers to include fresh identifier

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #2556:
URL: https://github.com/apache/iceberg/pull/2556#discussion_r628519164



##########
File path: api/src/main/java/org/apache/iceberg/types/TypeUtil.java
##########
@@ -213,7 +224,7 @@ public static Schema assignIncreasingFreshIds(Schema schema) {
    */
   public static Schema reassignIds(Schema schema, Schema idSourceSchema) {
     Types.StructType struct = visit(schema, new ReassignIds(idSourceSchema)).asStructType();
-    return new Schema(struct.fields());
+    return new Schema(struct.fields(), refreshIdentifierFields(struct, schema));

Review comment:
       since `schema` passed in to this method is almost always a schema just got constructed (and thus have to call this method to assign the right ids), I think `refreshIdentifierFields(struct, schema)` here will almost always be a no-op? 
   
   I wonder if we want to use `idSourceSchema` to get the identifier fields, although that might have a different problem that the input `schema` could be a subset of `idSourceSchema` and thus doesn't include all identifier fields. Though in this case identifier fields won't be useful and we may skip it, and we can visit two schema and verify they have the same number of columns to identify this case. I'm not sure about the use case of this method and if we really need to assign identifier fields here though. 

##########
File path: api/src/test/java/org/apache/iceberg/types/TestTypeUtil.java
##########
@@ -42,6 +44,44 @@ public void testReassignIdsDuplicateColumns() {
     Assert.assertEquals(sourceSchema.asStruct(), actualSchema.asStruct());
   }
 
+  @Test
+  public void testReassignIdsWithIdentifier() {
+    Schema schema = new Schema(
+        Lists.newArrayList(
+            required(0, "a", Types.IntegerType.get()),
+            required(1, "A", Types.IntegerType.get())),
+        Sets.newHashSet(0)
+    );
+    Schema sourceSchema = new Schema(
+        Lists.newArrayList(
+            required(1, "a", Types.IntegerType.get()),
+            required(2, "A", Types.IntegerType.get())),
+        Sets.newHashSet(1)
+    );
+    final Schema actualSchema = TypeUtil.reassignIds(schema, sourceSchema);
+    Assert.assertEquals(sourceSchema.asStruct(), actualSchema.asStruct());
+    Assert.assertEquals(sourceSchema.identifierFieldIds(), actualSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testAssignIncreasingFreshIdWithIdentifier() {
+    Schema schema = new Schema(
+        Lists.newArrayList(
+            required(10, "a", Types.IntegerType.get()),
+            required(11, "A", Types.IntegerType.get())),
+        Sets.newHashSet(10)
+    );
+    Schema sourceSchema = new Schema(
+        Lists.newArrayList(
+            required(1, "a", Types.IntegerType.get()),
+            required(2, "A", Types.IntegerType.get())),
+        Sets.newHashSet(1)
+    );
+    final Schema actualSchema = TypeUtil.assignIncreasingFreshIds(schema);
+    Assert.assertEquals(sourceSchema.asStruct(), actualSchema.asStruct());
+    Assert.assertEquals(sourceSchema.identifierFieldIds(), actualSchema.identifierFieldIds());
+  }
+

Review comment:
       do we want to add a case to call `assignFreshIds(Schema schema, Schema baseSchema, NextID nextId)` with `schema` doesn't have identifier fields but `baseSchema` has, and the output schema will have identifier fields? I think it's an existing use case we have in TableMetadata that may worth explicit testing. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2556: Core: update schema constructor callers to include fresh identifier

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2556:
URL: https://github.com/apache/iceberg/pull/2556#discussion_r627362642



##########
File path: api/src/main/java/org/apache/iceberg/types/TypeUtil.java
##########
@@ -181,10 +183,24 @@ public static Schema assignFreshIds(int schemaId, Schema schema, NextID nextId)
    * @return a structurally identical schema with new ids assigned by the nextId function
    */
   public static Schema assignFreshIds(Schema schema, Schema baseSchema, NextID nextId) {
-    return new Schema(TypeUtil
-        .visit(schema.asStruct(), new AssignFreshIds(schema, baseSchema, nextId))
-        .asNestedType()
-        .fields());
+    Types.StructType freshSchemaStruct = TypeUtil

Review comment:
       Another thing (unrelated to this PR but I think it's important):   when we reconstruct the `Schema` in `assignFreshIds`,   looks like we've ignored the `Map<String, Integer> aliases`,  that not seems the correct behavior, right ? I mean we should use the existing `aliases` from the old schema to construct the refreshed new schema .




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2556: Core: update schema constructor callers to include fresh identifier

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2556:
URL: https://github.com/apache/iceberg/pull/2556#discussion_r627975342



##########
File path: api/src/main/java/org/apache/iceberg/types/TypeUtil.java
##########
@@ -181,10 +183,24 @@ public static Schema assignFreshIds(int schemaId, Schema schema, NextID nextId)
    * @return a structurally identical schema with new ids assigned by the nextId function
    */
   public static Schema assignFreshIds(Schema schema, Schema baseSchema, NextID nextId) {
-    return new Schema(TypeUtil
-        .visit(schema.asStruct(), new AssignFreshIds(schema, baseSchema, nextId))
-        .asNestedType()
-        .fields());
+    Types.StructType freshSchemaStruct = TypeUtil

Review comment:
       In theory, a `schema` is composed by `schemaId`, `fields`,  `aliases`, `identifierFieldIds` . We will need to maintain all those members when refresh or reassign field ids based on the old schema,  by default we should pass the old schema's  info to the fresh schema if people don't provide a necessary info to fill.
   
   Getting this work into a new PR looks good to me.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx merged pull request #2556: Core: update schema constructor callers to include fresh identifier

Posted by GitBox <gi...@apache.org>.
openinx merged pull request #2556:
URL: https://github.com/apache/iceberg/pull/2556


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #2556: Core: update schema constructor callers to include fresh identifier

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #2556:
URL: https://github.com/apache/iceberg/pull/2556#discussion_r628527215



##########
File path: api/src/main/java/org/apache/iceberg/types/TypeUtil.java
##########
@@ -151,10 +152,8 @@ public static Type assignFreshIds(Type type, NextID nextId) {
    * @return a structurally identical schema with new ids assigned by the nextId function
    */
   public static Schema assignFreshIds(Schema schema, NextID nextId) {
-    return new Schema(TypeUtil
-        .visit(schema.asStruct(), new AssignFreshIds(nextId))
-        .asNestedType()
-        .fields());
+    Types.StructType struct = TypeUtil.visit(schema.asStruct(), new AssignFreshIds(nextId)).asStructType();

Review comment:
       I think `Schema#select` is mostly used for schema projection as a form of schema transformation helper method I mentioned in earlier PRs, so I think missing `identifierFieldIds` might not be a problem. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org