You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/01/17 21:17:41 UTC

[GitHub] [iceberg] humengyu2012 opened a new pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

humengyu2012 opened a new pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912


   
   Creating a iceberg table in hive like: 
   ```sql
   CREATE TABLE test_iceberg (
     c1 int,
     c2 int,
     c3 string
   ) 
   PARTITIONED BY (c4 string)
   STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
   TBLPROPERTIES (
     'iceberg.identifier-fields' = 'c1,c2'
   );
   ```
   
   We could obtain an iceberg table:
   ```json
   {
     "format-version" : 1,
     "table-uuid" : "7b05fa2b-9504-4ea2-8e24-5f760a101f2f",
     "location" : "hdfs://xxx:8020/user/hive/warehouse/test_iceberg",
     "last-updated-ms" : 1642454127758,
     "last-column-id" : 4,
     "schema" : {
       "type" : "struct",
       "schema-id" : 0,
       "identifier-field-ids" : [ 1, 2 ],
       "fields" : [ {
         "id" : 1,
         "name" : "c1",
         "required" : true,
         "type" : "int"
       }, {
         "id" : 2,
         "name" : "c2",
         "required" : true,
         "type" : "int"
       }, {
         "id" : 3,
         "name" : "c3",
         "required" : false,
         "type" : "string"
       }, {
         "id" : 4,
         "name" : "c4",
         "required" : false,
         "type" : "string"
       } ]
     },
     "current-schema-id" : 0,
     "schemas" : [ {
       "type" : "struct",
       "schema-id" : 0,
       "identifier-field-ids" : [ 1, 2 ],
       "fields" : [ {
         "id" : 1,
         "name" : "c1",
         "required" : true,
         "type" : "int"
       }, {
         "id" : 2,
         "name" : "c2",
         "required" : true,
         "type" : "int"
       }, {
         "id" : 3,
         "name" : "c3",
         "required" : false,
         "type" : "string"
       }, {
         "id" : 4,
         "name" : "c4",
         "required" : false,
         "type" : "string"
       } ]
     } ],
     "partition-spec" : [ {
       "name" : "c4",
       "transform" : "identity",
       "source-id" : 4,
       "field-id" : 1000
     } ],
     "default-spec-id" : 0,
     "partition-specs" : [ {
       "spec-id" : 0,
       "fields" : [ {
         "name" : "c4",
         "transform" : "identity",
         "source-id" : 4,
         "field-id" : 1000
       } ]
     } ],
     "last-partition-id" : 1000,
     "default-sort-order-id" : 0,
     "sort-orders" : [ {
       "order-id" : 0,
       "fields" : [ ]
     } ],
     "properties" : {
       "engine.hive.enabled" : "true",
       "storage_handler" : "org.apache.iceberg.mr.hive.HiveIcebergStorageHandler"
     },
     "current-snapshot-id" : -1,
     "snapshots" : [ ],
     "snapshot-log" : [ ],
     "metadata-log" : [ ]
   }
   ```
   Although this feature is supported with json, it is too difficult for most people to write a json string in hive sql.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r790622663



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveSchemaUtil.java
##########
@@ -75,7 +83,42 @@ public static Schema convert(List<FieldSchema> fieldSchemas, boolean autoConvert
       typeInfos.add(TypeInfoUtils.getTypeInfoFromTypeString(col.getType()));
       comments.add(col.getComment());
     }
-    return HiveSchemaConverter.convert(names, typeInfos, comments, autoConvert);
+    Schema schema = HiveSchemaConverter.convert(names, typeInfos, comments, autoConvert);
+    return rebuildSchemaWithIdentifierFieldIds(schema, identifierFieldNames);
+  }
+
+  /**
+   * Rebuild a schema with given schema and identifierFieldNames
+   * @param schema The origin schema.
+   * @param identifierFieldNames The identifierFieldNames.
+   * @return New schema with IdentifierFieldIds.
+   */
+  public static Schema rebuildSchemaWithIdentifierFieldIds(Schema schema, Set<String> identifierFieldNames) {
+    // Identifier fields in nested field are not supported, so we just check the first level columns.
+    Map<String, Types.NestedField> columnsMap = schema.columns().stream()
+            .collect(Collectors.toMap(Types.NestedField::name, field -> field));
+    Set<Integer> identifierFieldIds = identifierFieldNames.stream()
+            .map(name -> {
+              Types.NestedField field = columnsMap.get(name);
+              if (field == null) {
+                // Does not exist or in nested field.
+                throw new IllegalArgumentException(
+                        String.format("Cannot add field `%s` as an identifier field: " +
+                                "must not in nested field and exist", name));
+              }
+              if (field.type().isNestedType()) {
+                // Field is nested.
+                throw new IllegalArgumentException(
+                        String.format("Cannot add field `%s` as an identifier field: " +

Review comment:
       I am not a native English speaker, but maybe this is better:
   ```
   Cannot set field `%s` as an identifier field: only primitive fields are allowed
   ```
   Or something better 😄




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] humengyu2012 commented on pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
humengyu2012 commented on pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#issuecomment-1014947194


   I have tried to remove `id++`, but it will change the id generation order, and lead to some test fail. So I think it shouldn't be changed in this pr. I can do this in the next pr if we need it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#issuecomment-1018609222


   @humengyu2012: Usually we stick to the PR, so the reviewers have easier time to follow what happens.
   You can have multiple changes (reverts etc) in a PR, so feel free to do as many as need.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r790765791



##########
File path: mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergSerDe.java
##########
@@ -176,12 +179,22 @@ private static Schema hiveSchemaOrThrow(Properties serDeProperties, Exception pr
       if (columnComments != null) {
         Collections.addAll(comments, columnComments.split(Character.toString(Character.MIN_VALUE)));
       }
+      String identifierFields = serDeProperties.getProperty(InputFormatConfig.IDENTIFIER_FIELD_NAMES);
+      Set<String> identifierFieldNames = getIdentifierFieldSet(identifierFields);
       Schema hiveSchema = HiveSchemaUtil.convert(names, TypeInfoUtils.getTypeInfosFromTypeString(columnTypes),
-              comments, autoConversion);
+              comments, autoConversion, identifierFieldNames);
       LOG.info("Using hive schema {}", SchemaParser.toJson(hiveSchema));
       return hiveSchema;
     } else {
       throw new SerDeException("Please provide an existing table or a valid schema", previousException);
     }
   }
+
+  public static Set<String> getIdentifierFieldSet(@Nullable String identifierFields) {

Review comment:
       Could we create a general method here instead, which encapsulates getting the list from the map too? Something like:
   ```
   public static Set<String> getIdentifierFieldSet(Map properties) {
   }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] humengyu2012 commented on pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
humengyu2012 commented on pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#issuecomment-1018677873


   Thanks, I will reopen this PR and close another one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r790756310



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveSchemaUtil.java
##########
@@ -75,7 +83,42 @@ public static Schema convert(List<FieldSchema> fieldSchemas, boolean autoConvert
       typeInfos.add(TypeInfoUtils.getTypeInfoFromTypeString(col.getType()));
       comments.add(col.getComment());
     }
-    return HiveSchemaConverter.convert(names, typeInfos, comments, autoConvert);
+    Schema schema = HiveSchemaConverter.convert(names, typeInfos, comments, autoConvert);
+    return rebuildSchemaWithIdentifierFieldIds(schema, identifierFieldNames);
+  }
+
+  /**
+   * Rebuild a schema with given schema and identifierFieldNames
+   * @param schema The origin schema.
+   * @param identifierFieldNames The identifierFieldNames.
+   * @return New schema with IdentifierFieldIds.
+   */
+  public static Schema rebuildSchemaWithIdentifierFieldIds(Schema schema, Set<String> identifierFieldNames) {

Review comment:
       Could this be private, or at least package private and `@VisibleForTesting`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r794086057



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveSchemaUtil.java
##########
@@ -75,7 +91,46 @@ public static Schema convert(List<FieldSchema> fieldSchemas, boolean autoConvert
       typeInfos.add(TypeInfoUtils.getTypeInfoFromTypeString(col.getType()));
       comments.add(col.getComment());
     }
-    return HiveSchemaConverter.convert(names, typeInfos, comments, autoConvert);
+    Schema schema = HiveSchemaConverter.convert(names, typeInfos, comments, autoConvert);
+    return rebuildSchemaWithIdentifierFieldIds(schema, identifierFieldNames);
+  }
+
+  /**
+   * Rebuild a schema with given schema and identifierFieldNames
+   * @param schema The origin schema.
+   * @param identifierFieldNames The identifierFieldNames.
+   * @return New schema with IdentifierFieldIds.
+   */
+  @VisibleForTesting
+  static Schema rebuildSchemaWithIdentifierFieldIds(Schema schema, Set<String> identifierFieldNames) {
+    if (identifierFieldNames.size() == 0) {
+      return schema;
+    }
+    // Identifier fields in nested field are not supported, so we just check the first level columns.

Review comment:
       Can't we give a better error message by checking if the request has a nested field and rejecting it, rather than only checking for top-level columns?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] humengyu2012 commented on pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
humengyu2012 commented on pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#issuecomment-1016111125


   https://github.com/apache/iceberg/issues/3920


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] humengyu2012 closed pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
humengyu2012 closed pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r791383079



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveSchemaUtil.java
##########
@@ -75,7 +84,43 @@ public static Schema convert(List<FieldSchema> fieldSchemas, boolean autoConvert
       typeInfos.add(TypeInfoUtils.getTypeInfoFromTypeString(col.getType()));
       comments.add(col.getComment());
     }
-    return HiveSchemaConverter.convert(names, typeInfos, comments, autoConvert);
+    Schema schema = HiveSchemaConverter.convert(names, typeInfos, comments, autoConvert);
+    return rebuildSchemaWithIdentifierFieldIds(schema, identifierFieldNames);

Review comment:
       Nit: one more optimization: do not rebuild the schema, if there is no identifer

##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveSchemaUtil.java
##########
@@ -75,7 +84,43 @@ public static Schema convert(List<FieldSchema> fieldSchemas, boolean autoConvert
       typeInfos.add(TypeInfoUtils.getTypeInfoFromTypeString(col.getType()));
       comments.add(col.getComment());
     }
-    return HiveSchemaConverter.convert(names, typeInfos, comments, autoConvert);
+    Schema schema = HiveSchemaConverter.convert(names, typeInfos, comments, autoConvert);
+    return rebuildSchemaWithIdentifierFieldIds(schema, identifierFieldNames);

Review comment:
       Maybe a quick return inside of the method 

##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveSchemaUtil.java
##########
@@ -54,7 +58,11 @@ private HiveSchemaUtil() {
    * @return An equivalent Iceberg Schema
    */
   public static Schema convert(List<FieldSchema> fieldSchemas) {
-    return convert(fieldSchemas, false);
+    return convert(fieldSchemas, false, Collections.emptySet());
+  }
+
+  public static Schema convert(List<FieldSchema> fieldSchemas, Set<String> identifierFieldNames) {

Review comment:
       Nit: add javadoc please 

##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveSchemaUtil.java
##########
@@ -112,8 +157,10 @@ public static Schema convert(List<String> names, List<TypeInfo> types, List<Stri
    *                    thrown.
    * @return The Iceberg schema
    */
-  public static Schema convert(List<String> names, List<TypeInfo> types, List<String> comments, boolean autoConvert) {
-    return HiveSchemaConverter.convert(names, types, comments, autoConvert);
+  public static Schema convert(List<String> names, List<TypeInfo> types, List<String> comments, boolean autoConvert,

Review comment:
       Fix javadoc 

##########
File path: mr/src/main/java/org/apache/iceberg/mr/InputFormatConfig.java
##########
@@ -49,6 +49,9 @@ private InputFormatConfig() {
   public static final String SERIALIZED_TABLE_PREFIX = "iceberg.mr.serialized.table.";
   public static final String TABLE_CATALOG_PREFIX = "iceberg.mr.table.catalog.";
   public static final String LOCALITY = "iceberg.mr.locality";
+  public static final String IDENTIFIER_FIELD_NAMES = "iceberg.identifier-field-names";
+  // Usually, column names containing ',' are not supported by hive, so we can use ',' as separator.

Review comment:
       AFAIK with the correct escaping, you can have anything in the column names. Could you please try that out?
   Thanks, Peter 

##########
File path: mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerNoScan.java
##########
@@ -773,4 +773,52 @@ public void testDropHiveTableWithoutUnderlyingTable() throws IOException {
   private String getCurrentSnapshotForHiveCatalogTable(org.apache.iceberg.Table icebergTable) {
     return ((BaseMetastoreTableOperations) ((BaseTable) icebergTable).operations()).currentMetadataLocation();
   }
+
+  @Test
+  public void testCreateTableWithIdentifierIds() {
+    TableIdentifier tableIdentifier = TableIdentifier.of("default", "customers");
+    shell.executeStatement(

Review comment:
       You might have missed this comment earlier 

##########
File path: mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerNoScan.java
##########
@@ -773,4 +773,52 @@ public void testDropHiveTableWithoutUnderlyingTable() throws IOException {
   private String getCurrentSnapshotForHiveCatalogTable(org.apache.iceberg.Table icebergTable) {
     return ((BaseMetastoreTableOperations) ((BaseTable) icebergTable).operations()).currentMetadataLocation();
   }
+
+  @Test
+  public void testCreateTableWithIdentifierIds() {
+    TableIdentifier tableIdentifier = TableIdentifier.of("default", "customers");
+    shell.executeStatement(
+        String.format("CREATE EXTERNAL TABLE %s ( " +
+                "  c1 INT, " +
+                "  c2 STRING, " +
+                "  c3 STRUCT<c4:STRING, c5:STRING> " +
+                ") " +
+                "PARTITIONED BY (c6 STRING) " +
+                "STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' %s " +
+                "TBLPROPERTIES (" +
+                "  '%s' = '%s'," +
+                "  '%s' = '%s'" +
+                ")",
+            tableIdentifier,
+            testTables.locationForCreateTableSQL(tableIdentifier),
+            InputFormatConfig.IDENTIFIER_FIELD_NAMES, "c1,c2",
+            InputFormatConfig.CATALOG_NAME, testTables.catalogName()));
+
+    org.apache.iceberg.Table table = testTables.loadTable(tableIdentifier);
+    Assert.assertEquals(ImmutableSet.of(1, 2), table.schema().identifierFieldIds());
+  }
+
+  @Test
+  public void testCreateTableWithIdentifierIdsError() {
+    TableIdentifier tableIdentifier = TableIdentifier.of("default", "customers");
+    Assert.assertThrows(
+        "Cannot add field c4 as an identifier field: must not in nested field",
+        IllegalArgumentException.class,
+        () -> shell.executeStatement(
+            String.format("CREATE EXTERNAL TABLE %s ( " +

Review comment:
       And this too 

##########
File path: mr/src/main/java/org/apache/iceberg/mr/InputFormatConfig.java
##########
@@ -49,6 +49,9 @@ private InputFormatConfig() {
   public static final String SERIALIZED_TABLE_PREFIX = "iceberg.mr.serialized.table.";
   public static final String TABLE_CATALOG_PREFIX = "iceberg.mr.table.catalog.";
   public static final String LOCALITY = "iceberg.mr.locality";
+  public static final String IDENTIFIER_FIELD_NAMES = "iceberg.identifier-field-names";
+  // Usually, column names containing ',' are not supported by hive, so we can use ',' as separator.

Review comment:
       We work around this when pushing column names with another config which allows us to provide a separator too. This could default to `, `,  so it is rarely used, but allows the user to fix the schema, if needed 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r791385061



##########
File path: mr/src/main/java/org/apache/iceberg/mr/InputFormatConfig.java
##########
@@ -49,6 +49,9 @@ private InputFormatConfig() {
   public static final String SERIALIZED_TABLE_PREFIX = "iceberg.mr.serialized.table.";
   public static final String TABLE_CATALOG_PREFIX = "iceberg.mr.table.catalog.";
   public static final String LOCALITY = "iceberg.mr.locality";
+  public static final String IDENTIFIER_FIELD_NAMES = "iceberg.identifier-field-names";
+  // Usually, column names containing ',' are not supported by hive, so we can use ',' as separator.

Review comment:
       AFAIK with the correct escaping, you can have anything in the column names. Could you please try that out?
   Thanks, Peter 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r790107120



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveSchemaUtil.java
##########
@@ -75,7 +84,36 @@ public static Schema convert(List<FieldSchema> fieldSchemas, boolean autoConvert
       typeInfos.add(TypeInfoUtils.getTypeInfoFromTypeString(col.getType()));
       comments.add(col.getComment());
     }
-    return HiveSchemaConverter.convert(names, typeInfos, comments, autoConvert);
+    Schema schema = HiveSchemaConverter.convert(names, typeInfos, comments, autoConvert);
+    return rebuildSchemaWithIdentifierFieldIds(schema, identifierFieldNames);
+  }
+
+  /**
+   * Rebuild a schema with given schema and identifierFieldNames
+   * @param schema The origin schema.
+   * @param identifierFieldNames The identifierFieldNames.
+   * @return New schema with IdentifierFieldIds.
+   */
+  public static Schema rebuildSchemaWithIdentifierFieldIds(Schema schema, Set<String> identifierFieldNames) {
+    Map<Integer, Integer> indexParents = TypeUtil.indexParents(schema.asStruct());
+    Set<Integer> identifierFieldIds = identifierFieldNames.stream()
+        .map(name -> {
+          Types.NestedField field = schema.findField(name);

Review comment:
       i think it would be better to simply just check the first level fields instead of a recursive check, and then throwing an exception if it is a nested field. Also it might be possible to have a nested field with the same name as a column.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] humengyu2012 edited a comment on pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
humengyu2012 edited a comment on pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#issuecomment-1023914271


   @rdblue Do you mean the method should like this:
   ```java
      static Schema rebuildSchemaWithIdentifierFieldIds(Schema schema, Set<String> identifierFieldNames) {
       if (identifierFieldNames.size() == 0) {
         return schema;
       }
   
       Map<Integer, Integer> indexParents = TypeUtil.indexParents(schema.asStruct());
       Set<Integer> identifierFieldIds = identifierFieldNames.stream()
           .map(name -> {
             Types.NestedField field = schema.findField(name);
             Preconditions.checkArgument(field != null, "Column %s does not exist", name);
             
             int id = field.fieldId();
             Optional<Types.NestedField> parent = Optional.ofNullable(indexParents.get(id)).map(schema::findField);
             parent.ifPresent(parentField -> Preconditions.checkArgument(parentField.type().isStructType(),
                 "Cannot add field %s as an identifier field: must not be nested in %s", field.name(), parent));
             
             Preconditions.checkArgument(field.type().isPrimitiveType(),
                 "Cannot add field %s as an identifier field: not a primitive type field", field.name());
             return id;
           }).collect(Collectors.toSet());
   
       List<Types.NestedField> columns = schema.columns().stream()
           .map(column -> identifierFieldIds.contains(column.fieldId()) ? column.asRequired() : column)
           .collect(Collectors.toList());
       return new Schema(columns, identifierFieldIds);
     }
   ```
   The original implementation of this method looked like this, but @pvary and me both think it would be better to simply just check the first level fields instead of a recursive check. We just want to simplify the calculation for finding fields. I would like to know your suggestions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r792398300



##########
File path: mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerNoScan.java
##########
@@ -773,4 +773,53 @@ public void testDropHiveTableWithoutUnderlyingTable() throws IOException {
   private String getCurrentSnapshotForHiveCatalogTable(org.apache.iceberg.Table icebergTable) {
     return ((BaseMetastoreTableOperations) ((BaseTable) icebergTable).operations()).currentMetadataLocation();
   }
+
+  private String getCreateHiveTableSqlWithFixedColumns(

Review comment:
       nit: We do not break the methods at every parameter, just at 120 char, like:
   ```
   private String getCreateHiveTableSqlWithFixedColumns(TableIdentifier tableIdentifier,
         Map<String, String> tableProperties) {
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r793699480



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveSchemaUtil.java
##########
@@ -63,9 +78,10 @@ public static Schema convert(List<FieldSchema> fieldSchemas) {
    * @param autoConvert If <code>true</code> then TINYINT and SMALLINT is converted to INTEGER and VARCHAR and CHAR is
    *                    converted to STRING. Otherwise if these types are used in the Hive schema then exception is
    *                    thrown.
+   * @param identifierFieldNames The identifierFieldSet which corresponds to identifierFieldIdSet.

Review comment:
       nit: Maybe: `The names of the identifier fields for this schema`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#issuecomment-1036236383


   @humengyu2012: Since @rdblue requested changes, I would ask his opinion before merging anything.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#issuecomment-1018282278


   @humengyu2012: Why did you closed the PR, and created a new one?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] humengyu2012 edited a comment on pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
humengyu2012 edited a comment on pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#issuecomment-1016111125


   issue https://github.com/apache/iceberg/issues/3920


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] humengyu2012 removed a comment on pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
humengyu2012 removed a comment on pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#issuecomment-1014941681


   I modified my code according to your suggestion, can you help me look at it again?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] humengyu2012 removed a comment on pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
humengyu2012 removed a comment on pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#issuecomment-1016635114


   New pr is https://github.com/apache/iceberg/pull/3929.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r790761416



##########
File path: mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java
##########
@@ -103,11 +105,13 @@ public void preCreateTable(org.apache.hadoop.hive.metastore.api.Table hmsTable)
       }
     }
 
+    Set<String> identifierFieldNames = HiveIcebergSerDe.getIdentifierFieldSet(
+        hmsTable.getParameters().remove(InputFormatConfig.IDENTIFIER_FIELD_NAMES));

Review comment:
       Could we use just a `get` here, and remove the parameter using the `PARAMETERS_TO_REMOVE`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r791383079



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveSchemaUtil.java
##########
@@ -75,7 +84,43 @@ public static Schema convert(List<FieldSchema> fieldSchemas, boolean autoConvert
       typeInfos.add(TypeInfoUtils.getTypeInfoFromTypeString(col.getType()));
       comments.add(col.getComment());
     }
-    return HiveSchemaConverter.convert(names, typeInfos, comments, autoConvert);
+    Schema schema = HiveSchemaConverter.convert(names, typeInfos, comments, autoConvert);
+    return rebuildSchemaWithIdentifierFieldIds(schema, identifierFieldNames);

Review comment:
       Nit: one more optimization: do not rebuild the schema, if there is no identifer




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#issuecomment-1020940450


   
   
   
   
   > > Could TestTables.createHiveTableSQL help here?
   > 
   > ![image](https://user-images.githubusercontent.com/45056332/150929210-e73844bc-d955-4cf6-abab-978f8d43328d.png) @pvary TestTables.createHiveTableSQL can not set columns, but I need some hive columns to test. Could I continue to use sql?
   
   Good point. We might want to add a new method to the TestTables then, but if handling column types is too complicated, then at least create a local test method, so we do not duplicate the code.
   We might also want to test having column names matching nested field names too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#issuecomment-1022027147


   > @pvary Is the format OK? My code is always formatted automatically because I have imported the format profile [intellij-java-palantir-style.xml](https://github.com/apache/iceberg/blob/master/.baseline/idea/intellij-java-palantir-style.xml).
   
   My main problem was that the `HiveIcebergMetaHook.PARAMETERS_TO_REMOVE`, and `HiveIcebergMetaHook.PROPERTIES_TO_REMOVE` was formatted differently (maybe even before your patch)
   
   Otherwise it is fine with me


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#issuecomment-1021968690


   Only formatting changes left from my side.
   Let's see how the tests are going.
   
   Thanks for the cahnges @humengyu2012. Well done! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] humengyu2012 commented on pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
humengyu2012 commented on pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#issuecomment-1018577609


   1. I think my previous implementation was so bad and had too many useless commits, so I made some big changes;
   2. The hive table property 'iceberg.identifier-fields' is not good, I rename it to 'iceberg.identifier-field-names' and change my description in new PR. 
   Too many things I changed , so I checkout a new branch. I'm sorry to confuse you, maybe I should continue here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r790107509



##########
File path: mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerNoScan.java
##########
@@ -773,4 +773,52 @@ public void testDropHiveTableWithoutUnderlyingTable() throws IOException {
   private String getCurrentSnapshotForHiveCatalogTable(org.apache.iceberg.Table icebergTable) {
     return ((BaseMetastoreTableOperations) ((BaseTable) icebergTable).operations()).currentMetadataLocation();
   }
+
+  @Test
+  public void testCreateTableWithIdentifierIds() {
+    TableIdentifier tableIdentifier = TableIdentifier.of("default", "customers");
+    shell.executeStatement(
+        String.format("CREATE EXTERNAL TABLE %s ( " +
+                "  c1 INT, " +
+                "  c2 STRING, " +
+                "  c3 STRUCT<c4:STRING, c5:STRING> " +
+                ") " +
+                "PARTITIONED BY (c6 STRING) " +
+                "STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' %s " +
+                "TBLPROPERTIES (" +
+                "  '%s' = '%s'," +
+                "  '%s' = '%s'" +
+                ")",
+            tableIdentifier,
+            testTables.locationForCreateTableSQL(tableIdentifier),
+            InputFormatConfig.IDENTIFIER_FIELD_NAMES, "c1,c2",
+            InputFormatConfig.CATALOG_NAME, testTables.catalogName()));
+
+    org.apache.iceberg.Table table = testTables.loadTable(tableIdentifier);
+    Assert.assertEquals(ImmutableSet.of(1, 2), table.schema().identifierFieldIds());
+  }
+
+  @Test
+  public void testCreateTableWithIdentifierIdsError() {
+    TableIdentifier tableIdentifier = TableIdentifier.of("default", "customers");
+    Assert.assertThrows(
+        "Cannot add field c4 as an identifier field: must not in nested field",
+        IllegalArgumentException.class,
+        () -> shell.executeStatement(
+            String.format("CREATE EXTERNAL TABLE %s ( " +

Review comment:
       Same as above




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r791384092



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveSchemaUtil.java
##########
@@ -54,7 +58,11 @@ private HiveSchemaUtil() {
    * @return An equivalent Iceberg Schema
    */
   public static Schema convert(List<FieldSchema> fieldSchemas) {
-    return convert(fieldSchemas, false);
+    return convert(fieldSchemas, false, Collections.emptySet());
+  }
+
+  public static Schema convert(List<FieldSchema> fieldSchemas, Set<String> identifierFieldNames) {

Review comment:
       Nit: add javadoc please 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] humengyu2012 commented on pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
humengyu2012 commented on pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#issuecomment-1022002048


   @pvary Is the format OK? My format always auto format because I have import the format profile (checkstyle.xml)[https://github.com/apache/iceberg/blob/master/.baseline/checkstyle/checkstyle.xml].


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r790766335



##########
File path: mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergSerDe.java
##########
@@ -176,12 +179,22 @@ private static Schema hiveSchemaOrThrow(Properties serDeProperties, Exception pr
       if (columnComments != null) {
         Collections.addAll(comments, columnComments.split(Character.toString(Character.MIN_VALUE)));
       }
+      String identifierFields = serDeProperties.getProperty(InputFormatConfig.IDENTIFIER_FIELD_NAMES);
+      Set<String> identifierFieldNames = getIdentifierFieldSet(identifierFields);
       Schema hiveSchema = HiveSchemaUtil.convert(names, TypeInfoUtils.getTypeInfosFromTypeString(columnTypes),
-              comments, autoConversion);
+              comments, autoConversion, identifierFieldNames);
       LOG.info("Using hive schema {}", SchemaParser.toJson(hiveSchema));
       return hiveSchema;
     } else {
       throw new SerDeException("Please provide an existing table or a valid schema", previousException);
     }
   }
+
+  public static Set<String> getIdentifierFieldSet(@Nullable String identifierFields) {
+    return Optional.ofNullable(identifierFields)
+        .filter(s -> !s.isEmpty())
+        .map(value -> Arrays.asList((value).split(InputFormatConfig.IDENTIFIER_FIELDS_SEPARATOR)))

Review comment:
       What happens when we have a name which contains `,`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r790754822



##########
File path: hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveSchemaUtil.java
##########
@@ -170,6 +172,38 @@ public void testConversionWithoutLastComment() {
     Assert.assertEquals(expected.asStruct(), schema.asStruct());
   }
 
+  @Test
+  public void testRebuildSchemaWithIdentifierFieldIds() {
+    Schema actualSchema = HiveSchemaUtil.rebuildSchemaWithIdentifierFieldIds(COMPLEX_ICEBERG_SCHEMA,
+            ImmutableSet.of("id", "name"));
+
+    List<Types.NestedField> columns = Lists.newArrayList(COMPLEX_ICEBERG_SCHEMA.columns());
+    columns.set(0, columns.get(0).asRequired());
+    columns.set(1, columns.get(1).asRequired());
+    Schema expectedSchema = new Schema(columns, ImmutableSet.of(0, 1));
+
+    Assert.assertEquals(SchemaParser.toJson(expectedSchema), SchemaParser.toJson(actualSchema));
+  }
+
+  @Test
+  public void testRebuildSchemaWithIdentifierFieldIdsError() {
+    Assert.assertThrows("Cannot add field `not_exist_column` as an identifier field: " +
+                    "the field must exist on the root level",

Review comment:
       nit: Please fix the indentation. Continuation is 4 spaces




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r790758387



##########
File path: mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergSerDe.java
##########
@@ -176,12 +179,22 @@ private static Schema hiveSchemaOrThrow(Properties serDeProperties, Exception pr
       if (columnComments != null) {
         Collections.addAll(comments, columnComments.split(Character.toString(Character.MIN_VALUE)));
       }
+      String identifierFields = serDeProperties.getProperty(InputFormatConfig.IDENTIFIER_FIELD_NAMES);
+      Set<String> identifierFieldNames = getIdentifierFieldSet(identifierFields);
       Schema hiveSchema = HiveSchemaUtil.convert(names, TypeInfoUtils.getTypeInfosFromTypeString(columnTypes),
-              comments, autoConversion);
+              comments, autoConversion, identifierFieldNames);

Review comment:
       nit: indentation




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r791386170



##########
File path: mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerNoScan.java
##########
@@ -773,4 +773,52 @@ public void testDropHiveTableWithoutUnderlyingTable() throws IOException {
   private String getCurrentSnapshotForHiveCatalogTable(org.apache.iceberg.Table icebergTable) {
     return ((BaseMetastoreTableOperations) ((BaseTable) icebergTable).operations()).currentMetadataLocation();
   }
+
+  @Test
+  public void testCreateTableWithIdentifierIds() {
+    TableIdentifier tableIdentifier = TableIdentifier.of("default", "customers");
+    shell.executeStatement(

Review comment:
       You might have missed this comment earlier 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] humengyu2012 edited a comment on pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
humengyu2012 edited a comment on pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#issuecomment-1022002048


   @pvary Is the format OK? My code is always formatted automatically because I have import the format profile (checkstyle.xml)[https://github.com/apache/iceberg/blob/master/.baseline/checkstyle/checkstyle.xml].


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] humengyu2012 commented on pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
humengyu2012 commented on pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#issuecomment-1014941681


   I modified my code according to your suggestion, can you help me look at it again?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] humengyu2012 commented on pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
humengyu2012 commented on pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#issuecomment-1035117292


   Could anyone tell me what to do next? Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r790753374



##########
File path: hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveSchemaUtil.java
##########
@@ -170,6 +172,38 @@ public void testConversionWithoutLastComment() {
     Assert.assertEquals(expected.asStruct(), schema.asStruct());
   }
 
+  @Test
+  public void testRebuildSchemaWithIdentifierFieldIds() {
+    Schema actualSchema = HiveSchemaUtil.rebuildSchemaWithIdentifierFieldIds(COMPLEX_ICEBERG_SCHEMA,
+            ImmutableSet.of("id", "name"));

Review comment:
       nit: Please fix the indentation. Continuation is 4 spaces




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r791388529



##########
File path: mr/src/main/java/org/apache/iceberg/mr/InputFormatConfig.java
##########
@@ -49,6 +49,9 @@ private InputFormatConfig() {
   public static final String SERIALIZED_TABLE_PREFIX = "iceberg.mr.serialized.table.";
   public static final String TABLE_CATALOG_PREFIX = "iceberg.mr.table.catalog.";
   public static final String LOCALITY = "iceberg.mr.locality";
+  public static final String IDENTIFIER_FIELD_NAMES = "iceberg.identifier-field-names";
+  // Usually, column names containing ',' are not supported by hive, so we can use ',' as separator.

Review comment:
       We work around this when pushing column names with another config which allows us to provide a separator too. This could default to `, `,  so it is rarely used, but allows the user to fix the schema, if needed 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r790762525



##########
File path: mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java
##########
@@ -103,11 +105,13 @@ public void preCreateTable(org.apache.hadoop.hive.metastore.api.Table hmsTable)
       }
     }
 
+    Set<String> identifierFieldNames = HiveIcebergSerDe.getIdentifierFieldSet(
+        hmsTable.getParameters().remove(InputFormatConfig.IDENTIFIER_FIELD_NAMES));

Review comment:
       Also we might want to move this into the `schema` method




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r791384304



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveSchemaUtil.java
##########
@@ -112,8 +157,10 @@ public static Schema convert(List<String> names, List<TypeInfo> types, List<Stri
    *                    thrown.
    * @return The Iceberg schema
    */
-  public static Schema convert(List<String> names, List<TypeInfo> types, List<String> comments, boolean autoConvert) {
-    return HiveSchemaConverter.convert(names, types, comments, autoConvert);
+  public static Schema convert(List<String> names, List<TypeInfo> types, List<String> comments, boolean autoConvert,

Review comment:
       Fix javadoc 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] humengyu2012 commented on pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
humengyu2012 commented on pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#issuecomment-1020904201


   
   > Could TestTables.createHiveTableSQL help here?
   
   ![image](https://user-images.githubusercontent.com/45056332/150929210-e73844bc-d955-4cf6-abab-978f8d43328d.png)
   @pvary TestTables.createHiveTableSQL can not set columns, but I need some hive columns to test. Could I continue to use sql?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r786301689



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveSchemaUtil.java
##########
@@ -75,7 +79,18 @@ public static Schema convert(List<FieldSchema> fieldSchemas, boolean autoConvert
       typeInfos.add(TypeInfoUtils.getTypeInfoFromTypeString(col.getType()));
       comments.add(col.getComment());
     }
-    return HiveSchemaConverter.convert(names, typeInfos, comments, autoConvert);
+    Set<Integer> identifierFieldIdSet = identifierFieldIdSetOrThrow(names, identifierFieldSet);
+    return HiveSchemaConverter.convert(names, typeInfos, comments, autoConvert, identifierFieldIdSet);
+  }
+
+  public static Set<Integer> identifierFieldIdSetOrThrow(List<String> names, Set<String> identifierFieldSet) {
+    return identifierFieldSet.stream()
+        .map(field -> {
+          int index = names.indexOf(field);

Review comment:
       Field IDs are not indexes, they are the integer identifiers embedded in an Iceberg schema.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r790107477



##########
File path: mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerNoScan.java
##########
@@ -773,4 +773,52 @@ public void testDropHiveTableWithoutUnderlyingTable() throws IOException {
   private String getCurrentSnapshotForHiveCatalogTable(org.apache.iceberg.Table icebergTable) {
     return ((BaseMetastoreTableOperations) ((BaseTable) icebergTable).operations()).currentMetadataLocation();
   }
+
+  @Test
+  public void testCreateTableWithIdentifierIds() {
+    TableIdentifier tableIdentifier = TableIdentifier.of("default", "customers");
+    shell.executeStatement(

Review comment:
       Could TestTables.createHiveTableSQL help here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] humengyu2012 edited a comment on pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
humengyu2012 edited a comment on pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#issuecomment-1022002048


   @pvary Is the format OK? My code is always formatted automatically because I have import the format profile [intellij-java-palantir-style.xml](https://github.com/apache/iceberg/blob/master/.baseline/idea/intellij-java-palantir-style.xml).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r790621850



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveSchemaUtil.java
##########
@@ -75,7 +83,42 @@ public static Schema convert(List<FieldSchema> fieldSchemas, boolean autoConvert
       typeInfos.add(TypeInfoUtils.getTypeInfoFromTypeString(col.getType()));
       comments.add(col.getComment());
     }
-    return HiveSchemaConverter.convert(names, typeInfos, comments, autoConvert);
+    Schema schema = HiveSchemaConverter.convert(names, typeInfos, comments, autoConvert);
+    return rebuildSchemaWithIdentifierFieldIds(schema, identifierFieldNames);
+  }
+
+  /**
+   * Rebuild a schema with given schema and identifierFieldNames
+   * @param schema The origin schema.
+   * @param identifierFieldNames The identifierFieldNames.
+   * @return New schema with IdentifierFieldIds.
+   */
+  public static Schema rebuildSchemaWithIdentifierFieldIds(Schema schema, Set<String> identifierFieldNames) {
+    // Identifier fields in nested field are not supported, so we just check the first level columns.
+    Map<String, Types.NestedField> columnsMap = schema.columns().stream()
+            .collect(Collectors.toMap(Types.NestedField::name, field -> field));
+    Set<Integer> identifierFieldIds = identifierFieldNames.stream()
+            .map(name -> {
+              Types.NestedField field = columnsMap.get(name);
+              if (field == null) {
+                // Does not exist or in nested field.
+                throw new IllegalArgumentException(
+                        String.format("Cannot add field `%s` as an identifier field: " +
+                                "must not in nested field and exist", name));

Review comment:
       I am not a native English speaker, but maybe this is better:
   ```
   Cannot set field `%s` as an identifier field: the field must exist on the root level
   ```
   
   Or something better 😄 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r790618080



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveSchemaUtil.java
##########
@@ -75,7 +83,42 @@ public static Schema convert(List<FieldSchema> fieldSchemas, boolean autoConvert
       typeInfos.add(TypeInfoUtils.getTypeInfoFromTypeString(col.getType()));
       comments.add(col.getComment());
     }
-    return HiveSchemaConverter.convert(names, typeInfos, comments, autoConvert);
+    Schema schema = HiveSchemaConverter.convert(names, typeInfos, comments, autoConvert);
+    return rebuildSchemaWithIdentifierFieldIds(schema, identifierFieldNames);
+  }
+
+  /**
+   * Rebuild a schema with given schema and identifierFieldNames
+   * @param schema The origin schema.
+   * @param identifierFieldNames The identifierFieldNames.
+   * @return New schema with IdentifierFieldIds.
+   */
+  public static Schema rebuildSchemaWithIdentifierFieldIds(Schema schema, Set<String> identifierFieldNames) {
+    // Identifier fields in nested field are not supported, so we just check the first level columns.
+    Map<String, Types.NestedField> columnsMap = schema.columns().stream()
+            .collect(Collectors.toMap(Types.NestedField::name, field -> field));
+    Set<Integer> identifierFieldIds = identifierFieldNames.stream()
+            .map(name -> {
+              Types.NestedField field = columnsMap.get(name);
+              if (field == null) {
+                // Does not exist or in nested field.
+                throw new IllegalArgumentException(
+                        String.format("Cannot add field `%s` as an identifier field: " +
+                                "must not in nested field and exist", name));
+              }
+              if (field.type().isNestedType()) {

Review comment:
       I think this should be `field.type().isPrimitiveType()`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r786301646



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveSchemaConverter.java
##########
@@ -50,21 +52,27 @@ private HiveSchemaConverter(boolean autoConvert) {
     this.id = 0;
   }
 
-  static Schema convert(List<String> names, List<TypeInfo> typeInfos, List<String> comments, boolean autoConvert) {
+  static Schema convert(List<String> names, List<TypeInfo> typeInfos, List<String> comments, boolean autoConvert,
+                        Set<Integer> identifierFieldSet) {
     HiveSchemaConverter converter = new HiveSchemaConverter(autoConvert);
-    return new Schema(converter.convertInternal(names, typeInfos, comments));
+    return new Schema(converter.convertInternal(names, typeInfos, comments, identifierFieldSet), identifierFieldSet);
   }
 
   static Type convert(TypeInfo typeInfo, boolean autoConvert) {
     HiveSchemaConverter converter = new HiveSchemaConverter(autoConvert);
     return converter.convertType(typeInfo);
   }
 
-  List<Types.NestedField> convertInternal(List<String> names, List<TypeInfo> typeInfos, List<String> comments) {
+  List<Types.NestedField> convertInternal(List<String> names, List<TypeInfo> typeInfos, List<String> comments,
+                                          Set<Integer> identifierFieldSet) {
     List<Types.NestedField> result = Lists.newArrayListWithExpectedSize(names.size());
     for (int i = 0; i < names.size(); ++i) {
-      result.add(Types.NestedField.optional(id++, names.get(i), convertType(typeInfos.get(i)),
-          (comments.isEmpty() || i >= comments.size()) ? null : comments.get(i)));
+      String doc = (comments.isEmpty() || i >= comments.size()) ? null : comments.get(i);
+      if (identifierFieldSet.contains(i)) {
+        result.add(Types.NestedField.required(id++, names.get(i), convertType(typeInfos.get(i)), doc));

Review comment:
       Iceberg code does not use the return value of `++` expressions because it tends to be confusing.
   
   Also, the identifier handling looks incorrect. The IDs used to create fields here determine the set of field IDs passed as the identifier field set. You can't have a set of IDs before those IDs are assigned. I suspect that this set should be a set of column names.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r786924755



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveSchemaConverter.java
##########
@@ -50,21 +52,32 @@ private HiveSchemaConverter(boolean autoConvert) {
     this.id = 0;
   }
 
-  static Schema convert(List<String> names, List<TypeInfo> typeInfos, List<String> comments, boolean autoConvert) {
+  static Schema convert(List<String> names, List<TypeInfo> typeInfos, List<String> comments, boolean autoConvert,
+                        Set<String> identifierFieldSet) {
     HiveSchemaConverter converter = new HiveSchemaConverter(autoConvert);
-    return new Schema(converter.convertInternal(names, typeInfos, comments));
+    List<Types.NestedField> fields = converter.convertInternal(names, typeInfos, comments, identifierFieldSet);
+    Set<Integer> identifierIdSet = HiveSchemaUtil.identifierFieldIdSetOrThrow(fields, identifierFieldSet);
+    return new Schema(fields, identifierIdSet);
   }
 
   static Type convert(TypeInfo typeInfo, boolean autoConvert) {
     HiveSchemaConverter converter = new HiveSchemaConverter(autoConvert);
     return converter.convertType(typeInfo);
   }
 
-  List<Types.NestedField> convertInternal(List<String> names, List<TypeInfo> typeInfos, List<String> comments) {
+  List<Types.NestedField> convertInternal(List<String> names, List<TypeInfo> typeInfos, List<String> comments,
+                                          Set<String> identifierFieldSet) {
     List<Types.NestedField> result = Lists.newArrayListWithExpectedSize(names.size());
     for (int i = 0; i < names.size(); ++i) {
-      result.add(Types.NestedField.optional(id++, names.get(i), convertType(typeInfos.get(i)),
-          (comments.isEmpty() || i >= comments.size()) ? null : comments.get(i)));
+      int fieldId = id;
+      id = id + 1;

Review comment:
       I think it is more direct to use id, but increment it at the end of the for block.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#issuecomment-1020940450


   
   
   
   
   > > Could TestTables.createHiveTableSQL help here?
   > 
   > ![image](https://user-images.githubusercontent.com/45056332/150929210-e73844bc-d955-4cf6-abab-978f8d43328d.png) @pvary TestTables.createHiveTableSQL can not set columns, but I need some hive columns to test. Could I continue to use sql?
   
   Good point. We might want to add a new method to the TestTables then, but if handling column types is too complicated, then at least create a local test method, so we do not duplicate the code.
   We might also want to test having column names matching nested field names too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] humengyu2012 commented on pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
humengyu2012 commented on pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#issuecomment-1021895143


   @pvary I added a new test method to fix duplicated sql, what do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r793701901



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveSchemaUtil.java
##########
@@ -110,10 +165,13 @@ public static Schema convert(List<String> names, List<TypeInfo> types, List<Stri
    * @param autoConvert If <code>true</code> then TINYINT and SMALLINT is converted to INTEGER and VARCHAR and CHAR is
    *                    converted to STRING. Otherwise if these types are used in the Hive schema then exception is
    *                    thrown.
+   * @param identifierFieldNames The identifierFieldSet which corresponds to identifierFieldIdSet.

Review comment:
       The names of the identifier fields for the schema

##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveSchemaUtil.java
##########
@@ -63,9 +78,10 @@ public static Schema convert(List<FieldSchema> fieldSchemas) {
    * @param autoConvert If <code>true</code> then TINYINT and SMALLINT is converted to INTEGER and VARCHAR and CHAR is
    *                    converted to STRING. Otherwise if these types are used in the Hive schema then exception is
    *                    thrown.
+   * @param identifierFieldNames The identifierFieldSet which corresponds to identifierFieldIdSet.

Review comment:
       nit: Maybe: `The names of the identifier fields for the schema`

##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveSchemaUtil.java
##########
@@ -54,7 +58,18 @@ private HiveSchemaUtil() {
    * @return An equivalent Iceberg Schema
    */
   public static Schema convert(List<FieldSchema> fieldSchemas) {
-    return convert(fieldSchemas, false);
+    return convert(fieldSchemas, false, Collections.emptySet());
+  }
+
+  /**
+   * Converts a Hive schema (list of FieldSchema objects) to an Iceberg schema. If some of the types are not convertible
+   * then exception is thrown.
+   * @param fieldSchemas The list of the columns.
+   * @param identifierFieldNames The identifierFieldSet which corresponds to identifierFieldIdSet.

Review comment:
       nit: Maybe: `The names of the identifier fields for the schema`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] humengyu2012 commented on pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
humengyu2012 commented on pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#issuecomment-1023526290


   The javadoc has been fixed. Thanks for your review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] humengyu2012 commented on pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
humengyu2012 commented on pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#issuecomment-1023914271


   @rdblue Do you mean the method should like this:
   ```java
      static Schema rebuildSchemaWithIdentifierFieldIds(Schema schema, Set<String> identifierFieldNames) {
       if (identifierFieldNames.size() == 0) {
         return schema;
       }
   
       Map<Integer, Integer> indexParents = TypeUtil.indexParents(schema.asStruct());
       Set<Integer> identifierFieldIds = identifierFieldNames.stream()
           .map(name -> {
             Types.NestedField field = schema.findField(name);
             Preconditions.checkArgument(field != null, "Column %s does not exist", name);
             
             int id = field.fieldId();
             Optional<Types.NestedField> parent = Optional.ofNullable(indexParents.get(id)).map(schema::findField);
             parent.ifPresent(parentField -> Preconditions.checkArgument(parentField.type().isStructType(),
                 "Cannot add field %s as an identifier field: must not be nested in %s", field.name(), parent));
             
             Preconditions.checkArgument(field.type().isPrimitiveType(),
                 "Cannot add field %s as an identifier field: not a primitive type field", field.name());
             return id;
           }).collect(Collectors.toSet());
   
       List<Types.NestedField> columns = schema.columns().stream()
           .map(column -> identifierFieldIds.contains(column.fieldId()) ? column.asRequired() : column)
           .collect(Collectors.toList());
       return new Schema(columns, identifierFieldIds);
     }
   ```
   The original implementation of this method looked like this, but @pvary and me both think it it would be better to simply just check the first level fields instead of a recursive check. We just want to simplify the calculation for finding fields. I would like to know your suggestions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] humengyu2012 commented on pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
humengyu2012 commented on pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#issuecomment-1015086881


   Thanks for your guidance, I have modified my code according to your suggestion, can you help me look at it again?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] humengyu2012 commented on pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
humengyu2012 commented on pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#issuecomment-1016635114


   New pr is https://github.com/apache/iceberg/pull/3929.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r793700820



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveSchemaUtil.java
##########
@@ -75,7 +91,46 @@ public static Schema convert(List<FieldSchema> fieldSchemas, boolean autoConvert
       typeInfos.add(TypeInfoUtils.getTypeInfoFromTypeString(col.getType()));
       comments.add(col.getComment());
     }
-    return HiveSchemaConverter.convert(names, typeInfos, comments, autoConvert);
+    Schema schema = HiveSchemaConverter.convert(names, typeInfos, comments, autoConvert);
+    return rebuildSchemaWithIdentifierFieldIds(schema, identifierFieldNames);
+  }
+
+  /**
+   * Rebuild a schema with given schema and identifierFieldNames
+   * @param schema The origin schema.
+   * @param identifierFieldNames The identifierFieldNames.

Review comment:
       `The names of the identifier fields in the new schema`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#issuecomment-1022142871


   > I'm not sure if this is a good idea and would like to ask for your advice. @pvary
   
   We are adding multiple values to the sets, so we have to change them anyway. It is good if we have the same formatting minimally in file level. So this is fine by me. Thanks
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r790760598



##########
File path: hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveSchemaUtil.java
##########
@@ -170,6 +172,38 @@ public void testConversionWithoutLastComment() {
     Assert.assertEquals(expected.asStruct(), schema.asStruct());
   }
 
+  @Test
+  public void testRebuildSchemaWithIdentifierFieldIds() {
+    Schema actualSchema = HiveSchemaUtil.rebuildSchemaWithIdentifierFieldIds(COMPLEX_ICEBERG_SCHEMA,
+            ImmutableSet.of("id", "name"));
+
+    List<Types.NestedField> columns = Lists.newArrayList(COMPLEX_ICEBERG_SCHEMA.columns());
+    columns.set(0, columns.get(0).asRequired());
+    columns.set(1, columns.get(1).asRequired());
+    Schema expectedSchema = new Schema(columns, ImmutableSet.of(0, 1));
+
+    Assert.assertEquals(SchemaParser.toJson(expectedSchema), SchemaParser.toJson(actualSchema));
+  }
+
+  @Test
+  public void testRebuildSchemaWithIdentifierFieldIdsError() {
+    Assert.assertThrows("Cannot add field `not_exist_column` as an identifier field: " +
+                    "the field must exist on the root level",
+            IllegalArgumentException.class,
+            () -> HiveSchemaUtil.rebuildSchemaWithIdentifierFieldIds(COMPLEX_ICEBERG_SCHEMA,
+                    ImmutableSet.of("id", "not_exist_column")));
+    Assert.assertThrows("Cannot add field `employee_info.id` as an identifier field: " +
+                    "the field must exist on the root level",
+            IllegalArgumentException.class,
+            () -> HiveSchemaUtil.rebuildSchemaWithIdentifierFieldIds(COMPLEX_ICEBERG_SCHEMA,
+                    ImmutableSet.of("id", "employee_info.id")));
+    Assert.assertThrows("Cannot add field `employee_info` as an identifier field: " +
+                    "only primitive fields are allowed",
+            IllegalArgumentException.class,
+            () -> HiveSchemaUtil.rebuildSchemaWithIdentifierFieldIds(COMPLEX_ICEBERG_SCHEMA,
+                    ImmutableSet.of("id", "employee_info")));
+  }
+

Review comment:
       Can we add tests for the new `convert` methods where we check the identifiers?

##########
File path: hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveSchemaUtil.java
##########
@@ -170,6 +172,38 @@ public void testConversionWithoutLastComment() {
     Assert.assertEquals(expected.asStruct(), schema.asStruct());
   }
 
+  @Test
+  public void testRebuildSchemaWithIdentifierFieldIds() {
+    Schema actualSchema = HiveSchemaUtil.rebuildSchemaWithIdentifierFieldIds(COMPLEX_ICEBERG_SCHEMA,
+            ImmutableSet.of("id", "name"));
+
+    List<Types.NestedField> columns = Lists.newArrayList(COMPLEX_ICEBERG_SCHEMA.columns());
+    columns.set(0, columns.get(0).asRequired());
+    columns.set(1, columns.get(1).asRequired());
+    Schema expectedSchema = new Schema(columns, ImmutableSet.of(0, 1));
+
+    Assert.assertEquals(SchemaParser.toJson(expectedSchema), SchemaParser.toJson(actualSchema));
+  }
+
+  @Test
+  public void testRebuildSchemaWithIdentifierFieldIdsError() {
+    Assert.assertThrows("Cannot add field `not_exist_column` as an identifier field: " +
+                    "the field must exist on the root level",
+            IllegalArgumentException.class,
+            () -> HiveSchemaUtil.rebuildSchemaWithIdentifierFieldIds(COMPLEX_ICEBERG_SCHEMA,
+                    ImmutableSet.of("id", "not_exist_column")));
+    Assert.assertThrows("Cannot add field `employee_info.id` as an identifier field: " +
+                    "the field must exist on the root level",
+            IllegalArgumentException.class,
+            () -> HiveSchemaUtil.rebuildSchemaWithIdentifierFieldIds(COMPLEX_ICEBERG_SCHEMA,
+                    ImmutableSet.of("id", "employee_info.id")));
+    Assert.assertThrows("Cannot add field `employee_info` as an identifier field: " +
+                    "only primitive fields are allowed",
+            IllegalArgumentException.class,
+            () -> HiveSchemaUtil.rebuildSchemaWithIdentifierFieldIds(COMPLEX_ICEBERG_SCHEMA,
+                    ImmutableSet.of("id", "employee_info")));
+  }
+

Review comment:
       Could we add tests for the new `convert` methods where we check the identifiers please?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r791383444



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveSchemaUtil.java
##########
@@ -75,7 +84,43 @@ public static Schema convert(List<FieldSchema> fieldSchemas, boolean autoConvert
       typeInfos.add(TypeInfoUtils.getTypeInfoFromTypeString(col.getType()));
       comments.add(col.getComment());
     }
-    return HiveSchemaConverter.convert(names, typeInfos, comments, autoConvert);
+    Schema schema = HiveSchemaConverter.convert(names, typeInfos, comments, autoConvert);
+    return rebuildSchemaWithIdentifierFieldIds(schema, identifierFieldNames);

Review comment:
       Maybe a quick return inside of the method 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] humengyu2012 edited a comment on pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
humengyu2012 edited a comment on pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#issuecomment-1022002048


   @pvary Is the format OK? My code is always formatted automatically because I have imported the format profile [intellij-java-palantir-style.xml](https://github.com/apache/iceberg/blob/master/.baseline/idea/intellij-java-palantir-style.xml).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r793701164



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveSchemaUtil.java
##########
@@ -75,7 +91,46 @@ public static Schema convert(List<FieldSchema> fieldSchemas, boolean autoConvert
       typeInfos.add(TypeInfoUtils.getTypeInfoFromTypeString(col.getType()));
       comments.add(col.getComment());
     }
-    return HiveSchemaConverter.convert(names, typeInfos, comments, autoConvert);
+    Schema schema = HiveSchemaConverter.convert(names, typeInfos, comments, autoConvert);
+    return rebuildSchemaWithIdentifierFieldIds(schema, identifierFieldNames);
+  }
+
+  /**
+   * Rebuild a schema with given schema and identifierFieldNames
+   * @param schema The origin schema.
+   * @param identifierFieldNames The identifierFieldNames.
+   * @return New schema with IdentifierFieldIds.
+   */
+  @VisibleForTesting
+  static Schema rebuildSchemaWithIdentifierFieldIds(Schema schema, Set<String> identifierFieldNames) {
+    if (identifierFieldNames.size() == 0) {
+      return schema;
+    }
+    // Identifier fields in nested field are not supported, so we just check the first level columns.

Review comment:
       nit: extra new line




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r793699177



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveSchemaUtil.java
##########
@@ -54,7 +58,18 @@ private HiveSchemaUtil() {
    * @return An equivalent Iceberg Schema
    */
   public static Schema convert(List<FieldSchema> fieldSchemas) {
-    return convert(fieldSchemas, false);
+    return convert(fieldSchemas, false, Collections.emptySet());
+  }
+
+  /**
+   * Converts a Hive schema (list of FieldSchema objects) to an Iceberg schema. If some of the types are not convertible
+   * then exception is thrown.
+   * @param fieldSchemas The list of the columns.
+   * @param identifierFieldNames The identifierFieldSet which corresponds to identifierFieldIdSet.

Review comment:
       nit: Maybe: `The names of the identifier fields for this schema`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] humengyu2012 removed a comment on pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
humengyu2012 removed a comment on pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#issuecomment-1022140752


   I reformatted `HiveIcebergMetaHook.PARAMETERS_TO_REMOVE` and `HiveIcebergMetaHook.PROPERTIES_TO_REMOVE` with the following reference code
   `org.apache.iceberg.MetadataColumns#META_IDS`
   ![image](https://user-images.githubusercontent.com/45056332/151159926-309dccc3-456b-4b0b-b8d1-795f7358cf7a.png)
   
   I'm not sure if this is a good idea and would like to ask for your advice.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] humengyu2012 commented on pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
humengyu2012 commented on pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#issuecomment-1022140752


   I reformatted `HiveIcebergMetaHook.PARAMETERS_TO_REMOVE` and `HiveIcebergMetaHook.PROPERTIES_TO_REMOVE` with the following reference code
   `org.apache.iceberg.MetadataColumns#META_IDS`
   ![image](https://user-images.githubusercontent.com/45056332/151159926-309dccc3-456b-4b0b-b8d1-795f7358cf7a.png)
   
   I'm not sure if this is a good idea and would like to ask for your advice.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r792394985



##########
File path: mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java
##########
@@ -51,15 +51,22 @@
 public class HiveIcebergMetaHook implements HiveMetaHook {
   private static final Logger LOG = LoggerFactory.getLogger(HiveIcebergMetaHook.class);
   private static final Set<String> PARAMETERS_TO_REMOVE = ImmutableSet
-      .of(InputFormatConfig.TABLE_SCHEMA, Catalogs.LOCATION, Catalogs.NAME);
+      .of(InputFormatConfig.TABLE_SCHEMA,
+          Catalogs.LOCATION,
+          Catalogs.NAME,
+          InputFormatConfig.IDENTIFIER_FIELD_NAMES,
+          InputFormatConfig.IDENTIFIER_FIELDS_SEPARATOR);
   private static final Set<String> PROPERTIES_TO_REMOVE = ImmutableSet
       // We don't want to push down the metadata location props to Iceberg from HMS,
       // since the snapshot pointer in HMS would always be one step ahead
       .of(BaseMetastoreTableOperations.METADATA_LOCATION_PROP,
       BaseMetastoreTableOperations.PREVIOUS_METADATA_LOCATION_PROP,

Review comment:
       nit: Could we please fix this formatting too?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] humengyu2012 edited a comment on pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
humengyu2012 edited a comment on pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#issuecomment-1022002048


   @pvary Is the format OK? My code is always formatted automatically because I have import the format profile [checkstyle.xml](https://github.com/apache/iceberg/blob/master/.baseline/checkstyle/checkstyle.xml).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] humengyu2012 edited a comment on pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
humengyu2012 edited a comment on pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#issuecomment-1014947194


   I have tried to remove `id++`, but it will change the id generation order, and lead to some test fail. Because the method `convertType()` will call method `convert()` recursively when struct type occurs, id will be added. So I think it shouldn't be changed in this pr. I can do this in the next pr if we need it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] humengyu2012 removed a comment on pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
humengyu2012 removed a comment on pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#issuecomment-1014947194


   I have tried to remove `id++`, but it will change the id generation order, and lead to some test fail. Because the method `convertType()` will call method `convert()` recursively when struct type occurs, id will be added. So I think it shouldn't be changed in this pr. I can do this in the next pr if we need it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r786925851



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveSchemaConverter.java
##########
@@ -50,21 +52,32 @@ private HiveSchemaConverter(boolean autoConvert) {
     this.id = 0;
   }
 
-  static Schema convert(List<String> names, List<TypeInfo> typeInfos, List<String> comments, boolean autoConvert) {
+  static Schema convert(List<String> names, List<TypeInfo> typeInfos, List<String> comments, boolean autoConvert,
+                        Set<String> identifierFieldSet) {
     HiveSchemaConverter converter = new HiveSchemaConverter(autoConvert);
-    return new Schema(converter.convertInternal(names, typeInfos, comments));
+    List<Types.NestedField> fields = converter.convertInternal(names, typeInfos, comments, identifierFieldSet);
+    Set<Integer> identifierIdSet = HiveSchemaUtil.identifierFieldIdSetOrThrow(fields, identifierFieldSet);
+    return new Schema(fields, identifierIdSet);
   }
 
   static Type convert(TypeInfo typeInfo, boolean autoConvert) {
     HiveSchemaConverter converter = new HiveSchemaConverter(autoConvert);
     return converter.convertType(typeInfo);
   }
 
-  List<Types.NestedField> convertInternal(List<String> names, List<TypeInfo> typeInfos, List<String> comments) {
+  List<Types.NestedField> convertInternal(List<String> names, List<TypeInfo> typeInfos, List<String> comments,
+                                          Set<String> identifierFieldSet) {
     List<Types.NestedField> result = Lists.newArrayListWithExpectedSize(names.size());
     for (int i = 0; i < names.size(); ++i) {
-      result.add(Types.NestedField.optional(id++, names.get(i), convertType(typeInfos.get(i)),
-          (comments.isEmpty() || i >= comments.size()) ? null : comments.get(i)));
+      int fieldId = id;
+      id = id + 1;
+      String name = names.get(i);
+      String doc = (comments.isEmpty() || i >= comments.size()) ? null : comments.get(i);
+      if (identifierFieldSet.contains(name)) {

Review comment:
       I don't think this logic would work with nested identifier field names. For example, an integer field, `response.status` could be in the set of names, but this would not catch it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r786923916



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveSchemaConverter.java
##########
@@ -50,21 +52,32 @@ private HiveSchemaConverter(boolean autoConvert) {
     this.id = 0;
   }
 
-  static Schema convert(List<String> names, List<TypeInfo> typeInfos, List<String> comments, boolean autoConvert) {
+  static Schema convert(List<String> names, List<TypeInfo> typeInfos, List<String> comments, boolean autoConvert,
+                        Set<String> identifierFieldSet) {
     HiveSchemaConverter converter = new HiveSchemaConverter(autoConvert);
-    return new Schema(converter.convertInternal(names, typeInfos, comments));
+    List<Types.NestedField> fields = converter.convertInternal(names, typeInfos, comments, identifierFieldSet);
+    Set<Integer> identifierIdSet = HiveSchemaUtil.identifierFieldIdSetOrThrow(fields, identifierFieldSet);
+    return new Schema(fields, identifierIdSet);
   }
 
   static Type convert(TypeInfo typeInfo, boolean autoConvert) {
     HiveSchemaConverter converter = new HiveSchemaConverter(autoConvert);
     return converter.convertType(typeInfo);
   }
 
-  List<Types.NestedField> convertInternal(List<String> names, List<TypeInfo> typeInfos, List<String> comments) {
+  List<Types.NestedField> convertInternal(List<String> names, List<TypeInfo> typeInfos, List<String> comments,

Review comment:
       Why not create a Schema here that has the identifier fields already converted, rather than recovering them by checking field names again?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r786926261



##########
File path: hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveSchemaUtil.java
##########
@@ -170,6 +172,28 @@ public void testConversionWithoutLastComment() {
     Assert.assertEquals(expected.asStruct(), schema.asStruct());
   }
 
+  @Test
+  public void testIdentifierFieldIdSetOrThrow() {
+    ImmutableList<Types.NestedField> fields = ImmutableList.of(
+            optional(0, "c1", Types.LongType.get(), ""),
+            optional(1, "c2", Types.StringType.get(), ""),
+            optional(2, "c3", Types.StringType.get(), ""));

Review comment:
       Please remove the empty strings. If a comment is empty, it should be left null.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r790756748



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveSchemaUtil.java
##########
@@ -75,7 +83,42 @@ public static Schema convert(List<FieldSchema> fieldSchemas, boolean autoConvert
       typeInfos.add(TypeInfoUtils.getTypeInfoFromTypeString(col.getType()));
       comments.add(col.getComment());
     }
-    return HiveSchemaConverter.convert(names, typeInfos, comments, autoConvert);
+    Schema schema = HiveSchemaConverter.convert(names, typeInfos, comments, autoConvert);
+    return rebuildSchemaWithIdentifierFieldIds(schema, identifierFieldNames);
+  }
+
+  /**
+   * Rebuild a schema with given schema and identifierFieldNames
+   * @param schema The origin schema.
+   * @param identifierFieldNames The identifierFieldNames.
+   * @return New schema with IdentifierFieldIds.
+   */
+  public static Schema rebuildSchemaWithIdentifierFieldIds(Schema schema, Set<String> identifierFieldNames) {
+    // Identifier fields in nested field are not supported, so we just check the first level columns.
+    Map<String, Types.NestedField> columnsMap = schema.columns().stream()
+            .collect(Collectors.toMap(Types.NestedField::name, field -> field));

Review comment:
       nit: Please fix the indentation. Continuation is 4 spaces




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r791386294



##########
File path: mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerNoScan.java
##########
@@ -773,4 +773,52 @@ public void testDropHiveTableWithoutUnderlyingTable() throws IOException {
   private String getCurrentSnapshotForHiveCatalogTable(org.apache.iceberg.Table icebergTable) {
     return ((BaseMetastoreTableOperations) ((BaseTable) icebergTable).operations()).currentMetadataLocation();
   }
+
+  @Test
+  public void testCreateTableWithIdentifierIds() {
+    TableIdentifier tableIdentifier = TableIdentifier.of("default", "customers");
+    shell.executeStatement(
+        String.format("CREATE EXTERNAL TABLE %s ( " +
+                "  c1 INT, " +
+                "  c2 STRING, " +
+                "  c3 STRUCT<c4:STRING, c5:STRING> " +
+                ") " +
+                "PARTITIONED BY (c6 STRING) " +
+                "STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' %s " +
+                "TBLPROPERTIES (" +
+                "  '%s' = '%s'," +
+                "  '%s' = '%s'" +
+                ")",
+            tableIdentifier,
+            testTables.locationForCreateTableSQL(tableIdentifier),
+            InputFormatConfig.IDENTIFIER_FIELD_NAMES, "c1,c2",
+            InputFormatConfig.CATALOG_NAME, testTables.catalogName()));
+
+    org.apache.iceberg.Table table = testTables.loadTable(tableIdentifier);
+    Assert.assertEquals(ImmutableSet.of(1, 2), table.schema().identifierFieldIds());
+  }
+
+  @Test
+  public void testCreateTableWithIdentifierIdsError() {
+    TableIdentifier tableIdentifier = TableIdentifier.of("default", "customers");
+    Assert.assertThrows(
+        "Cannot add field c4 as an identifier field: must not in nested field",
+        IllegalArgumentException.class,
+        () -> shell.executeStatement(
+            String.format("CREATE EXTERNAL TABLE %s ( " +

Review comment:
       And this too 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] humengyu2012 commented on pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
humengyu2012 commented on pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#issuecomment-1018687595


   @pvary I revert all of my previous commits, submit some new code. Thanks for your guidance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] humengyu2012 commented on pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
humengyu2012 commented on pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#issuecomment-1020904201


   
   > Could TestTables.createHiveTableSQL help here?
   
   ![image](https://user-images.githubusercontent.com/45056332/150929210-e73844bc-d955-4cf6-abab-978f8d43328d.png)
   @pvary TestTables.createHiveTableSQL can not set columns, but I need some hive columns to test. Could I continue to use sql?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] humengyu2012 commented on pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
humengyu2012 commented on pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#issuecomment-1022141532


   I reformatted `HiveIcebergMetaHook.PARAMETERS_TO_REMOVE` and `HiveIcebergMetaHook.PROPERTIES_TO_REMOVE` with the following reference code
   `org.apache.iceberg.MetadataColumns#META_IDS`
   ![image](https://user-images.githubusercontent.com/45056332/151159926-309dccc3-456b-4b0b-b8d1-795f7358cf7a.png)
   
   I'm not sure if this is a good idea and would like to ask for your advice.
   @pvary 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on a change in pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#discussion_r793700191



##########
File path: hive-metastore/src/main/java/org/apache/iceberg/hive/HiveSchemaUtil.java
##########
@@ -75,7 +91,46 @@ public static Schema convert(List<FieldSchema> fieldSchemas, boolean autoConvert
       typeInfos.add(TypeInfoUtils.getTypeInfoFromTypeString(col.getType()));
       comments.add(col.getComment());
     }
-    return HiveSchemaConverter.convert(names, typeInfos, comments, autoConvert);
+    Schema schema = HiveSchemaConverter.convert(names, typeInfos, comments, autoConvert);
+    return rebuildSchemaWithIdentifierFieldIds(schema, identifierFieldNames);
+  }
+
+  /**
+   * Rebuild a schema with given schema and identifierFieldNames

Review comment:
       `Rebuild a schema with given schema and identifier fields.`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] humengyu2012 edited a comment on pull request #3912: Hive: Support 'identifier-field-ids' when creating table in hive

Posted by GitBox <gi...@apache.org>.
humengyu2012 edited a comment on pull request #3912:
URL: https://github.com/apache/iceberg/pull/3912#issuecomment-1035117292


   Can anyone tell me what to do next? Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org