You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/04/26 13:23:25 UTC

[GitHub] [hudi] onlywangyh commented on a diff in pull request #5434: [HUDI-3978] Fix use of partition path field as hive partition field in flink

onlywangyh commented on code in PR #5434:
URL: https://github.com/apache/hudi/pull/5434#discussion_r858555712


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/FilePathUtils.java:
##########
@@ -420,9 +420,9 @@ public static org.apache.flink.core.fs.Path toFlinkPath(Path path) {
    * @return array of the partition fields
    */
   public static String[] extractPartitionKeys(org.apache.flink.configuration.Configuration conf) {
-    if (FlinkOptions.isDefaultValueDefined(conf, FlinkOptions.PARTITION_PATH_FIELD)) {
+    if (FlinkOptions.isDefaultValueDefined(conf, FlinkOptions.HIVE_SYNC_PARTITION_FIELDS)) {
       return new String[0];
     }
-    return conf.getString(FlinkOptions.PARTITION_PATH_FIELD).split(",");
+    return conf.getString(FlinkOptions.HIVE_SYNC_PARTITION_FIELDS).split(",");
   }

Review Comment:
   In HiveSyncContext this PARTITION_PATH_FIELD assign to hive sync partition fields .I think these two params `PARTITION_PATH_FIELD` and  `HIVE_SYNC_PARTITION_FIELDS` 
   have different meanings in hudi.
   `PARTITION_PATH_FIELD` is for hudi KeyGenerator to get a partitionPath
   `HIVE_SYNC_PARTITION_FIELDS`  is use for hive to set a partition field.
   
   This function _extractPartitionKeys_ should get the hive partition fields key rather than a hudi partition path field. Sometimes confuse the values ​ will cause some errors
   
   
   In this case we use TimestampBasedAvroKeyGenerator and set hudi partition path field is same as hive partition fields .  There will be some promblems, see:
   `
   PARTITION_PATH_FIELD=datetime
   HIVE_SYNC_PARTITION_FIELDS=datetime
   `
   
   **In hudi:** we will get the _1596074902000L_ value and converted to a string hudi partition path like _2020-07-30_.  
   **In hive:** We will get the table like :
   ```
    CREATE EXTERNAL TABLE `testTable`(
      `_hoodie_commit_time` string COMMENT '',       
      `_hoodie_commit_seqno` string COMMENT '',          
      `_hoodie_record_key` string COMMENT '',
      `_hoodie_partition_path` string COMMENT '',
      `_hoodie_file_name` string COMMENT '',               
      `id` int COMMENT '',
      `datetime` bigint COMMENT ''        
      )
    PARTITIONED BY (`datetime` string COMMENT '')...
   ```
   This partition value _datetime=2020-07-30_  also will be add to hive.  We can't get the datetime value from this hive table,  and the table partition is also broken. This datetime value is conflicting
   
   When we set PARTITION PATH_FIELD value is different with HIVE_SYNC PARTITION FIELDS value like this.
   `
   PARTITION_PATH_FIELD=datetime
   HIVE_SYNC_PARTITION_FIELDS=inc_day
   `
   We can get this table like :
   ```
   CREATE EXTERNAL TABLE `testTable`(
      `_hoodie_commit_time` string COMMENT '',       
      `_hoodie_commit_seqno` string COMMENT '',          
      `_hoodie_record_key` string COMMENT '',
      `_hoodie_partition_path` string COMMENT '',
      `_hoodie_file_name` string COMMENT '',               
      `id` int COMMENT '',
      `datetime` bigint COMMENT ''        
      )
    PARTITIONED BY (`inc_day` string COMMENT '')...
   ```
   In this time we can normal get  the _datetime_ value, and the _inc_day_ as a partition field is also work.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org