You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "lamber-ken (Jira)" <ji...@apache.org> on 2019/12/09 16:18:00 UTC
[jira] [Closed] (HUDI-325) Unable to query by Hive after updating HDFS Hudi table

     [ https://issues.apache.org/jira/browse/HUDI-325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

lamber-ken closed HUDI-325.
---------------------------
    Resolution: Fixed

Fixed via master: d6e83e8f49828940159cd34711cc88ee7b42dc1c

> Unable to query by Hive after updating HDFS Hudi table
> ------------------------------------------------------
>
>                 Key: HUDI-325
>                 URL: https://issues.apache.org/jira/browse/HUDI-325
>             Project: Apache Hudi (incubating)
>          Issue Type: Bug
>            Reporter: Wenning Ding
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> h3. Description
> While doing internal testing in EMR, we found that if Hudi table path follows this kind of format: hdfs:///user/... or hdfs:/user/... then Hudi table would unable to query by Hive after updating.
> h3. Reproduction
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.spark.sql.SaveModeval 
> df = Seq(
>   (100, "event_name_900", "2015-01-01T13:51:39.340396Z", "type1"),
>   (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
>   (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
>   (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2")
>   ).toDF("event_id", "event_name", "event_ts", "event_type")
> var tableName = "hudi_test"
> var tablePath = "hdfs:///user/hadoop/" + tableName
> // write hudi dataset
> df.write.format("org.apache.hudi")
>   .option("hoodie.upsert.shuffle.parallelism", "2")
>   .option(HoodieWriteConfig.TABLE_NAME, tableName)
>   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
>   .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
>   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
>   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
>   .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
>   .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>   .mode(SaveMode.Overwrite)
>   .save(tablePath)
> // update hudi dataset
> val df2 = Seq(
>   (100, "event_name_11111", "2015-01-01T13:51:39.340396Z", "type1"),
>   (107, "event_name_578", "2015-01-01T13:51:42.248818Z", "type3")
>   ).toDF("event_id", "event_name", "event_ts", "event_type")
> df2.write.format("org.apache.hudi")
>    .option("hoodie.upsert.shuffle.parallelism", "2")
>    .option(HoodieWriteConfig.TABLE_NAME, tableName)
>    .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>    .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
>    .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>    .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
>    .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>    .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>    .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>    .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
>    .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
>    .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>    .mode(SaveMode.Append)
>    .save(tablePath)
> {code}
> Then do query in Hive:
> {code:java}
> select count(*) from hudi_test;
> {code}
> It returns: 
> {code:java}
> java.io.IOException: cannot find dir = hdfs://ip-172-30-6-236.ec2.internal:8020/user/hadoop/elb_logs_hudi_cow_8/2015-01-01/cb7531ac-dadf-4118-b722-55cb34bc66f2-0_34-7-336_20191104223321.parquet in pathToPartitionInfo: [hdfs:/user/hadoop/elb_logs_hudi_cow_8/2015-01-01]
>     at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:394)
>     at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:357)
>     at org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.schemaEvolved(SplitGrouper.java:284)
>     at org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:184)
>     at org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:161)
>     at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:207)
>     at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
>     at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:422)
>     at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>     at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
>     at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)     
> {code}
> h3. Investigations
> From my investigation, when updating Hudi table, Hudi uses this command:
> {code:java}
> hive> ALTER TABLE hudi_test PARTITION (partition_key='event_type') 
>     > SET LOCATION 'hdfs:/user/hadoop/hudi_test/event_type';
> {code}
> And this Hive command will directly update partition path to 'hdfs:/user/hadoop/hudi_test/event_type' in Hive Metastore.
> The problem is Hive can only recoginize full URI path like this: 'hdfs://ip-172-30-6-236.ec2.internal:8020/user/hadoop/hudi_test/event_type'. (Because Hive uses String match to find partition path for a data file)
> That why when doing Hive query, it breaks and shows cannot get pathToPartitionInfo.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)