You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "lamber-ken (Jira)" <ji...@apache.org> on 2019/12/09 16:18:00 UTC
[jira] [Closed] (HUDI-325) Unable to query by Hive after updating
HDFS Hudi table
[ https://issues.apache.org/jira/browse/HUDI-325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
lamber-ken closed HUDI-325.
---------------------------
Resolution: Fixed
Fixed via master: d6e83e8f49828940159cd34711cc88ee7b42dc1c
> Unable to query by Hive after updating HDFS Hudi table
> ------------------------------------------------------
>
> Key: HUDI-325
> URL: https://issues.apache.org/jira/browse/HUDI-325
> Project: Apache Hudi (incubating)
> Issue Type: Bug
> Reporter: Wenning Ding
> Priority: Major
> Labels: pull-request-available
> Time Spent: 20m
> Remaining Estimate: 0h
>
> h3. Description
> While doing internal testing in EMR, we found that if Hudi table path follows this kind of format: hdfs:///user/... or hdfs:/user/... then Hudi table would unable to query by Hive after updating.
> h3. Reproduction
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.spark.sql.SaveModeval
> df = Seq(
> (100, "event_name_900", "2015-01-01T13:51:39.340396Z", "type1"),
> (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
> (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
> (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2")
> ).toDF("event_id", "event_name", "event_ts", "event_type")
> var tableName = "hudi_test"
> var tablePath = "hdfs:///user/hadoop/" + tableName
> // write hudi dataset
> df.write.format("org.apache.hudi")
> .option("hoodie.upsert.shuffle.parallelism", "2")
> .option(HoodieWriteConfig.TABLE_NAME, tableName)
> .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
> .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
> .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
> .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
> .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
> .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
> .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
> .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
> .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
> .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor")
> .mode(SaveMode.Overwrite)
> .save(tablePath)
> // update hudi dataset
> val df2 = Seq(
> (100, "event_name_11111", "2015-01-01T13:51:39.340396Z", "type1"),
> (107, "event_name_578", "2015-01-01T13:51:42.248818Z", "type3")
> ).toDF("event_id", "event_name", "event_ts", "event_type")
> df2.write.format("org.apache.hudi")
> .option("hoodie.upsert.shuffle.parallelism", "2")
> .option(HoodieWriteConfig.TABLE_NAME, tableName)
> .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
> .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
> .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
> .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
> .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
> .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
> .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
> .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
> .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
> .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor")
> .mode(SaveMode.Append)
> .save(tablePath)
> {code}
> Then do query in Hive:
> {code:java}
> select count(*) from hudi_test;
> {code}
> It returns:
> {code:java}
> java.io.IOException: cannot find dir = hdfs://ip-172-30-6-236.ec2.internal:8020/user/hadoop/elb_logs_hudi_cow_8/2015-01-01/cb7531ac-dadf-4118-b722-55cb34bc66f2-0_34-7-336_20191104223321.parquet in pathToPartitionInfo: [hdfs:/user/hadoop/elb_logs_hudi_cow_8/2015-01-01]
> at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:394)
> at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:357)
> at org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.schemaEvolved(SplitGrouper.java:284)
> at org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:184)
> at org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:161)
> at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:207)
> at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
> at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
> at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
> at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> h3. Investigations
> From my investigation, when updating Hudi table, Hudi uses this command:
> {code:java}
> hive> ALTER TABLE hudi_test PARTITION (partition_key='event_type')
> > SET LOCATION 'hdfs:/user/hadoop/hudi_test/event_type';
> {code}
> And this Hive command will directly update partition path to 'hdfs:/user/hadoop/hudi_test/event_type' in Hive Metastore.
> The problem is Hive can only recoginize full URI path like this: 'hdfs://ip-172-30-6-236.ec2.internal:8020/user/hadoop/hudi_test/event_type'. (Because Hive uses String match to find partition path for a data file)
> That why when doing Hive query, it breaks and shows cannot get pathToPartitionInfo.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)