You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Wenning Ding (Jira)" <ji...@apache.org> on 2019/11/07 19:40:00 UTC
[jira] [Created] (HUDI-325) Unable to query by Hive after updating
HDFS Hudi table
Wenning Ding created HUDI-325:
---------------------------------
Summary: Unable to query by Hive after updating HDFS Hudi table
Key: HUDI-325
URL: https://issues.apache.org/jira/browse/HUDI-325
Project: Apache Hudi (incubating)
Issue Type: Bug
Reporter: Wenning Ding
h3. Description
While doing internal testing in EMR, we found that if Hudi table path follows this kind of format: hdfs:///user/... or hdfs:/user/... then Hudi table would unable to query by Hive after updating.
h3. Reproduction
{code:java}
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.spark.sql.SaveModeval
df = Seq(
(100, "event_name_900", "2015-01-01T13:51:39.340396Z", "type1"),
(101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
(104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
(105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2")
).toDF("event_id", "event_name", "event_ts", "event_type")
var tableName = "hudi_test"
var tablePath = "hdfs:///user/hadoop/" + tableName
// write hudi dataset
df.write.format("org.apache.hudi")
.option("hoodie.upsert.shuffle.parallelism", "2")
.option(HoodieWriteConfig.TABLE_NAME, tableName)
.option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
.option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
.option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
.option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
.option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
.option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor")
.mode(SaveMode.Overwrite)
.save(tablePath)
// update hudi dataset
val df2 = Seq(
(100, "event_name_11111", "2015-01-01T13:51:39.340396Z", "type1"),
(107, "event_name_578", "2015-01-01T13:51:42.248818Z", "type3")
).toDF("event_id", "event_name", "event_ts", "event_type")
df2.write.format("org.apache.hudi")
.option("hoodie.upsert.shuffle.parallelism", "2")
.option(HoodieWriteConfig.TABLE_NAME, tableName)
.option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
.option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
.option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
.option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
.option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
.option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor")
.mode(SaveMode.Append)
.save(tablePath)
{code}
Then do query in Hive:
{code:java}
select count(*) from hudi_test;
{code}
It returns:
{code:java}
java.io.IOException: cannot find dir = hdfs://ip-172-30-6-236.ec2.internal:8020/user/hadoop/elb_logs_hudi_cow_8/2015-01-01/cb7531ac-dadf-4118-b722-55cb34bc66f2-0_34-7-336_20191104223321.parquet in pathToPartitionInfo: [hdfs:/user/hadoop/elb_logs_hudi_cow_8/2015-01-01]
at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:394)
at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:357)
at org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.schemaEvolved(SplitGrouper.java:284)
at org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:184)
at org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:161)
at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:207)
at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}
h3. Investigations
From my investigation, when updating Hudi table, Hudi uses this command:
{code:java}
hive> ALTER TABLE hudi_test PARTITION (partition_key='event_type')
> SET LOCATION 'hdfs:/user/hadoop/hudi_test/event_type';
{code}
And this Hive command will directly update partition path to 'hdfs:/user/hadoop/hudi_test/event_type' in Hive Metastore.
The problem is Hive can only recoginize full URI path like this: 'hdfs://ip-172-30-6-236.ec2.internal:8020/user/hadoop/hudi_test/event_type'.
That why when doing Hive query, it breaks and shows cannot get pathToPartitionInfo.{{}}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)