You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Wenning Ding (Jira)" <ji...@apache.org> on 2019/11/07 19:40:00 UTC
[jira] [Created] (HUDI-325) Unable to query by Hive after updating HDFS Hudi table

Wenning Ding created HUDI-325:
---------------------------------

             Summary: Unable to query by Hive after updating HDFS Hudi table
                 Key: HUDI-325
                 URL: https://issues.apache.org/jira/browse/HUDI-325
             Project: Apache Hudi (incubating)
          Issue Type: Bug
            Reporter: Wenning Ding


h3. Description

While doing internal testing in EMR, we found that if Hudi table path follows this kind of format: hdfs:///user/... or hdfs:/user/... then Hudi table would unable to query by Hive after updating.
h3. Reproduction

 
{code:java}
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.spark.sql.SaveModeval 

df = Seq(
  (100, "event_name_900", "2015-01-01T13:51:39.340396Z", "type1"),
  (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
  (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
  (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2")
  ).toDF("event_id", "event_name", "event_ts", "event_type")

var tableName = "hudi_test"
var tablePath = "hdfs:///user/hadoop/" + tableName

// write hudi dataset
df.write.format("org.apache.hudi")
  .option("hoodie.upsert.shuffle.parallelism", "2")
  .option(HoodieWriteConfig.TABLE_NAME, tableName)
  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
  .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
  .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
  .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
  .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
  .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
  .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor")
  .mode(SaveMode.Overwrite)
  .save(tablePath)

// update hudi dataset
val df2 = Seq(
  (100, "event_name_11111", "2015-01-01T13:51:39.340396Z", "type1"),
  (107, "event_name_578", "2015-01-01T13:51:42.248818Z", "type3")
  ).toDF("event_id", "event_name", "event_ts", "event_type")

df2.write.format("org.apache.hudi")
   .option("hoodie.upsert.shuffle.parallelism", "2")
   .option(HoodieWriteConfig.TABLE_NAME, tableName)
   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
   .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
   .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
   .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor")
   .mode(SaveMode.Append)
   .save(tablePath)
{code}
Then do query in Hive:

 

 
{code:java}
select count(*) from hudi_test;
{code}
It returns:

 

 
{code:java}
java.io.IOException: cannot find dir = hdfs://ip-172-30-6-236.ec2.internal:8020/user/hadoop/elb_logs_hudi_cow_8/2015-01-01/cb7531ac-dadf-4118-b722-55cb34bc66f2-0_34-7-336_20191104223321.parquet in pathToPartitionInfo: [hdfs:/user/hadoop/elb_logs_hudi_cow_8/2015-01-01]
    at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:394)
    at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:357)
    at org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.schemaEvolved(SplitGrouper.java:284)
    at org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:184)
    at org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:161)
    at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:207)
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)     
{code}
 
h3. Investigations

From my investigation, when updating Hudi table, Hudi uses this command:

 
{code:java}
hive> ALTER TABLE hudi_test PARTITION (partition_key='event_type') 
    > SET LOCATION 'hdfs:/user/hadoop/hudi_test/event_type';
{code}
And this Hive command will directly update partition path to 'hdfs:/user/hadoop/hudi_test/event_type' in Hive Metastore.

 

The problem is Hive can only recoginize full URI path like this: 'hdfs://ip-172-30-6-236.ec2.internal:8020/user/hadoop/hudi_test/event_type'.

That why when doing Hive query, it breaks and shows cannot get pathToPartitionInfo.{{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)