You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Wenning Ding (Jira)" <ji...@apache.org> on 2019/11/07 19:41:00 UTC
[jira] [Updated] (HUDI-325) Unable to query by Hive after updating HDFS Hudi table

     [ https://issues.apache.org/jira/browse/HUDI-325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wenning Ding updated HUDI-325:
------------------------------
    Description: 
h3. Description

While doing internal testing in EMR, we found that if Hudi table path follows this kind of format: hdfs:///user/... or hdfs:/user/... then Hudi table would unable to query by Hive after updating.
h3. Reproduction
{code:java}
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.spark.sql.SaveModeval 

df = Seq(
  (100, "event_name_900", "2015-01-01T13:51:39.340396Z", "type1"),
  (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
  (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
  (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2")
  ).toDF("event_id", "event_name", "event_ts", "event_type")

var tableName = "hudi_test"
var tablePath = "hdfs:///user/hadoop/" + tableName

// write hudi dataset
df.write.format("org.apache.hudi")
  .option("hoodie.upsert.shuffle.parallelism", "2")
  .option(HoodieWriteConfig.TABLE_NAME, tableName)
  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
  .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
  .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
  .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
  .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
  .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
  .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor")
  .mode(SaveMode.Overwrite)
  .save(tablePath)

// update hudi dataset
val df2 = Seq(
  (100, "event_name_11111", "2015-01-01T13:51:39.340396Z", "type1"),
  (107, "event_name_578", "2015-01-01T13:51:42.248818Z", "type3")
  ).toDF("event_id", "event_name", "event_ts", "event_type")

df2.write.format("org.apache.hudi")
   .option("hoodie.upsert.shuffle.parallelism", "2")
   .option(HoodieWriteConfig.TABLE_NAME, tableName)
   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
   .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
   .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
   .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor")
   .mode(SaveMode.Append)
   .save(tablePath)
{code}
Then do query in Hive:
{code:java}
select count(*) from hudi_test;
{code}
It returns: 
{code:java}
java.io.IOException: cannot find dir = hdfs://ip-172-30-6-236.ec2.internal:8020/user/hadoop/elb_logs_hudi_cow_8/2015-01-01/cb7531ac-dadf-4118-b722-55cb34bc66f2-0_34-7-336_20191104223321.parquet in pathToPartitionInfo: [hdfs:/user/hadoop/elb_logs_hudi_cow_8/2015-01-01]
    at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:394)
    at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:357)
    at org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.schemaEvolved(SplitGrouper.java:284)
    at org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:184)
    at org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:161)
    at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:207)
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)     
{code}
h3. Investigations

From my investigation, when updating Hudi table, Hudi uses this command:
{code:java}
hive> ALTER TABLE hudi_test PARTITION (partition_key='event_type') 
    > SET LOCATION 'hdfs:/user/hadoop/hudi_test/event_type';
{code}
And this Hive command will directly update partition path to 'hdfs:/user/hadoop/hudi_test/event_type' in Hive Metastore.

The problem is Hive can only recoginize full URI path like this: 'hdfs://ip-172-30-6-236.ec2.internal:8020/user/hadoop/hudi_test/event_type'.

That why when doing Hive query, it breaks and shows cannot get pathToPartitionInfo.\{{}}

  was:
h3. Description

While doing internal testing in EMR, we found that if Hudi table path follows this kind of format: hdfs:///user/... or hdfs:/user/... then Hudi table would unable to query by Hive after updating.
h3. Reproduction

 
{code:java}
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.spark.sql.SaveModeval 

df = Seq(
  (100, "event_name_900", "2015-01-01T13:51:39.340396Z", "type1"),
  (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
  (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
  (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2")
  ).toDF("event_id", "event_name", "event_ts", "event_type")

var tableName = "hudi_test"
var tablePath = "hdfs:///user/hadoop/" + tableName

// write hudi dataset
df.write.format("org.apache.hudi")
  .option("hoodie.upsert.shuffle.parallelism", "2")
  .option(HoodieWriteConfig.TABLE_NAME, tableName)
  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
  .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
  .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
  .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
  .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
  .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
  .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor")
  .mode(SaveMode.Overwrite)
  .save(tablePath)

// update hudi dataset
val df2 = Seq(
  (100, "event_name_11111", "2015-01-01T13:51:39.340396Z", "type1"),
  (107, "event_name_578", "2015-01-01T13:51:42.248818Z", "type3")
  ).toDF("event_id", "event_name", "event_ts", "event_type")

df2.write.format("org.apache.hudi")
   .option("hoodie.upsert.shuffle.parallelism", "2")
   .option(HoodieWriteConfig.TABLE_NAME, tableName)
   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
   .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
   .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
   .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor")
   .mode(SaveMode.Append)
   .save(tablePath)
{code}
Then do query in Hive:

 

 
{code:java}
select count(*) from hudi_test;
{code}
It returns:

 

 
{code:java}
java.io.IOException: cannot find dir = hdfs://ip-172-30-6-236.ec2.internal:8020/user/hadoop/elb_logs_hudi_cow_8/2015-01-01/cb7531ac-dadf-4118-b722-55cb34bc66f2-0_34-7-336_20191104223321.parquet in pathToPartitionInfo: [hdfs:/user/hadoop/elb_logs_hudi_cow_8/2015-01-01]
    at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:394)
    at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:357)
    at org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.schemaEvolved(SplitGrouper.java:284)
    at org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:184)
    at org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:161)
    at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:207)
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)     
{code}
 
h3. Investigations

From my investigation, when updating Hudi table, Hudi uses this command:

 
{code:java}
hive> ALTER TABLE hudi_test PARTITION (partition_key='event_type') 
    > SET LOCATION 'hdfs:/user/hadoop/hudi_test/event_type';
{code}
And this Hive command will directly update partition path to 'hdfs:/user/hadoop/hudi_test/event_type' in Hive Metastore.

 

The problem is Hive can only recoginize full URI path like this: 'hdfs://ip-172-30-6-236.ec2.internal:8020/user/hadoop/hudi_test/event_type'.

That why when doing Hive query, it breaks and shows cannot get pathToPartitionInfo.{{}}


> Unable to query by Hive after updating HDFS Hudi table
> ------------------------------------------------------
>
>                 Key: HUDI-325
>                 URL: https://issues.apache.org/jira/browse/HUDI-325
>             Project: Apache Hudi (incubating)
>          Issue Type: Bug
>            Reporter: Wenning Ding
>            Priority: Major
>
> h3. Description
> While doing internal testing in EMR, we found that if Hudi table path follows this kind of format: hdfs:///user/... or hdfs:/user/... then Hudi table would unable to query by Hive after updating.
> h3. Reproduction
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.spark.sql.SaveModeval 
> df = Seq(
>   (100, "event_name_900", "2015-01-01T13:51:39.340396Z", "type1"),
>   (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
>   (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
>   (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2")
>   ).toDF("event_id", "event_name", "event_ts", "event_type")
> var tableName = "hudi_test"
> var tablePath = "hdfs:///user/hadoop/" + tableName
> // write hudi dataset
> df.write.format("org.apache.hudi")
>   .option("hoodie.upsert.shuffle.parallelism", "2")
>   .option(HoodieWriteConfig.TABLE_NAME, tableName)
>   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
>   .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
>   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
>   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
>   .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
>   .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>   .mode(SaveMode.Overwrite)
>   .save(tablePath)
> // update hudi dataset
> val df2 = Seq(
>   (100, "event_name_11111", "2015-01-01T13:51:39.340396Z", "type1"),
>   (107, "event_name_578", "2015-01-01T13:51:42.248818Z", "type3")
>   ).toDF("event_id", "event_name", "event_ts", "event_type")
> df2.write.format("org.apache.hudi")
>    .option("hoodie.upsert.shuffle.parallelism", "2")
>    .option(HoodieWriteConfig.TABLE_NAME, tableName)
>    .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>    .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
>    .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>    .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
>    .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>    .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>    .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>    .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
>    .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
>    .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>    .mode(SaveMode.Append)
>    .save(tablePath)
> {code}
> Then do query in Hive:
> {code:java}
> select count(*) from hudi_test;
> {code}
> It returns: 
> {code:java}
> java.io.IOException: cannot find dir = hdfs://ip-172-30-6-236.ec2.internal:8020/user/hadoop/elb_logs_hudi_cow_8/2015-01-01/cb7531ac-dadf-4118-b722-55cb34bc66f2-0_34-7-336_20191104223321.parquet in pathToPartitionInfo: [hdfs:/user/hadoop/elb_logs_hudi_cow_8/2015-01-01]
>     at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:394)
>     at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:357)
>     at org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.schemaEvolved(SplitGrouper.java:284)
>     at org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:184)
>     at org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:161)
>     at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:207)
>     at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
>     at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:422)
>     at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>     at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
>     at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)     
> {code}
> h3. Investigations
> From my investigation, when updating Hudi table, Hudi uses this command:
> {code:java}
> hive> ALTER TABLE hudi_test PARTITION (partition_key='event_type') 
>     > SET LOCATION 'hdfs:/user/hadoop/hudi_test/event_type';
> {code}
> And this Hive command will directly update partition path to 'hdfs:/user/hadoop/hudi_test/event_type' in Hive Metastore.
> The problem is Hive can only recoginize full URI path like this: 'hdfs://ip-172-30-6-236.ec2.internal:8020/user/hadoop/hudi_test/event_type'.
> That why when doing Hive query, it breaks and shows cannot get pathToPartitionInfo.\{{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)