You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "Wenning Ding (Jira)" <ji...@apache.org> on 2019/11/20 19:10:00 UTC

[jira] [Created] (HUDI-353) Add support for Hive style partitioning path

Wenning Ding created HUDI-353:
---------------------------------

             Summary: Add support for Hive style partitioning path
                 Key: HUDI-353
                 URL: https://issues.apache.org/jira/browse/HUDI-353
             Project: Apache Hudi (incubating)
          Issue Type: Improvement
            Reporter: Wenning Ding


In Hive, the partition folder name follows this format: <partition_column_name>=<partition_value>.
But in Hudi, the name of its partition folder is <partition_value>.

e.g. A dataset is partitioned by three columns: year, month and day.
In Hive, the data is saved in: {{.../<table_name>/year=2019/month=05/day=01/xxx.parquet}}
In Hudi, the data is saved in: {{.../<table_name>/2019/05/01/xxx.parquet}}

Basically I add a new option in Spark datasource named {{HIVE_STYLE_PARTITIONING_FILED_OPT_KEY}} which indicates whether using hive style partitioning or not. By default this option is false (not use).

Also, if using hive style partitioning, instead of scanning the dataset and manually adding/updating all partitions, we can use "MSCK REPAIR TABLE <table_name>" to automatically sync all the partition info with Hive MetaStore.
h3.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)