You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "sivabalan narayanan (Jira)" <ji...@apache.org> on 2023/02/21 23:08:00 UTC
[jira] [Comment Edited] (HUDI-5828) Support df.write.forma("hudi") with out any additional options

    [ https://issues.apache.org/jira/browse/HUDI-5828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691819#comment-17691819 ] 

sivabalan narayanan edited comment on HUDI-5828 at 2/21/23 11:07 PM:
---------------------------------------------------------------------

As per our quick start guide, 

we have 5 configs that are required. 

1. shuffle parallelism.

2. record key 

3. partition path

4. precombine 

5. table name 

 

1: with 0.13.0, we have already relaxed and this is not a mandatory field. It wasn't mandatory even before, but with 0.13.0, our parallelism is dynamically derived from the incoming df 

2: With auto generation of record keys support, we should be able to relax this constraint. 

3: We are adding support to infer partition from the incoming df's with https://issues.apache.org/jira/browse/HUDI-5796. So, thats taken care of. But some follow up is required though. for non-partitioned, we need to infer that incoming df is non-partitioned and choose NonPartitioned as key gen class. If not, default key gen class is SimpleKeyGen. But this might work w/o any additional fixes for simple partition path. 
4. preCombine: this is already an optional field and users don't need to supply them. 

5: table name: This is somewhat tricky. 

We can auto generate some hudi table name, but when hive sync is enabled, we should not generating it automatically. Since with external metastores, no two tables will have same name and they should have meaningful names, we can't auto generate. If not, the table names will be hudi_12313, hudi_e5e44, hudi_45sadf etc. So, here is what we can do.

 

User flow1: 

For a user who uses just spark ds to write and read. 

a. Auto generate hoodie.table.name if user does not supply one. The auto generated table name will get serialized into hoodie.properties. 

 

User flow2: 

User who writes via spark and syncs to hive on every commit. 

User does not need to supply hoodie.table.name. But user is expected to set explicit value for "hoodie.datasource.hive_sync.table". So, auto generated table name will get serialized into hoodie.properties, but for hive sync purposes, we will choose what user explicitly set for the corresponding config. 

 

User flow3:

Similar to flow2. 

User writes via spark and syncs to hive in a standalone manner and not w/ every write. 

Regular writes will proceed as usual, where we will genreate hudi table name automatically for the first time. 

When syncing to external metastore, user has to explicitly set value for "hoodie.datasource.hive_sync.table". 

 

Note: For case 2 and 3: 

if user explicitly sets value for "hoodie.table.name", we should automatically infer for "hoodie.datasource.hive_sync.table". Only if user has not explicitly set "hoodie.table.name" and the name was programmatically auto generated, user has to explicitly set value for "hoodie.datasource.hive_sync.table".

 

Format for hoodie table name to auto generate: 

hoodie_table_\{ts}_\{randomInt}

where ts is current timestamp 

and we will also generate a random Integer to accommodate any concurrent writer. 

 

 

Summary:

So, putting all of these together, here is where we will stand.

df.write.format("hudi").option("hoodie.datasource.write.recordkey.autogen","true").save(path)

 

Special handling:

We could even further simplify if need be. 

We can deduce that user has not provided any configs (0 user supplied configs) and in such cases, we can choose the default value for "hoodie.datasource.write.recordkey.autogen" as true and proceed instead of failing. This is somewhat synonymous to how we might set the default key gen type to Simple or NonPartitioned. 

 

 

 

 


was (Author: shivnarayan):
As per our quick start guide, 

we have 5 configs that are required. 

1. shuffle parallelism.

2. record key 

3. partition path

4. precombine 

5. table name 

 

1: with 0.13.0, we have already relaxed and this is not a mandatory field. It wasn't mandatory even before, but with 0.13.0, our parallelism is dynamically derived from the incoming df 

2: With auto generation of record keys support, we should be able to relax this constraint. 

3: We are adding support to infer partition from the incoming df's with https://issues.apache.org/jira/browse/HUDI-5796. So, thats taken care of. But some follow up is required though. for non-partitioned, we need to infer that incoming df is non-partitioned and choose NonPartitioned as key gen class. If not, default key gen class is SimpleKeyGen. But this might work w/o any additional fixes for simple partition path. 
4. preCombine: this is already an optional field and users don't need to supply them. 

5: table name: This is somewhat tricky. 

We can auto generate some hudi table name, but when hive sync is enabled, we should not generating it automatically. Since with external metastores, no two tables will have same name and they should have meaningful names, we can't auto generate. If not, the table names will be hudi_12313, hudi_e5e44, hudi_45sadf etc. So, here is what we can do.

 

User flow1: 

For a user who uses just spark ds to write and read. 

a. Auto generate hoodie.table.name if user does not supply one. The auto generated table name will get serialized into hoodie.properties. 

 

User flow2: 

User who writes via spark and syncs to hive on every commit. 

User does not need to supply hoodie.table.name. But user is expected to set explicit value for "hoodie.datasource.hive_sync.table". So, auto generated table name will get serialized into hoodie.properties, but for hive sync purposes, we will choose what user explicitly set for the corresponding config. 

 

User flow3:

Similar to flow2. 

User writes via spark and syncs to hive in a standalone manner and not w/ every write. 

Regular writes will proceed as usual, where we will genreate hudi table name automatically for the first time. 

When syncing to external metastore, user has to explicitly set value for "hoodie.datasource.hive_sync.table". 

 

Format for hoodie table name to auto generate: 

hoodie_table_\{ts}_\{randomInt}

where ts is current timestamp 

and we will also generate a random Integer to accommodate any concurrent writer. 

 

 

Summary:

So, putting all of these together, here is where we will stand.

df.write.format("hudi").option("hoodie.datasource.write.recordkey.autogen","true").save(path)

 

Special handling:

We could even further simplify if need be. 

We can deduce that user has not provided any configs (0 user supplied configs) and in such cases, we can choose the default value for "hoodie.datasource.write.recordkey.autogen" as true and proceed instead of failing. This is somewhat synonymous to how we might set the default key gen type to Simple or NonPartitioned. 

 

 

 

 

> Support df.write.forma("hudi") with out any additional options
> --------------------------------------------------------------
>
>                 Key: HUDI-5828
>                 URL: https://issues.apache.org/jira/browse/HUDI-5828
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: writer-core
>            Reporter: sivabalan narayanan
>            Priority: Major
>
> Wrt simplifying the usage of hudi among more users, we should try to see if we can support writing to hudi w/o any options during write. 
>  
> For eg, we can do the following with paruqet writes. 
> {code:java}
> df.write.format("parquet").save(path)
> {code}
>  
> So, for a non-partitioned dataset, we should try if we can support this usability. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)