You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "Raymond Xu (Jira)" <ji...@apache.org> on 2022/06/02 00:50:00 UTC

[jira] [Updated] (HUDI-4071) Better Spark Datasource default configs

     [ https://issues.apache.org/jira/browse/HUDI-4071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raymond Xu updated HUDI-4071:
-----------------------------
    Sprint: 2022/05/16, 2022/05/17  (was: 2022/05/16)

> Better Spark Datasource default configs
> ---------------------------------------
>
>                 Key: HUDI-4071
>                 URL: https://issues.apache.org/jira/browse/HUDI-4071
>             Project: Apache Hudi
>          Issue Type: Task
>            Reporter: Sagar Sumit
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.12.0
>
>
> Default configs should be:
>  # Optimized for insert/bulk_insert e.g. by default if we have NONE sort mode then it's as good as parquet writes with some additional work for meta columns. An extension of this is to keep a map of minimal optimized configs per operation type. This is partly related to better performant configs HUDI-2151
>  # Make reasonable assumptions, e.g. for index type, bloom filter does not rely on any external system, so it can be a better default candidate than let's say HBase index.
>  # Scout all configs with noDefaultValue and assign a default if necessary.
>  # Keep spark-sql and spark datasource config keys same as much as possible, otherwise it's difficult operationally for the user. Rename/reuse existing datasource keys that are meant for same purpose. This is related to HUDI-4070 as well.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)