You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/05/24 15:19:26 UTC

[GitHub] [hudi] nsivabalan commented on a diff in pull request #5667: [HUDI-4142] RFC for new table APIs and config changes

nsivabalan commented on code in PR #5667:
URL: https://github.com/apache/hudi/pull/5667#discussion_r880650541


##########
rfc/rfc-54/rfc-54.md:
##########
@@ -0,0 +1,175 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+# RFC-54: New Table APIs and Streamline Hudi Configs
+
+## Proposers
+
+- @codope
+
+## Approvers
+
+- @xushiyan
+- @vinothchandar
+
+## Status
+
+JIRA: [HUDI-4141](https://issues.apache.org/jira/browse/HUDI-4141)
+
+## Abstract
+
+Users configure jobs to write Hudi tables and control the behaviour of their
+jobs at different levels such as table, write client, datasource, record
+payload, etc. On one hand, this is the true strength of Hudi which makes it
+suitable for many use cases and offers the users a solution to the tradeoffs
+encountered in data systems. On the other, it has also resulted in the learning
+curve for new users to be steeper. In this RFC, we propose to streamline some of
+these configurations. Additionally, we propose a few table level APIs to create
+or update Hudi table programmatically. Together, they would help in a smoother
+onboarding experience and increase the usability of Hudi. It would also help
+existing users through better configuration maintenance.
+
+## Background
+
+Currently, users can create and update Hudi Table using three different
+ways: [Spark datasource](https://hudi.apache.org/docs/writing_data),
+[SQL](https://hudi.apache.org/docs/table_management)
+and [DeltaStreamer](https://hudi.apache.org/docs/hoodie_deltastreamer). Each one
+of these ways is setup using a bunch
+of [configurations](https://hudi.apache.org/docs/configurations), which has
+grown over the years as new features have beed added. Imagine yourself as a data
+engineer who has been using Spark to write parquet tables. You want to try out
+Hudi and land on
+the [quickstart](https://hudi.apache.org/docs/quick-start-guide) page. You see a
+bunch of configurations (precombine field, record key, partition path) to be set
+and wonder why can't I just do `spark.write.format("hudi").save()`. Apart from
+configurations, there is no first-class support for table management APIs such
+as to create or drop table. The implementation section below presents the
+proposals to fill such gaps.
+
+## Implementation
+
+Implementation can be split into two independent changes: streamline
+configuration and new table APIs.
+
+### Streamline Configuration
+
+#### Minimal set of quickstart configurations
+
+* Users should be able to simply write Hudi table
+  using `spark.write.format("hudi")`. If no record key and precombine field is
+  provided, then assume append only and avoid index lookup and merging.
+* Hudi should infer partition field if users provide
+  as `spark.write.format("hudi").partitionBy(field)`.
+* Users need not pass all the configurations in each write operation. Once the
+  table has been created, most table configs do not change, e.g. table name
+  needs to be passed in every write, even though its only needed first time.
+  Hudi should fetch from table configs when options are not provided by the
+  user.
+
+#### Good defaults
+
+* Default values for configurations should be optimized for simple bulk load
+  scenario e.g. by default if we have NONE sort mode then it's as good as
+  parquet writes with some additional work for meta columns.
+* Make reasonable assumptions, such as do not rely on any external system (e.g.
+  hbase) for default. As another example, enable schema reconciliation by
+  default instead of failing writes.
+
+#### Reuse and consistency
+
+* Keep spark-sql and spark datasource and deltastreamer configs in sync as much
+  as possible. Document exceptions, e.g. key generator for sql is
+  ComplexKeyGenerator while for datasource it is SimpleKeyGenerator.
+* Rename/reuse existing datasource keys that are meant for same purpose.
+* In all these changes, we should support backward compatibility.
+
+#### Refactor Hive Sync
+
+* Reduce the number of configs needed for Hive sync, e.g. table name once
+  provided at the time of first write can be reused for hive sync table name
+  config as well.
+* Revisit the class hierarchy and refactor if needed.
+
+#### Configuration Builders
+
+* Users should be able to use the config builders instead of specifying config
+  keys,
+  e.g. `spark.write.format("hudi").options(HoodieClusteringConfig.Builder().withXYZ().build())`
+
+### Table APIs
+
+These APIs are meant for programmatically interacting with Hudi tables. Users
+should be able to create or update the tables using static methods.
+
+| Method Name   | Description   |
+| ------------- | ------------- |
+| bootstrap     | Create a Hudi table from the given parquet table  |
+| create        | Create a Hudi table with the given configs if it does not exist.   |

Review Comment:
   Is there a `load` api to instantiate HudiTable for an already existing hudi table or will it be the same `create`? 



##########
rfc/rfc-54/rfc-54.md:
##########
@@ -0,0 +1,175 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+# RFC-54: New Table APIs and Streamline Hudi Configs
+
+## Proposers
+
+- @codope
+
+## Approvers
+
+- @xushiyan
+- @vinothchandar
+
+## Status
+
+JIRA: [HUDI-4141](https://issues.apache.org/jira/browse/HUDI-4141)
+
+## Abstract
+
+Users configure jobs to write Hudi tables and control the behaviour of their
+jobs at different levels such as table, write client, datasource, record
+payload, etc. On one hand, this is the true strength of Hudi which makes it
+suitable for many use cases and offers the users a solution to the tradeoffs
+encountered in data systems. On the other, it has also resulted in the learning
+curve for new users to be steeper. In this RFC, we propose to streamline some of
+these configurations. Additionally, we propose a few table level APIs to create
+or update Hudi table programmatically. Together, they would help in a smoother
+onboarding experience and increase the usability of Hudi. It would also help
+existing users through better configuration maintenance.
+
+## Background
+
+Currently, users can create and update Hudi Table using three different
+ways: [Spark datasource](https://hudi.apache.org/docs/writing_data),
+[SQL](https://hudi.apache.org/docs/table_management)
+and [DeltaStreamer](https://hudi.apache.org/docs/hoodie_deltastreamer). Each one
+of these ways is setup using a bunch
+of [configurations](https://hudi.apache.org/docs/configurations), which has
+grown over the years as new features have beed added. Imagine yourself as a data
+engineer who has been using Spark to write parquet tables. You want to try out
+Hudi and land on
+the [quickstart](https://hudi.apache.org/docs/quick-start-guide) page. You see a
+bunch of configurations (precombine field, record key, partition path) to be set
+and wonder why can't I just do `spark.write.format("hudi").save()`. Apart from
+configurations, there is no first-class support for table management APIs such
+as to create or drop table. The implementation section below presents the
+proposals to fill such gaps.
+
+## Implementation
+
+Implementation can be split into two independent changes: streamline
+configuration and new table APIs.
+
+### Streamline Configuration
+
+#### Minimal set of quickstart configurations
+
+* Users should be able to simply write Hudi table
+  using `spark.write.format("hudi")`. If no record key and precombine field is
+  provided, then assume append only and avoid index lookup and merging.
+* Hudi should infer partition field if users provide
+  as `spark.write.format("hudi").partitionBy(field)`.
+* Users need not pass all the configurations in each write operation. Once the
+  table has been created, most table configs do not change, e.g. table name
+  needs to be passed in every write, even though its only needed first time.
+  Hudi should fetch from table configs when options are not provided by the
+  user.
+
+#### Good defaults
+
+* Default values for configurations should be optimized for simple bulk load
+  scenario e.g. by default if we have NONE sort mode then it's as good as
+  parquet writes with some additional work for meta columns.
+* Make reasonable assumptions, such as do not rely on any external system (e.g.
+  hbase) for default. As another example, enable schema reconciliation by
+  default instead of failing writes.
+
+#### Reuse and consistency
+
+* Keep spark-sql and spark datasource and deltastreamer configs in sync as much
+  as possible. Document exceptions, e.g. key generator for sql is
+  ComplexKeyGenerator while for datasource it is SimpleKeyGenerator.
+* Rename/reuse existing datasource keys that are meant for same purpose.
+* In all these changes, we should support backward compatibility.
+
+#### Refactor Hive Sync
+
+* Reduce the number of configs needed for Hive sync, e.g. table name once
+  provided at the time of first write can be reused for hive sync table name
+  config as well.
+* Revisit the class hierarchy and refactor if needed.
+
+#### Configuration Builders
+
+* Users should be able to use the config builders instead of specifying config
+  keys,
+  e.g. `spark.write.format("hudi").options(HoodieClusteringConfig.Builder().withXYZ().build())`
+
+### Table APIs
+
+These APIs are meant for programmatically interacting with Hudi tables. Users
+should be able to create or update the tables using static methods.
+
+| Method Name   | Description   |
+| ------------- | ------------- |
+| bootstrap     | Create a Hudi table from the given parquet table  |
+| create        | Create a Hudi table with the given configs if it does not exist.   |

Review Comment:
   Is there a `load` api to instantiate HudiTable instance for an already existing hudi table or will it be the same `create`? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org