You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/06/02 21:24:27 UTC

[GitHub] [hudi] alvarolemos commented on a diff in pull request #5667: [HUDI-4142] [RFC-54] New Table APIs and streamline Hudi configs

alvarolemos commented on code in PR #5667:
URL: https://github.com/apache/hudi/pull/5667#discussion_r888425567


##########
rfc/rfc-54/rfc-54.md:
##########
@@ -0,0 +1,183 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+# RFC-54: New Table APIs and Streamline Hudi Configs
+
+## Proposers
+
+- @codope
+
+## Approvers
+
+- @xushiyan
+- @vinothchandar
+
+## Status
+
+JIRA: [HUDI-4141](https://issues.apache.org/jira/browse/HUDI-4141)
+
+## Abstract
+
+Users configure jobs to write Hudi tables and control the behaviour of their
+jobs at different levels such as table, write client, datasource, record
+payload, etc. On one hand, this is the true strength of Hudi which makes it
+suitable for many use cases and offers the users a solution to the tradeoffs
+encountered in data systems. On the other, it has also resulted in the learning
+curve for new users to be steeper. In this RFC, we propose to streamline some of
+these configurations. Additionally, we propose a few table level APIs to create
+or update Hudi table programmatically. Together, they would help in a smoother
+onboarding experience and increase the usability of Hudi. It would also help
+existing users through better configuration maintenance.
+
+## Background
+
+Currently, users can create and update Hudi Table using three different
+ways: [Spark datasource](https://hudi.apache.org/docs/writing_data),
+[SQL](https://hudi.apache.org/docs/table_management)
+and [DeltaStreamer](https://hudi.apache.org/docs/hoodie_deltastreamer). Each one
+of these ways is setup using a bunch
+of [configurations](https://hudi.apache.org/docs/configurations), which has
+grown over the years as new features have beed added. Imagine yourself as a data
+engineer who has been using Spark to write parquet tables. You want to try out
+Hudi and land on
+the [quickstart](https://hudi.apache.org/docs/quick-start-guide) page. You see a
+bunch of configurations (precombine field, record key, partition path) to be set
+and wonder why can't I just do `spark.write.format("hudi").save()`. Apart from
+configurations, there is no first-class support for table management APIs such
+as to create or drop table. The implementation section below presents the
+proposals to fill such gaps.
+
+## Implementation
+
+Implementation can be split into two independent changes: streamline
+configuration and new table APIs.
+
+### Streamline Configuration
+
+#### Minimal set of quickstart configurations
+
+* Users should be able to simply write Hudi table
+  using `spark.write.format("hudi")`. If no record key and precombine field is
+  provided, then assume append only and avoid index lookup and merging.
+* Hudi should infer partition field if users provide
+  as `spark.write.format("hudi").partitionBy(field)`.
+* Users need not pass all the configurations in each write operation. Once the
+  table has been created, most table configs do not change, e.g. table name
+  needs to be passed in every write, even though its only needed first time.
+  Hudi should fetch from table configs when options are not provided by the
+  user.
+
+#### Better defaults
+
+* Default values for configurations should be optimized for simple bulk load
+  scenario e.g. by default if we have NONE sort mode then it's as good as
+  parquet writes with some additional work for meta columns.
+* Make reasonable assumptions, such as do not rely on any external system (e.g.
+  hbase) for default. As another example, enable schema reconciliation by
+  default instead of failing writes.
+
+#### Consistency across write paths
+
+* Keep configs for Spark SQL, Spark DataSource and HoodieDeltaStreamer in sync as much
+  as possible. Document exceptions, e.g. key generator for sql is
+  ComplexKeyGenerator while for datasource it is SimpleKeyGenerator.
+* Rename/reuse existing datasource keys that are meant for the same purpose.
+* In all these changes, we should support backward compatibility.
+
+#### Refactor Meta Sync ([RFC-55](/rfc/rfc-55/rfc-55.md))
+
+* Reduce the number of configs needed for Hive sync, e.g. table name once
+  provided at the time of first write can be reused for hive sync table name
+  config as well.
+* Refactor the class hierarchy and APIs.
+
+#### Support `HoodieConfig` API
+
+* Users should be able to use the config builders instead of specifying config
+  keys,
+  e.g. `spark.write.format("hudi").options(HoodieClusteringConfig.Builder().withXYZ().build())`
+
+### Table APIs
+

Review Comment:
   This is a great idea! Many people used to other frameworks (like DeltaLake) would onboard easily. As a user, I just have a concern: are you planning on creating SDKs for other languages supported by Spark, especially Python? Asking that because at my company we use Hudi successfully with PySpark (even though the Hudi project doesn't have a single line of Python) because of the way it works through configuration. I believe that there are many other users that have successfully used Hudi with PySpark for that same reason, so I would think about that and maybe add that support in the roadmap



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org