You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by co...@apache.org on 2022/07/10 06:12:42 UTC
[hudi] branch master updated: [HUDI-3730][RFC-55] Improve hudi-sync classes design and simplify configs (#5695)
This is an automated email from the ASF dual-hosted git repository.
codope pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/master by this push:
new 63f95ab801 [HUDI-3730][RFC-55] Improve hudi-sync classes design and simplify configs (#5695)
63f95ab801 is described below
commit 63f95ab801c0266bd4f03762b7a353f831b61784
Author: 冯健 <fe...@gmail.com>
AuthorDate: Sun Jul 10 14:12:34 2022 +0800
[HUDI-3730][RFC-55] Improve hudi-sync classes design and simplify configs (#5695)
* [HUDI-4146] RFC for Improve Hive/Meta sync class design and hierarchies
Co-authored-by: jian.feng <ji...@shopee.com>
Co-authored-by: Raymond Xu <27...@users.noreply.github.com>
---
rfc/rfc-55/hudi-sync-class-diagram.png | Bin 0 -> 418052 bytes
rfc/rfc-55/hudi-sync-flows.png | Bin 0 -> 139211 bytes
rfc/rfc-55/rfc-55.md | 157 +++++++++++++++++++++++++++++++++
3 files changed, 157 insertions(+)
diff --git a/rfc/rfc-55/hudi-sync-class-diagram.png b/rfc/rfc-55/hudi-sync-class-diagram.png
new file mode 100644
index 0000000000..11acf4bffc
Binary files /dev/null and b/rfc/rfc-55/hudi-sync-class-diagram.png differ
diff --git a/rfc/rfc-55/hudi-sync-flows.png b/rfc/rfc-55/hudi-sync-flows.png
new file mode 100644
index 0000000000..068adcc6a5
Binary files /dev/null and b/rfc/rfc-55/hudi-sync-flows.png differ
diff --git a/rfc/rfc-55/rfc-55.md b/rfc/rfc-55/rfc-55.md
new file mode 100644
index 0000000000..3e8d130216
--- /dev/null
+++ b/rfc/rfc-55/rfc-55.md
@@ -0,0 +1,157 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+# RFC-55: Improve hudi-sync classes design and simplify configs
+
+## Proposers
+
+- @<proposer1 @fengjian428>
+- @<proposer2 @xushiyan>
+
+## Approvers
+
+ - @<approver1 @vinothchandar>
+ - @<approver2 @codope>
+
+## Status
+
+JIRA: [HUDI-3730](https://issues.apache.org/jira/browse/HUDI-3730)
+
+## Abstract
+
+![hudi-sync-flows.png](hudi-sync-flows.png)
+
+Hudi support sync to various metastores via different processing framework like Spark, Flink, and Kafka connect.
+
+There are some room for improvement
+
+* The way to generate Sync configs are inconsistent in different framework
+* The abstraction of SyncClasses was designed for HiveSync, hence there are duplicated code, unused method, and parameters.
+
+We need a standard way to run hudi sync. We also need a unified abstraction of XXXSyncTool , XXXSyncClient and XXXSyncConfig to handle supported metastores, like hive metastore, bigquery, datahub, etc.
+
+## Classes design
+
+![hudi-sync-class-diagram.png](hudi-sync-class-diagram.png)
+
+Below are the proposed key classes to handle the main sync logic. They are extensible for different metastores.
+
+### `HoodieSyncTool`
+
+*Renamed from `AbstractSyncTool`.*
+
+```java
+public abstract class HoodieSyncTool implements AutoCloseable {
+
+ protected HoodieSyncClient syncClient;
+
+ /**
+ * Sync tool class is the entrypoint to run meta sync.
+ *
+ * @param props A bag of properties passed by users. It can contain all hoodie.* and any other config.
+ * @param hadoopConf Hadoop specific configs.
+ */
+ public HoodieSyncTool(Properties props, Configuration hadoopConf);
+
+ public abstract void syncHoodieTable();
+
+ public static void main(String[] args) {
+ // instantiate HoodieSyncConfig and concrete sync tool, and run sync.
+ }
+}
+```
+
+### `HoodieSyncConfig`
+
+```java
+public class HoodieSyncConfig extends HoodieConfig {
+
+ public static class HoodieSyncConfigParams {
+ // POJO class to take command line parameters
+ @Parameter()
+ private String basePath; // common essential parameters
+
+ public Properties toProps();
+ }
+
+ /**
+ * XXXSyncConfig is meant to be created and used by XXXSyncTool exclusively and internally.
+ *
+ * @param props passed from XXXSyncTool.
+ * @param hadoopConf passed from XXXSyncTool.
+ */
+ public HoodieSyncConfig(Properties props, Configuration hadoopConf);
+}
+
+public class HiveSyncConfig extends HoodieSyncConfig {
+
+ public static class HiveSyncConfigParams {
+
+ @Parameter()
+ private String syncMode;
+
+ // delegate common parameters to other XXXParams class
+ // this overcomes single-inheritance's inconvenience
+ // see https://jcommander.org/#_parameter_delegates
+ @ParametersDelegate()
+ private HoodieSyncConfigParams hoodieSyncConfigParams = new HoodieSyncConfigParams();
+
+ public Properties toProps();
+ }
+
+ public HoodieSyncConfig(Properties props);
+}
+```
+
+### `HoodieSyncClient`
+
+*Renamed from `AbstractSyncHoodieClient`.*
+
+```java
+public abstract class HoodieSyncClient implements AutoCloseable {
+ // metastore-agnostic APIs
+}
+```
+
+## Config simplification
+
+- rename all sync related configs to suffix as `hoodie.sync.*`
+ - no more `hoodie.meta.sync.*` or `hoodie.meta_sync.*`
+ - no more variable name or class name like `metaSyncEnabled` or `metaSyncTool`; standardize as `hoodieSync*` to align with module name `hudi-sync`
+- remove all sync related option constants from `DataSourceOptions`
+- `database` and `table` should not be required by sync tool; they should be inferred from table properties
+- users should not need to set PartitionValueExtractor; partition values should be inferred automatically
+- remove `USE_JDBC` and fully adopt `SYNC_MODE`
+- remove `HIVE_SYNC_ENDABLED` and related arguments from sync tools and delta streamers. Use `SYNC_ENABLED`
+- migrate repeated sync config to original config
+ - `META_SYNC_BASE_FILE_FORMAT` -> `org.apache.hudi.common.table.HoodieTableConfig.BASE_FILE_FORMAT`
+ - `META_SYNC_PARTITION_FIELDS` -> `org.apache.hudi.common.table.HoodieTableConfig.PARTITION_FIELDS`
+ - `META_SYNC_ASSUME_DATE_PARTITION` -> `org.apache.hudi.common.config.HoodieMetadataConfig.ASSUME_DATE_PARTITIONING`
+ - `META_SYNC_DECODE_PARTITION` -> `org.apache.hudi.common.table.HoodieTableConfig.URL_ENCODE_PARTITIONING`
+ - `META_SYNC_USE_FILE_LISTING_FROM_METADATA` -> `org.apache.hudi.common.config.HoodieMetadataConfig.ENABLE`
+
+## Rollout/Adoption Plan
+
+- Users who set `USE_JDBC` will need to change to set `SYNC_MODE=jdbc`
+- Users who set `--enable-hive-sync` or `HIVE_SYNC_ENABLED` will need to drop the argument or config and change to `--enable-sync` or `SYNC_ENABLED`.
+- Users who import from `DataSourceOptions` for meta sync constants will need to import relevant configs from `HoodieSyncConfig` and subclasses.
+- Users who set `AwsGlueCatalogSyncTool` as sync tool class need to update the class name to `AWSGlueCatalogSyncTool`
+
+## Test Plan
+
+- CI covers most operations for Hive sync with HMS
+- end-to-end testing with setup for Glue Catalog, BigQuery, DataHub instance
+- manual testing with partitions added and removed