You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/05/26 12:12:08 UTC

[GitHub] [hudi] fengjian428 opened a new pull request, #5695: [HUDI-4146] RFC for Improve Hive/Meta sync class design and hierachies

fengjian428 opened a new pull request, #5695:
URL: https://github.com/apache/hudi/pull/5695

   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
     - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
     - *Added integration tests for end-to-end.*
     - *Added HoodieClientWriteTest to verify the change.*
     - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5695: [HUDI-4146][RFC-55] for Improve Hive/Meta sync class design and hierachies

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5695:
URL: https://github.com/apache/hudi/pull/5695#issuecomment-1140435149

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1c009f25c98856ec6fba2fce3a3f9dba4d2620b1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8946",
       "triggerID" : "1c009f25c98856ec6fba2fce3a3f9dba4d2620b1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7b73ba95a1615452e9ea96b911cea887fbdabe8e",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9000",
       "triggerID" : "7b73ba95a1615452e9ea96b911cea887fbdabe8e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 7b73ba95a1615452e9ea96b911cea887fbdabe8e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9000) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5695: [HUDI-3730][RFC-55] Improve hudi-sync classes design and simplify configs

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5695:
URL: https://github.com/apache/hudi/pull/5695#issuecomment-1179655288

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "7b73ba95a1615452e9ea96b911cea887fbdabe8e",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9000",
       "triggerID" : "7b73ba95a1615452e9ea96b911cea887fbdabe8e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "015ee83641033b3e3e1ef1ab907013933e27f31d",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9807",
       "triggerID" : "015ee83641033b3e3e1ef1ab907013933e27f31d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 7b73ba95a1615452e9ea96b911cea887fbdabe8e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9000) 
   * 015ee83641033b3e3e1ef1ab907013933e27f31d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9807) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] fengjian428 commented on pull request #5695: [HUDI-4146][RFC-55] for Improve Hive/Meta sync class design and hierachies

Posted by GitBox <gi...@apache.org>.
fengjian428 commented on PR #5695:
URL: https://github.com/apache/hudi/pull/5695#issuecomment-1147643205

   > Use spark, directly call spark.sharedState.externalCatalog api to sync metadata? avoid creating hms client
   
   this is not only used for spark, we need also to support Flink, KafkaConnect, and running the tools independently


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5695: [HUDI-4146][RFC-55] for Improve Hive/Meta sync class design and hierachies

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5695:
URL: https://github.com/apache/hudi/pull/5695#issuecomment-1138543805

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1c009f25c98856ec6fba2fce3a3f9dba4d2620b1",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8946",
       "triggerID" : "1c009f25c98856ec6fba2fce3a3f9dba4d2620b1",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1c009f25c98856ec6fba2fce3a3f9dba4d2620b1 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8946) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on a diff in pull request #5695: [HUDI-3730][RFC-55] Improve metasync class design and simplify configs

Posted by GitBox <gi...@apache.org>.
xushiyan commented on code in PR #5695:
URL: https://github.com/apache/hudi/pull/5695#discussion_r913061159


##########
rfc/rfc-55/rfc-55.md:
##########
@@ -0,0 +1,148 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-55: Improve metasync class design and simplify configs
+
+## Proposers
+
+- @<proposer1 @fengjian428>
+- @<proposer2 @xushiyan>
+
+## Approvers
+
+ - @<approver1 @vinothchandar>
+ - @<approver2 @codope>
+
+## Status
+
+JIRA: [HUDI-3730](https://issues.apache.org/jira/browse/HUDI-3730)
+
+## Abstract
+
+![ArchitectureMetaSync.png](ArchitectureMetaSync.png)
+
+Hudi now can sync meta to various Catalogs if user has need, and user can sync meta in different framework such as Spark, Flink, and Kafka connect. 
+The current situation is:
+
+* The way to generate Sync configs are inconsistent in different framework;
+* The abstraction of SyncClasses was designed for HiveSync, there are some duplicated code, useless method, parameters and config for new Catalogs, it needs to be improved. 
+ 
+That being said, we need a standard way to call meta sync. We also need a unified abstraction of XXXSyncTool , XXXSyncClient and XXXSyncConfig to handle all supported meta sync, including hms, bigquery, datahub, etc
+
+## Classes design
+
+![classDesign.png](classDesign.png)
+
+* for the engines which need use MetaSync, should implement _SupportMetaSync_ on the sync classes, such as DeltaSync, KafkaConnectTransactionServices and etc. for example: `runMetaSync();` then will sync metadata by every SyncToolClasses which indicated in config
+* redesign AbstractSyncClient and AbstractSyncTool, add Catalog Interface. make the hierarchy of classes more clearly and more precisely 
+* unify the way to generate SyncConfig and the way to call SyncTool,remove some useless parameters
+
+### `HoodieSyncTool`
+
+*Renamed from `AbstractSyncTool`.*
+
+```java
+public abstract class HoodieSyncTool implements AutoCloseable {
+
+  protected HoodieSyncClient syncClient;
+
+  /**
+   * Sync tool class is the entrypoint to run meta sync.
+   *
+   * @param props A bag of properties passed by users. It can contain all hoodie.* and any other config.
+   * @param hadoopConf Hadoop specific configs.
+   */
+  public HoodieSyncTool(Properties props, Configuration hadoopConf);
+
+  public abstract void syncHoodieTable();
+
+  public static void main(String[] args) {
+     // instantiate HoodieSyncConfig and concrete sync tool, and run sync.
+  }
+}
+```
+
+### `HoodieSyncConfig`
+
+```java
+public class HoodieSyncConfig extends HoodieConfig {
+
+  public static class HoodieSyncConfigParams {
+    // POJO class to take command line parameters
+    @Parameter()
+    private String basePath; // common essential parameters
+
+    public Properties toProps();
+  }
+
+   /**
+    * XXXSyncConfig is meant to be created and used by XXXSyncTool exclusively and internally.
+    * 
+    * @param props passed from XXXSyncTool.
+    * @param hadoopConf passed from XXXSyncTool.
+    */
+  public HoodieSyncConfig(Properties props, Configuration hadoopConf);
+}
+
+public class HiveSyncConfig extends HoodieSyncConfig {
+
+  public static class HiveSyncConfigParams {
+
+    @Parameter()
+    private String syncMode;
+
+    // delegate common parameters to other XXXParams class
+    // this overcomes single-inheritance's inconvenience
+    // see https://jcommander.org/#_parameter_delegates
+    @ParametersDelegate()
+    private HoodieSyncConfigParams hoodieSyncConfigParams = new HoodieSyncConfigParams();
+
+    public Properties toProps();
+  }
+
+  public HoodieSyncConfig(Properties props);
+}
+```
+
+### `HoodieSyncClient`
+
+*Renamed from `AbstractSyncHoodieClient`.*
+
+```java
+public abstract class HoodieSyncClient implements AutoCloseable {
+  // metastore-agnostic APIs
+}
+```
+
+## Config simplification
+
+- users should not need to set additional table name
+- users should not need to set PartitionValueExtractor; partition values should be inferred automatically
+- remove `USE_JDBC` and fully adopt `SYNC_MODE`
+
+(more to be added)

Review Comment:
   these will be covered in separate PRs. #5854 covers just the refactoring part.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5695: [HUDI-3730][RFC-55] Improve metasync class design and simplify configs

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5695:
URL: https://github.com/apache/hudi/pull/5695#issuecomment-1166566595

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "7b73ba95a1615452e9ea96b911cea887fbdabe8e",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "7b73ba95a1615452e9ea96b911cea887fbdabe8e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 7b73ba95a1615452e9ea96b911cea887fbdabe8e UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] melin commented on pull request #5695: [HUDI-4146][RFC-55] for Improve Hive/Meta sync class design and hierachies

Posted by GitBox <gi...@apache.org>.
melin commented on PR #5695:
URL: https://github.com/apache/hudi/pull/5695#issuecomment-1140601173

   Use spark, directly call spark.sharedState.externalCatalog api to sync metadata? avoid creating hms client


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5695: [HUDI-4146] draft: RFC for Improve Hive/Meta sync class design and hierachies

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5695:
URL: https://github.com/apache/hudi/pull/5695#issuecomment-1138505693

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1c009f25c98856ec6fba2fce3a3f9dba4d2620b1",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1c009f25c98856ec6fba2fce3a3f9dba4d2620b1",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1c009f25c98856ec6fba2fce3a3f9dba4d2620b1 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on a diff in pull request #5695: [HUDI-3730][RFC-55] Improve metasync class design and simplify configs

Posted by GitBox <gi...@apache.org>.
xushiyan commented on code in PR #5695:
URL: https://github.com/apache/hudi/pull/5695#discussion_r913061498


##########
rfc/rfc-55/rfc-55.md:
##########
@@ -0,0 +1,148 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-55: Improve metasync class design and simplify configs
+
+## Proposers
+
+- @<proposer1 @fengjian428>
+- @<proposer2 @xushiyan>
+
+## Approvers
+
+ - @<approver1 @vinothchandar>
+ - @<approver2 @codope>
+
+## Status
+
+JIRA: [HUDI-3730](https://issues.apache.org/jira/browse/HUDI-3730)
+
+## Abstract
+
+![ArchitectureMetaSync.png](ArchitectureMetaSync.png)
+
+Hudi now can sync meta to various Catalogs if user has need, and user can sync meta in different framework such as Spark, Flink, and Kafka connect. 
+The current situation is:
+
+* The way to generate Sync configs are inconsistent in different framework;
+* The abstraction of SyncClasses was designed for HiveSync, there are some duplicated code, useless method, parameters and config for new Catalogs, it needs to be improved. 
+ 
+That being said, we need a standard way to call meta sync. We also need a unified abstraction of XXXSyncTool , XXXSyncClient and XXXSyncConfig to handle all supported meta sync, including hms, bigquery, datahub, etc
+
+## Classes design
+
+![classDesign.png](classDesign.png)
+
+* for the engines which need use MetaSync, should implement _SupportMetaSync_ on the sync classes, such as DeltaSync, KafkaConnectTransactionServices and etc. for example: `runMetaSync();` then will sync metadata by every SyncToolClasses which indicated in config
+* redesign AbstractSyncClient and AbstractSyncTool, add Catalog Interface. make the hierarchy of classes more clearly and more precisely 
+* unify the way to generate SyncConfig and the way to call SyncTool,remove some useless parameters
+
+### `HoodieSyncTool`
+
+*Renamed from `AbstractSyncTool`.*
+
+```java
+public abstract class HoodieSyncTool implements AutoCloseable {
+
+  protected HoodieSyncClient syncClient;
+
+  /**
+   * Sync tool class is the entrypoint to run meta sync.
+   *
+   * @param props A bag of properties passed by users. It can contain all hoodie.* and any other config.
+   * @param hadoopConf Hadoop specific configs.
+   */
+  public HoodieSyncTool(Properties props, Configuration hadoopConf);
+
+  public abstract void syncHoodieTable();
+
+  public static void main(String[] args) {
+     // instantiate HoodieSyncConfig and concrete sync tool, and run sync.
+  }
+}
+```
+
+### `HoodieSyncConfig`
+
+```java
+public class HoodieSyncConfig extends HoodieConfig {
+
+  public static class HoodieSyncConfigParams {
+    // POJO class to take command line parameters
+    @Parameter()
+    private String basePath; // common essential parameters
+
+    public Properties toProps();
+  }
+
+   /**
+    * XXXSyncConfig is meant to be created and used by XXXSyncTool exclusively and internally.
+    * 
+    * @param props passed from XXXSyncTool.
+    * @param hadoopConf passed from XXXSyncTool.
+    */
+  public HoodieSyncConfig(Properties props, Configuration hadoopConf);
+}
+
+public class HiveSyncConfig extends HoodieSyncConfig {
+
+  public static class HiveSyncConfigParams {
+
+    @Parameter()
+    private String syncMode;
+
+    // delegate common parameters to other XXXParams class
+    // this overcomes single-inheritance's inconvenience
+    // see https://jcommander.org/#_parameter_delegates
+    @ParametersDelegate()
+    private HoodieSyncConfigParams hoodieSyncConfigParams = new HoodieSyncConfigParams();
+
+    public Properties toProps();
+  }
+
+  public HoodieSyncConfig(Properties props);
+}
+```
+
+### `HoodieSyncClient`
+
+*Renamed from `AbstractSyncHoodieClient`.*
+
+```java
+public abstract class HoodieSyncClient implements AutoCloseable {
+  // metastore-agnostic APIs
+}
+```
+
+## Config simplification
+
+- users should not need to set additional table name
+- users should not need to set PartitionValueExtractor; partition values should be inferred automatically
+- remove `USE_JDBC` and fully adopt `SYNC_MODE`
+
+(more to be added)
+
+## Rollout/Adoption Plan
+
+ - What impact (if any) will there be on existing users? 
+   - No impact, the config changes should be back compatible with the old one if there have
+ - If we are changing behavior how will we phase out the older behavior?
+ - If we need special migration tools, describe them here.
+ - When will we remove the existing behavior
+
+## Test Plan
+
+Describe in few sentences how the RFC will be tested. How will we know that the implementation works as expected? How will we know nothing broke?.

Review Comment:
   yes will update the content here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5695: [HUDI-3730][RFC-55] Improve hudi-sync classes design and simplify configs

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5695:
URL: https://github.com/apache/hudi/pull/5695#issuecomment-1179654545

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "7b73ba95a1615452e9ea96b911cea887fbdabe8e",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "7b73ba95a1615452e9ea96b911cea887fbdabe8e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 7b73ba95a1615452e9ea96b911cea887fbdabe8e UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5695: [HUDI-4146][RFC-55] for Improve Hive/Meta sync class design and hierachies

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5695:
URL: https://github.com/apache/hudi/pull/5695#issuecomment-1140425698

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1c009f25c98856ec6fba2fce3a3f9dba4d2620b1",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8946",
       "triggerID" : "1c009f25c98856ec6fba2fce3a3f9dba4d2620b1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7b73ba95a1615452e9ea96b911cea887fbdabe8e",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "7b73ba95a1615452e9ea96b911cea887fbdabe8e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1c009f25c98856ec6fba2fce3a3f9dba4d2620b1 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8946) 
   * 7b73ba95a1615452e9ea96b911cea887fbdabe8e UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5695: [HUDI-3730][RFC-55] Improve hudi-sync classes design and simplify configs

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5695:
URL: https://github.com/apache/hudi/pull/5695#issuecomment-1179654924

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "7b73ba95a1615452e9ea96b911cea887fbdabe8e",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9000",
       "triggerID" : "7b73ba95a1615452e9ea96b911cea887fbdabe8e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "015ee83641033b3e3e1ef1ab907013933e27f31d",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "015ee83641033b3e3e1ef1ab907013933e27f31d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 7b73ba95a1615452e9ea96b911cea887fbdabe8e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9000) 
   * 015ee83641033b3e3e1ef1ab907013933e27f31d UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope merged pull request #5695: [HUDI-3730][RFC-55] Improve hudi-sync classes design and simplify configs

Posted by GitBox <gi...@apache.org>.
codope merged PR #5695:
URL: https://github.com/apache/hudi/pull/5695


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5695: [HUDI-3730][RFC-55] Improve metasync class design and simplify configs

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5695:
URL: https://github.com/apache/hudi/pull/5695#issuecomment-1166567565

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "7b73ba95a1615452e9ea96b911cea887fbdabe8e",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9000",
       "triggerID" : "7b73ba95a1615452e9ea96b911cea887fbdabe8e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 7b73ba95a1615452e9ea96b911cea887fbdabe8e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9000) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on a diff in pull request #5695: [HUDI-3730][RFC-55] Improve metasync class design and simplify configs

Posted by GitBox <gi...@apache.org>.
codope commented on code in PR #5695:
URL: https://github.com/apache/hudi/pull/5695#discussion_r911823576


##########
rfc/rfc-55/rfc-55.md:
##########
@@ -0,0 +1,148 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-55: Improve metasync class design and simplify configs
+
+## Proposers
+
+- @<proposer1 @fengjian428>
+- @<proposer2 @xushiyan>
+
+## Approvers
+
+ - @<approver1 @vinothchandar>
+ - @<approver2 @codope>
+
+## Status
+
+JIRA: [HUDI-3730](https://issues.apache.org/jira/browse/HUDI-3730)
+
+## Abstract
+
+![ArchitectureMetaSync.png](ArchitectureMetaSync.png)
+
+Hudi now can sync meta to various Catalogs if user has need, and user can sync meta in different framework such as Spark, Flink, and Kafka connect. 
+The current situation is:
+
+* The way to generate Sync configs are inconsistent in different framework;
+* The abstraction of SyncClasses was designed for HiveSync, there are some duplicated code, useless method, parameters and config for new Catalogs, it needs to be improved. 
+ 
+That being said, we need a standard way to call meta sync. We also need a unified abstraction of XXXSyncTool , XXXSyncClient and XXXSyncConfig to handle all supported meta sync, including hms, bigquery, datahub, etc
+
+## Classes design
+
+![classDesign.png](classDesign.png)
+
+* for the engines which need use MetaSync, should implement _SupportMetaSync_ on the sync classes, such as DeltaSync, KafkaConnectTransactionServices and etc. for example: `runMetaSync();` then will sync metadata by every SyncToolClasses which indicated in config
+* redesign AbstractSyncClient and AbstractSyncTool, add Catalog Interface. make the hierarchy of classes more clearly and more precisely 
+* unify the way to generate SyncConfig and the way to call SyncTool,remove some useless parameters
+
+### `HoodieSyncTool`
+
+*Renamed from `AbstractSyncTool`.*
+
+```java
+public abstract class HoodieSyncTool implements AutoCloseable {
+
+  protected HoodieSyncClient syncClient;
+
+  /**
+   * Sync tool class is the entrypoint to run meta sync.
+   *
+   * @param props A bag of properties passed by users. It can contain all hoodie.* and any other config.
+   * @param hadoopConf Hadoop specific configs.
+   */
+  public HoodieSyncTool(Properties props, Configuration hadoopConf);
+
+  public abstract void syncHoodieTable();
+
+  public static void main(String[] args) {
+     // instantiate HoodieSyncConfig and concrete sync tool, and run sync.
+  }
+}
+```
+
+### `HoodieSyncConfig`
+
+```java
+public class HoodieSyncConfig extends HoodieConfig {
+
+  public static class HoodieSyncConfigParams {
+    // POJO class to take command line parameters
+    @Parameter()
+    private String basePath; // common essential parameters
+
+    public Properties toProps();
+  }
+
+   /**
+    * XXXSyncConfig is meant to be created and used by XXXSyncTool exclusively and internally.
+    * 
+    * @param props passed from XXXSyncTool.
+    * @param hadoopConf passed from XXXSyncTool.
+    */
+  public HoodieSyncConfig(Properties props, Configuration hadoopConf);
+}
+
+public class HiveSyncConfig extends HoodieSyncConfig {
+
+  public static class HiveSyncConfigParams {
+
+    @Parameter()
+    private String syncMode;
+
+    // delegate common parameters to other XXXParams class
+    // this overcomes single-inheritance's inconvenience
+    // see https://jcommander.org/#_parameter_delegates
+    @ParametersDelegate()
+    private HoodieSyncConfigParams hoodieSyncConfigParams = new HoodieSyncConfigParams();
+
+    public Properties toProps();
+  }
+
+  public HoodieSyncConfig(Properties props);
+}
+```
+
+### `HoodieSyncClient`
+
+*Renamed from `AbstractSyncHoodieClient`.*
+
+```java
+public abstract class HoodieSyncClient implements AutoCloseable {
+  // metastore-agnostic APIs
+}
+```
+
+## Config simplification
+
+- users should not need to set additional table name
+- users should not need to set PartitionValueExtractor; partition values should be inferred automatically
+- remove `USE_JDBC` and fully adopt `SYNC_MODE`
+
+(more to be added)
+
+## Rollout/Adoption Plan
+
+ - What impact (if any) will there be on existing users? 
+   - No impact, the config changes should be back compatible with the old one if there have
+ - If we are changing behavior how will we phase out the older behavior?
+ - If we need special migration tools, describe them here.
+ - When will we remove the existing behavior
+
+## Test Plan
+
+Describe in few sentences how the RFC will be tested. How will we know that the implementation works as expected? How will we know nothing broke?.

Review Comment:
   I think we should test this manually for Glue catalog and BigQuery e2e. Integration test already covers Hive.



##########
rfc/rfc-55/rfc-55.md:
##########
@@ -0,0 +1,148 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-55: Improve metasync class design and simplify configs
+
+## Proposers
+
+- @<proposer1 @fengjian428>
+- @<proposer2 @xushiyan>
+
+## Approvers
+
+ - @<approver1 @vinothchandar>
+ - @<approver2 @codope>
+
+## Status
+
+JIRA: [HUDI-3730](https://issues.apache.org/jira/browse/HUDI-3730)
+
+## Abstract
+
+![ArchitectureMetaSync.png](ArchitectureMetaSync.png)
+
+Hudi now can sync meta to various Catalogs if user has need, and user can sync meta in different framework such as Spark, Flink, and Kafka connect. 
+The current situation is:
+
+* The way to generate Sync configs are inconsistent in different framework;
+* The abstraction of SyncClasses was designed for HiveSync, there are some duplicated code, useless method, parameters and config for new Catalogs, it needs to be improved. 
+ 
+That being said, we need a standard way to call meta sync. We also need a unified abstraction of XXXSyncTool , XXXSyncClient and XXXSyncConfig to handle all supported meta sync, including hms, bigquery, datahub, etc
+
+## Classes design
+
+![classDesign.png](classDesign.png)
+
+* for the engines which need use MetaSync, should implement _SupportMetaSync_ on the sync classes, such as DeltaSync, KafkaConnectTransactionServices and etc. for example: `runMetaSync();` then will sync metadata by every SyncToolClasses which indicated in config
+* redesign AbstractSyncClient and AbstractSyncTool, add Catalog Interface. make the hierarchy of classes more clearly and more precisely 
+* unify the way to generate SyncConfig and the way to call SyncTool,remove some useless parameters
+
+### `HoodieSyncTool`
+
+*Renamed from `AbstractSyncTool`.*
+
+```java
+public abstract class HoodieSyncTool implements AutoCloseable {
+
+  protected HoodieSyncClient syncClient;
+
+  /**
+   * Sync tool class is the entrypoint to run meta sync.
+   *
+   * @param props A bag of properties passed by users. It can contain all hoodie.* and any other config.
+   * @param hadoopConf Hadoop specific configs.
+   */
+  public HoodieSyncTool(Properties props, Configuration hadoopConf);
+
+  public abstract void syncHoodieTable();
+
+  public static void main(String[] args) {
+     // instantiate HoodieSyncConfig and concrete sync tool, and run sync.
+  }
+}
+```
+
+### `HoodieSyncConfig`
+
+```java
+public class HoodieSyncConfig extends HoodieConfig {
+
+  public static class HoodieSyncConfigParams {
+    // POJO class to take command line parameters
+    @Parameter()
+    private String basePath; // common essential parameters
+
+    public Properties toProps();
+  }
+
+   /**
+    * XXXSyncConfig is meant to be created and used by XXXSyncTool exclusively and internally.
+    * 
+    * @param props passed from XXXSyncTool.
+    * @param hadoopConf passed from XXXSyncTool.
+    */
+  public HoodieSyncConfig(Properties props, Configuration hadoopConf);
+}
+
+public class HiveSyncConfig extends HoodieSyncConfig {
+
+  public static class HiveSyncConfigParams {
+
+    @Parameter()
+    private String syncMode;
+
+    // delegate common parameters to other XXXParams class
+    // this overcomes single-inheritance's inconvenience
+    // see https://jcommander.org/#_parameter_delegates
+    @ParametersDelegate()
+    private HoodieSyncConfigParams hoodieSyncConfigParams = new HoodieSyncConfigParams();
+
+    public Properties toProps();
+  }
+
+  public HoodieSyncConfig(Properties props);
+}
+```
+
+### `HoodieSyncClient`
+
+*Renamed from `AbstractSyncHoodieClient`.*
+
+```java
+public abstract class HoodieSyncClient implements AutoCloseable {
+  // metastore-agnostic APIs
+}
+```
+
+## Config simplification
+
+- users should not need to set additional table name
+- users should not need to set PartitionValueExtractor; partition values should be inferred automatically
+- remove `USE_JDBC` and fully adopt `SYNC_MODE`
+
+(more to be added)

Review Comment:
   @xushiyan @fengjian428  Are there any more config proposals to be added here? I think #5854 covers the first three. Please update the rfc if there are more.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5695: [HUDI-4146][RFC-55] for Improve Hive/Meta sync class design and hierachies

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5695:
URL: https://github.com/apache/hudi/pull/5695#issuecomment-1140426203

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1c009f25c98856ec6fba2fce3a3f9dba4d2620b1",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8946",
       "triggerID" : "1c009f25c98856ec6fba2fce3a3f9dba4d2620b1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7b73ba95a1615452e9ea96b911cea887fbdabe8e",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9000",
       "triggerID" : "7b73ba95a1615452e9ea96b911cea887fbdabe8e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1c009f25c98856ec6fba2fce3a3f9dba4d2620b1 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8946) 
   * 7b73ba95a1615452e9ea96b911cea887fbdabe8e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9000) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #5695: [HUDI-4146][RFC-55] for Improve Hive/Meta sync class design and hierachies

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #5695:
URL: https://github.com/apache/hudi/pull/5695#issuecomment-1138549273

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1c009f25c98856ec6fba2fce3a3f9dba4d2620b1",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8946",
       "triggerID" : "1c009f25c98856ec6fba2fce3a3f9dba4d2620b1",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1c009f25c98856ec6fba2fce3a3f9dba4d2620b1 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8946) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org