You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/07/18 09:10:17 UTC

[GitHub] [hudi] wzx140 opened a new pull request, #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback

wzx140 opened a new pull request, #6132:
URL: https://github.com/apache/hudi/pull/6132

   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
     - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
     - *Added integration tests for end-to-end.*
     - *Added HoodieClientWriteTest to verify the change.*
     - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] wzx140 commented on a diff in pull request #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
wzx140 commented on code in PR #6132:
URL: https://github.com/apache/hudi/pull/6132#discussion_r925182441


##########
rfc/rfc-46/rfc-46.md:
##########
@@ -84,59 +84,90 @@ is known to have poor performance (compared to non-reflection based instantiatio
 
 #### Record Merge API
 
-Stateless component interface providing for API Combining Records will look like following:
+CombineAndGetUpdateValue and Precombine will converge to one API. Stateless component interface providing for API Combining Records will look like following:
 
 ```java
-interface HoodieMerge {
-   HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer);
-
-   Option<HoodieRecord> combineAndGetUpdateValue(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException;
+interface HoodieRecordMerger {
+   // combineAndGetUpdateValue and precombine
+   Option<HoodieRecord> merge(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException;
+   
+   // The record type handled by the current merger
+   // SPARK, AVRO, FLINK
+   HoodieRecordType getRecordType();
 }
 
-   /**
-    * Spark-specific implementation 
-    */
-   class HoodieSparkRecordMerge implements HoodieMerge {
+/**
+ * Spark-specific implementation 
+ */
+class HoodieSparkRecordMerger implements HoodieRecordMerger {
 
-      @Override
-      public HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer) {
-        // HoodieSparkRecords preCombine
-      }
+   @Override
+   Option<HoodieRecord> merge(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException {
+     // HoodieSparkRecord precombine and combineAndGetUpdateValue 
+   }
 
-      @Override
-      public Option<HoodieRecord> combineAndGetUpdateValue(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) {
-         // HoodieSparkRecord combineAndGetUpdateValue
-      }
+   @Override
+   HoodieRecordType getRecordType() {
+     return HoodieRecordType.SPARK;
    }
+}
    
-   /**
-    * Flink-specific implementation 
-    */
-   class HoodieFlinkRecordMerge implements HoodieMerge {
-
-      @Override
-      public HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer) {
-        // HoodieFlinkRecord preCombine
-      }
+/**
+ * Flink-specific implementation 
+ */
+class HoodieFlinkRecordMerger implements HoodieRecordMerger {
+
+   @Override
+   Option<HoodieRecord> merge(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException {
+      // HoodieFlinkRecord precombine and combineAndGetUpdateValue 
+   }
 
-      @Override
-      public Option<HoodieRecord> combineAndGetUpdateValue(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) {
-         // HoodieFlinkRecord combineAndGetUpdateValue
-      }
+   @Override
+   HoodieRecordType getRecordType() {
+      return HoodieRecordType.FLINK;
    }
+}
 ```
 Where user can provide their own subclass implementing such interface for the engines of interest.
 
-#### Migration from `HoodieRecordPayload` to `HoodieMerge`
+#### Migration from `HoodieRecordPayload` to `HoodieRecordMerger`
 
 To warrant backward-compatibility (BWC) on the code-level with already created subclasses of `HoodieRecordPayload` currently
-already used in production by Hudi users, we will provide a BWC-bridge in the form of instance of `HoodieMerge`, that will 
+already used in production by Hudi users, we will provide a BWC-bridge in the form of instance of `HoodieRecordMerger` called `HoodieAvroRecordMerger`, that will 
 be using user-defined subclass of `HoodieRecordPayload` to combine the records.
 
 Leveraging such bridge will make provide for seamless BWC migration to the 0.11 release, however will be removing the performance 
 benefit of this refactoring, since it would unavoidably have to perform conversion to intermediate representation (Avro). To realize
 full-suite of benefits of this refactoring, users will have to migrate their merging logic out of `HoodieRecordPayload` subclass and into
-new `HoodieMerge` implementation.
+new `HoodieRecordMerger` implementation.
+
+Precombine is used to merge records from logs or incoming records; CombineAndGetUpdateValue is used to merge record from log file and record from base file.
+these two merge logics are not exactly the same for some RecordPayload, such as OverwriteWithLatestAvroPaload. 
+We add an Enum in HoodieRecord to mark where it comes from(BASE, LOG or WRITE). `HoodieAvroRecordMerger`'s API will look like following:

Review Comment:
   For downward compatibility, I think it is necessary to maintain this difference from OverwriteWithLatestAvroPayload. But in the future implement based on HoodieRecordMerger, this semantics can be unified



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6132:
URL: https://github.com/apache/hudi/pull/6132#issuecomment-1190051367

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10030",
       "triggerID" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9ce2c608b05103e2cf7123504252fa021b7bfcf6",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10032",
       "triggerID" : "9ce2c608b05103e2cf7123504252fa021b7bfcf6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "63384a9ee861c2fb14da988d404382ced69f733d",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10086",
       "triggerID" : "63384a9ee861c2fb14da988d404382ced69f733d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "011f79d738f845e2ae9224529f7ad5fa8250f4f2",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10088",
       "triggerID" : "011f79d738f845e2ae9224529f7ad5fa8250f4f2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b39ec4efe4a084dcd9ca6ffd56bb55b4d76d5ac7",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10093",
       "triggerID" : "b39ec4efe4a084dcd9ca6ffd56bb55b4d76d5ac7",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 9ce2c608b05103e2cf7123504252fa021b7bfcf6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10032) 
   * 63384a9ee861c2fb14da988d404382ced69f733d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10086) 
   * 011f79d738f845e2ae9224529f7ad5fa8250f4f2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10088) 
   * b39ec4efe4a084dcd9ca6ffd56bb55b4d76d5ac7 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10093) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] wzx140 commented on a diff in pull request #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
wzx140 commented on code in PR #6132:
URL: https://github.com/apache/hudi/pull/6132#discussion_r925101281


##########
rfc/rfc-46/rfc-46.md:
##########
@@ -156,13 +187,76 @@ Following major components will be refactored:
 3. `HoodieRealtimeRecordReader`s 
    1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
 
+### Config for Record Merge
+The MERGE_CLASS_NAME config is engine-aware. If you are not specified the MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your engine type.
+
+### Public Api in HoodieRecord
+Because we implement different types of records, we need to transfer some func in AvroUtils into HoodieRecord for different data(avro, InternalRow, RowData).
+Its public API will look like following:
+
+```java
+class HoodieRecord {
+
+   /**
+    * Get column in record to support RDDCustomColumnsSortPartitioner
+    */
+   Object getRecordColumnValues(Schema recordSchema, String[] columns, boolean consistentLogicalTimestampEnabled);
+
+   /**
+    * Support bootstrap.
+    */
+   HoodieRecord mergeWith(HoodieRecord other) throws IOException;
+
+   /**
+    * Rewrite record into new schema(add meta columns)
+    */
+   HoodieRecord rewriteRecord(Schema recordSchema, Properties props, Schema targetSchema) throws IOException;
+
+   /**
+    * Support schema evolution.
+    */
+   HoodieRecord rewriteRecordWithNewSchema(Schema recordSchema, Properties props, Schema newSchema, Map<String, String> renameCols) throws IOException;

Review Comment:
   This is strongly related to the data type(avro, InternalRow). The implementation of each kind of data is different, so it is better to put it in HoodieRecord. Otherwise we need to use different utilities according to the HoodieRecordType.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6132:
URL: https://github.com/apache/hudi/pull/6132#issuecomment-1189867251

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10030",
       "triggerID" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9ce2c608b05103e2cf7123504252fa021b7bfcf6",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10032",
       "triggerID" : "9ce2c608b05103e2cf7123504252fa021b7bfcf6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "63384a9ee861c2fb14da988d404382ced69f733d",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "63384a9ee861c2fb14da988d404382ced69f733d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 9ce2c608b05103e2cf7123504252fa021b7bfcf6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10032) 
   * 63384a9ee861c2fb14da988d404382ced69f733d UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on pull request #6132: [RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on PR #6132:
URL: https://github.com/apache/hudi/pull/6132#issuecomment-1205688069

   @wzx140 seems like this PR now carries all the changes related to RFC-46. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] wzx140 commented on a diff in pull request #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
wzx140 commented on code in PR #6132:
URL: https://github.com/apache/hudi/pull/6132#discussion_r925185490


##########
rfc/rfc-46/rfc-46.md:
##########
@@ -156,13 +187,76 @@ Following major components will be refactored:
 3. `HoodieRealtimeRecordReader`s 
    1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
 
+### Config for Record Merge
+The MERGE_CLASS_NAME config is engine-aware. If you are not specified the MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your engine type.
+
+### Public Api in HoodieRecord
+Because we implement different types of records, we need to transfer some func in AvroUtils into HoodieRecord for different data(avro, InternalRow, RowData).
+Its public API will look like following:
+
+```java
+class HoodieRecord {
+
+   /**
+    * Get column in record to support RDDCustomColumnsSortPartitioner
+    */
+   Object getRecordColumnValues(Schema recordSchema, String[] columns, boolean consistentLogicalTimestampEnabled);
+
+   /**
+    * Support bootstrap.
+    */
+   HoodieRecord mergeWith(HoodieRecord other) throws IOException;

Review Comment:
   I think the logic of this bootstrap merge is certain and should not be exposed to users in HoodieRecordMerger.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6132:
URL: https://github.com/apache/hudi/pull/6132#issuecomment-1190107408

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10030",
       "triggerID" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9ce2c608b05103e2cf7123504252fa021b7bfcf6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10032",
       "triggerID" : "9ce2c608b05103e2cf7123504252fa021b7bfcf6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "63384a9ee861c2fb14da988d404382ced69f733d",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10086",
       "triggerID" : "63384a9ee861c2fb14da988d404382ced69f733d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "011f79d738f845e2ae9224529f7ad5fa8250f4f2",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10088",
       "triggerID" : "011f79d738f845e2ae9224529f7ad5fa8250f4f2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b39ec4efe4a084dcd9ca6ffd56bb55b4d76d5ac7",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10093",
       "triggerID" : "b39ec4efe4a084dcd9ca6ffd56bb55b4d76d5ac7",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 63384a9ee861c2fb14da988d404382ced69f733d Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10086) 
   * 011f79d738f845e2ae9224529f7ad5fa8250f4f2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10088) 
   * b39ec4efe4a084dcd9ca6ffd56bb55b4d76d5ac7 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10093) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6132:
URL: https://github.com/apache/hudi/pull/6132#issuecomment-1190391346

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10030",
       "triggerID" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9ce2c608b05103e2cf7123504252fa021b7bfcf6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10032",
       "triggerID" : "9ce2c608b05103e2cf7123504252fa021b7bfcf6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "63384a9ee861c2fb14da988d404382ced69f733d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10086",
       "triggerID" : "63384a9ee861c2fb14da988d404382ced69f733d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "011f79d738f845e2ae9224529f7ad5fa8250f4f2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10088",
       "triggerID" : "011f79d738f845e2ae9224529f7ad5fa8250f4f2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b39ec4efe4a084dcd9ca6ffd56bb55b4d76d5ac7",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10093",
       "triggerID" : "b39ec4efe4a084dcd9ca6ffd56bb55b4d76d5ac7",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b39ec4efe4a084dcd9ca6ffd56bb55b4d76d5ac7 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10093) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6132:
URL: https://github.com/apache/hudi/pull/6132#issuecomment-1187714025

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10030",
       "triggerID" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9ce2c608b05103e2cf7123504252fa021b7bfcf6",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10032",
       "triggerID" : "9ce2c608b05103e2cf7123504252fa021b7bfcf6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 9ce2c608b05103e2cf7123504252fa021b7bfcf6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10032) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] wzx140 commented on a diff in pull request #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
wzx140 commented on code in PR #6132:
URL: https://github.com/apache/hudi/pull/6132#discussion_r925098172


##########
rfc/rfc-46/rfc-46.md:
##########
@@ -156,13 +187,76 @@ Following major components will be refactored:
 3. `HoodieRealtimeRecordReader`s 
    1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
 
+### Config for Record Merge
+The MERGE_CLASS_NAME config is engine-aware. If you are not specified the MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your engine type.
+
+### Public Api in HoodieRecord
+Because we implement different types of records, we need to transfer some func in AvroUtils into HoodieRecord for different data(avro, InternalRow, RowData).
+Its public API will look like following:
+
+```java
+class HoodieRecord {
+
+   /**
+    * Get column in record to support RDDCustomColumnsSortPartitioner
+    */
+   Object getRecordColumnValues(Schema recordSchema, String[] columns, boolean consistentLogicalTimestampEnabled);

Review Comment:
   HoodieAvroRecord(HoodieRecordPayload) should not hold the schema. This will bring extra memory overhead for each record. So the recordSchema is used by HoodieRecordPayload.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] wzx140 commented on a diff in pull request #6132: [RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
wzx140 commented on code in PR #6132:
URL: https://github.com/apache/hudi/pull/6132#discussion_r973951218


##########
rfc/rfc-46/rfc-46.md:
##########
@@ -128,21 +173,88 @@ Following major components will be refactored:
 
 1. `HoodieWriteHandle`s will be  
    1. Accepting `HoodieRecord` instead of raw Avro payload (avoiding Avro conversion)
-   2. Using Combining API engine to merge records (when necessary) 
+   2. Using Record Merge API to merge records (when necessary) 
    3. Passes `HoodieRecord` as is to `FileWriter`
 2. `HoodieFileWriter`s will be 
    1. Accepting `HoodieRecord`
    2. Will be engine-specific (so that they're able to handle internal record representation)
 3. `HoodieRealtimeRecordReader`s 
    1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
 
+### Config for Record Merge
+The MERGE_CLASS_NAME config is engine-aware. If you are not specified the MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your engine type.
+
+### Public Api in HoodieRecord
+Because we implement different types of records, we need to implement functionality similar to AvroUtils in HoodieRecord for different data(avro, InternalRow, RowData).
+Its public API will look like following:
+
+```java
+import java.util.Properties;
+
+class HoodieRecord {
+
+   /**
+    * Get column in record to support RDDCustomColumnsSortPartitioner
+    */
+   Object getRecordColumnValues(Schema recordSchema, String[] columns,

Review Comment:
   This is consistent with HoodieAvroUtils#getRecordColumnValues . Column value if a single column, or concatenated String values by comma.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] wzx140 commented on a diff in pull request #6132: [RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
wzx140 commented on code in PR #6132:
URL: https://github.com/apache/hudi/pull/6132#discussion_r973952869


##########
rfc/rfc-46/rfc-46.md:
##########
@@ -128,21 +173,88 @@ Following major components will be refactored:
 
 1. `HoodieWriteHandle`s will be  
    1. Accepting `HoodieRecord` instead of raw Avro payload (avoiding Avro conversion)
-   2. Using Combining API engine to merge records (when necessary) 
+   2. Using Record Merge API to merge records (when necessary) 
    3. Passes `HoodieRecord` as is to `FileWriter`
 2. `HoodieFileWriter`s will be 
    1. Accepting `HoodieRecord`
    2. Will be engine-specific (so that they're able to handle internal record representation)
 3. `HoodieRealtimeRecordReader`s 
    1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
 
+### Config for Record Merge
+The MERGE_CLASS_NAME config is engine-aware. If you are not specified the MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your engine type.
+
+### Public Api in HoodieRecord
+Because we implement different types of records, we need to implement functionality similar to AvroUtils in HoodieRecord for different data(avro, InternalRow, RowData).
+Its public API will look like following:
+
+```java
+import java.util.Properties;
+
+class HoodieRecord {
+
+   /**
+    * Get column in record to support RDDCustomColumnsSortPartitioner
+    */
+   Object getRecordColumnValues(Schema recordSchema, String[] columns,
+           boolean consistentLogicalTimestampEnabled);
+
+   /**
+    * Support bootstrap.
+    */
+   HoodieRecord mergeWith(HoodieRecord other, Schema targetSchema) throws IOException;
+
+   /**
+    * Rewrite record into new schema(add meta columns)
+    */
+   HoodieRecord rewriteRecord(Schema recordSchema, Properties props, Schema targetSchema)
+           throws IOException;
+
+   /**
+    * Support schema evolution.
+    */
+   HoodieRecord rewriteRecordWithNewSchema(Schema recordSchema, Properties props, Schema newSchema,
+           Map<String, String> renameCols) throws IOException;
+
+   HoodieRecord updateValues(Schema recordSchema, Properties props,

Review Comment:
   This is also used in HoodieHFileDataBlock#serializeRecord to update recordKey field



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6132:
URL: https://github.com/apache/hudi/pull/6132#issuecomment-1187045311

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10030",
       "triggerID" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9ce2c608b05103e2cf7123504252fa021b7bfcf6",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "9ce2c608b05103e2cf7123504252fa021b7bfcf6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10030) 
   * 9ce2c608b05103e2cf7123504252fa021b7bfcf6 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] wzx140 commented on a diff in pull request #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
wzx140 commented on code in PR #6132:
URL: https://github.com/apache/hudi/pull/6132#discussion_r925186261


##########
rfc/rfc-46/rfc-46.md:
##########
@@ -156,13 +187,76 @@ Following major components will be refactored:
 3. `HoodieRealtimeRecordReader`s 
    1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
 
+### Config for Record Merge
+The MERGE_CLASS_NAME config is engine-aware. If you are not specified the MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your engine type.
+
+### Public Api in HoodieRecord
+Because we implement different types of records, we need to transfer some func in AvroUtils into HoodieRecord for different data(avro, InternalRow, RowData).
+Its public API will look like following:
+
+```java
+class HoodieRecord {
+
+   /**
+    * Get column in record to support RDDCustomColumnsSortPartitioner
+    */
+   Object getRecordColumnValues(Schema recordSchema, String[] columns, boolean consistentLogicalTimestampEnabled);
+
+   /**
+    * Support bootstrap.
+    */
+   HoodieRecord mergeWith(HoodieRecord other) throws IOException;
+
+   /**
+    * Rewrite record into new schema(add meta columns)
+    */
+   HoodieRecord rewriteRecord(Schema recordSchema, Properties props, Schema targetSchema) throws IOException;
+
+   /**
+    * Support schema evolution.
+    */
+   HoodieRecord rewriteRecordWithNewSchema(Schema recordSchema, Properties props, Schema newSchema, Map<String, String> renameCols) throws IOException;
+
+   HoodieRecord addMetadataValues(Schema recordSchema, Properties props, Map<HoodieMetadataField, String> metadataValues) throws IOException;
+
+   /**
+    * Is deleted.
+    */
+   boolean isPresent(Schema recordSchema, Properties props) throws IOException;

Review Comment:
   fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] wzx140 commented on pull request #6132: [RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
wzx140 commented on PR #6132:
URL: https://github.com/apache/hudi/pull/6132#issuecomment-1251754051

   @alexeykudinkin Thank you for your suggestion. This will be fixed soon


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] wzx140 commented on pull request #6132: [RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
wzx140 commented on PR #6132:
URL: https://github.com/apache/hudi/pull/6132#issuecomment-1250682959

   @alexeykudinkin Sorry for not watching this pr for a long time. Most of the comments have been discussed in slack, is there anything else that needs to be changed?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yuzhaojing merged pull request #6132: [RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
yuzhaojing merged PR #6132:
URL: https://github.com/apache/hudi/pull/6132


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6132:
URL: https://github.com/apache/hudi/pull/6132#issuecomment-1189870174

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10030",
       "triggerID" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9ce2c608b05103e2cf7123504252fa021b7bfcf6",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10032",
       "triggerID" : "9ce2c608b05103e2cf7123504252fa021b7bfcf6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "63384a9ee861c2fb14da988d404382ced69f733d",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10086",
       "triggerID" : "63384a9ee861c2fb14da988d404382ced69f733d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "011f79d738f845e2ae9224529f7ad5fa8250f4f2",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "011f79d738f845e2ae9224529f7ad5fa8250f4f2",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 9ce2c608b05103e2cf7123504252fa021b7bfcf6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10032) 
   * 63384a9ee861c2fb14da988d404382ced69f733d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10086) 
   * 011f79d738f845e2ae9224529f7ad5fa8250f4f2 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] wzx140 commented on a diff in pull request #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
wzx140 commented on code in PR #6132:
URL: https://github.com/apache/hudi/pull/6132#discussion_r925105200


##########
rfc/rfc-46/rfc-46.md:
##########
@@ -156,13 +187,76 @@ Following major components will be refactored:
 3. `HoodieRealtimeRecordReader`s 
    1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
 
+### Config for Record Merge
+The MERGE_CLASS_NAME config is engine-aware. If you are not specified the MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your engine type.
+
+### Public Api in HoodieRecord
+Because we implement different types of records, we need to transfer some func in AvroUtils into HoodieRecord for different data(avro, InternalRow, RowData).
+Its public API will look like following:
+
+```java
+class HoodieRecord {
+
+   /**
+    * Get column in record to support RDDCustomColumnsSortPartitioner
+    */
+   Object getRecordColumnValues(Schema recordSchema, String[] columns, boolean consistentLogicalTimestampEnabled);
+
+   /**
+    * Support bootstrap.
+    */
+   HoodieRecord mergeWith(HoodieRecord other) throws IOException;
+
+   /**
+    * Rewrite record into new schema(add meta columns)
+    */
+   HoodieRecord rewriteRecord(Schema recordSchema, Properties props, Schema targetSchema) throws IOException;
+
+   /**
+    * Support schema evolution.
+    */
+   HoodieRecord rewriteRecordWithNewSchema(Schema recordSchema, Properties props, Schema newSchema, Map<String, String> renameCols) throws IOException;
+
+   HoodieRecord addMetadataValues(Schema recordSchema, Properties props, Map<HoodieMetadataField, String> metadataValues) throws IOException;
+
+   /**
+    * Is deleted.
+    */
+   boolean isPresent(Schema recordSchema, Properties props) throws IOException;
+
+   /**
+    * Is EmptyRecord. Generated by ExpressionPayload.
+    */
+   boolean shouldIgnore(Schema recordSchema, Properties props) throws IOException;
+
+   /**
+    * This method used to extract HoodieKey not through keyGenerator.
+    */
+   HoodieRecord expansion(

Review Comment:
   HoodieAvroRecord need to use preCombineField to extract orderingVal. Is it a better way to put  payloadClass and preCombineField into the props?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] wzx140 commented on a diff in pull request #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
wzx140 commented on code in PR #6132:
URL: https://github.com/apache/hudi/pull/6132#discussion_r925103047


##########
rfc/rfc-46/rfc-46.md:
##########
@@ -156,13 +187,76 @@ Following major components will be refactored:
 3. `HoodieRealtimeRecordReader`s 
    1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
 
+### Config for Record Merge
+The MERGE_CLASS_NAME config is engine-aware. If you are not specified the MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your engine type.
+
+### Public Api in HoodieRecord
+Because we implement different types of records, we need to transfer some func in AvroUtils into HoodieRecord for different data(avro, InternalRow, RowData).
+Its public API will look like following:
+
+```java
+class HoodieRecord {
+
+   /**
+    * Get column in record to support RDDCustomColumnsSortPartitioner
+    */
+   Object getRecordColumnValues(Schema recordSchema, String[] columns, boolean consistentLogicalTimestampEnabled);
+
+   /**
+    * Support bootstrap.
+    */
+   HoodieRecord mergeWith(HoodieRecord other) throws IOException;
+
+   /**
+    * Rewrite record into new schema(add meta columns)
+    */
+   HoodieRecord rewriteRecord(Schema recordSchema, Properties props, Schema targetSchema) throws IOException;
+
+   /**
+    * Support schema evolution.
+    */
+   HoodieRecord rewriteRecordWithNewSchema(Schema recordSchema, Properties props, Schema newSchema, Map<String, String> renameCols) throws IOException;
+
+   HoodieRecord addMetadataValues(Schema recordSchema, Properties props, Map<HoodieMetadataField, String> metadataValues) throws IOException;
+
+   /**
+    * Is deleted.
+    */
+   boolean isPresent(Schema recordSchema, Properties props) throws IOException;
+
+   /**
+    * Is EmptyRecord. Generated by ExpressionPayload.
+    */
+   boolean shouldIgnore(Schema recordSchema, Properties props) throws IOException;
+
+   /**
+    * This method used to extract HoodieKey not through keyGenerator.
+    */
+   HoodieRecord expansion(

Review Comment:
   This method used in LogScanner. It expanses the raw data(avro, InternalRow) to HoodieRecord which has HoodieKey.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] wzx140 commented on a diff in pull request #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
wzx140 commented on code in PR #6132:
URL: https://github.com/apache/hudi/pull/6132#discussion_r927337196


##########
rfc/rfc-46/rfc-46.md:
##########
@@ -84,59 +84,90 @@ is known to have poor performance (compared to non-reflection based instantiatio
 
 #### Record Merge API
 
-Stateless component interface providing for API Combining Records will look like following:
+CombineAndGetUpdateValue and Precombine will converge to one API. Stateless component interface providing for API Combining Records will look like following:
 
 ```java
-interface HoodieMerge {
-   HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer);
-
-   Option<HoodieRecord> combineAndGetUpdateValue(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException;
+interface HoodieRecordMerger {
+   // combineAndGetUpdateValue and precombine
+   Option<HoodieRecord> merge(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException;
+   
+   // The record type handled by the current merger
+   // SPARK, AVRO, FLINK
+   HoodieRecordType getRecordType();
 }
 
-   /**
-    * Spark-specific implementation 
-    */
-   class HoodieSparkRecordMerge implements HoodieMerge {
+/**
+ * Spark-specific implementation 
+ */
+class HoodieSparkRecordMerger implements HoodieRecordMerger {
 
-      @Override
-      public HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer) {
-        // HoodieSparkRecords preCombine
-      }
+   @Override
+   Option<HoodieRecord> merge(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException {
+     // HoodieSparkRecord precombine and combineAndGetUpdateValue 
+   }
 
-      @Override
-      public Option<HoodieRecord> combineAndGetUpdateValue(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) {
-         // HoodieSparkRecord combineAndGetUpdateValue
-      }
+   @Override
+   HoodieRecordType getRecordType() {
+     return HoodieRecordType.SPARK;
    }
+}
    
-   /**
-    * Flink-specific implementation 
-    */
-   class HoodieFlinkRecordMerge implements HoodieMerge {
-
-      @Override
-      public HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer) {
-        // HoodieFlinkRecord preCombine
-      }
+/**
+ * Flink-specific implementation 
+ */
+class HoodieFlinkRecordMerger implements HoodieRecordMerger {
+
+   @Override
+   Option<HoodieRecord> merge(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException {
+      // HoodieFlinkRecord precombine and combineAndGetUpdateValue 
+   }
 
-      @Override
-      public Option<HoodieRecord> combineAndGetUpdateValue(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) {
-         // HoodieFlinkRecord combineAndGetUpdateValue
-      }
+   @Override
+   HoodieRecordType getRecordType() {
+      return HoodieRecordType.FLINK;
    }
+}
 ```
 Where user can provide their own subclass implementing such interface for the engines of interest.
 
-#### Migration from `HoodieRecordPayload` to `HoodieMerge`
+#### Migration from `HoodieRecordPayload` to `HoodieRecordMerger`
 
 To warrant backward-compatibility (BWC) on the code-level with already created subclasses of `HoodieRecordPayload` currently
-already used in production by Hudi users, we will provide a BWC-bridge in the form of instance of `HoodieMerge`, that will 
+already used in production by Hudi users, we will provide a BWC-bridge in the form of instance of `HoodieRecordMerger` called `HoodieAvroRecordMerger`, that will 
 be using user-defined subclass of `HoodieRecordPayload` to combine the records.
 
 Leveraging such bridge will make provide for seamless BWC migration to the 0.11 release, however will be removing the performance 
 benefit of this refactoring, since it would unavoidably have to perform conversion to intermediate representation (Avro). To realize
 full-suite of benefits of this refactoring, users will have to migrate their merging logic out of `HoodieRecordPayload` subclass and into
-new `HoodieMerge` implementation.
+new `HoodieRecordMerger` implementation.
+
+Precombine is used to merge records from logs or incoming records; CombineAndGetUpdateValue is used to merge record from log file and record from base file.
+these two merge logics are not exactly the same for some RecordPayload, such as OverwriteWithLatestAvroPaload. 
+We add an Enum in HoodieRecord to mark where it comes from(BASE, LOG or WRITE). `HoodieAvroRecordMerger`'s API will look like following:

Review Comment:
   I agree. But we need to check all the recordPayloads, and the ut involved. Maybe we can do a bug fix later. Because this function is evolving and we can change it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6132:
URL: https://github.com/apache/hudi/pull/6132#issuecomment-1200728114

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10030",
       "triggerID" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9ce2c608b05103e2cf7123504252fa021b7bfcf6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10032",
       "triggerID" : "9ce2c608b05103e2cf7123504252fa021b7bfcf6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "63384a9ee861c2fb14da988d404382ced69f733d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10086",
       "triggerID" : "63384a9ee861c2fb14da988d404382ced69f733d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "011f79d738f845e2ae9224529f7ad5fa8250f4f2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10088",
       "triggerID" : "011f79d738f845e2ae9224529f7ad5fa8250f4f2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b39ec4efe4a084dcd9ca6ffd56bb55b4d76d5ac7",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10093",
       "triggerID" : "b39ec4efe4a084dcd9ca6ffd56bb55b4d76d5ac7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d2e5c9721b7e10c44291162f6c3604ec83cfa350",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10149",
       "triggerID" : "d2e5c9721b7e10c44291162f6c3604ec83cfa350",
       "triggerType" : "PUSH"
     }, {
       "hash" : "87b34db2554029d7e3ad945dfa1bb7d74f15b5c5",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10480",
       "triggerID" : "87b34db2554029d7e3ad945dfa1bb7d74f15b5c5",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 87b34db2554029d7e3ad945dfa1bb7d74f15b5c5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10480) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6132:
URL: https://github.com/apache/hudi/pull/6132#issuecomment-1187617279

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10030",
       "triggerID" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9ce2c608b05103e2cf7123504252fa021b7bfcf6",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10032",
       "triggerID" : "9ce2c608b05103e2cf7123504252fa021b7bfcf6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10030) 
   * 9ce2c608b05103e2cf7123504252fa021b7bfcf6 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10032) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6132:
URL: https://github.com/apache/hudi/pull/6132#issuecomment-1187049972

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10030",
       "triggerID" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9ce2c608b05103e2cf7123504252fa021b7bfcf6",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10032",
       "triggerID" : "9ce2c608b05103e2cf7123504252fa021b7bfcf6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10030) 
   * 9ce2c608b05103e2cf7123504252fa021b7bfcf6 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10032) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6132:
URL: https://github.com/apache/hudi/pull/6132#issuecomment-1191557418

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10030",
       "triggerID" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9ce2c608b05103e2cf7123504252fa021b7bfcf6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10032",
       "triggerID" : "9ce2c608b05103e2cf7123504252fa021b7bfcf6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "63384a9ee861c2fb14da988d404382ced69f733d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10086",
       "triggerID" : "63384a9ee861c2fb14da988d404382ced69f733d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "011f79d738f845e2ae9224529f7ad5fa8250f4f2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10088",
       "triggerID" : "011f79d738f845e2ae9224529f7ad5fa8250f4f2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b39ec4efe4a084dcd9ca6ffd56bb55b4d76d5ac7",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10093",
       "triggerID" : "b39ec4efe4a084dcd9ca6ffd56bb55b4d76d5ac7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d2e5c9721b7e10c44291162f6c3604ec83cfa350",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "d2e5c9721b7e10c44291162f6c3604ec83cfa350",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b39ec4efe4a084dcd9ca6ffd56bb55b4d76d5ac7 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10093) 
   * d2e5c9721b7e10c44291162f6c3604ec83cfa350 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #6132:
URL: https://github.com/apache/hudi/pull/6132#discussion_r928058440


##########
rfc/rfc-46/rfc-46.md:
##########
@@ -84,59 +84,90 @@ is known to have poor performance (compared to non-reflection based instantiatio
 
 #### Record Merge API
 
-Stateless component interface providing for API Combining Records will look like following:
+CombineAndGetUpdateValue and Precombine will converge to one API. Stateless component interface providing for API Combining Records will look like following:
 
 ```java
-interface HoodieMerge {
-   HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer);
-
-   Option<HoodieRecord> combineAndGetUpdateValue(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException;
+interface HoodieRecordMerger {
+   // combineAndGetUpdateValue and precombine
+   Option<HoodieRecord> merge(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException;
+   
+   // The record type handled by the current merger
+   // SPARK, AVRO, FLINK
+   HoodieRecordType getRecordType();
 }
 
-   /**
-    * Spark-specific implementation 
-    */
-   class HoodieSparkRecordMerge implements HoodieMerge {
+/**
+ * Spark-specific implementation 
+ */
+class HoodieSparkRecordMerger implements HoodieRecordMerger {
 
-      @Override
-      public HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer) {
-        // HoodieSparkRecords preCombine
-      }
+   @Override
+   Option<HoodieRecord> merge(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException {
+     // HoodieSparkRecord precombine and combineAndGetUpdateValue 
+   }
 
-      @Override
-      public Option<HoodieRecord> combineAndGetUpdateValue(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) {
-         // HoodieSparkRecord combineAndGetUpdateValue
-      }
+   @Override
+   HoodieRecordType getRecordType() {
+     return HoodieRecordType.SPARK;
    }
+}
    
-   /**
-    * Flink-specific implementation 
-    */
-   class HoodieFlinkRecordMerge implements HoodieMerge {
-
-      @Override
-      public HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer) {
-        // HoodieFlinkRecord preCombine
-      }
+/**
+ * Flink-specific implementation 
+ */
+class HoodieFlinkRecordMerger implements HoodieRecordMerger {
+
+   @Override
+   Option<HoodieRecord> merge(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException {
+      // HoodieFlinkRecord precombine and combineAndGetUpdateValue 
+   }
 
-      @Override
-      public Option<HoodieRecord> combineAndGetUpdateValue(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) {
-         // HoodieFlinkRecord combineAndGetUpdateValue
-      }
+   @Override
+   HoodieRecordType getRecordType() {
+      return HoodieRecordType.FLINK;
    }
+}
 ```
 Where user can provide their own subclass implementing such interface for the engines of interest.
 
-#### Migration from `HoodieRecordPayload` to `HoodieMerge`
+#### Migration from `HoodieRecordPayload` to `HoodieRecordMerger`
 
 To warrant backward-compatibility (BWC) on the code-level with already created subclasses of `HoodieRecordPayload` currently
-already used in production by Hudi users, we will provide a BWC-bridge in the form of instance of `HoodieMerge`, that will 
+already used in production by Hudi users, we will provide a BWC-bridge in the form of instance of `HoodieRecordMerger` called `HoodieAvroRecordMerger`, that will 
 be using user-defined subclass of `HoodieRecordPayload` to combine the records.
 
 Leveraging such bridge will make provide for seamless BWC migration to the 0.11 release, however will be removing the performance 
 benefit of this refactoring, since it would unavoidably have to perform conversion to intermediate representation (Avro). To realize
 full-suite of benefits of this refactoring, users will have to migrate their merging logic out of `HoodieRecordPayload` subclass and into
-new `HoodieMerge` implementation.
+new `HoodieRecordMerger` implementation.
+
+Precombine is used to merge records from logs or incoming records; CombineAndGetUpdateValue is used to merge record from log file and record from base file.
+these two merge logics are not exactly the same for some RecordPayload, such as OverwriteWithLatestAvroPaload. 
+We add an Enum in HoodieRecord to mark where it comes from(BASE, LOG or WRITE). `HoodieAvroRecordMerger`'s API will look like following:

Review Comment:
   This will be another breaking change then. I totally understand and am with you that it's not making our life easier, but i think it'd be better if we take it upfront make all the breaking changes at once and be done with it.
   
   For ex, this new Enum is going to stick with us for a while if we'd introduce it and may produce new implementations using it which we will need to unwind in the future, which is exactly what i think we'd trying to avoid.



##########
rfc/rfc-46/rfc-46.md:
##########
@@ -84,59 +84,90 @@ is known to have poor performance (compared to non-reflection based instantiatio
 
 #### Record Merge API
 
-Stateless component interface providing for API Combining Records will look like following:
+CombineAndGetUpdateValue and Precombine will converge to one API. Stateless component interface providing for API Combining Records will look like following:
 
 ```java
-interface HoodieMerge {
-   HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer);
-
-   Option<HoodieRecord> combineAndGetUpdateValue(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException;
+interface HoodieRecordMerger {
+   // combineAndGetUpdateValue and precombine
+   Option<HoodieRecord> merge(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException;
+   
+   // The record type handled by the current merger
+   // SPARK, AVRO, FLINK
+   HoodieRecordType getRecordType();
 }
 
-   /**
-    * Spark-specific implementation 
-    */
-   class HoodieSparkRecordMerge implements HoodieMerge {
+/**
+ * Spark-specific implementation 
+ */
+class HoodieSparkRecordMerger implements HoodieRecordMerger {
 
-      @Override
-      public HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer) {
-        // HoodieSparkRecords preCombine
-      }
+   @Override
+   Option<HoodieRecord> merge(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException {
+     // HoodieSparkRecord precombine and combineAndGetUpdateValue 
+   }
 
-      @Override
-      public Option<HoodieRecord> combineAndGetUpdateValue(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) {
-         // HoodieSparkRecord combineAndGetUpdateValue
-      }
+   @Override
+   HoodieRecordType getRecordType() {
+     return HoodieRecordType.SPARK;
    }
+}
    
-   /**
-    * Flink-specific implementation 
-    */
-   class HoodieFlinkRecordMerge implements HoodieMerge {
-
-      @Override
-      public HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer) {
-        // HoodieFlinkRecord preCombine
-      }
+/**
+ * Flink-specific implementation 
+ */
+class HoodieFlinkRecordMerger implements HoodieRecordMerger {
+
+   @Override
+   Option<HoodieRecord> merge(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException {
+      // HoodieFlinkRecord precombine and combineAndGetUpdateValue 
+   }
 
-      @Override
-      public Option<HoodieRecord> combineAndGetUpdateValue(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) {
-         // HoodieFlinkRecord combineAndGetUpdateValue
-      }
+   @Override
+   HoodieRecordType getRecordType() {
+      return HoodieRecordType.FLINK;
    }
+}
 ```
 Where user can provide their own subclass implementing such interface for the engines of interest.
 
-#### Migration from `HoodieRecordPayload` to `HoodieMerge`
+#### Migration from `HoodieRecordPayload` to `HoodieRecordMerger`
 
 To warrant backward-compatibility (BWC) on the code-level with already created subclasses of `HoodieRecordPayload` currently
-already used in production by Hudi users, we will provide a BWC-bridge in the form of instance of `HoodieMerge`, that will 
+already used in production by Hudi users, we will provide a BWC-bridge in the form of instance of `HoodieRecordMerger` called `HoodieAvroRecordMerger`, that will 
 be using user-defined subclass of `HoodieRecordPayload` to combine the records.
 
 Leveraging such bridge will make provide for seamless BWC migration to the 0.11 release, however will be removing the performance 
 benefit of this refactoring, since it would unavoidably have to perform conversion to intermediate representation (Avro). To realize
 full-suite of benefits of this refactoring, users will have to migrate their merging logic out of `HoodieRecordPayload` subclass and into
-new `HoodieMerge` implementation.
+new `HoodieRecordMerger` implementation.
+
+Precombine is used to merge records from logs or incoming records; CombineAndGetUpdateValue is used to merge record from log file and record from base file.
+these two merge logics are not exactly the same for some RecordPayload, such as OverwriteWithLatestAvroPaload. 
+We add an Enum in HoodieRecord to mark where it comes from(BASE, LOG or WRITE). `HoodieAvroRecordMerger`'s API will look like following:

Review Comment:
   This will be another breaking change then. I totally understand and am with you that it's not making our life easier, but i think it'd be better if we take it upfront make all the breaking changes at once and be done with it.
   
   For ex, this new Enum is going to stick with us for a while if we'd introduce it and may produce new implementations using it which we will need to unwind in the future, which is exactly what i think we'd try to avoid.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] wzx140 commented on pull request #6132: [RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
wzx140 commented on PR #6132:
URL: https://github.com/apache/hudi/pull/6132#issuecomment-1251779085

   @alexeykudinkin I have changed func updateMetadataValues(Schema recordSchema, Properties props, MetadataValues metadataValues). truncateRecordKey will be put in HoodieRecordCompatibilityInterface.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6132: [RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #6132:
URL: https://github.com/apache/hudi/pull/6132#discussion_r938180595


##########
rfc/rfc-46/rfc-46.md:
##########
@@ -128,21 +173,88 @@ Following major components will be refactored:
 
 1. `HoodieWriteHandle`s will be  
    1. Accepting `HoodieRecord` instead of raw Avro payload (avoiding Avro conversion)
-   2. Using Combining API engine to merge records (when necessary) 
+   2. Using Record Merge API to merge records (when necessary) 
    3. Passes `HoodieRecord` as is to `FileWriter`
 2. `HoodieFileWriter`s will be 
    1. Accepting `HoodieRecord`
    2. Will be engine-specific (so that they're able to handle internal record representation)
 3. `HoodieRealtimeRecordReader`s 
    1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
 
+### Config for Record Merge
+The MERGE_CLASS_NAME config is engine-aware. If you are not specified the MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your engine type.
+
+### Public Api in HoodieRecord
+Because we implement different types of records, we need to implement functionality similar to AvroUtils in HoodieRecord for different data(avro, InternalRow, RowData).
+Its public API will look like following:
+
+```java
+import java.util.Properties;
+
+class HoodieRecord {
+
+   /**
+    * Get column in record to support RDDCustomColumnsSortPartitioner
+    */
+   Object getRecordColumnValues(Schema recordSchema, String[] columns,
+           boolean consistentLogicalTimestampEnabled);
+
+   /**
+    * Support bootstrap.
+    */
+   HoodieRecord mergeWith(HoodieRecord other, Schema targetSchema) throws IOException;
+
+   /**
+    * Rewrite record into new schema(add meta columns)
+    */
+   HoodieRecord rewriteRecord(Schema recordSchema, Properties props, Schema targetSchema)
+           throws IOException;
+
+   /**
+    * Support schema evolution.
+    */
+   HoodieRecord rewriteRecordWithNewSchema(Schema recordSchema, Properties props, Schema newSchema,
+           Map<String, String> renameCols) throws IOException;
+
+   HoodieRecord updateValues(Schema recordSchema, Properties props,
+           Map<String, String> metadataValues) throws IOException;
+
+   boolean isDelete(Schema recordSchema, Properties props) throws IOException;
+
+   /**
+    * Is EmptyRecord. Generated by ExpressionPayload.
+    */
+   boolean shouldIgnore(Schema recordSchema, Properties props) throws IOException;
+
+   /**
+    * This method used to extract HoodieKey through parameters.
+    */
+   HoodieRecord getKeyWithParams(

Review Comment:
   Sorry, i think i previously was under wrong impression that this method returns `HoodieKey`, but now i see that it returns the same `HoodieRecord` and the only thing that is changed essentially is just the name of the method.
   
   I [left my comment](https://github.com/apache/hudi/pull/5629#discussion_r938253077) elaborating why i think we should have no need for this method, PTAL.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] wzx140 commented on a diff in pull request #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
wzx140 commented on code in PR #6132:
URL: https://github.com/apache/hudi/pull/6132#discussion_r933207193


##########
rfc/rfc-46/rfc-46.md:
##########
@@ -84,59 +84,90 @@ is known to have poor performance (compared to non-reflection based instantiatio
 
 #### Record Merge API
 
-Stateless component interface providing for API Combining Records will look like following:
+CombineAndGetUpdateValue and Precombine will converge to one API. Stateless component interface providing for API Combining Records will look like following:
 
 ```java
-interface HoodieMerge {
-   HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer);
-
-   Option<HoodieRecord> combineAndGetUpdateValue(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException;
+interface HoodieRecordMerger {
+   // combineAndGetUpdateValue and precombine
+   Option<HoodieRecord> merge(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException;
+   
+   // The record type handled by the current merger
+   // SPARK, AVRO, FLINK
+   HoodieRecordType getRecordType();
 }
 
-   /**
-    * Spark-specific implementation 
-    */
-   class HoodieSparkRecordMerge implements HoodieMerge {
+/**
+ * Spark-specific implementation 
+ */
+class HoodieSparkRecordMerger implements HoodieRecordMerger {
 
-      @Override
-      public HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer) {
-        // HoodieSparkRecords preCombine
-      }
+   @Override
+   Option<HoodieRecord> merge(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException {
+     // HoodieSparkRecord precombine and combineAndGetUpdateValue 
+   }
 
-      @Override
-      public Option<HoodieRecord> combineAndGetUpdateValue(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) {
-         // HoodieSparkRecord combineAndGetUpdateValue
-      }
+   @Override
+   HoodieRecordType getRecordType() {
+     return HoodieRecordType.SPARK;
    }
+}
    
-   /**
-    * Flink-specific implementation 
-    */
-   class HoodieFlinkRecordMerge implements HoodieMerge {
-
-      @Override
-      public HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer) {
-        // HoodieFlinkRecord preCombine
-      }
+/**
+ * Flink-specific implementation 
+ */
+class HoodieFlinkRecordMerger implements HoodieRecordMerger {
+
+   @Override
+   Option<HoodieRecord> merge(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException {
+      // HoodieFlinkRecord precombine and combineAndGetUpdateValue 
+   }
 
-      @Override
-      public Option<HoodieRecord> combineAndGetUpdateValue(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) {
-         // HoodieFlinkRecord combineAndGetUpdateValue
-      }
+   @Override
+   HoodieRecordType getRecordType() {
+      return HoodieRecordType.FLINK;
    }
+}
 ```
 Where user can provide their own subclass implementing such interface for the engines of interest.
 
-#### Migration from `HoodieRecordPayload` to `HoodieMerge`
+#### Migration from `HoodieRecordPayload` to `HoodieRecordMerger`
 
 To warrant backward-compatibility (BWC) on the code-level with already created subclasses of `HoodieRecordPayload` currently
-already used in production by Hudi users, we will provide a BWC-bridge in the form of instance of `HoodieMerge`, that will 
+already used in production by Hudi users, we will provide a BWC-bridge in the form of instance of `HoodieRecordMerger` called `HoodieAvroRecordMerger`, that will 
 be using user-defined subclass of `HoodieRecordPayload` to combine the records.
 
 Leveraging such bridge will make provide for seamless BWC migration to the 0.11 release, however will be removing the performance 
 benefit of this refactoring, since it would unavoidably have to perform conversion to intermediate representation (Avro). To realize
 full-suite of benefits of this refactoring, users will have to migrate their merging logic out of `HoodieRecordPayload` subclass and into
-new `HoodieMerge` implementation.
+new `HoodieRecordMerger` implementation.
+
+Precombine is used to merge records from logs or incoming records; CombineAndGetUpdateValue is used to merge record from log file and record from base file.
+these two merge logics are not exactly the same for some RecordPayload, such as OverwriteWithLatestAvroPaload. 
+We add an Enum in HoodieRecord to mark where it comes from(BASE, LOG or WRITE). `HoodieAvroRecordMerger`'s API will look like following:

Review Comment:
   I think you're right. I've removed the mark(BASE, LOG or WRITE) in HoodieRecord and unified logic of HoodieSparkRecord.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6132:
URL: https://github.com/apache/hudi/pull/6132#issuecomment-1189914786

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10030",
       "triggerID" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9ce2c608b05103e2cf7123504252fa021b7bfcf6",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10032",
       "triggerID" : "9ce2c608b05103e2cf7123504252fa021b7bfcf6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "63384a9ee861c2fb14da988d404382ced69f733d",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10086",
       "triggerID" : "63384a9ee861c2fb14da988d404382ced69f733d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "011f79d738f845e2ae9224529f7ad5fa8250f4f2",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10088",
       "triggerID" : "011f79d738f845e2ae9224529f7ad5fa8250f4f2",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 9ce2c608b05103e2cf7123504252fa021b7bfcf6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10032) 
   * 63384a9ee861c2fb14da988d404382ced69f733d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10086) 
   * 011f79d738f845e2ae9224529f7ad5fa8250f4f2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10088) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] wzx140 commented on a diff in pull request #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
wzx140 commented on code in PR #6132:
URL: https://github.com/apache/hudi/pull/6132#discussion_r925186365


##########
rfc/rfc-46/rfc-46.md:
##########
@@ -156,13 +187,76 @@ Following major components will be refactored:
 3. `HoodieRealtimeRecordReader`s 
    1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
 
+### Config for Record Merge
+The MERGE_CLASS_NAME config is engine-aware. If you are not specified the MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your engine type.
+
+### Public Api in HoodieRecord
+Because we implement different types of records, we need to transfer some func in AvroUtils into HoodieRecord for different data(avro, InternalRow, RowData).
+Its public API will look like following:
+
+```java
+class HoodieRecord {
+
+   /**
+    * Get column in record to support RDDCustomColumnsSortPartitioner
+    */
+   Object getRecordColumnValues(Schema recordSchema, String[] columns, boolean consistentLogicalTimestampEnabled);
+
+   /**
+    * Support bootstrap.
+    */
+   HoodieRecord mergeWith(HoodieRecord other) throws IOException;
+
+   /**
+    * Rewrite record into new schema(add meta columns)
+    */
+   HoodieRecord rewriteRecord(Schema recordSchema, Properties props, Schema targetSchema) throws IOException;
+
+   /**
+    * Support schema evolution.
+    */
+   HoodieRecord rewriteRecordWithNewSchema(Schema recordSchema, Properties props, Schema newSchema, Map<String, String> renameCols) throws IOException;
+
+   HoodieRecord addMetadataValues(Schema recordSchema, Properties props, Map<HoodieMetadataField, String> metadataValues) throws IOException;
+
+   /**
+    * Is deleted.
+    */
+   boolean isPresent(Schema recordSchema, Properties props) throws IOException;
+
+   /**
+    * Is EmptyRecord. Generated by ExpressionPayload.
+    */
+   boolean shouldIgnore(Schema recordSchema, Properties props) throws IOException;
+
+   /**
+    * This method used to extract HoodieKey not through keyGenerator.
+    */
+   HoodieRecord expansion(
+           String payloadClass,
+           String preCombineField,
+           Option<Pair<String, String>> simpleKeyGenFieldsOpt,
+           Boolean withOperation,
+           Option<String> partitionNameOp,
+           Boolean populateMetaFieldsOp) throws IOException;
+
+   /**
+    * This method used to extract HoodieKey through keyGenerator. This method used in ClusteringExecutionStrategy.
+    */
+   HoodieRecord transform(Properties props, Option<BaseKeyGenerator> keyGen);
+
+   Option<IndexedRecord> toIndexedRecord(Schema schema, Properties props) throws IOException;

Review Comment:
   fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #6132:
URL: https://github.com/apache/hudi/pull/6132#discussion_r925079371


##########
rfc/rfc-46/rfc-46.md:
##########
@@ -84,59 +84,90 @@ is known to have poor performance (compared to non-reflection based instantiatio
 
 #### Record Merge API
 
-Stateless component interface providing for API Combining Records will look like following:
+CombineAndGetUpdateValue and Precombine will converge to one API. Stateless component interface providing for API Combining Records will look like following:
 
 ```java
-interface HoodieMerge {
-   HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer);
-
-   Option<HoodieRecord> combineAndGetUpdateValue(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException;
+interface HoodieRecordMerger {
+   // combineAndGetUpdateValue and precombine
+   Option<HoodieRecord> merge(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException;
+   
+   // The record type handled by the current merger
+   // SPARK, AVRO, FLINK
+   HoodieRecordType getRecordType();
 }
 
-   /**
-    * Spark-specific implementation 
-    */
-   class HoodieSparkRecordMerge implements HoodieMerge {
+/**
+ * Spark-specific implementation 
+ */
+class HoodieSparkRecordMerger implements HoodieRecordMerger {
 
-      @Override
-      public HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer) {
-        // HoodieSparkRecords preCombine
-      }
+   @Override
+   Option<HoodieRecord> merge(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException {
+     // HoodieSparkRecord precombine and combineAndGetUpdateValue 
+   }
 
-      @Override
-      public Option<HoodieRecord> combineAndGetUpdateValue(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) {
-         // HoodieSparkRecord combineAndGetUpdateValue
-      }
+   @Override
+   HoodieRecordType getRecordType() {
+     return HoodieRecordType.SPARK;
    }
+}
    
-   /**
-    * Flink-specific implementation 
-    */
-   class HoodieFlinkRecordMerge implements HoodieMerge {
-
-      @Override
-      public HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer) {
-        // HoodieFlinkRecord preCombine
-      }
+/**
+ * Flink-specific implementation 
+ */
+class HoodieFlinkRecordMerger implements HoodieRecordMerger {
+
+   @Override
+   Option<HoodieRecord> merge(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException {
+      // HoodieFlinkRecord precombine and combineAndGetUpdateValue 
+   }
 
-      @Override
-      public Option<HoodieRecord> combineAndGetUpdateValue(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) {
-         // HoodieFlinkRecord combineAndGetUpdateValue
-      }
+   @Override
+   HoodieRecordType getRecordType() {
+      return HoodieRecordType.FLINK;
    }
+}
 ```
 Where user can provide their own subclass implementing such interface for the engines of interest.
 
-#### Migration from `HoodieRecordPayload` to `HoodieMerge`
+#### Migration from `HoodieRecordPayload` to `HoodieRecordMerger`
 
 To warrant backward-compatibility (BWC) on the code-level with already created subclasses of `HoodieRecordPayload` currently
-already used in production by Hudi users, we will provide a BWC-bridge in the form of instance of `HoodieMerge`, that will 
+already used in production by Hudi users, we will provide a BWC-bridge in the form of instance of `HoodieRecordMerger` called `HoodieAvroRecordMerger`, that will 
 be using user-defined subclass of `HoodieRecordPayload` to combine the records.
 
 Leveraging such bridge will make provide for seamless BWC migration to the 0.11 release, however will be removing the performance 
 benefit of this refactoring, since it would unavoidably have to perform conversion to intermediate representation (Avro). To realize
 full-suite of benefits of this refactoring, users will have to migrate their merging logic out of `HoodieRecordPayload` subclass and into
-new `HoodieMerge` implementation.
+new `HoodieRecordMerger` implementation.
+
+Precombine is used to merge records from logs or incoming records; CombineAndGetUpdateValue is used to merge record from log file and record from base file.
+these two merge logics are not exactly the same for some RecordPayload, such as OverwriteWithLatestAvroPaload. 
+We add an Enum in HoodieRecord to mark where it comes from(BASE, LOG or WRITE). `HoodieAvroRecordMerger`'s API will look like following:

Review Comment:
   I don't think we even need this: current implementation of `OverwriteWithLatestAvroPayload` is actually implemented inconsistently right now:
   
   - When de-duping it will use `orderingVal` to determine the latest record, but
   - When merging persisted record w/ incoming it will just assume that incoming somehow is more recent than persisted one (even though it could be the opposite
   
   We should actually unify its semantic to be consistent across `preCombine` and `combineAndGetUpdateValue`



##########
rfc/rfc-46/rfc-46.md:
##########
@@ -84,59 +84,90 @@ is known to have poor performance (compared to non-reflection based instantiatio
 
 #### Record Merge API
 
-Stateless component interface providing for API Combining Records will look like following:
+CombineAndGetUpdateValue and Precombine will converge to one API. Stateless component interface providing for API Combining Records will look like following:
 
 ```java
-interface HoodieMerge {
-   HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer);
-
-   Option<HoodieRecord> combineAndGetUpdateValue(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException;
+interface HoodieRecordMerger {
+   // combineAndGetUpdateValue and precombine
+   Option<HoodieRecord> merge(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException;
+   
+   // The record type handled by the current merger
+   // SPARK, AVRO, FLINK
+   HoodieRecordType getRecordType();
 }
 
-   /**
-    * Spark-specific implementation 
-    */
-   class HoodieSparkRecordMerge implements HoodieMerge {
+/**
+ * Spark-specific implementation 
+ */
+class HoodieSparkRecordMerger implements HoodieRecordMerger {
 
-      @Override
-      public HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer) {
-        // HoodieSparkRecords preCombine
-      }
+   @Override
+   Option<HoodieRecord> merge(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException {
+     // HoodieSparkRecord precombine and combineAndGetUpdateValue 
+   }
 
-      @Override
-      public Option<HoodieRecord> combineAndGetUpdateValue(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) {
-         // HoodieSparkRecord combineAndGetUpdateValue
-      }
+   @Override
+   HoodieRecordType getRecordType() {
+     return HoodieRecordType.SPARK;
    }
+}
    
-   /**
-    * Flink-specific implementation 
-    */
-   class HoodieFlinkRecordMerge implements HoodieMerge {
-
-      @Override
-      public HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer) {
-        // HoodieFlinkRecord preCombine
-      }
+/**
+ * Flink-specific implementation 
+ */
+class HoodieFlinkRecordMerger implements HoodieRecordMerger {
+
+   @Override
+   Option<HoodieRecord> merge(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException {
+      // HoodieFlinkRecord precombine and combineAndGetUpdateValue 
+   }
 
-      @Override
-      public Option<HoodieRecord> combineAndGetUpdateValue(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) {
-         // HoodieFlinkRecord combineAndGetUpdateValue
-      }
+   @Override
+   HoodieRecordType getRecordType() {
+      return HoodieRecordType.FLINK;
    }
+}
 ```
 Where user can provide their own subclass implementing such interface for the engines of interest.
 
-#### Migration from `HoodieRecordPayload` to `HoodieMerge`
+#### Migration from `HoodieRecordPayload` to `HoodieRecordMerger`
 
 To warrant backward-compatibility (BWC) on the code-level with already created subclasses of `HoodieRecordPayload` currently
-already used in production by Hudi users, we will provide a BWC-bridge in the form of instance of `HoodieMerge`, that will 
+already used in production by Hudi users, we will provide a BWC-bridge in the form of instance of `HoodieRecordMerger` called `HoodieAvroRecordMerger`, that will 
 be using user-defined subclass of `HoodieRecordPayload` to combine the records.
 
 Leveraging such bridge will make provide for seamless BWC migration to the 0.11 release, however will be removing the performance 

Review Comment:
   Typo: "make provide"



##########
rfc/rfc-46/rfc-46.md:
##########
@@ -84,59 +84,90 @@ is known to have poor performance (compared to non-reflection based instantiatio
 
 #### Record Merge API
 
-Stateless component interface providing for API Combining Records will look like following:
+CombineAndGetUpdateValue and Precombine will converge to one API. Stateless component interface providing for API Combining Records will look like following:
 
 ```java
-interface HoodieMerge {
-   HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer);
-
-   Option<HoodieRecord> combineAndGetUpdateValue(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException;
+interface HoodieRecordMerger {
+   // combineAndGetUpdateValue and precombine
+   Option<HoodieRecord> merge(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException;
+   
+   // The record type handled by the current merger
+   // SPARK, AVRO, FLINK
+   HoodieRecordType getRecordType();
 }
 
-   /**
-    * Spark-specific implementation 
-    */
-   class HoodieSparkRecordMerge implements HoodieMerge {
+/**
+ * Spark-specific implementation 
+ */
+class HoodieSparkRecordMerger implements HoodieRecordMerger {
 
-      @Override
-      public HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer) {
-        // HoodieSparkRecords preCombine
-      }
+   @Override
+   Option<HoodieRecord> merge(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException {
+     // HoodieSparkRecord precombine and combineAndGetUpdateValue 

Review Comment:
   Same comment as above



##########
rfc/rfc-46/rfc-46.md:
##########
@@ -156,13 +187,76 @@ Following major components will be refactored:
 3. `HoodieRealtimeRecordReader`s 
    1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
 
+### Config for Record Merge
+The MERGE_CLASS_NAME config is engine-aware. If you are not specified the MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your engine type.
+
+### Public Api in HoodieRecord
+Because we implement different types of records, we need to transfer some func in AvroUtils into HoodieRecord for different data(avro, InternalRow, RowData).
+Its public API will look like following:
+
+```java
+class HoodieRecord {
+
+   /**
+    * Get column in record to support RDDCustomColumnsSortPartitioner
+    */
+   Object getRecordColumnValues(Schema recordSchema, String[] columns, boolean consistentLogicalTimestampEnabled);
+
+   /**
+    * Support bootstrap.
+    */
+   HoodieRecord mergeWith(HoodieRecord other) throws IOException;
+
+   /**
+    * Rewrite record into new schema(add meta columns)
+    */
+   HoodieRecord rewriteRecord(Schema recordSchema, Properties props, Schema targetSchema) throws IOException;
+
+   /**
+    * Support schema evolution.
+    */
+   HoodieRecord rewriteRecordWithNewSchema(Schema recordSchema, Properties props, Schema newSchema, Map<String, String> renameCols) throws IOException;
+
+   HoodieRecord addMetadataValues(Schema recordSchema, Properties props, Map<HoodieMetadataField, String> metadataValues) throws IOException;
+
+   /**
+    * Is deleted.
+    */
+   boolean isPresent(Schema recordSchema, Properties props) throws IOException;

Review Comment:
   Let's instead call it `isDelete`



##########
rfc/rfc-46/rfc-46.md:
##########
@@ -156,13 +187,76 @@ Following major components will be refactored:
 3. `HoodieRealtimeRecordReader`s 
    1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
 
+### Config for Record Merge
+The MERGE_CLASS_NAME config is engine-aware. If you are not specified the MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your engine type.
+
+### Public Api in HoodieRecord
+Because we implement different types of records, we need to transfer some func in AvroUtils into HoodieRecord for different data(avro, InternalRow, RowData).
+Its public API will look like following:
+
+```java
+class HoodieRecord {
+
+   /**
+    * Get column in record to support RDDCustomColumnsSortPartitioner
+    */
+   Object getRecordColumnValues(Schema recordSchema, String[] columns, boolean consistentLogicalTimestampEnabled);
+
+   /**
+    * Support bootstrap.
+    */
+   HoodieRecord mergeWith(HoodieRecord other) throws IOException;
+
+   /**
+    * Rewrite record into new schema(add meta columns)
+    */
+   HoodieRecord rewriteRecord(Schema recordSchema, Properties props, Schema targetSchema) throws IOException;
+
+   /**
+    * Support schema evolution.
+    */
+   HoodieRecord rewriteRecordWithNewSchema(Schema recordSchema, Properties props, Schema newSchema, Map<String, String> renameCols) throws IOException;
+
+   HoodieRecord addMetadataValues(Schema recordSchema, Properties props, Map<HoodieMetadataField, String> metadataValues) throws IOException;
+
+   /**
+    * Is deleted.
+    */
+   boolean isPresent(Schema recordSchema, Properties props) throws IOException;
+
+   /**
+    * Is EmptyRecord. Generated by ExpressionPayload.
+    */
+   boolean shouldIgnore(Schema recordSchema, Properties props) throws IOException;
+
+   /**
+    * This method used to extract HoodieKey not through keyGenerator.
+    */
+   HoodieRecord expansion(
+           String payloadClass,
+           String preCombineField,
+           Option<Pair<String, String>> simpleKeyGenFieldsOpt,
+           Boolean withOperation,
+           Option<String> partitionNameOp,
+           Boolean populateMetaFieldsOp) throws IOException;
+
+   /**
+    * This method used to extract HoodieKey through keyGenerator. This method used in ClusteringExecutionStrategy.
+    */
+   HoodieRecord transform(Properties props, Option<BaseKeyGenerator> keyGen);
+
+   Option<IndexedRecord> toIndexedRecord(Schema schema, Properties props) throws IOException;

Review Comment:
   What's the purpose of this method? Are we using it to fallback to Avro? 
   
   If that's the case then we should:
   
   1. Return `HoodieAvroRecord` instead of `IndexedRecord` 
   2. Make sure we annotate it as the one which usages should be very restricted (mostly for the time we migrate and in a few exceptions)



##########
rfc/rfc-46/rfc-46.md:
##########
@@ -156,13 +187,76 @@ Following major components will be refactored:
 3. `HoodieRealtimeRecordReader`s 
    1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
 
+### Config for Record Merge
+The MERGE_CLASS_NAME config is engine-aware. If you are not specified the MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your engine type.
+
+### Public Api in HoodieRecord
+Because we implement different types of records, we need to transfer some func in AvroUtils into HoodieRecord for different data(avro, InternalRow, RowData).

Review Comment:
   We might need to re-phrase this to make message a little more clear: we are saying that we will be implementing functionality similar to `AvroUtils` in `HoodieRecord`, right?



##########
rfc/rfc-46/rfc-46.md:
##########
@@ -156,13 +187,76 @@ Following major components will be refactored:
 3. `HoodieRealtimeRecordReader`s 
    1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
 
+### Config for Record Merge
+The MERGE_CLASS_NAME config is engine-aware. If you are not specified the MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your engine type.
+
+### Public Api in HoodieRecord
+Because we implement different types of records, we need to transfer some func in AvroUtils into HoodieRecord for different data(avro, InternalRow, RowData).
+Its public API will look like following:
+
+```java
+class HoodieRecord {
+
+   /**
+    * Get column in record to support RDDCustomColumnsSortPartitioner
+    */
+   Object getRecordColumnValues(Schema recordSchema, String[] columns, boolean consistentLogicalTimestampEnabled);
+
+   /**
+    * Support bootstrap.
+    */
+   HoodieRecord mergeWith(HoodieRecord other) throws IOException;

Review Comment:
   All merging logic should be extracted from `HoodieRecord` and consolidated w/in `HoodieRecordMerger`



##########
rfc/rfc-46/rfc-46.md:
##########
@@ -156,13 +187,76 @@ Following major components will be refactored:
 3. `HoodieRealtimeRecordReader`s 
    1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
 
+### Config for Record Merge
+The MERGE_CLASS_NAME config is engine-aware. If you are not specified the MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your engine type.
+
+### Public Api in HoodieRecord
+Because we implement different types of records, we need to transfer some func in AvroUtils into HoodieRecord for different data(avro, InternalRow, RowData).
+Its public API will look like following:
+
+```java
+class HoodieRecord {
+
+   /**
+    * Get column in record to support RDDCustomColumnsSortPartitioner
+    */
+   Object getRecordColumnValues(Schema recordSchema, String[] columns, boolean consistentLogicalTimestampEnabled);
+
+   /**
+    * Support bootstrap.
+    */
+   HoodieRecord mergeWith(HoodieRecord other) throws IOException;
+
+   /**
+    * Rewrite record into new schema(add meta columns)
+    */
+   HoodieRecord rewriteRecord(Schema recordSchema, Properties props, Schema targetSchema) throws IOException;
+
+   /**
+    * Support schema evolution.
+    */
+   HoodieRecord rewriteRecordWithNewSchema(Schema recordSchema, Properties props, Schema newSchema, Map<String, String> renameCols) throws IOException;

Review Comment:
   I'd suggest we avoid adding this to an API and keep it as a utility method elsewhere



##########
rfc/rfc-46/rfc-46.md:
##########
@@ -84,59 +84,90 @@ is known to have poor performance (compared to non-reflection based instantiatio
 
 #### Record Merge API
 
-Stateless component interface providing for API Combining Records will look like following:
+CombineAndGetUpdateValue and Precombine will converge to one API. Stateless component interface providing for API Combining Records will look like following:
 
 ```java
-interface HoodieMerge {
-   HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer);
-
-   Option<HoodieRecord> combineAndGetUpdateValue(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException;
+interface HoodieRecordMerger {
+   // combineAndGetUpdateValue and precombine

Review Comment:
   Let's update the comment to elaborate on what's expected semantic of this method (for ex, we should mention that this operation should be associative)



##########
rfc/rfc-46/rfc-46.md:
##########
@@ -156,13 +187,76 @@ Following major components will be refactored:
 3. `HoodieRealtimeRecordReader`s 
    1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
 
+### Config for Record Merge
+The MERGE_CLASS_NAME config is engine-aware. If you are not specified the MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your engine type.
+
+### Public Api in HoodieRecord
+Because we implement different types of records, we need to transfer some func in AvroUtils into HoodieRecord for different data(avro, InternalRow, RowData).
+Its public API will look like following:
+
+```java
+class HoodieRecord {
+
+   /**
+    * Get column in record to support RDDCustomColumnsSortPartitioner
+    */
+   Object getRecordColumnValues(Schema recordSchema, String[] columns, boolean consistentLogicalTimestampEnabled);
+
+   /**
+    * Support bootstrap.
+    */
+   HoodieRecord mergeWith(HoodieRecord other) throws IOException;
+
+   /**
+    * Rewrite record into new schema(add meta columns)
+    */
+   HoodieRecord rewriteRecord(Schema recordSchema, Properties props, Schema targetSchema) throws IOException;
+
+   /**
+    * Support schema evolution.
+    */
+   HoodieRecord rewriteRecordWithNewSchema(Schema recordSchema, Properties props, Schema newSchema, Map<String, String> renameCols) throws IOException;
+
+   HoodieRecord addMetadataValues(Schema recordSchema, Properties props, Map<HoodieMetadataField, String> metadataValues) throws IOException;
+
+   /**
+    * Is deleted.
+    */
+   boolean isPresent(Schema recordSchema, Properties props) throws IOException;
+
+   /**
+    * Is EmptyRecord. Generated by ExpressionPayload.
+    */
+   boolean shouldIgnore(Schema recordSchema, Properties props) throws IOException;
+
+   /**
+    * This method used to extract HoodieKey not through keyGenerator.
+    */
+   HoodieRecord expansion(

Review Comment:
   We should not have this method: 
   
   1. It accepts `payloadClass` that we're eliminating
   2. It accepts `preCombineField` that should only be used inside `HoodieRecordMerger`
   3. Its purpose is unclear



##########
rfc/rfc-46/rfc-46.md:
##########
@@ -156,13 +187,76 @@ Following major components will be refactored:
 3. `HoodieRealtimeRecordReader`s 
    1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
 
+### Config for Record Merge
+The MERGE_CLASS_NAME config is engine-aware. If you are not specified the MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your engine type.
+
+### Public Api in HoodieRecord
+Because we implement different types of records, we need to transfer some func in AvroUtils into HoodieRecord for different data(avro, InternalRow, RowData).
+Its public API will look like following:
+
+```java
+class HoodieRecord {
+
+   /**
+    * Get column in record to support RDDCustomColumnsSortPartitioner
+    */
+   Object getRecordColumnValues(Schema recordSchema, String[] columns, boolean consistentLogicalTimestampEnabled);

Review Comment:
   I'd suggest we go with singular option: 
   ```
   <T> T getField(String column)
   ```
   
   We should not be passing the schema in, schema should be kept inside HoodieRecord itself. Also, let's get rid of `consistentLogicalTimestampEnabled` you can just pass it as true in your implementation.



##########
rfc/rfc-46/rfc-46.md:
##########
@@ -156,13 +187,76 @@ Following major components will be refactored:
 3. `HoodieRealtimeRecordReader`s 
    1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
 
+### Config for Record Merge
+The MERGE_CLASS_NAME config is engine-aware. If you are not specified the MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your engine type.
+
+### Public Api in HoodieRecord
+Because we implement different types of records, we need to transfer some func in AvroUtils into HoodieRecord for different data(avro, InternalRow, RowData).
+Its public API will look like following:
+
+```java
+class HoodieRecord {
+
+   /**
+    * Get column in record to support RDDCustomColumnsSortPartitioner
+    */
+   Object getRecordColumnValues(Schema recordSchema, String[] columns, boolean consistentLogicalTimestampEnabled);
+
+   /**
+    * Support bootstrap.
+    */
+   HoodieRecord mergeWith(HoodieRecord other) throws IOException;
+
+   /**
+    * Rewrite record into new schema(add meta columns)
+    */
+   HoodieRecord rewriteRecord(Schema recordSchema, Properties props, Schema targetSchema) throws IOException;

Review Comment:
   Let's avoid passing in `recordSchema` -- it should be captured inside `HoodieRecord`



##########
rfc/rfc-46/rfc-46.md:
##########
@@ -156,13 +187,76 @@ Following major components will be refactored:
 3. `HoodieRealtimeRecordReader`s 
    1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
 
+### Config for Record Merge
+The MERGE_CLASS_NAME config is engine-aware. If you are not specified the MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your engine type.
+
+### Public Api in HoodieRecord
+Because we implement different types of records, we need to transfer some func in AvroUtils into HoodieRecord for different data(avro, InternalRow, RowData).
+Its public API will look like following:
+
+```java
+class HoodieRecord {
+
+   /**
+    * Get column in record to support RDDCustomColumnsSortPartitioner
+    */
+   Object getRecordColumnValues(Schema recordSchema, String[] columns, boolean consistentLogicalTimestampEnabled);
+
+   /**
+    * Support bootstrap.
+    */
+   HoodieRecord mergeWith(HoodieRecord other) throws IOException;
+
+   /**
+    * Rewrite record into new schema(add meta columns)
+    */
+   HoodieRecord rewriteRecord(Schema recordSchema, Properties props, Schema targetSchema) throws IOException;
+
+   /**
+    * Support schema evolution.
+    */
+   HoodieRecord rewriteRecordWithNewSchema(Schema recordSchema, Properties props, Schema newSchema, Map<String, String> renameCols) throws IOException;
+
+   HoodieRecord addMetadataValues(Schema recordSchema, Properties props, Map<HoodieMetadataField, String> metadataValues) throws IOException;
+
+   /**
+    * Is deleted.
+    */
+   boolean isPresent(Schema recordSchema, Properties props) throws IOException;
+
+   /**
+    * Is EmptyRecord. Generated by ExpressionPayload.
+    */
+   boolean shouldIgnore(Schema recordSchema, Properties props) throws IOException;

Review Comment:
   Are you referring to `EmptyHoodieRecordPayload`? It's as well used tombstone record (ie delete).
   
   We should be fine with just one deleting mechanism.



##########
rfc/rfc-46/rfc-46.md:
##########
@@ -156,13 +187,76 @@ Following major components will be refactored:
 3. `HoodieRealtimeRecordReader`s 
    1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
 
+### Config for Record Merge
+The MERGE_CLASS_NAME config is engine-aware. If you are not specified the MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your engine type.
+
+### Public Api in HoodieRecord
+Because we implement different types of records, we need to transfer some func in AvroUtils into HoodieRecord for different data(avro, InternalRow, RowData).
+Its public API will look like following:
+
+```java
+class HoodieRecord {
+
+   /**
+    * Get column in record to support RDDCustomColumnsSortPartitioner
+    */
+   Object getRecordColumnValues(Schema recordSchema, String[] columns, boolean consistentLogicalTimestampEnabled);
+
+   /**
+    * Support bootstrap.
+    */
+   HoodieRecord mergeWith(HoodieRecord other) throws IOException;
+
+   /**
+    * Rewrite record into new schema(add meta columns)
+    */
+   HoodieRecord rewriteRecord(Schema recordSchema, Properties props, Schema targetSchema) throws IOException;
+
+   /**
+    * Support schema evolution.
+    */
+   HoodieRecord rewriteRecordWithNewSchema(Schema recordSchema, Properties props, Schema newSchema, Map<String, String> renameCols) throws IOException;
+
+   HoodieRecord addMetadataValues(Schema recordSchema, Properties props, Map<HoodieMetadataField, String> metadataValues) throws IOException;
+
+   /**
+    * Is deleted.
+    */
+   boolean isPresent(Schema recordSchema, Properties props) throws IOException;
+
+   /**
+    * Is EmptyRecord. Generated by ExpressionPayload.
+    */
+   boolean shouldIgnore(Schema recordSchema, Properties props) throws IOException;
+
+   /**
+    * This method used to extract HoodieKey not through keyGenerator.
+    */
+   HoodieRecord expansion(
+           String payloadClass,
+           String preCombineField,
+           Option<Pair<String, String>> simpleKeyGenFieldsOpt,
+           Boolean withOperation,
+           Option<String> partitionNameOp,
+           Boolean populateMetaFieldsOp) throws IOException;
+
+   /**
+    * This method used to extract HoodieKey through keyGenerator. This method used in ClusteringExecutionStrategy.
+    */
+   HoodieRecord transform(Properties props, Option<BaseKeyGenerator> keyGen);

Review Comment:
   Same comment as above for `expansion`: what this method is needed for?



##########
rfc/rfc-46/rfc-46.md:
##########
@@ -156,13 +187,76 @@ Following major components will be refactored:
 3. `HoodieRealtimeRecordReader`s 
    1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
 
+### Config for Record Merge
+The MERGE_CLASS_NAME config is engine-aware. If you are not specified the MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your engine type.
+
+### Public Api in HoodieRecord
+Because we implement different types of records, we need to transfer some func in AvroUtils into HoodieRecord for different data(avro, InternalRow, RowData).
+Its public API will look like following:
+
+```java
+class HoodieRecord {
+
+   /**
+    * Get column in record to support RDDCustomColumnsSortPartitioner
+    */
+   Object getRecordColumnValues(Schema recordSchema, String[] columns, boolean consistentLogicalTimestampEnabled);
+
+   /**
+    * Support bootstrap.
+    */
+   HoodieRecord mergeWith(HoodieRecord other) throws IOException;
+
+   /**
+    * Rewrite record into new schema(add meta columns)
+    */
+   HoodieRecord rewriteRecord(Schema recordSchema, Properties props, Schema targetSchema) throws IOException;

Review Comment:
   Signature should be 
   
   ```
   HoodieRecord rewriteInSchema(Schema targetSchema, Properties props)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #6132:
URL: https://github.com/apache/hudi/pull/6132#discussion_r926263553


##########
rfc/rfc-46/rfc-46.md:
##########
@@ -156,13 +187,76 @@ Following major components will be refactored:
 3. `HoodieRealtimeRecordReader`s 
    1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
 
+### Config for Record Merge
+The MERGE_CLASS_NAME config is engine-aware. If you are not specified the MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your engine type.
+
+### Public Api in HoodieRecord
+Because we implement different types of records, we need to transfer some func in AvroUtils into HoodieRecord for different data(avro, InternalRow, RowData).
+Its public API will look like following:
+
+```java
+class HoodieRecord {
+
+   /**
+    * Get column in record to support RDDCustomColumnsSortPartitioner
+    */
+   Object getRecordColumnValues(Schema recordSchema, String[] columns, boolean consistentLogicalTimestampEnabled);
+
+   /**
+    * Support bootstrap.
+    */
+   HoodieRecord mergeWith(HoodieRecord other) throws IOException;
+
+   /**
+    * Rewrite record into new schema(add meta columns)
+    */
+   HoodieRecord rewriteRecord(Schema recordSchema, Properties props, Schema targetSchema) throws IOException;
+
+   /**
+    * Support schema evolution.
+    */
+   HoodieRecord rewriteRecordWithNewSchema(Schema recordSchema, Properties props, Schema newSchema, Map<String, String> renameCols) throws IOException;
+
+   HoodieRecord addMetadataValues(Schema recordSchema, Properties props, Map<HoodieMetadataField, String> metadataValues) throws IOException;
+
+   /**
+    * Is deleted.
+    */
+   boolean isPresent(Schema recordSchema, Properties props) throws IOException;
+
+   /**
+    * Is EmptyRecord. Generated by ExpressionPayload.
+    */
+   boolean shouldIgnore(Schema recordSchema, Properties props) throws IOException;
+
+   /**
+    * This method used to extract HoodieKey not through keyGenerator.
+    */
+   HoodieRecord expansion(

Review Comment:
   Please check my comment in the PR where it's introduced (https://github.com/apache/hudi/pull/5629#discussion_r926262756), i don't think we should have this method and i believe we should be able to get away without it in any form.



##########
rfc/rfc-46/rfc-46.md:
##########
@@ -84,59 +84,90 @@ is known to have poor performance (compared to non-reflection based instantiatio
 
 #### Record Merge API
 
-Stateless component interface providing for API Combining Records will look like following:
+CombineAndGetUpdateValue and Precombine will converge to one API. Stateless component interface providing for API Combining Records will look like following:
 
 ```java
-interface HoodieMerge {
-   HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer);
-
-   Option<HoodieRecord> combineAndGetUpdateValue(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException;
+interface HoodieRecordMerger {
+   // combineAndGetUpdateValue and precombine
+   Option<HoodieRecord> merge(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException;
+   
+   // The record type handled by the current merger
+   // SPARK, AVRO, FLINK
+   HoodieRecordType getRecordType();
 }
 
-   /**
-    * Spark-specific implementation 
-    */
-   class HoodieSparkRecordMerge implements HoodieMerge {
+/**
+ * Spark-specific implementation 
+ */
+class HoodieSparkRecordMerger implements HoodieRecordMerger {
 
-      @Override
-      public HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer) {
-        // HoodieSparkRecords preCombine
-      }
+   @Override
+   Option<HoodieRecord> merge(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException {
+     // HoodieSparkRecord precombine and combineAndGetUpdateValue 
+   }
 
-      @Override
-      public Option<HoodieRecord> combineAndGetUpdateValue(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) {
-         // HoodieSparkRecord combineAndGetUpdateValue
-      }
+   @Override
+   HoodieRecordType getRecordType() {
+     return HoodieRecordType.SPARK;
    }
+}
    
-   /**
-    * Flink-specific implementation 
-    */
-   class HoodieFlinkRecordMerge implements HoodieMerge {
-
-      @Override
-      public HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer) {
-        // HoodieFlinkRecord preCombine
-      }
+/**
+ * Flink-specific implementation 
+ */
+class HoodieFlinkRecordMerger implements HoodieRecordMerger {
+
+   @Override
+   Option<HoodieRecord> merge(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException {
+      // HoodieFlinkRecord precombine and combineAndGetUpdateValue 
+   }
 
-      @Override
-      public Option<HoodieRecord> combineAndGetUpdateValue(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) {
-         // HoodieFlinkRecord combineAndGetUpdateValue
-      }
+   @Override
+   HoodieRecordType getRecordType() {
+      return HoodieRecordType.FLINK;
    }
+}
 ```
 Where user can provide their own subclass implementing such interface for the engines of interest.
 
-#### Migration from `HoodieRecordPayload` to `HoodieMerge`
+#### Migration from `HoodieRecordPayload` to `HoodieRecordMerger`
 
 To warrant backward-compatibility (BWC) on the code-level with already created subclasses of `HoodieRecordPayload` currently
-already used in production by Hudi users, we will provide a BWC-bridge in the form of instance of `HoodieMerge`, that will 
+already used in production by Hudi users, we will provide a BWC-bridge in the form of instance of `HoodieRecordMerger` called `HoodieAvroRecordMerger`, that will 
 be using user-defined subclass of `HoodieRecordPayload` to combine the records.
 
 Leveraging such bridge will make provide for seamless BWC migration to the 0.11 release, however will be removing the performance 
 benefit of this refactoring, since it would unavoidably have to perform conversion to intermediate representation (Avro). To realize
 full-suite of benefits of this refactoring, users will have to migrate their merging logic out of `HoodieRecordPayload` subclass and into
-new `HoodieMerge` implementation.
+new `HoodieRecordMerger` implementation.
+
+Precombine is used to merge records from logs or incoming records; CombineAndGetUpdateValue is used to merge record from log file and record from base file.
+these two merge logics are not exactly the same for some RecordPayload, such as OverwriteWithLatestAvroPaload. 
+We add an Enum in HoodieRecord to mark where it comes from(BASE, LOG or WRITE). `HoodieAvroRecordMerger`'s API will look like following:

Review Comment:
   I think we should qualify existing behavior as simply a bug and resolve it with this, instead of going out of the way to maintain BWC.
   
   What's your take folks @vinothchandar @prasannarajaperumal? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6132:
URL: https://github.com/apache/hudi/pull/6132#issuecomment-1191563004

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10030",
       "triggerID" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9ce2c608b05103e2cf7123504252fa021b7bfcf6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10032",
       "triggerID" : "9ce2c608b05103e2cf7123504252fa021b7bfcf6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "63384a9ee861c2fb14da988d404382ced69f733d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10086",
       "triggerID" : "63384a9ee861c2fb14da988d404382ced69f733d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "011f79d738f845e2ae9224529f7ad5fa8250f4f2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10088",
       "triggerID" : "011f79d738f845e2ae9224529f7ad5fa8250f4f2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b39ec4efe4a084dcd9ca6ffd56bb55b4d76d5ac7",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10093",
       "triggerID" : "b39ec4efe4a084dcd9ca6ffd56bb55b4d76d5ac7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d2e5c9721b7e10c44291162f6c3604ec83cfa350",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10149",
       "triggerID" : "d2e5c9721b7e10c44291162f6c3604ec83cfa350",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * b39ec4efe4a084dcd9ca6ffd56bb55b4d76d5ac7 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10093) 
   * d2e5c9721b7e10c44291162f6c3604ec83cfa350 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10149) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] wzx140 commented on pull request #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
wzx140 commented on PR #6132:
URL: https://github.com/apache/hudi/pull/6132#issuecomment-1200655483

   @alexeykudinkin Could you review these change if it’s convenient for you. Remove HoodieSource that marks the record from base or log. Rename expasion and transform func.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6132:
URL: https://github.com/apache/hudi/pull/6132#issuecomment-1200684560

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10030",
       "triggerID" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9ce2c608b05103e2cf7123504252fa021b7bfcf6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10032",
       "triggerID" : "9ce2c608b05103e2cf7123504252fa021b7bfcf6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "63384a9ee861c2fb14da988d404382ced69f733d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10086",
       "triggerID" : "63384a9ee861c2fb14da988d404382ced69f733d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "011f79d738f845e2ae9224529f7ad5fa8250f4f2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10088",
       "triggerID" : "011f79d738f845e2ae9224529f7ad5fa8250f4f2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b39ec4efe4a084dcd9ca6ffd56bb55b4d76d5ac7",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10093",
       "triggerID" : "b39ec4efe4a084dcd9ca6ffd56bb55b4d76d5ac7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d2e5c9721b7e10c44291162f6c3604ec83cfa350",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10149",
       "triggerID" : "d2e5c9721b7e10c44291162f6c3604ec83cfa350",
       "triggerType" : "PUSH"
     }, {
       "hash" : "87b34db2554029d7e3ad945dfa1bb7d74f15b5c5",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10480",
       "triggerID" : "87b34db2554029d7e3ad945dfa1bb7d74f15b5c5",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d2e5c9721b7e10c44291162f6c3604ec83cfa350 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10149) 
   * 87b34db2554029d7e3ad945dfa1bb7d74f15b5c5 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10480) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] wzx140 commented on a diff in pull request #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
wzx140 commented on code in PR #6132:
URL: https://github.com/apache/hudi/pull/6132#discussion_r925106786


##########
rfc/rfc-46/rfc-46.md:
##########
@@ -156,13 +187,76 @@ Following major components will be refactored:
 3. `HoodieRealtimeRecordReader`s 
    1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
 
+### Config for Record Merge
+The MERGE_CLASS_NAME config is engine-aware. If you are not specified the MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your engine type.
+
+### Public Api in HoodieRecord
+Because we implement different types of records, we need to transfer some func in AvroUtils into HoodieRecord for different data(avro, InternalRow, RowData).
+Its public API will look like following:
+
+```java
+class HoodieRecord {
+
+   /**
+    * Get column in record to support RDDCustomColumnsSortPartitioner
+    */
+   Object getRecordColumnValues(Schema recordSchema, String[] columns, boolean consistentLogicalTimestampEnabled);
+
+   /**
+    * Support bootstrap.
+    */
+   HoodieRecord mergeWith(HoodieRecord other) throws IOException;
+
+   /**
+    * Rewrite record into new schema(add meta columns)
+    */
+   HoodieRecord rewriteRecord(Schema recordSchema, Properties props, Schema targetSchema) throws IOException;
+
+   /**
+    * Support schema evolution.
+    */
+   HoodieRecord rewriteRecordWithNewSchema(Schema recordSchema, Properties props, Schema newSchema, Map<String, String> renameCols) throws IOException;
+
+   HoodieRecord addMetadataValues(Schema recordSchema, Properties props, Map<HoodieMetadataField, String> metadataValues) throws IOException;
+
+   /**
+    * Is deleted.
+    */
+   boolean isPresent(Schema recordSchema, Properties props) throws IOException;
+
+   /**
+    * Is EmptyRecord. Generated by ExpressionPayload.
+    */
+   boolean shouldIgnore(Schema recordSchema, Properties props) throws IOException;

Review Comment:
   No. This method check whether is EmptyRecord SENTINEL. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6132:
URL: https://github.com/apache/hudi/pull/6132#issuecomment-1190120167

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10030",
       "triggerID" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9ce2c608b05103e2cf7123504252fa021b7bfcf6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10032",
       "triggerID" : "9ce2c608b05103e2cf7123504252fa021b7bfcf6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "63384a9ee861c2fb14da988d404382ced69f733d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10086",
       "triggerID" : "63384a9ee861c2fb14da988d404382ced69f733d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "011f79d738f845e2ae9224529f7ad5fa8250f4f2",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10088",
       "triggerID" : "011f79d738f845e2ae9224529f7ad5fa8250f4f2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b39ec4efe4a084dcd9ca6ffd56bb55b4d76d5ac7",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10093",
       "triggerID" : "b39ec4efe4a084dcd9ca6ffd56bb55b4d76d5ac7",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 011f79d738f845e2ae9224529f7ad5fa8250f4f2 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10088) 
   * b39ec4efe4a084dcd9ca6ffd56bb55b4d76d5ac7 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10093) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] wzx140 commented on a diff in pull request #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
wzx140 commented on code in PR #6132:
URL: https://github.com/apache/hudi/pull/6132#discussion_r925100124


##########
rfc/rfc-46/rfc-46.md:
##########
@@ -156,13 +187,76 @@ Following major components will be refactored:
 3. `HoodieRealtimeRecordReader`s 
    1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
 
+### Config for Record Merge
+The MERGE_CLASS_NAME config is engine-aware. If you are not specified the MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your engine type.
+
+### Public Api in HoodieRecord
+Because we implement different types of records, we need to transfer some func in AvroUtils into HoodieRecord for different data(avro, InternalRow, RowData).
+Its public API will look like following:
+
+```java
+class HoodieRecord {
+
+   /**
+    * Get column in record to support RDDCustomColumnsSortPartitioner
+    */
+   Object getRecordColumnValues(Schema recordSchema, String[] columns, boolean consistentLogicalTimestampEnabled);
+
+   /**
+    * Support bootstrap.
+    */
+   HoodieRecord mergeWith(HoodieRecord other) throws IOException;
+
+   /**
+    * Rewrite record into new schema(add meta columns)
+    */
+   HoodieRecord rewriteRecord(Schema recordSchema, Properties props, Schema targetSchema) throws IOException;

Review Comment:
   Same as getRecordColumnValues



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6132: [RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #6132:
URL: https://github.com/apache/hudi/pull/6132#discussion_r938175507


##########
rfc/rfc-46/rfc-46.md:
##########
@@ -74,49 +74,94 @@ Following (high-level) steps are proposed:
    2. Split into interface and engine-specific implementations (holding internal engine-specific representation of the payload) 
    3. Implementing new standardized record-level APIs (like `getPartitionKey` , `getRecordKey`, etc)
    4. Staying **internal** component, that will **NOT** contain any user-defined semantic (like merging)
-2. Extract Record Combining (Merge) API from `HoodieRecordPayload` into a standalone, stateless component (engine). Such component will be
+2. Extract Record Merge API from `HoodieRecordPayload` into a standalone, stateless component. Such component will be
    1. Abstracted as stateless object providing API to combine records (according to predefined semantics) for engines (Spark, Flink) of interest
    2. Plug-in point for user-defined combination semantics
 3. Gradually deprecate, phase-out and eventually remove `HoodieRecordPayload` abstraction
 
 Phasing out usage of `HoodieRecordPayload` will also bring the benefit of avoiding to use Java reflection in the hot-path, which
 is known to have poor performance (compared to non-reflection based instantiation).
 
-#### Combine API Engine
+#### Record Merge API
 
-Stateless component interface providing for API Combining Records will look like following:
+CombineAndGetUpdateValue and Precombine will converge to one API. Stateless component interface providing for API Combining Records will look like following:
 
 ```java
-interface HoodieRecordCombiningEngine {
-  
-  default HoodieRecord precombine(HoodieRecord older, HoodieRecord newer) {
-    if (spark) {
-      precombineSpark((SparkHoodieRecord) older, (SparkHoodieRecord) newer);
-    } else if (flink) {
-      // precombine for Flink
-    }
-  }
+interface HoodieRecordMerger {
+   // This method converges combineAndGetUpdateValue and precombine from HoodiePayload. 
+   // It'd be associative operation: f(a, f(b, c)) = f(f(a, b), c) (which we can translate as having 3 versions A, B, C of the single record, both orders of operations applications have to yield the same result)
+   Option<HoodieRecord> merge(HoodieRecord older, HoodieRecord newer, Schema schema, Properties props) throws IOException;
+   
+   // The record type handled by the current merger
+   // SPARK, AVRO, FLINK
+   HoodieRecordType getRecordType();
+}
 
-   /**
-    * Spark-specific implementation 
-    */
-  SparkHoodieRecord precombineSpark(SparkHoodieRecord older, SparkHoodieRecord newer);
-  
-  // ...
+/**
+ * Spark-specific implementation 
+ */
+class HoodieSparkRecordMerger implements HoodieRecordMerger {

Review Comment:
   Please check my comment in here:
   https://github.com/apache/hudi/pull/5629#discussion_r938154747
   
   We should allow user to implement single `RecordMerger` object supporting every engine type



##########
rfc/rfc-46/rfc-46.md:
##########
@@ -128,21 +173,88 @@ Following major components will be refactored:
 
 1. `HoodieWriteHandle`s will be  
    1. Accepting `HoodieRecord` instead of raw Avro payload (avoiding Avro conversion)
-   2. Using Combining API engine to merge records (when necessary) 
+   2. Using Record Merge API to merge records (when necessary) 
    3. Passes `HoodieRecord` as is to `FileWriter`
 2. `HoodieFileWriter`s will be 
    1. Accepting `HoodieRecord`
    2. Will be engine-specific (so that they're able to handle internal record representation)
 3. `HoodieRealtimeRecordReader`s 
    1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
 
+### Config for Record Merge
+The MERGE_CLASS_NAME config is engine-aware. If you are not specified the MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your engine type.
+
+### Public Api in HoodieRecord
+Because we implement different types of records, we need to implement functionality similar to AvroUtils in HoodieRecord for different data(avro, InternalRow, RowData).
+Its public API will look like following:
+
+```java
+import java.util.Properties;
+
+class HoodieRecord {
+
+   /**
+    * Get column in record to support RDDCustomColumnsSortPartitioner
+    */
+   Object getRecordColumnValues(Schema recordSchema, String[] columns,

Review Comment:
   This should return an array of objects, right?



##########
rfc/rfc-46/rfc-46.md:
##########
@@ -128,21 +173,88 @@ Following major components will be refactored:
 
 1. `HoodieWriteHandle`s will be  
    1. Accepting `HoodieRecord` instead of raw Avro payload (avoiding Avro conversion)
-   2. Using Combining API engine to merge records (when necessary) 
+   2. Using Record Merge API to merge records (when necessary) 
    3. Passes `HoodieRecord` as is to `FileWriter`
 2. `HoodieFileWriter`s will be 
    1. Accepting `HoodieRecord`
    2. Will be engine-specific (so that they're able to handle internal record representation)
 3. `HoodieRealtimeRecordReader`s 
    1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
 
+### Config for Record Merge
+The MERGE_CLASS_NAME config is engine-aware. If you are not specified the MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your engine type.
+
+### Public Api in HoodieRecord
+Because we implement different types of records, we need to implement functionality similar to AvroUtils in HoodieRecord for different data(avro, InternalRow, RowData).
+Its public API will look like following:
+
+```java
+import java.util.Properties;
+
+class HoodieRecord {
+
+   /**
+    * Get column in record to support RDDCustomColumnsSortPartitioner
+    */
+   Object getRecordColumnValues(Schema recordSchema, String[] columns,
+           boolean consistentLogicalTimestampEnabled);
+
+   /**
+    * Support bootstrap.
+    */
+   HoodieRecord mergeWith(HoodieRecord other, Schema targetSchema) throws IOException;
+
+   /**
+    * Rewrite record into new schema(add meta columns)
+    */
+   HoodieRecord rewriteRecord(Schema recordSchema, Properties props, Schema targetSchema)
+           throws IOException;
+
+   /**
+    * Support schema evolution.
+    */
+   HoodieRecord rewriteRecordWithNewSchema(Schema recordSchema, Properties props, Schema newSchema,
+           Map<String, String> renameCols) throws IOException;
+
+   HoodieRecord updateValues(Schema recordSchema, Properties props,

Review Comment:
   I'd suggest to
   
   1. Name it `updateMetadataValues` to avoid confusion (we shouldn't be allowing to modify the record's payload)
   2. Instead of `Map<Sstring, String>` let's create a strongly typed Java class w/ all meta-fields and pass it here



##########
rfc/rfc-46/rfc-46.md:
##########
@@ -128,21 +173,88 @@ Following major components will be refactored:
 
 1. `HoodieWriteHandle`s will be  
    1. Accepting `HoodieRecord` instead of raw Avro payload (avoiding Avro conversion)
-   2. Using Combining API engine to merge records (when necessary) 
+   2. Using Record Merge API to merge records (when necessary) 
    3. Passes `HoodieRecord` as is to `FileWriter`
 2. `HoodieFileWriter`s will be 
    1. Accepting `HoodieRecord`
    2. Will be engine-specific (so that they're able to handle internal record representation)
 3. `HoodieRealtimeRecordReader`s 
    1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
 
+### Config for Record Merge
+The MERGE_CLASS_NAME config is engine-aware. If you are not specified the MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your engine type.
+
+### Public Api in HoodieRecord
+Because we implement different types of records, we need to implement functionality similar to AvroUtils in HoodieRecord for different data(avro, InternalRow, RowData).
+Its public API will look like following:
+
+```java
+import java.util.Properties;
+
+class HoodieRecord {
+
+   /**
+    * Get column in record to support RDDCustomColumnsSortPartitioner
+    */
+   Object getRecordColumnValues(Schema recordSchema, String[] columns,
+           boolean consistentLogicalTimestampEnabled);
+
+   /**
+    * Support bootstrap.
+    */
+   HoodieRecord mergeWith(HoodieRecord other, Schema targetSchema) throws IOException;

Review Comment:
   As we've discussed prior we should avoid adding any merging semantic to the Record API itself. What this method is going to be used for?



##########
rfc/rfc-46/rfc-46.md:
##########
@@ -128,21 +173,88 @@ Following major components will be refactored:
 
 1. `HoodieWriteHandle`s will be  
    1. Accepting `HoodieRecord` instead of raw Avro payload (avoiding Avro conversion)
-   2. Using Combining API engine to merge records (when necessary) 
+   2. Using Record Merge API to merge records (when necessary) 
    3. Passes `HoodieRecord` as is to `FileWriter`
 2. `HoodieFileWriter`s will be 
    1. Accepting `HoodieRecord`
    2. Will be engine-specific (so that they're able to handle internal record representation)
 3. `HoodieRealtimeRecordReader`s 
    1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
 
+### Config for Record Merge
+The MERGE_CLASS_NAME config is engine-aware. If you are not specified the MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your engine type.
+
+### Public Api in HoodieRecord
+Because we implement different types of records, we need to implement functionality similar to AvroUtils in HoodieRecord for different data(avro, InternalRow, RowData).
+Its public API will look like following:
+
+```java
+import java.util.Properties;
+
+class HoodieRecord {
+
+   /**
+    * Get column in record to support RDDCustomColumnsSortPartitioner
+    */
+   Object getRecordColumnValues(Schema recordSchema, String[] columns,
+           boolean consistentLogicalTimestampEnabled);
+
+   /**
+    * Support bootstrap.
+    */
+   HoodieRecord mergeWith(HoodieRecord other, Schema targetSchema) throws IOException;
+
+   /**
+    * Rewrite record into new schema(add meta columns)
+    */
+   HoodieRecord rewriteRecord(Schema recordSchema, Properties props, Schema targetSchema)
+           throws IOException;
+
+   /**
+    * Support schema evolution.
+    */
+   HoodieRecord rewriteRecordWithNewSchema(Schema recordSchema, Properties props, Schema newSchema,
+           Map<String, String> renameCols) throws IOException;
+
+   HoodieRecord updateValues(Schema recordSchema, Properties props,
+           Map<String, String> metadataValues) throws IOException;
+
+   boolean isDelete(Schema recordSchema, Properties props) throws IOException;
+
+   /**
+    * Is EmptyRecord. Generated by ExpressionPayload.
+    */
+   boolean shouldIgnore(Schema recordSchema, Properties props) throws IOException;
+
+   /**
+    * This method used to extract HoodieKey through parameters.
+    */
+   HoodieRecord getKeyWithParams(

Review Comment:
   Great progress on this method! 
   
   It's arguments still don't make sense though from the user perspective: i can't imagine looking at it where it could be even used. Can you please point me to where this method is planned to be used?



##########
rfc/rfc-46/rfc-46.md:
##########
@@ -128,21 +173,88 @@ Following major components will be refactored:
 
 1. `HoodieWriteHandle`s will be  
    1. Accepting `HoodieRecord` instead of raw Avro payload (avoiding Avro conversion)
-   2. Using Combining API engine to merge records (when necessary) 
+   2. Using Record Merge API to merge records (when necessary) 
    3. Passes `HoodieRecord` as is to `FileWriter`
 2. `HoodieFileWriter`s will be 
    1. Accepting `HoodieRecord`
    2. Will be engine-specific (so that they're able to handle internal record representation)
 3. `HoodieRealtimeRecordReader`s 
    1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
 
+### Config for Record Merge
+The MERGE_CLASS_NAME config is engine-aware. If you are not specified the MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your engine type.
+
+### Public Api in HoodieRecord
+Because we implement different types of records, we need to implement functionality similar to AvroUtils in HoodieRecord for different data(avro, InternalRow, RowData).
+Its public API will look like following:
+
+```java
+import java.util.Properties;
+
+class HoodieRecord {
+
+   /**
+    * Get column in record to support RDDCustomColumnsSortPartitioner
+    */
+   Object getRecordColumnValues(Schema recordSchema, String[] columns,
+           boolean consistentLogicalTimestampEnabled);
+
+   /**
+    * Support bootstrap.
+    */
+   HoodieRecord mergeWith(HoodieRecord other, Schema targetSchema) throws IOException;
+
+   /**
+    * Rewrite record into new schema(add meta columns)
+    */
+   HoodieRecord rewriteRecord(Schema recordSchema, Properties props, Schema targetSchema)
+           throws IOException;
+
+   /**
+    * Support schema evolution.
+    */
+   HoodieRecord rewriteRecordWithNewSchema(Schema recordSchema, Properties props, Schema newSchema,
+           Map<String, String> renameCols) throws IOException;
+
+   HoodieRecord updateValues(Schema recordSchema, Properties props,
+           Map<String, String> metadataValues) throws IOException;
+
+   boolean isDelete(Schema recordSchema, Properties props) throws IOException;
+
+   /**
+    * Is EmptyRecord. Generated by ExpressionPayload.
+    */
+   boolean shouldIgnore(Schema recordSchema, Properties props) throws IOException;
+
+   /**
+    * This method used to extract HoodieKey through parameters.
+    */
+   HoodieRecord getKeyWithParams(
+           Schema schema,
+           Properties props,
+           Option<Pair<String, String>>simpleKeyGenFieldsOpt,
+           Boolean withOperation,
+           Option<String> partitionNameOp,
+           Boolean populateMetaFieldsOp) throws IOException;
+
+   /**
+    * This method used to extract HoodieKey through keyGenerator. This method used in ClusteringExecutionStrategy.
+    */
+   HoodieRecord getKeyWithKeyGen(Properties props, Option<BaseKeyGenerator> keyGen);
+
+   Option<HoodieAvroIndexedRecord> toIndexedRecord(Schema schema, Properties props)

Review Comment:
   Why do we need `HoodieAvroIndexedRecord`? 
   I think `HoodieAvroRecord` should be enough



##########
rfc/rfc-46/rfc-46.md:
##########
@@ -156,13 +187,76 @@ Following major components will be refactored:
 3. `HoodieRealtimeRecordReader`s 
    1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
 
+### Config for Record Merge
+The MERGE_CLASS_NAME config is engine-aware. If you are not specified the MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your engine type.
+
+### Public Api in HoodieRecord
+Because we implement different types of records, we need to transfer some func in AvroUtils into HoodieRecord for different data(avro, InternalRow, RowData).
+Its public API will look like following:
+
+```java
+class HoodieRecord {
+
+   /**
+    * Get column in record to support RDDCustomColumnsSortPartitioner
+    */
+   Object getRecordColumnValues(Schema recordSchema, String[] columns, boolean consistentLogicalTimestampEnabled);
+
+   /**
+    * Support bootstrap.
+    */
+   HoodieRecord mergeWith(HoodieRecord other) throws IOException;
+
+   /**
+    * Rewrite record into new schema(add meta columns)
+    */
+   HoodieRecord rewriteRecord(Schema recordSchema, Properties props, Schema targetSchema) throws IOException;
+
+   /**
+    * Support schema evolution.
+    */
+   HoodieRecord rewriteRecordWithNewSchema(Schema recordSchema, Properties props, Schema newSchema, Map<String, String> renameCols) throws IOException;
+
+   HoodieRecord addMetadataValues(Schema recordSchema, Properties props, Map<HoodieMetadataField, String> metadataValues) throws IOException;
+
+   /**
+    * Is deleted.
+    */
+   boolean isPresent(Schema recordSchema, Properties props) throws IOException;
+
+   /**
+    * Is EmptyRecord. Generated by ExpressionPayload.
+    */
+   boolean shouldIgnore(Schema recordSchema, Properties props) throws IOException;

Review Comment:
   We should probably update the java-doc them to avoid ref to any particular implementation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6132: [RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #6132:
URL: https://github.com/apache/hudi/pull/6132#discussion_r938180595


##########
rfc/rfc-46/rfc-46.md:
##########
@@ -128,21 +173,88 @@ Following major components will be refactored:
 
 1. `HoodieWriteHandle`s will be  
    1. Accepting `HoodieRecord` instead of raw Avro payload (avoiding Avro conversion)
-   2. Using Combining API engine to merge records (when necessary) 
+   2. Using Record Merge API to merge records (when necessary) 
    3. Passes `HoodieRecord` as is to `FileWriter`
 2. `HoodieFileWriter`s will be 
    1. Accepting `HoodieRecord`
    2. Will be engine-specific (so that they're able to handle internal record representation)
 3. `HoodieRealtimeRecordReader`s 
    1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
 
+### Config for Record Merge
+The MERGE_CLASS_NAME config is engine-aware. If you are not specified the MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your engine type.
+
+### Public Api in HoodieRecord
+Because we implement different types of records, we need to implement functionality similar to AvroUtils in HoodieRecord for different data(avro, InternalRow, RowData).
+Its public API will look like following:
+
+```java
+import java.util.Properties;
+
+class HoodieRecord {
+
+   /**
+    * Get column in record to support RDDCustomColumnsSortPartitioner
+    */
+   Object getRecordColumnValues(Schema recordSchema, String[] columns,
+           boolean consistentLogicalTimestampEnabled);
+
+   /**
+    * Support bootstrap.
+    */
+   HoodieRecord mergeWith(HoodieRecord other, Schema targetSchema) throws IOException;
+
+   /**
+    * Rewrite record into new schema(add meta columns)
+    */
+   HoodieRecord rewriteRecord(Schema recordSchema, Properties props, Schema targetSchema)
+           throws IOException;
+
+   /**
+    * Support schema evolution.
+    */
+   HoodieRecord rewriteRecordWithNewSchema(Schema recordSchema, Properties props, Schema newSchema,
+           Map<String, String> renameCols) throws IOException;
+
+   HoodieRecord updateValues(Schema recordSchema, Properties props,
+           Map<String, String> metadataValues) throws IOException;
+
+   boolean isDelete(Schema recordSchema, Properties props) throws IOException;
+
+   /**
+    * Is EmptyRecord. Generated by ExpressionPayload.
+    */
+   boolean shouldIgnore(Schema recordSchema, Properties props) throws IOException;
+
+   /**
+    * This method used to extract HoodieKey through parameters.
+    */
+   HoodieRecord getKeyWithParams(

Review Comment:
   Sorry, i think i previously was under wrong impression that this method returns `HoodieKey`, but now i see that it returns the same `HoodieRecord` and the only thing that is changed essentially is just the name of the method.
   
   I left [my comment in the actual PR](https://github.com/apache/hudi/pull/5629#discussion_r938253077) elaborating why i think we should have no need for this method, PTAL.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6132:
URL: https://github.com/apache/hudi/pull/6132#issuecomment-1190046526

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10030",
       "triggerID" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9ce2c608b05103e2cf7123504252fa021b7bfcf6",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10032",
       "triggerID" : "9ce2c608b05103e2cf7123504252fa021b7bfcf6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "63384a9ee861c2fb14da988d404382ced69f733d",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10086",
       "triggerID" : "63384a9ee861c2fb14da988d404382ced69f733d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "011f79d738f845e2ae9224529f7ad5fa8250f4f2",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10088",
       "triggerID" : "011f79d738f845e2ae9224529f7ad5fa8250f4f2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b39ec4efe4a084dcd9ca6ffd56bb55b4d76d5ac7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b39ec4efe4a084dcd9ca6ffd56bb55b4d76d5ac7",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 9ce2c608b05103e2cf7123504252fa021b7bfcf6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10032) 
   * 63384a9ee861c2fb14da988d404382ced69f733d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10086) 
   * 011f79d738f845e2ae9224529f7ad5fa8250f4f2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10088) 
   * b39ec4efe4a084dcd9ca6ffd56bb55b4d76d5ac7 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6132:
URL: https://github.com/apache/hudi/pull/6132#issuecomment-1186984578

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6132: [RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #6132:
URL: https://github.com/apache/hudi/pull/6132#discussion_r974718156


##########
rfc/rfc-46/rfc-46.md:
##########
@@ -128,21 +173,88 @@ Following major components will be refactored:
 
 1. `HoodieWriteHandle`s will be  
    1. Accepting `HoodieRecord` instead of raw Avro payload (avoiding Avro conversion)
-   2. Using Combining API engine to merge records (when necessary) 
+   2. Using Record Merge API to merge records (when necessary) 
    3. Passes `HoodieRecord` as is to `FileWriter`
 2. `HoodieFileWriter`s will be 
    1. Accepting `HoodieRecord`
    2. Will be engine-specific (so that they're able to handle internal record representation)
 3. `HoodieRealtimeRecordReader`s 
    1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
 
+### Config for Record Merge
+The MERGE_CLASS_NAME config is engine-aware. If you are not specified the MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your engine type.
+
+### Public Api in HoodieRecord
+Because we implement different types of records, we need to implement functionality similar to AvroUtils in HoodieRecord for different data(avro, InternalRow, RowData).
+Its public API will look like following:
+
+```java
+import java.util.Properties;
+
+class HoodieRecord {
+
+   /**
+    * Get column in record to support RDDCustomColumnsSortPartitioner
+    */
+   Object getRecordColumnValues(Schema recordSchema, String[] columns,

Review Comment:
   @wzx140 understand where you're coming from.
   
   We should have already deprecated `getRecordColumnValues` as this method is heavily coupled to where it's used currently and unfortunately isn't generic enough to serve its purpose. In this particular case converting the values and concat-ing them as strings doesn't make sense for a generic utility -- whenever someone requests a list of column values they expect to get a list of values (as they are) as compared to receiving a string (!) of concatenated values.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6132: [RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #6132:
URL: https://github.com/apache/hudi/pull/6132#discussion_r974723729


##########
rfc/rfc-46/rfc-46.md:
##########
@@ -128,21 +173,88 @@ Following major components will be refactored:
 
 1. `HoodieWriteHandle`s will be  
    1. Accepting `HoodieRecord` instead of raw Avro payload (avoiding Avro conversion)
-   2. Using Combining API engine to merge records (when necessary) 
+   2. Using Record Merge API to merge records (when necessary) 
    3. Passes `HoodieRecord` as is to `FileWriter`
 2. `HoodieFileWriter`s will be 
    1. Accepting `HoodieRecord`
    2. Will be engine-specific (so that they're able to handle internal record representation)
 3. `HoodieRealtimeRecordReader`s 
    1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
 
+### Config for Record Merge
+The MERGE_CLASS_NAME config is engine-aware. If you are not specified the MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your engine type.
+
+### Public Api in HoodieRecord
+Because we implement different types of records, we need to implement functionality similar to AvroUtils in HoodieRecord for different data(avro, InternalRow, RowData).
+Its public API will look like following:
+
+```java
+import java.util.Properties;
+
+class HoodieRecord {
+
+   /**
+    * Get column in record to support RDDCustomColumnsSortPartitioner
+    */
+   Object getRecordColumnValues(Schema recordSchema, String[] columns,
+           boolean consistentLogicalTimestampEnabled);
+
+   /**
+    * Support bootstrap.
+    */
+   HoodieRecord mergeWith(HoodieRecord other, Schema targetSchema) throws IOException;
+
+   /**
+    * Rewrite record into new schema(add meta columns)
+    */
+   HoodieRecord rewriteRecord(Schema recordSchema, Properties props, Schema targetSchema)
+           throws IOException;
+
+   /**
+    * Support schema evolution.
+    */
+   HoodieRecord rewriteRecordWithNewSchema(Schema recordSchema, Properties props, Schema newSchema,
+           Map<String, String> renameCols) throws IOException;
+
+   HoodieRecord updateValues(Schema recordSchema, Properties props,

Review Comment:
   @wzx140 we should split these up: 
   
    - Only legitimate use-case for us to update fields is Hudi's metadata
    - `HoodieHFileDataBlock` shouldn't be modifying existing payload but should instead be _rewriting_ w/o the field it wants to omit. We will tackle that separately, and for the sake of RFC-46 we can create temporary method `truncateRecordKey` which will be overwriting record-key value for now (we will deprecate and remove this method after we address this)
   
   We should not leave a loophole where we allow a record to be modified to make sure that nobody can start building against this API



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6132: [RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #6132:
URL: https://github.com/apache/hudi/pull/6132#discussion_r938178095


##########
rfc/rfc-46/rfc-46.md:
##########
@@ -156,13 +187,76 @@ Following major components will be refactored:
 3. `HoodieRealtimeRecordReader`s 
    1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
 
+### Config for Record Merge
+The MERGE_CLASS_NAME config is engine-aware. If you are not specified the MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your engine type.
+
+### Public Api in HoodieRecord
+Because we implement different types of records, we need to transfer some func in AvroUtils into HoodieRecord for different data(avro, InternalRow, RowData).
+Its public API will look like following:
+
+```java
+class HoodieRecord {
+
+   /**
+    * Get column in record to support RDDCustomColumnsSortPartitioner
+    */
+   Object getRecordColumnValues(Schema recordSchema, String[] columns, boolean consistentLogicalTimestampEnabled);
+
+   /**
+    * Support bootstrap.
+    */
+   HoodieRecord mergeWith(HoodieRecord other) throws IOException;
+
+   /**
+    * Rewrite record into new schema(add meta columns)
+    */
+   HoodieRecord rewriteRecord(Schema recordSchema, Properties props, Schema targetSchema) throws IOException;
+
+   /**
+    * Support schema evolution.
+    */
+   HoodieRecord rewriteRecordWithNewSchema(Schema recordSchema, Properties props, Schema newSchema, Map<String, String> renameCols) throws IOException;
+
+   HoodieRecord addMetadataValues(Schema recordSchema, Properties props, Map<HoodieMetadataField, String> metadataValues) throws IOException;
+
+   /**
+    * Is deleted.
+    */
+   boolean isPresent(Schema recordSchema, Properties props) throws IOException;
+
+   /**
+    * Is EmptyRecord. Generated by ExpressionPayload.
+    */
+   boolean shouldIgnore(Schema recordSchema, Properties props) throws IOException;

Review Comment:
   We should probably update the java-doc then to avoid ref to any particular implementation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6132:
URL: https://github.com/apache/hudi/pull/6132#issuecomment-1191766329

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10030",
       "triggerID" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9ce2c608b05103e2cf7123504252fa021b7bfcf6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10032",
       "triggerID" : "9ce2c608b05103e2cf7123504252fa021b7bfcf6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "63384a9ee861c2fb14da988d404382ced69f733d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10086",
       "triggerID" : "63384a9ee861c2fb14da988d404382ced69f733d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "011f79d738f845e2ae9224529f7ad5fa8250f4f2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10088",
       "triggerID" : "011f79d738f845e2ae9224529f7ad5fa8250f4f2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b39ec4efe4a084dcd9ca6ffd56bb55b4d76d5ac7",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10093",
       "triggerID" : "b39ec4efe4a084dcd9ca6ffd56bb55b4d76d5ac7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d2e5c9721b7e10c44291162f6c3604ec83cfa350",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10149",
       "triggerID" : "d2e5c9721b7e10c44291162f6c3604ec83cfa350",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d2e5c9721b7e10c44291162f6c3604ec83cfa350 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10149) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6132:
URL: https://github.com/apache/hudi/pull/6132#issuecomment-1186988871

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10030",
       "triggerID" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10030) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6132: [HUDI-4414] Update the RFC-46 doc to fix comments feedback

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6132:
URL: https://github.com/apache/hudi/pull/6132#issuecomment-1200680609

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10030",
       "triggerID" : "1f59ce7aa51032d91b2fe6748c310fa03c0b1c1d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9ce2c608b05103e2cf7123504252fa021b7bfcf6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10032",
       "triggerID" : "9ce2c608b05103e2cf7123504252fa021b7bfcf6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "63384a9ee861c2fb14da988d404382ced69f733d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10086",
       "triggerID" : "63384a9ee861c2fb14da988d404382ced69f733d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "011f79d738f845e2ae9224529f7ad5fa8250f4f2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10088",
       "triggerID" : "011f79d738f845e2ae9224529f7ad5fa8250f4f2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b39ec4efe4a084dcd9ca6ffd56bb55b4d76d5ac7",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10093",
       "triggerID" : "b39ec4efe4a084dcd9ca6ffd56bb55b4d76d5ac7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d2e5c9721b7e10c44291162f6c3604ec83cfa350",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10149",
       "triggerID" : "d2e5c9721b7e10c44291162f6c3604ec83cfa350",
       "triggerType" : "PUSH"
     }, {
       "hash" : "87b34db2554029d7e3ad945dfa1bb7d74f15b5c5",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "87b34db2554029d7e3ad945dfa1bb7d74f15b5c5",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d2e5c9721b7e10c44291162f6c3604ec83cfa350 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10149) 
   * 87b34db2554029d7e3ad945dfa1bb7d74f15b5c5 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org