You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/01/27 00:31:48 UTC

[GitHub] [hudi] alexeykudinkin opened a new pull request #4697: [HUDI-3318] Drafted RFC-46

alexeykudinkin opened a new pull request #4697:
URL: https://github.com/apache/hudi/pull/4697


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.*
   
   ## What is the purpose of the pull request
   
   Drafting RFC-46
   
   ## Brief change log
   
   See above 
   
   ## Verify this pull request
   
   N/A
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4697: [HUDI-3318] Drafted RFC-46

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4697:
URL: https://github.com/apache/hudi/pull/4697#issuecomment-1022732555


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "829d13200dd05fe4fb7234c6c5d2de3c65668f15",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "829d13200dd05fe4fb7234c6c5d2de3c65668f15",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 829d13200dd05fe4fb7234c6c5d2de3c65668f15 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4697: [HUDI-3318] RFC-46

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on a change in pull request #4697:
URL: https://github.com/apache/hudi/pull/4697#discussion_r794701227



##########
File path: rfc/rfc-46/rfc-46.md
##########
@@ -0,0 +1,159 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-46: Optimize Record Payload handling
+
+## Proposers
+
+- @alexeykudinkin
+
+## Approvers
+ - @vinothchandar
+ - @nsivabalan
+ - @xushiyan
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-3217
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Avro historically has been a centerpiece of the Hudi architecture: it's a default representation that many components expect
+when dealing with records (during merge, column value extractions, writing into storage, etc). 
+
+While having a single format of the record representation is certainly making implementation of some components simpler, 
+it bears unavoidable performance penalty of de-/serialization loop: every record handled by Hudi has to be converted
+from (low-level) engine-specific representation (`Row` for Spark, `RowData` for Flink, `ArrayWritable` for Hive) into intermediate 
+one (Avro), with some operations (like clustering, compaction) potentially incurring this penalty multiple times (on read- 
+and write-paths). 
+
+As such, goal of this effort is to remove the need of conversion from engine-specific internal representations to Avro 
+while handling records. 
+
+## Background
+
+Historically, Avro has settled in as de-facto intermediate representation of the record's payload since the early days of Hudi.
+As project matured and the scale of the installations grew, necessity to convert into an intermediate representation quickly 
+become a noticeable bottleneck in terms of performance of critical Hudi flows. 
+
+At the center of it is the hierarchy of `HoodieRecordPayload`s, which is used to hold individual record's payload 
+providing an APIs like `preCombine`, `combineAndGetUpdateValue` to combine it with other record using some user-defined semantic. 
+
+## Implementation
+
+### Revisiting Record Classes Hierarchy
+
+To achieve stated goals of avoiding unnecessary conversions into intermediate representation (Avro), existing Hudi workflows
+operating on individual records will have to be refactored and laid out in a way that would be _unassuming about internal 
+representation_ of the record, ie code should be working w/ a record as an _opaque object_: exposing certain API to access 
+crucial data (precombine, primary, partition keys, etc), but not providing the access to the raw payload.
+
+Having existing workflows re-structured in such a way around a record being an opaque object, would allow us to encapsulate 
+internal representation of the record w/in its class hierarchy, which in turn would allow for us to hold engine-specific (Spark, Flink, etc)
+representations of the records w/o exposing purely engine-agnostic components to it. 
+
+Following (high-level) steps are proposed: 
+
+1. Promote `HoodieRecord` to become a standardized API of interacting with a single record, that will be  
+   1. Replacing all accesses from `HoodieRecordPayload`
+   2. Split into interface and engine-specific implementations (holding internal engine-specific representation of the payload) 
+   3. Implementing new standardized record-level APIs (like `getPartitionKey` , `getRecordKey`, etc)
+   4. Staying **internal** component, that will **NOT** contain any user-defined semantic (like merging)
+2. Extract Record Combining (Merge) API from `HoodieRecordPayload` into a standalone, stateless component (engine). Such component will be
+   1. Abstracted as stateless object providing API to combine records (according to predefined semantics) for engines (Spark, Flink) of interest
+   2. Plug-in point for user-defined combination semantics
+3. Gradually deprecate, phase-out and eventually remove `HoodieRecordPayload` abstraction
+
+Phasing out usage of `HoodieRecordPayload` will also bring the benefit of avoiding to use Java reflection in the hot-path, which
+is known to have poor performance (compared to non-reflection based instantiation).
+
+#### Combine API Engine
+
+Stateless component interface providing for API Combining Records will look like following:
+
+```java
+interface HoodieRecordCombiningEngine {
+  
+  default HoodieRecord precombine(HoodieRecord older, HoodieRecord newer) {
+    if (spark) {
+      precombineSpark((SparkHoodieRecord) older, (SparkHoodieRecord) newer);
+    } else if (flink) {
+      // precombine for Flink
+    }
+  }
+
+   /**
+    * Spark-specific implementation 
+    */
+  SparkHoodieRecord precombineSpark(SparkHoodieRecord older, SparkHoodieRecord newer);
+  
+  // ...
+}
+```
+Where user can provide their own subclass implementing such interface for the engines of interest.
+
+#### Migration from `HoodieRecordPayload` to `HoodieRecordCombiningEngine`
+
+To warrant backward-compatibility (BWC) on the code-level with already created subclasses of `HoodieRecordPayload` currently
+already used in production by Hudi users, we will provide a BWC-bridge in the form of instance of `HoodieRecordCombiningEngine`, that will 
+be using user-defined subclass of `HoodieRecordPayload` to combine the records.
+
+Leveraging such bridge will make provide for seamless BWC migration to the 0.11 release, however will be removing the performance 
+benefit of this refactoring, since it would unavoidably have to perform conversion to intermediate representation (Avro). To realize
+full-suite of benefits of this refactoring, users will have to migrate their merging logic out of `HoodieRecordPayload` subclass and into
+new `HoodieRecordCombiningEngine` implementation.
+
+### Refactoring Flows Directly Interacting w/ Records:
+
+As was called out prior to achieve the goal of being able to sustain engine-internal representations being held by `HoodieRecord` 
+class w/o compromising major components' neutrality (ie being engine-agnostic), such components directly interacting w/
+records' payloads today will have to be refactored to instead interact w/ standardized `HoodieRecord`s API.
+
+Following major components will be refactored:
+
+1. `WriteHandle`s will be  
+   1. Accepting `HoodieRecord` instead of raw Avro payload (avoiding Avro conversion)
+   2. Using Combining API engine to merge records (when necessary) 
+   3. Passes `HoodieRecord` as is to `FileWriter`
+2. `FileWriter`s will be 
+   1. Accepting `HoodieRecord`
+   2. Will be engine-specific (so that they're able to handle internal record representation)
+3. `RecordReader`s 
+   1. API will be returning opaque `HoodieRecord` instead of raw Avro payload

Review comment:
       Correct




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] xushiyan commented on a change in pull request #4697: [HUDI-3318] RFC-46

Posted by GitBox <gi...@apache.org>.
xushiyan commented on a change in pull request #4697:
URL: https://github.com/apache/hudi/pull/4697#discussion_r796143762



##########
File path: rfc/rfc-46/rfc-46.md
##########
@@ -0,0 +1,159 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-46: Optimize Record Payload handling
+
+## Proposers
+
+- @alexeykudinkin
+
+## Approvers
+ - @vinothchandar
+ - @nsivabalan
+ - @xushiyan
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-3217
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Avro historically has been a centerpiece of the Hudi architecture: it's a default representation that many components expect
+when dealing with records (during merge, column value extractions, writing into storage, etc). 
+
+While having a single format of the record representation is certainly making implementation of some components simpler, 
+it bears unavoidable performance penalty of de-/serialization loop: every record handled by Hudi has to be converted
+from (low-level) engine-specific representation (`Row` for Spark, `RowData` for Flink, `ArrayWritable` for Hive) into intermediate 
+one (Avro), with some operations (like clustering, compaction) potentially incurring this penalty multiple times (on read- 
+and write-paths). 
+
+As such, goal of this effort is to remove the need of conversion from engine-specific internal representations to Avro 
+while handling records. 
+
+## Background
+
+Historically, Avro has settled in as de-facto intermediate representation of the record's payload since the early days of Hudi.
+As project matured and the scale of the installations grew, necessity to convert into an intermediate representation quickly 
+become a noticeable bottleneck in terms of performance of critical Hudi flows. 
+
+At the center of it is the hierarchy of `HoodieRecordPayload`s, which is used to hold individual record's payload 
+providing an APIs like `preCombine`, `combineAndGetUpdateValue` to combine it with other record using some user-defined semantic. 
+
+## Implementation
+
+### Revisiting Record Classes Hierarchy
+
+To achieve stated goals of avoiding unnecessary conversions into intermediate representation (Avro), existing Hudi workflows
+operating on individual records will have to be refactored and laid out in a way that would be _unassuming about internal 
+representation_ of the record, ie code should be working w/ a record as an _opaque object_: exposing certain API to access 
+crucial data (precombine, primary, partition keys, etc), but not providing the access to the raw payload.
+
+Having existing workflows re-structured in such a way around a record being an opaque object, would allow us to encapsulate 
+internal representation of the record w/in its class hierarchy, which in turn would allow for us to hold engine-specific (Spark, Flink, etc)
+representations of the records w/o exposing purely engine-agnostic components to it. 
+
+Following (high-level) steps are proposed: 
+
+1. Promote `HoodieRecord` to become a standardized API of interacting with a single record, that will be  
+   1. Replacing all accesses from `HoodieRecordPayload`
+   2. Split into interface and engine-specific implementations (holding internal engine-specific representation of the payload) 
+   3. Implementing new standardized record-level APIs (like `getPartitionKey` , `getRecordKey`, etc)
+   4. Staying **internal** component, that will **NOT** contain any user-defined semantic (like merging)
+2. Extract Record Combining (Merge) API from `HoodieRecordPayload` into a standalone, stateless component (engine). Such component will be
+   1. Abstracted as stateless object providing API to combine records (according to predefined semantics) for engines (Spark, Flink) of interest
+   2. Plug-in point for user-defined combination semantics
+3. Gradually deprecate, phase-out and eventually remove `HoodieRecordPayload` abstraction
+
+Phasing out usage of `HoodieRecordPayload` will also bring the benefit of avoiding to use Java reflection in the hot-path, which
+is known to have poor performance (compared to non-reflection based instantiation).
+
+#### Combine API Engine
+
+Stateless component interface providing for API Combining Records will look like following:
+
+```java
+interface HoodieRecordCombiningEngine {

Review comment:
       ok so here it should be 
   
   ```suggestion
   class HoodieRecordCombiningEngine {
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4697: [HUDI-3318] Drafted RFC-46

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4697:
URL: https://github.com/apache/hudi/pull/4697#issuecomment-1023582575


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "829d13200dd05fe4fb7234c6c5d2de3c65668f15",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5547",
       "triggerID" : "829d13200dd05fe4fb7234c6c5d2de3c65668f15",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0b681481c9b55c7bb9689688c83e884846bb9b57",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "0b681481c9b55c7bb9689688c83e884846bb9b57",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3c89821bea846f44dbee27eb841e869d496ea0dc",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "3c89821bea846f44dbee27eb841e869d496ea0dc",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 829d13200dd05fe4fb7234c6c5d2de3c65668f15 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5547) 
   * 0b681481c9b55c7bb9689688c83e884846bb9b57 UNKNOWN
   * 3c89821bea846f44dbee27eb841e869d496ea0dc UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4697: [HUDI-3318] Drafted RFC-46

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4697:
URL: https://github.com/apache/hudi/pull/4697#issuecomment-1022734156


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "829d13200dd05fe4fb7234c6c5d2de3c65668f15",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5547",
       "triggerID" : "829d13200dd05fe4fb7234c6c5d2de3c65668f15",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 829d13200dd05fe4fb7234c6c5d2de3c65668f15 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5547) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4697: [HUDI-3318] RFC-46

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4697:
URL: https://github.com/apache/hudi/pull/4697#issuecomment-1026401815


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "829d13200dd05fe4fb7234c6c5d2de3c65668f15",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5547",
       "triggerID" : "829d13200dd05fe4fb7234c6c5d2de3c65668f15",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0b681481c9b55c7bb9689688c83e884846bb9b57",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "0b681481c9b55c7bb9689688c83e884846bb9b57",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3c89821bea846f44dbee27eb841e869d496ea0dc",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5561",
       "triggerID" : "3c89821bea846f44dbee27eb841e869d496ea0dc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "755b4aecc11f7c9ca726a126f8a60ece3df7e3c9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "755b4aecc11f7c9ca726a126f8a60ece3df7e3c9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 0b681481c9b55c7bb9689688c83e884846bb9b57 UNKNOWN
   * 3c89821bea846f44dbee27eb841e869d496ea0dc Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5561) 
   * 755b4aecc11f7c9ca726a126f8a60ece3df7e3c9 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4697: [HUDI-3318] RFC-46

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on a change in pull request #4697:
URL: https://github.com/apache/hudi/pull/4697#discussion_r796142512



##########
File path: rfc/rfc-46/rfc-46.md
##########
@@ -0,0 +1,159 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-46: Optimize Record Payload handling
+
+## Proposers
+
+- @alexeykudinkin
+
+## Approvers
+ - @vinothchandar
+ - @nsivabalan
+ - @xushiyan
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-3217
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Avro historically has been a centerpiece of the Hudi architecture: it's a default representation that many components expect
+when dealing with records (during merge, column value extractions, writing into storage, etc). 
+
+While having a single format of the record representation is certainly making implementation of some components simpler, 
+it bears unavoidable performance penalty of de-/serialization loop: every record handled by Hudi has to be converted
+from (low-level) engine-specific representation (`Row` for Spark, `RowData` for Flink, `ArrayWritable` for Hive) into intermediate 
+one (Avro), with some operations (like clustering, compaction) potentially incurring this penalty multiple times (on read- 
+and write-paths). 
+
+As such, goal of this effort is to remove the need of conversion from engine-specific internal representations to Avro 
+while handling records. 
+
+## Background
+
+Historically, Avro has settled in as de-facto intermediate representation of the record's payload since the early days of Hudi.
+As project matured and the scale of the installations grew, necessity to convert into an intermediate representation quickly 
+become a noticeable bottleneck in terms of performance of critical Hudi flows. 
+
+At the center of it is the hierarchy of `HoodieRecordPayload`s, which is used to hold individual record's payload 
+providing an APIs like `preCombine`, `combineAndGetUpdateValue` to combine it with other record using some user-defined semantic. 
+
+## Implementation
+
+### Revisiting Record Classes Hierarchy
+
+To achieve stated goals of avoiding unnecessary conversions into intermediate representation (Avro), existing Hudi workflows
+operating on individual records will have to be refactored and laid out in a way that would be _unassuming about internal 
+representation_ of the record, ie code should be working w/ a record as an _opaque object_: exposing certain API to access 
+crucial data (precombine, primary, partition keys, etc), but not providing the access to the raw payload.
+
+Having existing workflows re-structured in such a way around a record being an opaque object, would allow us to encapsulate 
+internal representation of the record w/in its class hierarchy, which in turn would allow for us to hold engine-specific (Spark, Flink, etc)
+representations of the records w/o exposing purely engine-agnostic components to it. 
+
+Following (high-level) steps are proposed: 
+
+1. Promote `HoodieRecord` to become a standardized API of interacting with a single record, that will be  
+   1. Replacing all accesses from `HoodieRecordPayload`
+   2. Split into interface and engine-specific implementations (holding internal engine-specific representation of the payload) 
+   3. Implementing new standardized record-level APIs (like `getPartitionKey` , `getRecordKey`, etc)
+   4. Staying **internal** component, that will **NOT** contain any user-defined semantic (like merging)
+2. Extract Record Combining (Merge) API from `HoodieRecordPayload` into a standalone, stateless component (engine). Such component will be
+   1. Abstracted as stateless object providing API to combine records (according to predefined semantics) for engines (Spark, Flink) of interest
+   2. Plug-in point for user-defined combination semantics
+3. Gradually deprecate, phase-out and eventually remove `HoodieRecordPayload` abstraction
+
+Phasing out usage of `HoodieRecordPayload` will also bring the benefit of avoiding to use Java reflection in the hot-path, which
+is known to have poor performance (compared to non-reflection based instantiation).
+
+#### Combine API Engine
+
+Stateless component interface providing for API Combining Records will look like following:
+
+```java
+interface HoodieRecordCombiningEngine {

Review comment:
       Engine would be a class (where you will need to provide different impls for Query Engines you're planning on supporting)
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4697: [HUDI-3318] RFC-46

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4697:
URL: https://github.com/apache/hudi/pull/4697#issuecomment-1026401815


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "829d13200dd05fe4fb7234c6c5d2de3c65668f15",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5547",
       "triggerID" : "829d13200dd05fe4fb7234c6c5d2de3c65668f15",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0b681481c9b55c7bb9689688c83e884846bb9b57",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "0b681481c9b55c7bb9689688c83e884846bb9b57",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3c89821bea846f44dbee27eb841e869d496ea0dc",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5561",
       "triggerID" : "3c89821bea846f44dbee27eb841e869d496ea0dc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "755b4aecc11f7c9ca726a126f8a60ece3df7e3c9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "755b4aecc11f7c9ca726a126f8a60ece3df7e3c9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 0b681481c9b55c7bb9689688c83e884846bb9b57 UNKNOWN
   * 3c89821bea846f44dbee27eb841e869d496ea0dc Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5561) 
   * 755b4aecc11f7c9ca726a126f8a60ece3df7e3c9 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4697: [HUDI-3318] Drafted RFC-46

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4697:
URL: https://github.com/apache/hudi/pull/4697#issuecomment-1023580374


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "829d13200dd05fe4fb7234c6c5d2de3c65668f15",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5547",
       "triggerID" : "829d13200dd05fe4fb7234c6c5d2de3c65668f15",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0b681481c9b55c7bb9689688c83e884846bb9b57",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "0b681481c9b55c7bb9689688c83e884846bb9b57",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 829d13200dd05fe4fb7234c6c5d2de3c65668f15 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5547) 
   * 0b681481c9b55c7bb9689688c83e884846bb9b57 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4697: [HUDI-3318] Drafted RFC-46

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4697:
URL: https://github.com/apache/hudi/pull/4697#issuecomment-1022732555


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "829d13200dd05fe4fb7234c6c5d2de3c65668f15",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "829d13200dd05fe4fb7234c6c5d2de3c65668f15",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 829d13200dd05fe4fb7234c6c5d2de3c65668f15 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4697: [HUDI-3318] RFC-46

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on a change in pull request #4697:
URL: https://github.com/apache/hudi/pull/4697#discussion_r794702417



##########
File path: rfc/rfc-46/rfc-46.md
##########
@@ -0,0 +1,159 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-46: Optimize Record Payload handling
+
+## Proposers
+
+- @alexeykudinkin
+
+## Approvers
+ - @vinothchandar
+ - @nsivabalan
+ - @xushiyan
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-3217
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Avro historically has been a centerpiece of the Hudi architecture: it's a default representation that many components expect
+when dealing with records (during merge, column value extractions, writing into storage, etc). 
+
+While having a single format of the record representation is certainly making implementation of some components simpler, 
+it bears unavoidable performance penalty of de-/serialization loop: every record handled by Hudi has to be converted
+from (low-level) engine-specific representation (`Row` for Spark, `RowData` for Flink, `ArrayWritable` for Hive) into intermediate 
+one (Avro), with some operations (like clustering, compaction) potentially incurring this penalty multiple times (on read- 
+and write-paths). 
+
+As such, goal of this effort is to remove the need of conversion from engine-specific internal representations to Avro 
+while handling records. 
+
+## Background
+
+Historically, Avro has settled in as de-facto intermediate representation of the record's payload since the early days of Hudi.
+As project matured and the scale of the installations grew, necessity to convert into an intermediate representation quickly 
+become a noticeable bottleneck in terms of performance of critical Hudi flows. 
+
+At the center of it is the hierarchy of `HoodieRecordPayload`s, which is used to hold individual record's payload 
+providing an APIs like `preCombine`, `combineAndGetUpdateValue` to combine it with other record using some user-defined semantic. 
+
+## Implementation
+
+### Revisiting Record Classes Hierarchy
+
+To achieve stated goals of avoiding unnecessary conversions into intermediate representation (Avro), existing Hudi workflows
+operating on individual records will have to be refactored and laid out in a way that would be _unassuming about internal 
+representation_ of the record, ie code should be working w/ a record as an _opaque object_: exposing certain API to access 
+crucial data (precombine, primary, partition keys, etc), but not providing the access to the raw payload.
+
+Having existing workflows re-structured in such a way around a record being an opaque object, would allow us to encapsulate 
+internal representation of the record w/in its class hierarchy, which in turn would allow for us to hold engine-specific (Spark, Flink, etc)
+representations of the records w/o exposing purely engine-agnostic components to it. 
+
+Following (high-level) steps are proposed: 
+
+1. Promote `HoodieRecord` to become a standardized API of interacting with a single record, that will be  
+   1. Replacing all accesses from `HoodieRecordPayload`
+   2. Split into interface and engine-specific implementations (holding internal engine-specific representation of the payload) 
+   3. Implementing new standardized record-level APIs (like `getPartitionKey` , `getRecordKey`, etc)
+   4. Staying **internal** component, that will **NOT** contain any user-defined semantic (like merging)
+2. Extract Record Combining (Merge) API from `HoodieRecordPayload` into a standalone, stateless component (engine). Such component will be
+   1. Abstracted as stateless object providing API to combine records (according to predefined semantics) for engines (Spark, Flink) of interest
+   2. Plug-in point for user-defined combination semantics
+3. Gradually deprecate, phase-out and eventually remove `HoodieRecordPayload` abstraction
+
+Phasing out usage of `HoodieRecordPayload` will also bring the benefit of avoiding to use Java reflection in the hot-path, which
+is known to have poor performance (compared to non-reflection based instantiation).
+
+#### Combine API Engine
+
+Stateless component interface providing for API Combining Records will look like following:
+
+```java
+interface HoodieRecordCombiningEngine {
+  
+  default HoodieRecord precombine(HoodieRecord older, HoodieRecord newer) {
+    if (spark) {
+      precombineSpark((SparkHoodieRecord) older, (SparkHoodieRecord) newer);
+    } else if (flink) {
+      // precombine for Flink
+    }
+  }
+
+   /**
+    * Spark-specific implementation 
+    */
+  SparkHoodieRecord precombineSpark(SparkHoodieRecord older, SparkHoodieRecord newer);
+  
+  // ...
+}
+```
+Where user can provide their own subclass implementing such interface for the engines of interest.
+
+#### Migration from `HoodieRecordPayload` to `HoodieRecordCombiningEngine`
+
+To warrant backward-compatibility (BWC) on the code-level with already created subclasses of `HoodieRecordPayload` currently
+already used in production by Hudi users, we will provide a BWC-bridge in the form of instance of `HoodieRecordCombiningEngine`, that will 
+be using user-defined subclass of `HoodieRecordPayload` to combine the records.

Review comment:
       Discussed offline: the plan is to continue on the migration pat of #3893, and introduce `HoodieAvroRecord` as the intermediate step before we will migrate to engine-specific implementation (`SparkHoodieRecord`, `FlinkHoodieRecord`, etc)

##########
File path: rfc/rfc-46/rfc-46.md
##########
@@ -0,0 +1,159 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-46: Optimize Record Payload handling
+
+## Proposers
+
+- @alexeykudinkin
+
+## Approvers
+ - @vinothchandar
+ - @nsivabalan
+ - @xushiyan
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-3217
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Avro historically has been a centerpiece of the Hudi architecture: it's a default representation that many components expect
+when dealing with records (during merge, column value extractions, writing into storage, etc). 
+
+While having a single format of the record representation is certainly making implementation of some components simpler, 
+it bears unavoidable performance penalty of de-/serialization loop: every record handled by Hudi has to be converted
+from (low-level) engine-specific representation (`Row` for Spark, `RowData` for Flink, `ArrayWritable` for Hive) into intermediate 
+one (Avro), with some operations (like clustering, compaction) potentially incurring this penalty multiple times (on read- 
+and write-paths). 
+
+As such, goal of this effort is to remove the need of conversion from engine-specific internal representations to Avro 
+while handling records. 
+
+## Background
+
+Historically, Avro has settled in as de-facto intermediate representation of the record's payload since the early days of Hudi.
+As project matured and the scale of the installations grew, necessity to convert into an intermediate representation quickly 
+become a noticeable bottleneck in terms of performance of critical Hudi flows. 
+
+At the center of it is the hierarchy of `HoodieRecordPayload`s, which is used to hold individual record's payload 
+providing an APIs like `preCombine`, `combineAndGetUpdateValue` to combine it with other record using some user-defined semantic. 
+
+## Implementation
+
+### Revisiting Record Classes Hierarchy
+
+To achieve stated goals of avoiding unnecessary conversions into intermediate representation (Avro), existing Hudi workflows
+operating on individual records will have to be refactored and laid out in a way that would be _unassuming about internal 
+representation_ of the record, ie code should be working w/ a record as an _opaque object_: exposing certain API to access 
+crucial data (precombine, primary, partition keys, etc), but not providing the access to the raw payload.
+
+Having existing workflows re-structured in such a way around a record being an opaque object, would allow us to encapsulate 
+internal representation of the record w/in its class hierarchy, which in turn would allow for us to hold engine-specific (Spark, Flink, etc)
+representations of the records w/o exposing purely engine-agnostic components to it. 
+
+Following (high-level) steps are proposed: 
+
+1. Promote `HoodieRecord` to become a standardized API of interacting with a single record, that will be  
+   1. Replacing all accesses from `HoodieRecordPayload`
+   2. Split into interface and engine-specific implementations (holding internal engine-specific representation of the payload) 
+   3. Implementing new standardized record-level APIs (like `getPartitionKey` , `getRecordKey`, etc)
+   4. Staying **internal** component, that will **NOT** contain any user-defined semantic (like merging)
+2. Extract Record Combining (Merge) API from `HoodieRecordPayload` into a standalone, stateless component (engine). Such component will be
+   1. Abstracted as stateless object providing API to combine records (according to predefined semantics) for engines (Spark, Flink) of interest
+   2. Plug-in point for user-defined combination semantics
+3. Gradually deprecate, phase-out and eventually remove `HoodieRecordPayload` abstraction
+
+Phasing out usage of `HoodieRecordPayload` will also bring the benefit of avoiding to use Java reflection in the hot-path, which
+is known to have poor performance (compared to non-reflection based instantiation).
+
+#### Combine API Engine
+
+Stateless component interface providing for API Combining Records will look like following:
+
+```java
+interface HoodieRecordCombiningEngine {
+  
+  default HoodieRecord precombine(HoodieRecord older, HoodieRecord newer) {
+    if (spark) {
+      precombineSpark((SparkHoodieRecord) older, (SparkHoodieRecord) newer);
+    } else if (flink) {
+      // precombine for Flink
+    }
+  }
+
+   /**
+    * Spark-specific implementation 
+    */
+  SparkHoodieRecord precombineSpark(SparkHoodieRecord older, SparkHoodieRecord newer);
+  
+  // ...
+}
+```
+Where user can provide their own subclass implementing such interface for the engines of interest.
+
+#### Migration from `HoodieRecordPayload` to `HoodieRecordCombiningEngine`
+
+To warrant backward-compatibility (BWC) on the code-level with already created subclasses of `HoodieRecordPayload` currently
+already used in production by Hudi users, we will provide a BWC-bridge in the form of instance of `HoodieRecordCombiningEngine`, that will 
+be using user-defined subclass of `HoodieRecordPayload` to combine the records.

Review comment:
       Discussed offline: the plan is to continue on the migration path of #3893 and introduce `HoodieAvroRecord` as the intermediate step before we will migrate to engine-specific implementation (`SparkHoodieRecord`, `FlinkHoodieRecord`, etc)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] xushiyan commented on a change in pull request #4697: [HUDI-3318] RFC-46

Posted by GitBox <gi...@apache.org>.
xushiyan commented on a change in pull request #4697:
URL: https://github.com/apache/hudi/pull/4697#discussion_r796144212



##########
File path: rfc/rfc-46/rfc-46.md
##########
@@ -0,0 +1,159 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-46: Optimize Record Payload handling
+
+## Proposers
+
+- @alexeykudinkin
+
+## Approvers
+ - @vinothchandar
+ - @nsivabalan
+ - @xushiyan
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-3217
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Avro historically has been a centerpiece of the Hudi architecture: it's a default representation that many components expect
+when dealing with records (during merge, column value extractions, writing into storage, etc). 
+
+While having a single format of the record representation is certainly making implementation of some components simpler, 
+it bears unavoidable performance penalty of de-/serialization loop: every record handled by Hudi has to be converted
+from (low-level) engine-specific representation (`Row` for Spark, `RowData` for Flink, `ArrayWritable` for Hive) into intermediate 
+one (Avro), with some operations (like clustering, compaction) potentially incurring this penalty multiple times (on read- 
+and write-paths). 
+
+As such, goal of this effort is to remove the need of conversion from engine-specific internal representations to Avro 
+while handling records. 
+
+## Background
+
+Historically, Avro has settled in as de-facto intermediate representation of the record's payload since the early days of Hudi.
+As project matured and the scale of the installations grew, necessity to convert into an intermediate representation quickly 
+become a noticeable bottleneck in terms of performance of critical Hudi flows. 
+
+At the center of it is the hierarchy of `HoodieRecordPayload`s, which is used to hold individual record's payload 
+providing an APIs like `preCombine`, `combineAndGetUpdateValue` to combine it with other record using some user-defined semantic. 
+
+## Implementation
+
+### Revisiting Record Classes Hierarchy
+
+To achieve stated goals of avoiding unnecessary conversions into intermediate representation (Avro), existing Hudi workflows
+operating on individual records will have to be refactored and laid out in a way that would be _unassuming about internal 
+representation_ of the record, ie code should be working w/ a record as an _opaque object_: exposing certain API to access 
+crucial data (precombine, primary, partition keys, etc), but not providing the access to the raw payload.
+
+Having existing workflows re-structured in such a way around a record being an opaque object, would allow us to encapsulate 
+internal representation of the record w/in its class hierarchy, which in turn would allow for us to hold engine-specific (Spark, Flink, etc)
+representations of the records w/o exposing purely engine-agnostic components to it. 
+
+Following (high-level) steps are proposed: 
+
+1. Promote `HoodieRecord` to become a standardized API of interacting with a single record, that will be  
+   1. Replacing all accesses from `HoodieRecordPayload`
+   2. Split into interface and engine-specific implementations (holding internal engine-specific representation of the payload) 
+   3. Implementing new standardized record-level APIs (like `getPartitionKey` , `getRecordKey`, etc)
+   4. Staying **internal** component, that will **NOT** contain any user-defined semantic (like merging)
+2. Extract Record Combining (Merge) API from `HoodieRecordPayload` into a standalone, stateless component (engine). Such component will be
+   1. Abstracted as stateless object providing API to combine records (according to predefined semantics) for engines (Spark, Flink) of interest
+   2. Plug-in point for user-defined combination semantics
+3. Gradually deprecate, phase-out and eventually remove `HoodieRecordPayload` abstraction
+
+Phasing out usage of `HoodieRecordPayload` will also bring the benefit of avoiding to use Java reflection in the hot-path, which
+is known to have poor performance (compared to non-reflection based instantiation).
+
+#### Combine API Engine
+
+Stateless component interface providing for API Combining Records will look like following:
+
+```java
+interface HoodieRecordCombiningEngine {
+  
+  default HoodieRecord precombine(HoodieRecord older, HoodieRecord newer) {
+    if (spark) {
+      precombineSpark((SparkHoodieRecord) older, (SparkHoodieRecord) newer);
+    } else if (flink) {
+      // precombine for Flink
+    }
+  }
+
+   /**
+    * Spark-specific implementation 
+    */
+  SparkHoodieRecord precombineSpark(SparkHoodieRecord older, SparkHoodieRecord newer);
+  
+  // ...
+}
+```
+Where user can provide their own subclass implementing such interface for the engines of interest.
+
+#### Migration from `HoodieRecordPayload` to `HoodieRecordCombiningEngine`
+
+To warrant backward-compatibility (BWC) on the code-level with already created subclasses of `HoodieRecordPayload` currently
+already used in production by Hudi users, we will provide a BWC-bridge in the form of instance of `HoodieRecordCombiningEngine`, that will 
+be using user-defined subclass of `HoodieRecordPayload` to combine the records.
+
+Leveraging such bridge will make provide for seamless BWC migration to the 0.11 release, however will be removing the performance 
+benefit of this refactoring, since it would unavoidably have to perform conversion to intermediate representation (Avro). To realize
+full-suite of benefits of this refactoring, users will have to migrate their merging logic out of `HoodieRecordPayload` subclass and into
+new `HoodieRecordCombiningEngine` implementation.
+
+### Refactoring Flows Directly Interacting w/ Records:
+
+As was called out prior to achieve the goal of being able to sustain engine-internal representations being held by `HoodieRecord` 
+class w/o compromising major components' neutrality (ie being engine-agnostic), such components directly interacting w/
+records' payloads today will have to be refactored to instead interact w/ standardized `HoodieRecord`s API.
+
+Following major components will be refactored:
+
+1. `WriteHandle`s will be  
+   1. Accepting `HoodieRecord` instead of raw Avro payload (avoiding Avro conversion)
+   2. Using Combining API engine to merge records (when necessary) 
+   3. Passes `HoodieRecord` as is to `FileWriter`
+2. `FileWriter`s will be 
+   1. Accepting `HoodieRecord`
+   2. Will be engine-specific (so that they're able to handle internal record representation)
+3. `RecordReader`s 
+   1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
+
+
+## Rollout/Adoption Plan
+
+ - What impact (if any) will there be on existing users? 
+   - Users of the Hudi will observe considerably better performance for most of the routine operations: writing, reading, compaction, clustering, etc due to avoiding the superfluous intermediate de-/serialization penalty
+   - By default, modified hierarchy would still leverage 
+   - Users will need to rebase their logic of combining records by creating a subclass of `HoodieRecordPayload`, and instead subclass newly created interface `HoodieRecordCombiningEngine` to get full-suite of performance benefits 
+ - If we are changing behavior how will we phase out the older behavior?
+   - Older behavior leveraging `HoodieRecordPayload` for merging will be marked as deprecated in 0.11, and subsequently removed in 0.1x
+ - If we need special migration tools, describe them here.
+   - No special migration tools will be necessary (other than BWC-bridge to make sure users can use 0.11 out of the box, and there are no breaking changes to the public API)
+ - When will we remove the existing behavior
+   - In subsequent releases (either 0.12 or 0.13) 

Review comment:
       minor fix ☝️ 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4697: [HUDI-3318] RFC-46

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4697:
URL: https://github.com/apache/hudi/pull/4697#issuecomment-1026403354


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "829d13200dd05fe4fb7234c6c5d2de3c65668f15",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5547",
       "triggerID" : "829d13200dd05fe4fb7234c6c5d2de3c65668f15",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0b681481c9b55c7bb9689688c83e884846bb9b57",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "0b681481c9b55c7bb9689688c83e884846bb9b57",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3c89821bea846f44dbee27eb841e869d496ea0dc",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5561",
       "triggerID" : "3c89821bea846f44dbee27eb841e869d496ea0dc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "755b4aecc11f7c9ca726a126f8a60ece3df7e3c9",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5635",
       "triggerID" : "755b4aecc11f7c9ca726a126f8a60ece3df7e3c9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 0b681481c9b55c7bb9689688c83e884846bb9b57 UNKNOWN
   * 3c89821bea846f44dbee27eb841e869d496ea0dc Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5561) 
   * 755b4aecc11f7c9ca726a126f8a60ece3df7e3c9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5635) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4697: [HUDI-3318] Drafted RFC-46

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4697:
URL: https://github.com/apache/hudi/pull/4697#issuecomment-1023582575


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "829d13200dd05fe4fb7234c6c5d2de3c65668f15",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5547",
       "triggerID" : "829d13200dd05fe4fb7234c6c5d2de3c65668f15",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0b681481c9b55c7bb9689688c83e884846bb9b57",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "0b681481c9b55c7bb9689688c83e884846bb9b57",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3c89821bea846f44dbee27eb841e869d496ea0dc",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "3c89821bea846f44dbee27eb841e869d496ea0dc",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 829d13200dd05fe4fb7234c6c5d2de3c65668f15 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5547) 
   * 0b681481c9b55c7bb9689688c83e884846bb9b57 UNKNOWN
   * 3c89821bea846f44dbee27eb841e869d496ea0dc UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4697: [HUDI-3318] Drafted RFC-46

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4697:
URL: https://github.com/apache/hudi/pull/4697#issuecomment-1022764937


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "829d13200dd05fe4fb7234c6c5d2de3c65668f15",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5547",
       "triggerID" : "829d13200dd05fe4fb7234c6c5d2de3c65668f15",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 829d13200dd05fe4fb7234c6c5d2de3c65668f15 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5547) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4697: [HUDI-3318] Drafted RFC-46

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4697:
URL: https://github.com/apache/hudi/pull/4697#issuecomment-1022764937


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "829d13200dd05fe4fb7234c6c5d2de3c65668f15",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5547",
       "triggerID" : "829d13200dd05fe4fb7234c6c5d2de3c65668f15",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 829d13200dd05fe4fb7234c6c5d2de3c65668f15 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5547) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4697: [HUDI-3318] Drafted RFC-46

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4697:
URL: https://github.com/apache/hudi/pull/4697#issuecomment-1022734156


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "829d13200dd05fe4fb7234c6c5d2de3c65668f15",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5547",
       "triggerID" : "829d13200dd05fe4fb7234c6c5d2de3c65668f15",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 829d13200dd05fe4fb7234c6c5d2de3c65668f15 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5547) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4697: [HUDI-3318] Drafted RFC-46

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4697:
URL: https://github.com/apache/hudi/pull/4697#issuecomment-1023584865


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "829d13200dd05fe4fb7234c6c5d2de3c65668f15",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5547",
       "triggerID" : "829d13200dd05fe4fb7234c6c5d2de3c65668f15",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0b681481c9b55c7bb9689688c83e884846bb9b57",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "0b681481c9b55c7bb9689688c83e884846bb9b57",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3c89821bea846f44dbee27eb841e869d496ea0dc",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5561",
       "triggerID" : "3c89821bea846f44dbee27eb841e869d496ea0dc",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 829d13200dd05fe4fb7234c6c5d2de3c65668f15 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5547) 
   * 0b681481c9b55c7bb9689688c83e884846bb9b57 UNKNOWN
   * 3c89821bea846f44dbee27eb841e869d496ea0dc Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5561) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #4697: [HUDI-3318] RFC-46

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #4697:
URL: https://github.com/apache/hudi/pull/4697#issuecomment-1023638258


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "829d13200dd05fe4fb7234c6c5d2de3c65668f15",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5547",
       "triggerID" : "829d13200dd05fe4fb7234c6c5d2de3c65668f15",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0b681481c9b55c7bb9689688c83e884846bb9b57",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "0b681481c9b55c7bb9689688c83e884846bb9b57",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3c89821bea846f44dbee27eb841e869d496ea0dc",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5561",
       "triggerID" : "3c89821bea846f44dbee27eb841e869d496ea0dc",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 0b681481c9b55c7bb9689688c83e884846bb9b57 UNKNOWN
   * 3c89821bea846f44dbee27eb841e869d496ea0dc Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5561) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] xushiyan merged pull request #4697: [HUDI-3318] [RFC-46] Optimize Record Payload handling

Posted by GitBox <gi...@apache.org>.
xushiyan merged pull request #4697:
URL: https://github.com/apache/hudi/pull/4697


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4697: [HUDI-3318] Drafted RFC-46

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4697:
URL: https://github.com/apache/hudi/pull/4697#issuecomment-1023584865


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "829d13200dd05fe4fb7234c6c5d2de3c65668f15",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5547",
       "triggerID" : "829d13200dd05fe4fb7234c6c5d2de3c65668f15",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0b681481c9b55c7bb9689688c83e884846bb9b57",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "0b681481c9b55c7bb9689688c83e884846bb9b57",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3c89821bea846f44dbee27eb841e869d496ea0dc",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5561",
       "triggerID" : "3c89821bea846f44dbee27eb841e869d496ea0dc",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 829d13200dd05fe4fb7234c6c5d2de3c65668f15 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5547) 
   * 0b681481c9b55c7bb9689688c83e884846bb9b57 UNKNOWN
   * 3c89821bea846f44dbee27eb841e869d496ea0dc Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5561) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4697: [HUDI-3318] Drafted RFC-46

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4697:
URL: https://github.com/apache/hudi/pull/4697#issuecomment-1023580374


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "829d13200dd05fe4fb7234c6c5d2de3c65668f15",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5547",
       "triggerID" : "829d13200dd05fe4fb7234c6c5d2de3c65668f15",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0b681481c9b55c7bb9689688c83e884846bb9b57",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "0b681481c9b55c7bb9689688c83e884846bb9b57",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 829d13200dd05fe4fb7234c6c5d2de3c65668f15 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5547) 
   * 0b681481c9b55c7bb9689688c83e884846bb9b57 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4697: [HUDI-3318] Drafted RFC-46

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #4697:
URL: https://github.com/apache/hudi/pull/4697#issuecomment-1023638258


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "829d13200dd05fe4fb7234c6c5d2de3c65668f15",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5547",
       "triggerID" : "829d13200dd05fe4fb7234c6c5d2de3c65668f15",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0b681481c9b55c7bb9689688c83e884846bb9b57",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "0b681481c9b55c7bb9689688c83e884846bb9b57",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3c89821bea846f44dbee27eb841e869d496ea0dc",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5561",
       "triggerID" : "3c89821bea846f44dbee27eb841e869d496ea0dc",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 0b681481c9b55c7bb9689688c83e884846bb9b57 UNKNOWN
   * 3c89821bea846f44dbee27eb841e869d496ea0dc Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5561) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] xushiyan commented on a change in pull request #4697: [HUDI-3318] Drafted RFC-46

Posted by GitBox <gi...@apache.org>.
xushiyan commented on a change in pull request #4697:
URL: https://github.com/apache/hudi/pull/4697#discussion_r794632615



##########
File path: rfc/rfc-46/rfc-46.md
##########
@@ -0,0 +1,159 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-46: Optimize Record Payload handling
+
+## Proposers
+
+- @alexeykudinkin
+
+## Approvers
+ - @vinothchandar
+ - @nsivabalan
+ - @xushiyan
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-3217
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Avro historically has been a centerpiece of the Hudi architecture: it's a default representation that many components expect
+when dealing with records (during merge, column value extractions, writing into storage, etc). 
+
+While having a single format of the record representation is certainly making implementation of some components simpler, 
+it bears unavoidable performance penalty of de-/serialization loop: every record handled by Hudi has to be converted
+from (low-level) engine-specific representation (`Row` for Spark, `RowData` for Flink, `ArrayWritable` for Hive) into intermediate 
+one (Avro), with some operations (like clustering, compaction) potentially incurring this penalty multiple times (on read- 
+and write-paths). 
+
+As such, goal of this effort is to remove the need of conversion from engine-specific internal representations to Avro 
+while handling records. 
+
+## Background
+
+Historically, Avro has settled in as de-facto intermediate representation of the record's payload since the early days of Hudi.
+As project matured and the scale of the installations grew, necessity to convert into an intermediate representation quickly 
+become a noticeable bottleneck in terms of performance of critical Hudi flows. 
+
+At the center of it is the hierarchy of `HoodieRecordPayload`s, which is used to hold individual record's payload 
+providing an APIs like `preCombine`, `combineAndGetUpdateValue` to combine it with other record using some user-defined semantic. 
+
+## Implementation
+
+### Revisiting Record Classes Hierarchy
+
+To achieve stated goals of avoiding unnecessary conversions into intermediate representation (Avro), existing Hudi workflows
+operating on individual records will have to be refactored and laid out in a way that would be _unassuming about internal 
+representation_ of the record, ie code should be working w/ a record as an _opaque object_: exposing certain API to access 
+crucial data (precombine, primary, partition keys, etc), but not providing the access to the raw payload.
+
+Having existing workflows re-structured in such a way around a record being an opaque object, would allow us to encapsulate 
+internal representation of the record w/in its class hierarchy, which in turn would allow for us to hold engine-specific (Spark, Flink, etc)
+representations of the records w/o exposing purely engine-agnostic components to it. 
+
+Following (high-level) steps are proposed: 
+
+1. Promote `HoodieRecord` to become a standardized API of interacting with a single record, that will be  
+   1. Replacing all accesses from `HoodieRecordPayload`
+   2. Split into interface and engine-specific implementations (holding internal engine-specific representation of the payload) 
+   3. Implementing new standardized record-level APIs (like `getPartitionKey` , `getRecordKey`, etc)
+   4. Staying **internal** component, that will **NOT** contain any user-defined semantic (like merging)
+2. Extract Record Combining (Merge) API from `HoodieRecordPayload` into a standalone, stateless component (engine). Such component will be
+   1. Abstracted as stateless object providing API to combine records (according to predefined semantics) for engines (Spark, Flink) of interest
+   2. Plug-in point for user-defined combination semantics
+3. Gradually deprecate, phase-out and eventually remove `HoodieRecordPayload` abstraction
+
+Phasing out usage of `HoodieRecordPayload` will also bring the benefit of avoiding to use Java reflection in the hot-path, which
+is known to have poor performance (compared to non-reflection based instantiation).
+
+#### Combine API Engine
+
+Stateless component interface providing for API Combining Records will look like following:
+
+```java
+interface HoodieRecordCombiningEngine {

Review comment:
       Engine sounds like a class; maybe a shorter interface name `HoodieSupportsCombine` ?
   

##########
File path: rfc/rfc-46/rfc-46.md
##########
@@ -0,0 +1,159 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-46: Optimize Record Payload handling
+
+## Proposers
+
+- @alexeykudinkin
+
+## Approvers
+ - @vinothchandar
+ - @nsivabalan
+ - @xushiyan
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-3217
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Avro historically has been a centerpiece of the Hudi architecture: it's a default representation that many components expect
+when dealing with records (during merge, column value extractions, writing into storage, etc). 
+
+While having a single format of the record representation is certainly making implementation of some components simpler, 
+it bears unavoidable performance penalty of de-/serialization loop: every record handled by Hudi has to be converted
+from (low-level) engine-specific representation (`Row` for Spark, `RowData` for Flink, `ArrayWritable` for Hive) into intermediate 
+one (Avro), with some operations (like clustering, compaction) potentially incurring this penalty multiple times (on read- 
+and write-paths). 
+
+As such, goal of this effort is to remove the need of conversion from engine-specific internal representations to Avro 
+while handling records. 
+
+## Background
+
+Historically, Avro has settled in as de-facto intermediate representation of the record's payload since the early days of Hudi.
+As project matured and the scale of the installations grew, necessity to convert into an intermediate representation quickly 
+become a noticeable bottleneck in terms of performance of critical Hudi flows. 
+
+At the center of it is the hierarchy of `HoodieRecordPayload`s, which is used to hold individual record's payload 
+providing an APIs like `preCombine`, `combineAndGetUpdateValue` to combine it with other record using some user-defined semantic. 
+
+## Implementation
+
+### Revisiting Record Classes Hierarchy
+
+To achieve stated goals of avoiding unnecessary conversions into intermediate representation (Avro), existing Hudi workflows
+operating on individual records will have to be refactored and laid out in a way that would be _unassuming about internal 
+representation_ of the record, ie code should be working w/ a record as an _opaque object_: exposing certain API to access 
+crucial data (precombine, primary, partition keys, etc), but not providing the access to the raw payload.
+
+Having existing workflows re-structured in such a way around a record being an opaque object, would allow us to encapsulate 
+internal representation of the record w/in its class hierarchy, which in turn would allow for us to hold engine-specific (Spark, Flink, etc)
+representations of the records w/o exposing purely engine-agnostic components to it. 
+
+Following (high-level) steps are proposed: 
+
+1. Promote `HoodieRecord` to become a standardized API of interacting with a single record, that will be  
+   1. Replacing all accesses from `HoodieRecordPayload`
+   2. Split into interface and engine-specific implementations (holding internal engine-specific representation of the payload) 
+   3. Implementing new standardized record-level APIs (like `getPartitionKey` , `getRecordKey`, etc)
+   4. Staying **internal** component, that will **NOT** contain any user-defined semantic (like merging)
+2. Extract Record Combining (Merge) API from `HoodieRecordPayload` into a standalone, stateless component (engine). Such component will be
+   1. Abstracted as stateless object providing API to combine records (according to predefined semantics) for engines (Spark, Flink) of interest
+   2. Plug-in point for user-defined combination semantics
+3. Gradually deprecate, phase-out and eventually remove `HoodieRecordPayload` abstraction
+
+Phasing out usage of `HoodieRecordPayload` will also bring the benefit of avoiding to use Java reflection in the hot-path, which
+is known to have poor performance (compared to non-reflection based instantiation).
+
+#### Combine API Engine
+
+Stateless component interface providing for API Combining Records will look like following:
+
+```java
+interface HoodieRecordCombiningEngine {
+  
+  default HoodieRecord precombine(HoodieRecord older, HoodieRecord newer) {
+    if (spark) {
+      precombineSpark((SparkHoodieRecord) older, (SparkHoodieRecord) newer);
+    } else if (flink) {
+      // precombine for Flink
+    }
+  }
+
+   /**
+    * Spark-specific implementation 
+    */
+  SparkHoodieRecord precombineSpark(SparkHoodieRecord older, SparkHoodieRecord newer);
+  
+  // ...
+}
+```
+Where user can provide their own subclass implementing such interface for the engines of interest.
+
+#### Migration from `HoodieRecordPayload` to `HoodieRecordCombiningEngine`
+
+To warrant backward-compatibility (BWC) on the code-level with already created subclasses of `HoodieRecordPayload` currently
+already used in production by Hudi users, we will provide a BWC-bridge in the form of instance of `HoodieRecordCombiningEngine`, that will 
+be using user-defined subclass of `HoodieRecordPayload` to combine the records.
+
+Leveraging such bridge will make provide for seamless BWC migration to the 0.11 release, however will be removing the performance 
+benefit of this refactoring, since it would unavoidably have to perform conversion to intermediate representation (Avro). To realize
+full-suite of benefits of this refactoring, users will have to migrate their merging logic out of `HoodieRecordPayload` subclass and into
+new `HoodieRecordCombiningEngine` implementation.
+
+### Refactoring Flows Directly Interacting w/ Records:
+
+As was called out prior to achieve the goal of being able to sustain engine-internal representations being held by `HoodieRecord` 
+class w/o compromising major components' neutrality (ie being engine-agnostic), such components directly interacting w/
+records' payloads today will have to be refactored to instead interact w/ standardized `HoodieRecord`s API.
+
+Following major components will be refactored:
+
+1. `WriteHandle`s will be  
+   1. Accepting `HoodieRecord` instead of raw Avro payload (avoiding Avro conversion)
+   2. Using Combining API engine to merge records (when necessary) 
+   3. Passes `HoodieRecord` as is to `FileWriter`
+2. `FileWriter`s will be 
+   1. Accepting `HoodieRecord`
+   2. Will be engine-specific (so that they're able to handle internal record representation)
+3. `RecordReader`s 
+   1. API will be returning opaque `HoodieRecord` instead of raw Avro payload

Review comment:
       was trying to look for the exact classes: are they `HoodieWriteHandle`, `HoodieFileWriter` and `HoodieFileReader` ?

##########
File path: rfc/rfc-46/rfc-46.md
##########
@@ -0,0 +1,159 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-46: Optimize Record Payload handling
+
+## Proposers
+
+- @alexeykudinkin
+
+## Approvers
+ - @vinothchandar
+ - @nsivabalan
+ - @xushiyan
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-3217
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Avro historically has been a centerpiece of the Hudi architecture: it's a default representation that many components expect
+when dealing with records (during merge, column value extractions, writing into storage, etc). 
+
+While having a single format of the record representation is certainly making implementation of some components simpler, 
+it bears unavoidable performance penalty of de-/serialization loop: every record handled by Hudi has to be converted
+from (low-level) engine-specific representation (`Row` for Spark, `RowData` for Flink, `ArrayWritable` for Hive) into intermediate 
+one (Avro), with some operations (like clustering, compaction) potentially incurring this penalty multiple times (on read- 
+and write-paths). 
+
+As such, goal of this effort is to remove the need of conversion from engine-specific internal representations to Avro 
+while handling records. 
+
+## Background
+
+Historically, Avro has settled in as de-facto intermediate representation of the record's payload since the early days of Hudi.
+As project matured and the scale of the installations grew, necessity to convert into an intermediate representation quickly 
+become a noticeable bottleneck in terms of performance of critical Hudi flows. 
+
+At the center of it is the hierarchy of `HoodieRecordPayload`s, which is used to hold individual record's payload 
+providing an APIs like `preCombine`, `combineAndGetUpdateValue` to combine it with other record using some user-defined semantic. 
+
+## Implementation
+
+### Revisiting Record Classes Hierarchy
+
+To achieve stated goals of avoiding unnecessary conversions into intermediate representation (Avro), existing Hudi workflows
+operating on individual records will have to be refactored and laid out in a way that would be _unassuming about internal 
+representation_ of the record, ie code should be working w/ a record as an _opaque object_: exposing certain API to access 
+crucial data (precombine, primary, partition keys, etc), but not providing the access to the raw payload.
+
+Having existing workflows re-structured in such a way around a record being an opaque object, would allow us to encapsulate 
+internal representation of the record w/in its class hierarchy, which in turn would allow for us to hold engine-specific (Spark, Flink, etc)
+representations of the records w/o exposing purely engine-agnostic components to it. 
+
+Following (high-level) steps are proposed: 
+
+1. Promote `HoodieRecord` to become a standardized API of interacting with a single record, that will be  
+   1. Replacing all accesses from `HoodieRecordPayload`
+   2. Split into interface and engine-specific implementations (holding internal engine-specific representation of the payload) 
+   3. Implementing new standardized record-level APIs (like `getPartitionKey` , `getRecordKey`, etc)
+   4. Staying **internal** component, that will **NOT** contain any user-defined semantic (like merging)
+2. Extract Record Combining (Merge) API from `HoodieRecordPayload` into a standalone, stateless component (engine). Such component will be
+   1. Abstracted as stateless object providing API to combine records (according to predefined semantics) for engines (Spark, Flink) of interest
+   2. Plug-in point for user-defined combination semantics
+3. Gradually deprecate, phase-out and eventually remove `HoodieRecordPayload` abstraction
+
+Phasing out usage of `HoodieRecordPayload` will also bring the benefit of avoiding to use Java reflection in the hot-path, which
+is known to have poor performance (compared to non-reflection based instantiation).
+
+#### Combine API Engine
+
+Stateless component interface providing for API Combining Records will look like following:
+
+```java
+interface HoodieRecordCombiningEngine {
+  
+  default HoodieRecord precombine(HoodieRecord older, HoodieRecord newer) {
+    if (spark) {
+      precombineSpark((SparkHoodieRecord) older, (SparkHoodieRecord) newer);
+    } else if (flink) {
+      // precombine for Flink
+    }
+  }
+
+   /**
+    * Spark-specific implementation 
+    */
+  SparkHoodieRecord precombineSpark(SparkHoodieRecord older, SparkHoodieRecord newer);
+  
+  // ...
+}
+```
+Where user can provide their own subclass implementing such interface for the engines of interest.
+
+#### Migration from `HoodieRecordPayload` to `HoodieRecordCombiningEngine`
+
+To warrant backward-compatibility (BWC) on the code-level with already created subclasses of `HoodieRecordPayload` currently
+already used in production by Hudi users, we will provide a BWC-bridge in the form of instance of `HoodieRecordCombiningEngine`, that will 
+be using user-defined subclass of `HoodieRecordPayload` to combine the records.

Review comment:
       So in https://github.com/apache/hudi/pull/3893 we have `HoodieAvroRecord` mostly for compatible migration purpose. Maybe we name it `HoodieLegacyAvroRecord` so code change can be manageable. A second migration would be needed for `HoodieLegacyAvroRecord` -> `SparkHoodieRecord` / `FlinkHoodieRecord`

##########
File path: rfc/rfc-46/rfc-46.md
##########
@@ -0,0 +1,159 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-46: Optimize Record Payload handling
+
+## Proposers
+
+- @alexeykudinkin
+
+## Approvers
+ - @vinothchandar
+ - @nsivabalan
+ - @xushiyan
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-3217
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Avro historically has been a centerpiece of the Hudi architecture: it's a default representation that many components expect
+when dealing with records (during merge, column value extractions, writing into storage, etc). 
+
+While having a single format of the record representation is certainly making implementation of some components simpler, 
+it bears unavoidable performance penalty of de-/serialization loop: every record handled by Hudi has to be converted
+from (low-level) engine-specific representation (`Row` for Spark, `RowData` for Flink, `ArrayWritable` for Hive) into intermediate 
+one (Avro), with some operations (like clustering, compaction) potentially incurring this penalty multiple times (on read- 
+and write-paths). 
+
+As such, goal of this effort is to remove the need of conversion from engine-specific internal representations to Avro 
+while handling records. 
+
+## Background
+
+Historically, Avro has settled in as de-facto intermediate representation of the record's payload since the early days of Hudi.
+As project matured and the scale of the installations grew, necessity to convert into an intermediate representation quickly 
+become a noticeable bottleneck in terms of performance of critical Hudi flows. 
+
+At the center of it is the hierarchy of `HoodieRecordPayload`s, which is used to hold individual record's payload 
+providing an APIs like `preCombine`, `combineAndGetUpdateValue` to combine it with other record using some user-defined semantic. 
+
+## Implementation
+
+### Revisiting Record Classes Hierarchy
+
+To achieve stated goals of avoiding unnecessary conversions into intermediate representation (Avro), existing Hudi workflows
+operating on individual records will have to be refactored and laid out in a way that would be _unassuming about internal 
+representation_ of the record, ie code should be working w/ a record as an _opaque object_: exposing certain API to access 
+crucial data (precombine, primary, partition keys, etc), but not providing the access to the raw payload.
+
+Having existing workflows re-structured in such a way around a record being an opaque object, would allow us to encapsulate 
+internal representation of the record w/in its class hierarchy, which in turn would allow for us to hold engine-specific (Spark, Flink, etc)
+representations of the records w/o exposing purely engine-agnostic components to it. 
+
+Following (high-level) steps are proposed: 
+
+1. Promote `HoodieRecord` to become a standardized API of interacting with a single record, that will be  
+   1. Replacing all accesses from `HoodieRecordPayload`
+   2. Split into interface and engine-specific implementations (holding internal engine-specific representation of the payload) 
+   3. Implementing new standardized record-level APIs (like `getPartitionKey` , `getRecordKey`, etc)
+   4. Staying **internal** component, that will **NOT** contain any user-defined semantic (like merging)
+2. Extract Record Combining (Merge) API from `HoodieRecordPayload` into a standalone, stateless component (engine). Such component will be
+   1. Abstracted as stateless object providing API to combine records (according to predefined semantics) for engines (Spark, Flink) of interest
+   2. Plug-in point for user-defined combination semantics
+3. Gradually deprecate, phase-out and eventually remove `HoodieRecordPayload` abstraction
+
+Phasing out usage of `HoodieRecordPayload` will also bring the benefit of avoiding to use Java reflection in the hot-path, which
+is known to have poor performance (compared to non-reflection based instantiation).
+
+#### Combine API Engine
+
+Stateless component interface providing for API Combining Records will look like following:
+
+```java
+interface HoodieRecordCombiningEngine {
+  
+  default HoodieRecord precombine(HoodieRecord older, HoodieRecord newer) {
+    if (spark) {
+      precombineSpark((SparkHoodieRecord) older, (SparkHoodieRecord) newer);
+    } else if (flink) {
+      // precombine for Flink
+    }
+  }
+
+   /**
+    * Spark-specific implementation 
+    */
+  SparkHoodieRecord precombineSpark(SparkHoodieRecord older, SparkHoodieRecord newer);
+  
+  // ...
+}
+```
+Where user can provide their own subclass implementing such interface for the engines of interest.
+
+#### Migration from `HoodieRecordPayload` to `HoodieRecordCombiningEngine`
+
+To warrant backward-compatibility (BWC) on the code-level with already created subclasses of `HoodieRecordPayload` currently
+already used in production by Hudi users, we will provide a BWC-bridge in the form of instance of `HoodieRecordCombiningEngine`, that will 
+be using user-defined subclass of `HoodieRecordPayload` to combine the records.
+
+Leveraging such bridge will make provide for seamless BWC migration to the 0.11 release, however will be removing the performance 
+benefit of this refactoring, since it would unavoidably have to perform conversion to intermediate representation (Avro). To realize
+full-suite of benefits of this refactoring, users will have to migrate their merging logic out of `HoodieRecordPayload` subclass and into
+new `HoodieRecordCombiningEngine` implementation.
+
+### Refactoring Flows Directly Interacting w/ Records:
+
+As was called out prior to achieve the goal of being able to sustain engine-internal representations being held by `HoodieRecord` 
+class w/o compromising major components' neutrality (ie being engine-agnostic), such components directly interacting w/
+records' payloads today will have to be refactored to instead interact w/ standardized `HoodieRecord`s API.
+
+Following major components will be refactored:
+
+1. `WriteHandle`s will be  
+   1. Accepting `HoodieRecord` instead of raw Avro payload (avoiding Avro conversion)
+   2. Using Combining API engine to merge records (when necessary) 
+   3. Passes `HoodieRecord` as is to `FileWriter`
+2. `FileWriter`s will be 
+   1. Accepting `HoodieRecord`
+   2. Will be engine-specific (so that they're able to handle internal record representation)
+3. `RecordReader`s 
+   1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
+
+
+## Rollout/Adoption Plan
+
+ - What impact (if any) will there be on existing users? 
+   - Users of the Hudi will observe considerably better performance for most of the routine operations: writing, reading, compaction, clustering, etc due to avoiding the superfluous intermediate de-/serialization penalty
+   - By default, modified hierarchy would still leverage 
+   - Users will need to rebase their logic of combining records by creating a subclass of `HoodieRecordPayload`, and instead subclass newly created interface `HoodieRecordCombiningEngine` to get full-suite of performance benefits 
+ - If we are changing behavior how will we phase out the older behavior?
+   - Older behavior leveraging `HoodieRecordPayload` for merging will be marked as deprecated in 0.11, and subsequently removed in 0.1x
+ - If we need special migration tools, describe them here.
+   - No special migration tools will be necessary (other than BWC-bridge to make sure users can use 0.11 out of the box, and there are no breaking changes to the public API)
+ - When will we remove the existing behavior
+   - In subsequent releases (either 0.12 or 0.13) 

Review comment:
       the roadmap is like after 0.12 it'll be 1.0




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] danny0405 commented on a change in pull request #4697: [HUDI-3318] Drafted RFC-46

Posted by GitBox <gi...@apache.org>.
danny0405 commented on a change in pull request #4697:
URL: https://github.com/apache/hudi/pull/4697#discussion_r793205021



##########
File path: rfc/rfc-46/rfc-46.md
##########
@@ -0,0 +1,154 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-46: Optimize Record Payload handling
+
+## Proposers
+
+- @alexeykudinkin
+
+## Approvers
+ - @vinothchandar
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-3217
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Avro historically has been a centerpiece of the Hudi architecture: it's a default representation that many components expect
+when dealing with records (during merge, column value extractions, writing into storage, etc). 
+
+While having a single format of the record representation is certainly making implementation of some components simpler, 
+it bears unavoidable performance penalty of de-/serialization loop: every record handled by Hudi has to be converted
+from (low-level) engine-specific representation (`Row` for Spark, `DataRow` for Flink, `ArrayWritable` for Hive) into intermediate 
+one (Avro), with some operations (like clustering, compaction) potentially incurring this penalty multiple times (on read- 

Review comment:
       `RowData` for Flink




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] xushiyan commented on a change in pull request #4697: [HUDI-3318] RFC-46

Posted by GitBox <gi...@apache.org>.
xushiyan commented on a change in pull request #4697:
URL: https://github.com/apache/hudi/pull/4697#discussion_r796144048



##########
File path: rfc/rfc-46/rfc-46.md
##########
@@ -0,0 +1,159 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-46: Optimize Record Payload handling
+
+## Proposers
+
+- @alexeykudinkin
+
+## Approvers
+ - @vinothchandar
+ - @nsivabalan
+ - @xushiyan
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-3217
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Avro historically has been a centerpiece of the Hudi architecture: it's a default representation that many components expect
+when dealing with records (during merge, column value extractions, writing into storage, etc). 
+
+While having a single format of the record representation is certainly making implementation of some components simpler, 
+it bears unavoidable performance penalty of de-/serialization loop: every record handled by Hudi has to be converted
+from (low-level) engine-specific representation (`Row` for Spark, `RowData` for Flink, `ArrayWritable` for Hive) into intermediate 
+one (Avro), with some operations (like clustering, compaction) potentially incurring this penalty multiple times (on read- 
+and write-paths). 
+
+As such, goal of this effort is to remove the need of conversion from engine-specific internal representations to Avro 
+while handling records. 
+
+## Background
+
+Historically, Avro has settled in as de-facto intermediate representation of the record's payload since the early days of Hudi.
+As project matured and the scale of the installations grew, necessity to convert into an intermediate representation quickly 
+become a noticeable bottleneck in terms of performance of critical Hudi flows. 
+
+At the center of it is the hierarchy of `HoodieRecordPayload`s, which is used to hold individual record's payload 
+providing an APIs like `preCombine`, `combineAndGetUpdateValue` to combine it with other record using some user-defined semantic. 
+
+## Implementation
+
+### Revisiting Record Classes Hierarchy
+
+To achieve stated goals of avoiding unnecessary conversions into intermediate representation (Avro), existing Hudi workflows
+operating on individual records will have to be refactored and laid out in a way that would be _unassuming about internal 
+representation_ of the record, ie code should be working w/ a record as an _opaque object_: exposing certain API to access 
+crucial data (precombine, primary, partition keys, etc), but not providing the access to the raw payload.
+
+Having existing workflows re-structured in such a way around a record being an opaque object, would allow us to encapsulate 
+internal representation of the record w/in its class hierarchy, which in turn would allow for us to hold engine-specific (Spark, Flink, etc)
+representations of the records w/o exposing purely engine-agnostic components to it. 
+
+Following (high-level) steps are proposed: 
+
+1. Promote `HoodieRecord` to become a standardized API of interacting with a single record, that will be  
+   1. Replacing all accesses from `HoodieRecordPayload`
+   2. Split into interface and engine-specific implementations (holding internal engine-specific representation of the payload) 
+   3. Implementing new standardized record-level APIs (like `getPartitionKey` , `getRecordKey`, etc)
+   4. Staying **internal** component, that will **NOT** contain any user-defined semantic (like merging)
+2. Extract Record Combining (Merge) API from `HoodieRecordPayload` into a standalone, stateless component (engine). Such component will be
+   1. Abstracted as stateless object providing API to combine records (according to predefined semantics) for engines (Spark, Flink) of interest
+   2. Plug-in point for user-defined combination semantics
+3. Gradually deprecate, phase-out and eventually remove `HoodieRecordPayload` abstraction
+
+Phasing out usage of `HoodieRecordPayload` will also bring the benefit of avoiding to use Java reflection in the hot-path, which
+is known to have poor performance (compared to non-reflection based instantiation).
+
+#### Combine API Engine
+
+Stateless component interface providing for API Combining Records will look like following:
+
+```java
+interface HoodieRecordCombiningEngine {
+  
+  default HoodieRecord precombine(HoodieRecord older, HoodieRecord newer) {
+    if (spark) {
+      precombineSpark((SparkHoodieRecord) older, (SparkHoodieRecord) newer);
+    } else if (flink) {
+      // precombine for Flink
+    }
+  }
+
+   /**
+    * Spark-specific implementation 
+    */
+  SparkHoodieRecord precombineSpark(SparkHoodieRecord older, SparkHoodieRecord newer);
+  
+  // ...
+}
+```
+Where user can provide their own subclass implementing such interface for the engines of interest.
+
+#### Migration from `HoodieRecordPayload` to `HoodieRecordCombiningEngine`
+
+To warrant backward-compatibility (BWC) on the code-level with already created subclasses of `HoodieRecordPayload` currently
+already used in production by Hudi users, we will provide a BWC-bridge in the form of instance of `HoodieRecordCombiningEngine`, that will 
+be using user-defined subclass of `HoodieRecordPayload` to combine the records.
+
+Leveraging such bridge will make provide for seamless BWC migration to the 0.11 release, however will be removing the performance 
+benefit of this refactoring, since it would unavoidably have to perform conversion to intermediate representation (Avro). To realize
+full-suite of benefits of this refactoring, users will have to migrate their merging logic out of `HoodieRecordPayload` subclass and into
+new `HoodieRecordCombiningEngine` implementation.
+
+### Refactoring Flows Directly Interacting w/ Records:
+
+As was called out prior to achieve the goal of being able to sustain engine-internal representations being held by `HoodieRecord` 
+class w/o compromising major components' neutrality (ie being engine-agnostic), such components directly interacting w/
+records' payloads today will have to be refactored to instead interact w/ standardized `HoodieRecord`s API.
+
+Following major components will be refactored:
+
+1. `WriteHandle`s will be  
+   1. Accepting `HoodieRecord` instead of raw Avro payload (avoiding Avro conversion)
+   2. Using Combining API engine to merge records (when necessary) 
+   3. Passes `HoodieRecord` as is to `FileWriter`
+2. `FileWriter`s will be 
+   1. Accepting `HoodieRecord`
+   2. Will be engine-specific (so that they're able to handle internal record representation)
+3. `RecordReader`s 
+   1. API will be returning opaque `HoodieRecord` instead of raw Avro payload

Review comment:
       ok would you fix the names so it reflects the exact class names?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org