You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "codope (via GitHub)" <gi...@apache.org> on 2023/02/14 12:31:10 UTC

[GitHub] [hudi] codope opened a new pull request, #7942: [HUDI-5753] Add docs for record payload

codope opened a new pull request, #7942:
URL: https://github.com/apache/hudi/pull/7942

   ### Change Logs
   
   Documentation about record payload under `Concepts` section.
   
   ### Impact
   
   Only docs.
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   Adds a new page under `Concepts` section of Hudi website.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on a diff in pull request #7942: [HUDI-5753] Add docs for record payload

Posted by "codope (via GitHub)" <gi...@apache.org>.
codope commented on code in PR #7942:
URL: https://github.com/apache/hudi/pull/7942#discussion_r1106755374


##########
website/docs/record_payload.md:
##########
@@ -0,0 +1,97 @@
+---
+title: Record Payload 
+keywords: [hudi, merge, upsert, precombine]
+---
+
+## Record Payload
+
+One of the core features of Hudi is the ability to incrementally upsert data, deduplicate and merge records on the fly.
+Additionally, users can implement their custom logic to merge the input records with the record on storage. Record
+payload is an abstract representation of a Hudi record that allows the aforementioned capability. As we shall see below,
+Hudi provides out-of-box support for different payloads for different use cases, and a new record merger API for
+optimized payload handling. But, first let us understand how record payload is used in the Hudi upsert path.
+
+<figure>
+    <img className="docimage" src={require("/assets/images/upsert_path.png").default} alt="upsert_path.png" />
+</figure>
+
+Figure above shows the main stages that records go through while being written to the Hudi table. In the precombining
+stage, Hudi performs any deduplication based on the payload implementation and precombine key configured by the user.
+Further, on index lookup, Hudi identifies which records are being updated and the record payload implementation tells
+Hudi how to merge the incoming record with the existing record on storage.
+
+### Existing Payloads
+
+#### OverwriteWithLatestAvroPayload
+
+This is the default record payload implementation. It picks the record with the greatest value (determined by calling
+.compareTo() on the value of precombine key) to break ties and simply picks the latest record while merging. This gives
+latest-write-wins style semantics.
+
+#### EventTimeAvroPayload
+
+Some use cases require merging records by event time and thus event time plays the role of an ordering field. This

Review Comment:
   Not much. `DefaultHoodieRecordPayload` maintains some additional metadata to track latency and freshness. I wanted to write about `DefaultHoodieRecordPayload` but the naming belies the actual default. So, I avoided confusion. Perhaps, we should make this the actual default. Is it covered in config simplification story? cc @bhasudha 
   Let me just add some notes about `DefaultHoodieRecordPayload`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on pull request #7942: [HUDI-5753] Add docs for record payload

Posted by "codope (via GitHub)" <gi...@apache.org>.
codope commented on PR #7942:
URL: https://github.com/apache/hudi/pull/7942#issuecomment-1495277107

   > @codope is it possible you can provide an example to extend the payload for a customized option. Also, are there configs the user should consider that's provided out-of-the-box? If possible, can you specify all of them inline with the right class?
   
   @nfarah86 I have added a link to FAQ where there are more details on how to implement a custom payload. I have also removed the record merger API. Need to follow up with a separate doc or update this doc in a separate PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on a diff in pull request #7942: [HUDI-5753] Add docs for record payload

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on code in PR #7942:
URL: https://github.com/apache/hudi/pull/7942#discussion_r1162020746


##########
website/docs/record_payload.md:
##########
@@ -0,0 +1,97 @@
+---
+title: Record Payload 
+keywords: [hudi, merge, upsert, precombine]
+---
+
+## Record Payload
+
+One of the core features of Hudi is the ability to incrementally upsert data, deduplicate and merge records on the fly.
+Additionally, users can implement their custom logic to merge the input records with the record on storage. Record
+payload is an abstract representation of a Hudi record that allows the aforementioned capability. As we shall see below,
+Hudi provides out-of-box support for different payloads for different use cases, and a new record merger API for
+optimized payload handling. But, first let us understand how record payload is used in the Hudi upsert path.
+
+<figure>
+    <img className="docimage" src={require("/assets/images/upsert_path.png").default} alt="upsert_path.png" />
+</figure>
+
+Figure above shows the main stages that records go through while being written to the Hudi table. In the precombining
+stage, Hudi performs any deduplication based on the payload implementation and precombine key configured by the user.
+Further, on index lookup, Hudi identifies which records are being updated and the record payload implementation tells
+Hudi how to merge the incoming record with the existing record on storage.
+
+### Existing Payloads
+
+#### OverwriteWithLatestAvroPayload
+
+This is the default record payload implementation. It picks the record with the greatest value (determined by calling
+.compareTo() on the value of precombine key) to break ties and simply picks the latest record while merging. This gives
+latest-write-wins style semantics.
+
+#### EventTimeAvroPayload
+
+Some use cases require merging records by event time and thus event time plays the role of an ordering field. This
+payload is particularly useful in the case of late-arriving data. For such use cases, users need to set
+the [payload event time field](/docs/configurations#RECORD_PAYLOAD) configuration.
+
+#### ExpressionPayload
+
+This payload is very useful when you want to merge or delete records based on some conditional expression, especially
+when updating records using [`MERGE INTO`](/docs/quick-start-guide#mergeinto) statement.
+
+#### Payload to support partial update
+
+Typically, once the merge step resolves which record to pick, then the record on storage is fully replaced by the
+resolved record. But, in some cases, the requirement is to update only certain fields and not replace the whole record.
+This is called partial update.
+`PartialUpdateAvroPayload` in Hudi provides out-box-support for such use cases. To illustrate the point, let us look at
+a simple example:
+
+Let's say the order field is `ts` and schema is :
+
+```
+{
+  [
+    {"name":"id","type":"string"},
+    {"name":"ts","type":"long"},
+    {"name":"name","type":"string"},
+    {"name":"price","type":"string"}
+  ]
+}
+```
+
+Current record in storage:
+
+```
+    id      ts      name    price
+    1       2       name_1  null
+```
+
+Incoming record:
+
+```
+    id      ts      name    price
+    1       1       null    price_1
+```
+
+Result data after merging using `PartialUpdateAvroPayload`:
+
+```
+    id      ts      name    price
+    1       2       name_1  price_1

Review Comment:
   oh I see. if the two records came in proper order, the final snapshot would have been 
   1 2 name_1 price_1
   
   I get it now. 
   



##########
website/docs/record_payload.md:
##########
@@ -0,0 +1,97 @@
+---
+title: Record Payload 
+keywords: [hudi, merge, upsert, precombine]
+---
+
+## Record Payload
+
+One of the core features of Hudi is the ability to incrementally upsert data, deduplicate and merge records on the fly.
+Additionally, users can implement their custom logic to merge the input records with the record on storage. Record
+payload is an abstract representation of a Hudi record that allows the aforementioned capability. As we shall see below,
+Hudi provides out-of-box support for different payloads for different use cases, and a new record merger API for
+optimized payload handling. But, first let us understand how record payload is used in the Hudi upsert path.
+
+<figure>
+    <img className="docimage" src={require("/assets/images/upsert_path.png").default} alt="upsert_path.png" />
+</figure>
+
+Figure above shows the main stages that records go through while being written to the Hudi table. In the precombining
+stage, Hudi performs any deduplication based on the payload implementation and precombine key configured by the user.
+Further, on index lookup, Hudi identifies which records are being updated and the record payload implementation tells
+Hudi how to merge the incoming record with the existing record on storage.
+
+### Existing Payloads
+
+#### OverwriteWithLatestAvroPayload
+
+This is the default record payload implementation. It picks the record with the greatest value (determined by calling
+.compareTo() on the value of precombine key) to break ties and simply picks the latest record while merging. This gives
+latest-write-wins style semantics.
+
+#### EventTimeAvroPayload
+
+Some use cases require merging records by event time and thus event time plays the role of an ordering field. This
+payload is particularly useful in the case of late-arriving data. For such use cases, users need to set
+the [payload event time field](/docs/configurations#RECORD_PAYLOAD) configuration.
+
+#### ExpressionPayload
+
+This payload is very useful when you want to merge or delete records based on some conditional expression, especially
+when updating records using [`MERGE INTO`](/docs/quick-start-guide#mergeinto) statement.
+
+#### Payload to support partial update
+
+Typically, once the merge step resolves which record to pick, then the record on storage is fully replaced by the
+resolved record. But, in some cases, the requirement is to update only certain fields and not replace the whole record.
+This is called partial update.
+`PartialUpdateAvroPayload` in Hudi provides out-box-support for such use cases. To illustrate the point, let us look at
+a simple example:
+
+Let's say the order field is `ts` and schema is :
+
+```
+{
+  [
+    {"name":"id","type":"string"},
+    {"name":"ts","type":"long"},
+    {"name":"name","type":"string"},
+    {"name":"price","type":"string"}
+  ]
+}
+```
+
+Current record in storage:
+
+```
+    id      ts      name    price
+    1       2       name_1  null
+```
+
+Incoming record:
+
+```
+    id      ts      name    price
+    1       1       null    price_1
+```
+
+Result data after merging using `PartialUpdateAvroPayload`:
+
+```
+    id      ts      name    price
+    1       2       name_1  price_1

Review Comment:
   this is bit confusing actually. 
   so based on ordering field, we choose the one in storage. so, shouldn't we ignore the new record totally? 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nfarah86 commented on pull request #7942: [HUDI-5753] Add docs for record payload

Posted by "nfarah86 (via GitHub)" <gi...@apache.org>.
nfarah86 commented on PR #7942:
URL: https://github.com/apache/hudi/pull/7942#issuecomment-1475435581

   @codope is it possible you can provide an example to extend the payload for a customized option. Also, are there configs the user should consider that's provided out-of-the-box. If possible, can you specify all of them inline with the right class?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nfarah86 commented on pull request #7942: [HUDI-5753] Add docs for record payload

Posted by "nfarah86 (via GitHub)" <gi...@apache.org>.
nfarah86 commented on PR #7942:
URL: https://github.com/apache/hudi/pull/7942#issuecomment-1475418303

   Following up- I need to use this doc for the write operations, so I can reference the payloads under the UPSERT operation. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #7942: [HUDI-5753] Add docs for record payload

Posted by "alexeykudinkin (via GitHub)" <gi...@apache.org>.
alexeykudinkin commented on code in PR #7942:
URL: https://github.com/apache/hudi/pull/7942#discussion_r1108960751


##########
website/docs/record_payload.md:
##########
@@ -0,0 +1,97 @@
+---
+title: Record Payload 
+keywords: [hudi, merge, upsert, precombine]
+---
+
+## Record Payload
+
+One of the core features of Hudi is the ability to incrementally upsert data, deduplicate and merge records on the fly.
+Additionally, users can implement their custom logic to merge the input records with the record on storage. Record
+payload is an abstract representation of a Hudi record that allows the aforementioned capability. As we shall see below,
+Hudi provides out-of-box support for different payloads for different use cases, and a new record merger API for
+optimized payload handling. But, first let us understand how record payload is used in the Hudi upsert path.
+
+<figure>
+    <img className="docimage" src={require("/assets/images/upsert_path.png").default} alt="upsert_path.png" />
+</figure>
+
+Figure above shows the main stages that records go through while being written to the Hudi table. In the precombining
+stage, Hudi performs any deduplication based on the payload implementation and precombine key configured by the user.
+Further, on index lookup, Hudi identifies which records are being updated and the record payload implementation tells
+Hudi how to merge the incoming record with the existing record on storage.
+
+### Existing Payloads
+
+#### OverwriteWithLatestAvroPayload
+
+This is the default record payload implementation. It picks the record with the greatest value (determined by calling
+.compareTo() on the value of precombine key) to break ties and simply picks the latest record while merging. This gives
+latest-write-wins style semantics.
+
+#### EventTimeAvroPayload
+
+Some use cases require merging records by event time and thus event time plays the role of an ordering field. This
+payload is particularly useful in the case of late-arriving data. For such use cases, users need to set
+the [payload event time field](/docs/configurations#RECORD_PAYLOAD) configuration.
+
+#### ExpressionPayload
+
+This payload is very useful when you want to merge or delete records based on some conditional expression, especially

Review Comment:
   Correct this is only supposed to be used internally (by MERGE INTO)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan merged pull request #7942: [HUDI-5753] Add docs for record payload

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan merged PR #7942:
URL: https://github.com/apache/hudi/pull/7942


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on a diff in pull request #7942: [HUDI-5753] Add docs for record payload

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on code in PR #7942:
URL: https://github.com/apache/hudi/pull/7942#discussion_r1106035940


##########
website/docs/record_payload.md:
##########
@@ -0,0 +1,97 @@
+---
+title: Record Payload 
+keywords: [hudi, merge, upsert, precombine]
+---
+
+## Record Payload
+
+One of the core features of Hudi is the ability to incrementally upsert data, deduplicate and merge records on the fly.
+Additionally, users can implement their custom logic to merge the input records with the record on storage. Record
+payload is an abstract representation of a Hudi record that allows the aforementioned capability. As we shall see below,
+Hudi provides out-of-box support for different payloads for different use cases, and a new record merger API for
+optimized payload handling. But, first let us understand how record payload is used in the Hudi upsert path.
+
+<figure>
+    <img className="docimage" src={require("/assets/images/upsert_path.png").default} alt="upsert_path.png" />
+</figure>
+
+Figure above shows the main stages that records go through while being written to the Hudi table. In the precombining
+stage, Hudi performs any deduplication based on the payload implementation and precombine key configured by the user.
+Further, on index lookup, Hudi identifies which records are being updated and the record payload implementation tells
+Hudi how to merge the incoming record with the existing record on storage.
+
+### Existing Payloads
+
+#### OverwriteWithLatestAvroPayload
+
+This is the default record payload implementation. It picks the record with the greatest value (determined by calling
+.compareTo() on the value of precombine key) to break ties and simply picks the latest record while merging. This gives

Review Comment:
   actually there is little more to this. lets land this doc for 0.13.0. but as an immediate follow up, address these comments. 
   we have precombine and combineAndGetUpdate method used in diff occasions. 
   so calling out just preCombine may not be right. bcoz, when merging w/ whats in storage, we ignore the preCombine value specifically in this payoad(OverwriteWithLatestAvroPayload)



##########
website/docs/record_payload.md:
##########
@@ -0,0 +1,97 @@
+---
+title: Record Payload 
+keywords: [hudi, merge, upsert, precombine]
+---
+
+## Record Payload
+
+One of the core features of Hudi is the ability to incrementally upsert data, deduplicate and merge records on the fly.
+Additionally, users can implement their custom logic to merge the input records with the record on storage. Record
+payload is an abstract representation of a Hudi record that allows the aforementioned capability. As we shall see below,
+Hudi provides out-of-box support for different payloads for different use cases, and a new record merger API for
+optimized payload handling. But, first let us understand how record payload is used in the Hudi upsert path.
+
+<figure>
+    <img className="docimage" src={require("/assets/images/upsert_path.png").default} alt="upsert_path.png" />
+</figure>
+
+Figure above shows the main stages that records go through while being written to the Hudi table. In the precombining
+stage, Hudi performs any deduplication based on the payload implementation and precombine key configured by the user.
+Further, on index lookup, Hudi identifies which records are being updated and the record payload implementation tells
+Hudi how to merge the incoming record with the existing record on storage.
+
+### Existing Payloads
+
+#### OverwriteWithLatestAvroPayload
+
+This is the default record payload implementation. It picks the record with the greatest value (determined by calling
+.compareTo() on the value of precombine key) to break ties and simply picks the latest record while merging. This gives
+latest-write-wins style semantics.
+
+#### EventTimeAvroPayload
+
+Some use cases require merging records by event time and thus event time plays the role of an ordering field. This
+payload is particularly useful in the case of late-arriving data. For such use cases, users need to set
+the [payload event time field](/docs/configurations#RECORD_PAYLOAD) configuration.
+
+#### ExpressionPayload
+
+This payload is very useful when you want to merge or delete records based on some conditional expression, especially

Review Comment:
   should we remove this from this list. I thought its meant to be used only internally. can anyone directly set expression payload for their table? 



##########
website/docs/record_payload.md:
##########
@@ -0,0 +1,97 @@
+---
+title: Record Payload 
+keywords: [hudi, merge, upsert, precombine]
+---
+
+## Record Payload
+
+One of the core features of Hudi is the ability to incrementally upsert data, deduplicate and merge records on the fly.
+Additionally, users can implement their custom logic to merge the input records with the record on storage. Record
+payload is an abstract representation of a Hudi record that allows the aforementioned capability. As we shall see below,
+Hudi provides out-of-box support for different payloads for different use cases, and a new record merger API for
+optimized payload handling. But, first let us understand how record payload is used in the Hudi upsert path.
+
+<figure>
+    <img className="docimage" src={require("/assets/images/upsert_path.png").default} alt="upsert_path.png" />
+</figure>
+
+Figure above shows the main stages that records go through while being written to the Hudi table. In the precombining
+stage, Hudi performs any deduplication based on the payload implementation and precombine key configured by the user.
+Further, on index lookup, Hudi identifies which records are being updated and the record payload implementation tells
+Hudi how to merge the incoming record with the existing record on storage.
+
+### Existing Payloads
+
+#### OverwriteWithLatestAvroPayload
+
+This is the default record payload implementation. It picks the record with the greatest value (determined by calling
+.compareTo() on the value of precombine key) to break ties and simply picks the latest record while merging. This gives
+latest-write-wins style semantics.
+
+#### EventTimeAvroPayload
+
+Some use cases require merging records by event time and thus event time plays the role of an ordering field. This
+payload is particularly useful in the case of late-arriving data. For such use cases, users need to set
+the [payload event time field](/docs/configurations#RECORD_PAYLOAD) configuration.
+
+#### ExpressionPayload
+
+This payload is very useful when you want to merge or delete records based on some conditional expression, especially
+when updating records using [`MERGE INTO`](/docs/quick-start-guide#mergeinto) statement.
+
+#### Payload to support partial update
+
+Typically, once the merge step resolves which record to pick, then the record on storage is fully replaced by the
+resolved record. But, in some cases, the requirement is to update only certain fields and not replace the whole record.
+This is called partial update.
+`PartialUpdateAvroPayload` in Hudi provides out-box-support for such use cases. To illustrate the point, let us look at
+a simple example:
+
+Let's say the order field is `ts` and schema is :
+
+```
+{
+  [
+    {"name":"id","type":"string"},
+    {"name":"ts","type":"long"},
+    {"name":"name","type":"string"},
+    {"name":"price","type":"string"}
+  ]
+}
+```
+
+Current record in storage:
+
+```
+    id      ts      name    price
+    1       2       name_1  null
+```
+
+Incoming record:
+
+```
+    id      ts      name    price
+    1       1       null    price_1
+```
+
+Result data after merging using `PartialUpdateAvroPayload`:
+
+```
+    id      ts      name    price
+    1       2       name_1  price_1

Review Comment:
   how is ts's value is 2? its not intuitive to me. I thought, for null values in new incoming, we will choose older value. 



##########
website/docs/record_payload.md:
##########
@@ -0,0 +1,97 @@
+---
+title: Record Payload 
+keywords: [hudi, merge, upsert, precombine]
+---
+
+## Record Payload
+
+One of the core features of Hudi is the ability to incrementally upsert data, deduplicate and merge records on the fly.
+Additionally, users can implement their custom logic to merge the input records with the record on storage. Record
+payload is an abstract representation of a Hudi record that allows the aforementioned capability. As we shall see below,
+Hudi provides out-of-box support for different payloads for different use cases, and a new record merger API for
+optimized payload handling. But, first let us understand how record payload is used in the Hudi upsert path.
+
+<figure>
+    <img className="docimage" src={require("/assets/images/upsert_path.png").default} alt="upsert_path.png" />
+</figure>
+
+Figure above shows the main stages that records go through while being written to the Hudi table. In the precombining
+stage, Hudi performs any deduplication based on the payload implementation and precombine key configured by the user.
+Further, on index lookup, Hudi identifies which records are being updated and the record payload implementation tells
+Hudi how to merge the incoming record with the existing record on storage.
+
+### Existing Payloads
+
+#### OverwriteWithLatestAvroPayload
+
+This is the default record payload implementation. It picks the record with the greatest value (determined by calling
+.compareTo() on the value of precombine key) to break ties and simply picks the latest record while merging. This gives
+latest-write-wins style semantics.
+
+#### EventTimeAvroPayload
+
+Some use cases require merging records by event time and thus event time plays the role of an ordering field. This
+payload is particularly useful in the case of late-arriving data. For such use cases, users need to set
+the [payload event time field](/docs/configurations#RECORD_PAYLOAD) configuration.
+
+#### ExpressionPayload
+
+This payload is very useful when you want to merge or delete records based on some conditional expression, especially
+when updating records using [`MERGE INTO`](/docs/quick-start-guide#mergeinto) statement.
+
+#### Payload to support partial update
+
+Typically, once the merge step resolves which record to pick, then the record on storage is fully replaced by the
+resolved record. But, in some cases, the requirement is to update only certain fields and not replace the whole record.
+This is called partial update.
+`PartialUpdateAvroPayload` in Hudi provides out-box-support for such use cases. To illustrate the point, let us look at
+a simple example:
+
+Let's say the order field is `ts` and schema is :
+
+```
+{
+  [
+    {"name":"id","type":"string"},
+    {"name":"ts","type":"long"},
+    {"name":"name","type":"string"},
+    {"name":"price","type":"string"}
+  ]
+}
+```
+
+Current record in storage:
+
+```
+    id      ts      name    price
+    1       2       name_1  null
+```
+
+Incoming record:
+
+```
+    id      ts      name    price
+    1       1       null    price_1
+```
+
+Result data after merging using `PartialUpdateAvroPayload`:
+
+```
+    id      ts      name    price
+    1       2       name_1  price_1
+```
+
+There are quite a few other implementations provided by Hudi. For example, `MySqlDebeziumAvroPayload` and
+`PostgresDebeziumAvroPayload` provides support for seamlessly applying changes captured via Debezium for MySQL and
+PostgresDB.
+`AWSDmsAvroPayload` provides support for applying changes captured via Amazon Database Migration Service onto S3.

Review Comment:
   OverwriteNonDefaultsWithLatestAvroPayload, DefaultHoodieRecordPayload



##########
website/docs/record_payload.md:
##########
@@ -0,0 +1,97 @@
+---
+title: Record Payload 
+keywords: [hudi, merge, upsert, precombine]
+---
+
+## Record Payload
+
+One of the core features of Hudi is the ability to incrementally upsert data, deduplicate and merge records on the fly.
+Additionally, users can implement their custom logic to merge the input records with the record on storage. Record
+payload is an abstract representation of a Hudi record that allows the aforementioned capability. As we shall see below,
+Hudi provides out-of-box support for different payloads for different use cases, and a new record merger API for
+optimized payload handling. But, first let us understand how record payload is used in the Hudi upsert path.
+
+<figure>
+    <img className="docimage" src={require("/assets/images/upsert_path.png").default} alt="upsert_path.png" />
+</figure>
+
+Figure above shows the main stages that records go through while being written to the Hudi table. In the precombining
+stage, Hudi performs any deduplication based on the payload implementation and precombine key configured by the user.
+Further, on index lookup, Hudi identifies which records are being updated and the record payload implementation tells
+Hudi how to merge the incoming record with the existing record on storage.
+
+### Existing Payloads
+
+#### OverwriteWithLatestAvroPayload
+
+This is the default record payload implementation. It picks the record with the greatest value (determined by calling
+.compareTo() on the value of precombine key) to break ties and simply picks the latest record while merging. This gives
+latest-write-wins style semantics.
+
+#### EventTimeAvroPayload
+
+Some use cases require merging records by event time and thus event time plays the role of an ordering field. This

Review Comment:
   curious to know, how is this diff from using DefaultHoodieRecordPayload where we use the event time as the payload ordering field. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on a diff in pull request #7942: [HUDI-5753] Add docs for record payload

Posted by "codope (via GitHub)" <gi...@apache.org>.
codope commented on code in PR #7942:
URL: https://github.com/apache/hudi/pull/7942#discussion_r1106781029


##########
website/docs/record_payload.md:
##########
@@ -0,0 +1,97 @@
+---
+title: Record Payload 
+keywords: [hudi, merge, upsert, precombine]
+---
+
+## Record Payload
+
+One of the core features of Hudi is the ability to incrementally upsert data, deduplicate and merge records on the fly.
+Additionally, users can implement their custom logic to merge the input records with the record on storage. Record
+payload is an abstract representation of a Hudi record that allows the aforementioned capability. As we shall see below,
+Hudi provides out-of-box support for different payloads for different use cases, and a new record merger API for
+optimized payload handling. But, first let us understand how record payload is used in the Hudi upsert path.
+
+<figure>
+    <img className="docimage" src={require("/assets/images/upsert_path.png").default} alt="upsert_path.png" />
+</figure>
+
+Figure above shows the main stages that records go through while being written to the Hudi table. In the precombining
+stage, Hudi performs any deduplication based on the payload implementation and precombine key configured by the user.
+Further, on index lookup, Hudi identifies which records are being updated and the record payload implementation tells
+Hudi how to merge the incoming record with the existing record on storage.
+
+### Existing Payloads
+
+#### OverwriteWithLatestAvroPayload
+
+This is the default record payload implementation. It picks the record with the greatest value (determined by calling
+.compareTo() on the value of precombine key) to break ties and simply picks the latest record while merging. This gives
+latest-write-wins style semantics.
+
+#### EventTimeAvroPayload
+
+Some use cases require merging records by event time and thus event time plays the role of an ordering field. This
+payload is particularly useful in the case of late-arriving data. For such use cases, users need to set
+the [payload event time field](/docs/configurations#RECORD_PAYLOAD) configuration.
+
+#### ExpressionPayload
+
+This payload is very useful when you want to merge or delete records based on some conditional expression, especially

Review Comment:
   Didn't know that this is meant to be used internally. Is there a guard like that on payload class config? cc @alexeykudinkin 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on a diff in pull request #7942: [HUDI-5753] Add docs for record payload

Posted by "codope (via GitHub)" <gi...@apache.org>.
codope commented on code in PR #7942:
URL: https://github.com/apache/hudi/pull/7942#discussion_r1106781711


##########
website/docs/record_payload.md:
##########
@@ -0,0 +1,97 @@
+---
+title: Record Payload 
+keywords: [hudi, merge, upsert, precombine]
+---
+
+## Record Payload
+
+One of the core features of Hudi is the ability to incrementally upsert data, deduplicate and merge records on the fly.
+Additionally, users can implement their custom logic to merge the input records with the record on storage. Record
+payload is an abstract representation of a Hudi record that allows the aforementioned capability. As we shall see below,
+Hudi provides out-of-box support for different payloads for different use cases, and a new record merger API for
+optimized payload handling. But, first let us understand how record payload is used in the Hudi upsert path.
+
+<figure>
+    <img className="docimage" src={require("/assets/images/upsert_path.png").default} alt="upsert_path.png" />
+</figure>
+
+Figure above shows the main stages that records go through while being written to the Hudi table. In the precombining
+stage, Hudi performs any deduplication based on the payload implementation and precombine key configured by the user.
+Further, on index lookup, Hudi identifies which records are being updated and the record payload implementation tells
+Hudi how to merge the incoming record with the existing record on storage.
+
+### Existing Payloads
+
+#### OverwriteWithLatestAvroPayload
+
+This is the default record payload implementation. It picks the record with the greatest value (determined by calling
+.compareTo() on the value of precombine key) to break ties and simply picks the latest record while merging. This gives

Review Comment:
   That's true. But, I wanted to keep things simple for the user as it is a concepts doc. Towards the end, I have pointed to the FAQ which has more details.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on a diff in pull request #7942: [HUDI-5753] Add docs for record payload

Posted by "codope (via GitHub)" <gi...@apache.org>.
codope commented on code in PR #7942:
URL: https://github.com/apache/hudi/pull/7942#discussion_r1106780501


##########
website/docs/record_payload.md:
##########
@@ -0,0 +1,97 @@
+---
+title: Record Payload 
+keywords: [hudi, merge, upsert, precombine]
+---
+
+## Record Payload
+
+One of the core features of Hudi is the ability to incrementally upsert data, deduplicate and merge records on the fly.
+Additionally, users can implement their custom logic to merge the input records with the record on storage. Record
+payload is an abstract representation of a Hudi record that allows the aforementioned capability. As we shall see below,
+Hudi provides out-of-box support for different payloads for different use cases, and a new record merger API for
+optimized payload handling. But, first let us understand how record payload is used in the Hudi upsert path.
+
+<figure>
+    <img className="docimage" src={require("/assets/images/upsert_path.png").default} alt="upsert_path.png" />
+</figure>
+
+Figure above shows the main stages that records go through while being written to the Hudi table. In the precombining
+stage, Hudi performs any deduplication based on the payload implementation and precombine key configured by the user.
+Further, on index lookup, Hudi identifies which records are being updated and the record payload implementation tells
+Hudi how to merge the incoming record with the existing record on storage.
+
+### Existing Payloads
+
+#### OverwriteWithLatestAvroPayload
+
+This is the default record payload implementation. It picks the record with the greatest value (determined by calling
+.compareTo() on the value of precombine key) to break ties and simply picks the latest record while merging. This gives
+latest-write-wins style semantics.
+
+#### EventTimeAvroPayload
+
+Some use cases require merging records by event time and thus event time plays the role of an ordering field. This
+payload is particularly useful in the case of late-arriving data. For such use cases, users need to set
+the [payload event time field](/docs/configurations#RECORD_PAYLOAD) configuration.
+
+#### ExpressionPayload
+
+This payload is very useful when you want to merge or delete records based on some conditional expression, especially
+when updating records using [`MERGE INTO`](/docs/quick-start-guide#mergeinto) statement.
+
+#### Payload to support partial update
+
+Typically, once the merge step resolves which record to pick, then the record on storage is fully replaced by the
+resolved record. But, in some cases, the requirement is to update only certain fields and not replace the whole record.
+This is called partial update.
+`PartialUpdateAvroPayload` in Hudi provides out-box-support for such use cases. To illustrate the point, let us look at
+a simple example:
+
+Let's say the order field is `ts` and schema is :
+
+```
+{
+  [
+    {"name":"id","type":"string"},
+    {"name":"ts","type":"long"},
+    {"name":"name","type":"string"},
+    {"name":"price","type":"string"}
+  ]
+}
+```
+
+Current record in storage:
+
+```
+    id      ts      name    price
+    1       2       name_1  null
+```
+
+Incoming record:
+
+```
+    id      ts      name    price
+    1       1       null    price_1
+```
+
+Result data after merging using `PartialUpdateAvroPayload`:
+
+```
+    id      ts      name    price
+    1       2       name_1  price_1

Review Comment:
   `ts` is the ordering field so the record with higher value is picked. Null value for `name` column in incoming record indeed gets replaced by value in the existing record.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org