You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/02/26 13:26:53 UTC

[GitHub] [hudi] nsivabalan commented on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC

nsivabalan commented on pull request #4880:
URL: https://github.com/apache/hudi/pull/4880#issuecomment-1052124364


   @danny0405 : Can you update the desc of the PR.
   
   @danny0405 @xushiyan : I looked at the jira linked and w/o getting into implementation details (I have not looked into the actual fix yet), but here is my understanding.
   Problem of interest: 
   ```
   Currently, the DELETE blocks are always written after the data blocks for one batch of data write, when there are INSERT/UPDATEs after the DELETE, the data would lost.
   
   What i can thought of is that the DELETE block should at least keep the event time sequence for #preCombine with other record payloads.
   ```
   
   My take here is, we can't fix this in a fool proof manner unless we plan to retain delete records forever in hudi. 
   
   for eg:
   If we get an insert in log blog1, delete in log block2 and insert in log block3, as per master, we return the record of interest as valid since we saw an insert after delete (even if delete record's preCombine value is greater than the insert record following it).
   
   if I am not wrong, the patch is fixing this, so that we still consider the record of interest as deleted since among all 3 versions of the record, the delete record has the higher preCombine value. 
   
   But what incase compaction kicked in and then we came across a newer insert w/ lower preCombine. During compaction, we will not carry over any deleted records. 
   
   To illustrate.
   insert in log block1, and delete in log block2.
   and then compaction kicks in.
   and now lets say we have a new insert for the same record w/ lower preCombine. 
   
   Since compaction will not carry over any deleted records, the new record will be considered a new insert. which is in line even if compaction has not kicked in (as per master). 
   
   So, are we proposing to retain all deleted records forever in this patch or are we fixing only the case where we maintain the strict ordering only until compaction kicks in. 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org