You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Zheng yunhong (Jira)" <ji...@apache.org> on 2021/08/13 02:36:00 UTC

[jira] [Assigned] (HUDI-2299) The log format DELETE block lose the info orderingVal

     [ https://issues.apache.org/jira/browse/HUDI-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng yunhong reassigned HUDI-2299:
-----------------------------------

    Assignee: Zheng yunhong

> The log format DELETE block lose the info orderingVal
> -----------------------------------------------------
>
>                 Key: HUDI-2299
>                 URL: https://issues.apache.org/jira/browse/HUDI-2299
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: Common Core
>            Reporter: Danny Chen
>            Assignee: Zheng yunhong
>            Priority: Major
>             Fix For: 0.10.0
>
>
> The append handle now always write data block first then delete block, and the delete block only keeps the hoodie keys, when reading, the scanner just read the DELETE block without any info of ordering value, thus, if the we write two records:
> insert: {id: 0, ts: 2}
> delete: {id: 0, ts: 1}
> Finally the insert message is deleted !!!, this is a critical bug for streaming write, we should fix it as soon as possible
> _*Here is the discussion on slack*_:
> Danny Chan  12:42 PM
> https://issues.apache.org/jira/browse/HUDI-2299
> 12:43
> Hi, @vc, our user found a critical bug for MOR log format, if there are disorder DELETEs in the streaming messages, the event time of the DELETEs are totally ignored.
> 12:44
> I guess this should be a blocker of 0.9 because it affect the correctness of the data set.
> vc  12:44 PM
> if we can fix it by end of day friday PST
> 12:44
> we can add it
> 12:44
> Just want to cut a release this week.
> 12:45
> Do you have a sense for the fix? bandwidth to take it up?
> Danny Chan  12:46 PM
> I try to fix it but can not figure out a good way, if the DELETE block records the orderingVal, the format breaks the compatibility.
> vc  1:05 PM
> We can version the format. thats doable. Should we precombine before even logging the deeltes?
> Danny Chan  1:11 PM
> Yes, we should
> vc  1:26 PM
> I think, thats how its working today. Deletes don't have an ordering val per se, right
> 1:28
> Delete block at t1 :
>   delete key k
> Data block at t2 :
>   ins key k with ordering val 2
> We can just fix it so that the insert shows up, since t2 > t1.
> For what kind of functionality you need, we need to do soft deletes i.e updates with an ordering value instead of hard deletes
> 1:28
> makes sense?
> Danny Chan  1:32 PM
> we can but that’s not the perfect solution, especially if the dataset comes from a CDC source, for example the MySQL binlog. There is no extra flag in schema for soft delete though.
> 1:37
> In my opinion, it is not about soft DELETE or hard DELETE, even if we do a soft DELETE, the event time (orderingVal) is still important for consumers for versoning. (edited) 
> vc  1:57 PM
> tbh, I don't see us fixing this in two days
> 1:58
> lets do a 0.9.1 after this ?
> 1:58
> shortly after with a bunch of bug fixes and the large pending PRs
> 1:58
> we can even make it 0.10.0
> Danny Chan  1:58 PM
> Yes, the cut time is very soon. We can move the fix to next version.
> vc  1:59 PM
> We have some inconsistent semantics in places
> 1:59
> some are commit time (arrival time) based and some are orderingVal (event time) based
> 2:00
> In the meantime, see HoodieDeleteBlockVersion you can just define a new version for delete block alone for e,g
> 2:00
> and add more information



--
This message was sent by Atlassian Jira
(v8.3.4#803005)