You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Christopher Weaver (Jira)" <ji...@apache.org> on 2020/04/17 18:09:00 UTC

[jira] [Created] (HUDI-802) AWSDmsTransformer does not handle insert -> delete of a row in a single batch correctly

Christopher Weaver created HUDI-802:
---------------------------------------

             Summary: AWSDmsTransformer does not handle insert -> delete of a row in a single batch correctly
                 Key: HUDI-802
                 URL: https://issues.apache.org/jira/browse/HUDI-802
             Project: Apache Hudi (incubating)
          Issue Type: Bug
          Components: DeltaStreamer
            Reporter: Christopher Weaver


The provided AWSDmsAvroPayload class ([https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/payload/AWSDmsAvroPayload.java]) currently handles cases where the "Op" column is a "D" for updates, and successfully removes the row from the resulting table. 

However, when an insert is quickly followed by a delete on the row (e.g. DMS processes them together and puts the update records together in the same parquet file), the row incorrectly appears in the resulting table. In this case, the record is not in the table and getInsertValue is called rather than combineAndGetUpdateValue. Since the logic to check for a delete is in combineAndGetUpdateValue, it is skipped and the delete is missed. Something like this could fix this issue: [https://github.com/Weves/incubator-hudi/blob/release-0.5.1/hudi-spark/src/main/java/org/apache/hudi/payload/CustomAWSDmsAvroPayload.java]. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)