You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Christopher Weaver (Jira)" <ji...@apache.org> on 2020/04/17 18:09:00 UTC
[jira] [Created] (HUDI-802) AWSDmsTransformer does not handle
insert -> delete of a row in a single batch correctly
Christopher Weaver created HUDI-802:
---------------------------------------
Summary: AWSDmsTransformer does not handle insert -> delete of a row in a single batch correctly
Key: HUDI-802
URL: https://issues.apache.org/jira/browse/HUDI-802
Project: Apache Hudi (incubating)
Issue Type: Bug
Components: DeltaStreamer
Reporter: Christopher Weaver
The provided AWSDmsAvroPayload class ([https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/payload/AWSDmsAvroPayload.java]) currently handles cases where the "Op" column is a "D" for updates, and successfully removes the row from the resulting table.
However, when an insert is quickly followed by a delete on the row (e.g. DMS processes them together and puts the update records together in the same parquet file), the row incorrectly appears in the resulting table. In this case, the record is not in the table and getInsertValue is called rather than combineAndGetUpdateValue. Since the logic to check for a delete is in combineAndGetUpdateValue, it is skipped and the delete is missed. Something like this could fix this issue: [https://github.com/Weves/incubator-hudi/blob/release-0.5.1/hudi-spark/src/main/java/org/apache/hudi/payload/CustomAWSDmsAvroPayload.java].
--
This message was sent by Atlassian Jira
(v8.3.4#803005)