You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/04/22 09:04:00 UTC

[GitHub] [incubator-hudi] PhatakN1 opened a new issue #1549: Potential issue when using Deltastreamer with DMS

PhatakN1 opened a new issue #1549:
URL: https://github.com/apache/incubator-hudi/issues/1549


   Hi,
   I am using DMS to stream data changes from a mySQL database to S3. I then use Deltastreamer to push this data to Hudi using EMR. Since I also need to do some minor transformations on the way, I use org.apache.hudi.utilities.transform.SqlQueryBasedTransformer  in the commandline instead of org.apache.hudi.utilities.transform.AWSDmsTransformer . To take care of the fact that DMS does not include the Op column when doing a full load, I use 
   hoodie.deltastreamer.transformer.sql=select C1,C2,...,'I' as Op from <SRC> to inject the column into the record. Everything works well with COPY_ON_WRITE datasets. However, I have found a couple of issues when using this with MERGE_ON_READ datasets.
   
   I have 2 scenarios
   1. I do an insert and delete each on my source database and both of them belong to the same partition in Hudi - I explicitly disable-compaction when I run DeltaStreamer. What I have found is that when I query the _ro table using SParkSQL, I see that both the insert and delete is applied and the result is returned accordingly. Ideally, since I disabled compaction, I should have seen the data without the insert and the delete. Query on _rt provides the right results with both delete and insert applied as expected. When I look into the specific S3 folder, I see that the insert and delete into the partition actually create a new .parquet file with no log file.That may be the reason why the select on _ro table provides data with the insert and delete applied
   2. In scenario2, I delete 1 record each in the source table which belong to 2 different Hudi partitions. Now, this time, I again run deltastreamer with the disable-compaction option. This time, when I see the S3 folders for these partitions, I see the .log file in both the folders. However, when I query both the _ro and _rt tables using SparkSQL, it does not reflect the deletes.Based on my understanding, the _rt table should reflect the deletes. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] n3nash commented on issue #1549: Potential issue when using Deltastreamer with DMS

Posted by GitBox <gi...@apache.org>.
n3nash commented on issue #1549:
URL: https://github.com/apache/incubator-hudi/issues/1549#issuecomment-626030218


   @PhatakN1 any updates ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] PhatakN1 edited a comment on issue #1549: Potential issue when using Deltastreamer with DMS

Posted by GitBox <gi...@apache.org>.
PhatakN1 edited a comment on issue #1549:
URL: https://github.com/apache/incubator-hudi/issues/1549#issuecomment-619587422


   Does this mean that hard deletes would work out of the box for COW datasets but not for MOR datasets? I also went through the documentation at https://cwiki.apache.org/confluence/display/HUDI/2020/01/15/Delete+support+in+Hudi which talks about delete support in Hudi and the example it provides is also on a COW dataset. It also needs addition of a field called _hoodie_is_deleted to the source. I tried adding this field as well, but the query on the _rt table on a MOR dataset still shows up the value(with  _hoodie_is_deleted=true)
   
   When I try the same insert/update/delete without even adding the _hoodie_is_deleted field, it works by looking at the OP field that DMS populates. 
   
   So, as of now, when working with DMS data as the source and leveraging deltastreamer, is COW dataset preferred over MOR? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] n3nash commented on issue #1549: Potential issue when using Deltastreamer with DMS

Posted by GitBox <gi...@apache.org>.
n3nash commented on issue #1549:
URL: https://github.com/apache/incubator-hudi/issues/1549#issuecomment-619420599


   @PhatakN1 So this is what is possibly happening : 
   
   1) Hard Deletes in Hudi is only supported by following a certain contract with your payload. Your payload implementation should carry an "empty" record value. Something like this -> https://github.com/apache/incubator-hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/model/EmptyHoodieRecordPayload.java that is supported out of the box by HoodieWriteClient here -> https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java#L307.
   
   To simulate this with your own implementation (AwsDMSPayload), you can override the methods `getInsertValue` and `combineAndGetUpdateValue` (take a look at the EmptyPayload above). 
   
   2) In your specific use-case, since your payload isn't an "empty" payload, even though the MERGE on the realtime query is happening through the implementation of your payload, Hudi doesn't know whether this is a "hard delete" or a "soft delete" - the only way for Hudi to know it's a hard delete is the way I described above. 
   
   Let me know if you have further questions 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] n3nash edited a comment on issue #1549: Potential issue when using Deltastreamer with DMS

Posted by GitBox <gi...@apache.org>.
n3nash edited a comment on issue #1549:
URL: https://github.com/apache/incubator-hudi/issues/1549#issuecomment-618811796


   @vinothchandar  We do invoke the same payload when combining records during merge/compaction. For deletes, the payload has to be an empty payload and then the record should be skipped -> https://github.com/apache/incubator-hudi/blob/master/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java#L94
   
   @PhatakN1 when you try deletes, is that any empty payload ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] PhatakN1 edited a comment on issue #1549: Potential issue when using Deltastreamer with DMS

Posted by GitBox <gi...@apache.org>.
PhatakN1 edited a comment on issue #1549:
URL: https://github.com/apache/incubator-hudi/issues/1549#issuecomment-618292312


   If MOR inserts go to a parquet file but updates to go a log file, then a query on the _ro table will show the inserts since the last compaction but not the updates. Isnt that like providing an inconsistent state of data? So, I still see all inserts since the last compaction but none of  the updates?
   
   These are the contents of the log file using show logfile records in hudi-cli
   ```
   {"_hoodie_commit_time": "20200422083923", "_hoodie_commit_seqno": "20200422083923_1_2", "_hoodie_record_key": "11", "_hoodie_partition_path": "2019-03-14", "_hoodie_file_name": "c9df1d00-5dda-4bf7-8f27-1d4534bbbe4c-0", "dms_received_ts": "2020-04-22T08:38:36.873970Z", "tran_id": 11, "tran_date": "2019-03-14", "store_id": 5, "store_city": "CHICAGO", "store_state": "IL", "item_code": "XXXXXX", "quantity": 15, "total": 106.25, "Op": "D"}
   ```
   
   This is the log file metadata
   ```
   ║ 20200422083923 │ 1           │ AVRO_DATA_BLOCK │ {"SCHEMA":"{\"type\":\"record\",\"name\":\"retail_transactions\",\"fields\":[{\"name\":\"_hoodie_commit_time\",\"type\":[\"null\",\"string\"],\"doc\":\"\",\"default\":null},{\"name\":\"_hoodie_commit_seqno\",\"type\":[\"null\",\"string\"],\"doc\":\"\",\"default\":null},{\"name\":\"_hoodie_record_key\",\"type\":[\"null\",\"string\"],\"doc\":\"\",\"default\":null},{\"name\":\"_hoodie_partition_path\",\"type\":[\"null\",\"string\"],\"doc\":\"\",\"default\":null},{\"name\":\"_hoodie_file_name\",\"type\":[\"null\",\"string\"],\"doc\":\"\",\"default\":null},{\"name\":\"dms_received_ts\",\"type\":\"string\"},{\"name\":\"tran_id\",\"type\":\"int\"},{\"name\":\"tran_date\",\"type\":\"string\"},{\"name\":\"store_id\",\"type\":\"int\"},{\"name\":\"store_city\",\"type\":\"string\"},{\"name\":\"store_state\",\"type\":\"string\"},{\"name\":\"item_code\",\"type\":\"string\"},{\"name\":\"quantity\",\"type\":\"int\"},{\"name\":\"total\",\"type\":\"float\"},{\"name\":\"Op\",\"type\":\"string\"}]}","INSTANT_TIME":"20200422083923"} │ {}             ║
   ```
   
   The name of the parquet file in the partition is c9df1d00-5dda-4bf7-8f27-1d4534bbbe4c-0_3-23-40_20200422072539.parquet and the log file name is `c9df1d00-5dda-4bf7-8f27-1d4534bbbe4c-0_20200422072539.log.1_1-24-33`
   
   The partiton metadata contents are 
   ```
   commitTime=20200422072539
   partitionDepth=1
   ```
   Not sure why a query on the _rt table does not reflect the delete. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] PhatakN1 commented on issue #1549: Potential issue when using Deltastreamer with DMS

Posted by GitBox <gi...@apache.org>.
PhatakN1 commented on issue #1549:
URL: https://github.com/apache/incubator-hudi/issues/1549#issuecomment-617692524


   And this is on 0.5.2


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] PhatakN1 commented on issue #1549: Potential issue when using Deltastreamer with DMS

Posted by GitBox <gi...@apache.org>.
PhatakN1 commented on issue #1549:
URL: https://github.com/apache/incubator-hudi/issues/1549#issuecomment-618803454


   These are the contents of hoodie.properties
   ----------------------------------------------------------------------------------------
   hoodie.compaction.payload.class=org.apache.hudi.payload.AWSDmsAvroPayload
   hoodie.table.name=retail_transactions
   hoodie.archivelog.folder=archived
   hoodie.table.type=MERGE_ON_READ
   hoodie.timeline.layout.version=1
   ----------------------------------------------------------------------------------------
   Some more background and context on what I did.
   I used mySQL--> DMS--> S3--> Hudi for the initial load of the table. This is where I used hoodie.compaction.payload.class=org.apache.hudi.payload.AWSDmsAvroPayload in my command.
   
   For CDC, I used mySQL--> DMS--> Kafka--> Hudi. Here, I used JsonKafkaSource in my command. 
   Would this cause an issue somewhere?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] n3nash edited a comment on issue #1549: Potential issue when using Deltastreamer with DMS

Posted by GitBox <gi...@apache.org>.
n3nash edited a comment on issue #1549:
URL: https://github.com/apache/incubator-hudi/issues/1549#issuecomment-618811796


   @vinothchandar  We do invoke the same payload when combining records during merge/compaction. For deletes, the payload has to be an empty payload and then the record should be skipped -> https://github.com/apache/incubator-hudi/blob/master/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java#L94
   
   @PhatakN1 when you try deletes, is that any empty payload ? Or is this something you just drive through configs in deltastreamer ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] vinothchandar commented on issue #1549: Potential issue when using Deltastreamer with DMS

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1549:
URL: https://github.com/apache/incubator-hudi/issues/1549#issuecomment-618806837


   as long as the records are the same and you are using the payload it should n't matter... 
   
   Let me try to repro this myself.. I am puzzled since I do see the payload class written into hoodie.properties.. So what should happen is that the payload's `combineAndGetUpdateValue()` should be invoked ... From the code though, it seems like this may not be happening..
   
   cc @n3nash are you able to confirm? My understanding was we will invoke the same payload in rt merge path. no?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] vinothchandar commented on issue #1549: Potential issue when using Deltastreamer with DMS

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1549:
URL: https://github.com/apache/incubator-hudi/issues/1549#issuecomment-618179797


   
   > When I look into the specific S3 folder, I see that the insert and delete into the partition actually  create a new .parquet file with no log file.
   
   So inserts in MOR still go to a parquet file. only updates go to a log file (merging is much more expensive since it reads, merges and write parquet, than just writing parquet). So what you saw is expected behavior. 
   
   
   > it does not reflect the deletes.Based on my understanding, the _rt table should reflect the deletes
   
   True.. it should reflect the deletes. MOR would have logged a delete block into the log file and the keys should be listed there.. Do you know the log files? if so, you can use the CLI and see what's inside the logs, there is a command to inspect the log file there.. 
   
   Happy to get this ironed out.. 
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] n3nash edited a comment on issue #1549: Potential issue when using Deltastreamer with DMS

Posted by GitBox <gi...@apache.org>.
n3nash edited a comment on issue #1549:
URL: https://github.com/apache/incubator-hudi/issues/1549#issuecomment-619654653


   @PhatakN1 COW & MOR both support all the operations, choosing which one to use is based on your use-case rather than feature set. 
   Are you overriding the method `combineAndGetUpdateValue` in your custom payload implementation ? This part of the code -> https://github.com/apache/incubator-hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/model/OverwriteWithLatestAvroPayload.java#L70 basically takes care of that in the `OverwriteWithLatestAvroPayload` payload implementation. But if you override that and make use of your own class and method impl, then you'll need to ensure you do the same in your code.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] PhatakN1 commented on issue #1549: Potential issue when using Deltastreamer with DMS

Posted by GitBox <gi...@apache.org>.
PhatakN1 commented on issue #1549:
URL: https://github.com/apache/incubator-hudi/issues/1549#issuecomment-618292312


   If MOR inserts go to a parquet file but updates to go a log file, then a query on the _ro table will show the inserts since the last compaction but not the updates. Isnt that like providing an inconsistent state of data? So, I still see all inserts since the last compaction but none of  the updates?
   
   These are the contents of the log file using show logfile records in hudi-cli
   {"_hoodie_commit_time": "20200422083923", "_hoodie_commit_seqno": "20200422083923_1_2", "_hoodie_record_key": "11", "_hoodie_partition_path": "2019-03-14", "_hoodie_file_name": "c9df1d00-5dda-4bf7-8f27-1d4534bbbe4c-0", "dms_received_ts": "2020-04-22T08:38:36.873970Z", "tran_id": 11, "tran_date": "2019-03-14", "store_id": 5, "store_city": "CHICAGO", "store_state": "IL", "item_code": "XXXXXX", "quantity": 15, "total": 106.25, "Op": "D"}
   
   This is the log file metadata
   ║ 20200422083923 │ 1           │ AVRO_DATA_BLOCK │ {"SCHEMA":"{\"type\":\"record\",\"name\":\"retail_transactions\",\"fields\":[{\"name\":\"_hoodie_commit_time\",\"type\":[\"null\",\"string\"],\"doc\":\"\",\"default\":null},{\"name\":\"_hoodie_commit_seqno\",\"type\":[\"null\",\"string\"],\"doc\":\"\",\"default\":null},{\"name\":\"_hoodie_record_key\",\"type\":[\"null\",\"string\"],\"doc\":\"\",\"default\":null},{\"name\":\"_hoodie_partition_path\",\"type\":[\"null\",\"string\"],\"doc\":\"\",\"default\":null},{\"name\":\"_hoodie_file_name\",\"type\":[\"null\",\"string\"],\"doc\":\"\",\"default\":null},{\"name\":\"dms_received_ts\",\"type\":\"string\"},{\"name\":\"tran_id\",\"type\":\"int\"},{\"name\":\"tran_date\",\"type\":\"string\"},{\"name\":\"store_id\",\"type\":\"int\"},{\"name\":\"store_city\",\"type\":\"string\"},{\"name\":\"store_state\",\"type\":\"string\"},{\"name\":\"item_code\",\"type\":\"string\"},{\"name\":\"quantity\",\"type\":\"int\"},{\"name\":\"total\",\"type\":\"float\"},{\"name\":\"Op\",\"type\":\"string\"}]}","INSTANT_TIME":"20200422083923"} │ {}             ║
   
   The name of the parquet file in the partition is c9df1d00-5dda-4bf7-8f27-1d4534bbbe4c-0_3-23-40_20200422072539.parquet and the log file name is .c9df1d00-5dda-4bf7-8f27-1d4534bbbe4c-0_20200422072539.log.1_1-24-33
   
   The partiton metadata contents are 
   commitTime=20200422072539
   partitionDepth=1
   
   Not sure why a query on the _rt table does not reflect the delete. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] vinothchandar commented on issue #1549: Potential issue when using Deltastreamer with DMS

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1549:
URL: https://github.com/apache/incubator-hudi/issues/1549#issuecomment-618800202


   @PhatakN1 ah okay.. Since Hudi itself is not aware of DMS or the`"Op": "D"`, it does log a data block with the deleted record.. I suspect the `AwsDMSPayload` is not getting used for merging the base and log files for the query.. 
   
   Could you also paste the contents of `.hoodie/hoodie.properties`?  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] PhatakN1 commented on issue #1549: Potential issue when using Deltastreamer with DMS

Posted by GitBox <gi...@apache.org>.
PhatakN1 commented on issue #1549:
URL: https://github.com/apache/incubator-hudi/issues/1549#issuecomment-618848088


   DMS basically adds an 'Op' column with values as 'I','U' or 'D' specifying the operation on the table. My understanding of the HoodieDeltaStreamer with AWSDmsAvroPayload is that a record with Op=D will delete the data in the table. Infact, if the delete and insert happen in the same partition in a  MOR table, this row does not show up in the query, which tells me that my understanding is right. However, if I run the deltastreamer with no compaction, the paylod with Op=D goes to the log file in the partition. And when I query the _rt table, this record shows up in the output(The Op field in the output is D).  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] n3nash commented on issue #1549: Potential issue when using Deltastreamer with DMS

Posted by GitBox <gi...@apache.org>.
n3nash commented on issue #1549:
URL: https://github.com/apache/incubator-hudi/issues/1549#issuecomment-618811796


   We do invoke the same payload when combining records during merge/compaction. For deletes, the payload has to be an empty payload and then the record should be skipped -> https://github.com/apache/incubator-hudi/blob/master/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java#L94
   
   @PhatakN1 when you try deletes, is that any empty payload ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] PhatakN1 commented on issue #1549: Potential issue when using Deltastreamer with DMS

Posted by GitBox <gi...@apache.org>.
PhatakN1 commented on issue #1549:
URL: https://github.com/apache/incubator-hudi/issues/1549#issuecomment-619587422


   Does this mean that hard deletes would work out of the box for COW datasets but not for MOR datasets? I also went through the documentation at https://cwiki.apache.org/confluence/display/HUDI/2020/01/15/Delete+support+in+Hudi which talks about delete support in Hudi and the example it provides is also on a COW dataset. It also needs addition of a field called _hoodie_is_deleted to the source. I tried adding this field as well, but the query on the _rt table on a MOR dataset still shows up the value(with  _hoodie_is_deleted=true)
   
   When I try the same insert/update/delete without even adding the _hoodie_is_deleted field, it works by looking at the OP field that DMS populates. 
   
   So, as of now, when working with DMS data as the source, is COW dataset preferred over MOR? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] n3nash commented on issue #1549: Potential issue when using Deltastreamer with DMS

Posted by GitBox <gi...@apache.org>.
n3nash commented on issue #1549:
URL: https://github.com/apache/incubator-hudi/issues/1549#issuecomment-619654653


   @PhatakN1 Are you overriding the method `combineAndGetUpdateValue` in your custom payload implementation ? This part of the code -> https://github.com/apache/incubator-hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/model/OverwriteWithLatestAvroPayload.java#L70 basically takes care of that in the `OverwriteWithLatestAvroPayload` payload implementation. But if you override that and make use of your own class and method impl, then you'll need to ensure you do the same in your code.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] n3nash commented on issue #1549: Potential issue when using Deltastreamer with DMS

Posted by GitBox <gi...@apache.org>.
n3nash commented on issue #1549:
URL: https://github.com/apache/incubator-hudi/issues/1549#issuecomment-619146422


   @PhatakN1 okay, let me look into the test case for _rt deletes to see if there's any missing gaps. Once I confirm that, we can come back and drill down into your specific use-case to see what's happening


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] bvaradar commented on issue #1549: Potential issue when using Deltastreamer with DMS

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1549:
URL: https://github.com/apache/incubator-hudi/issues/1549#issuecomment-620252568


   @n3nash : Assigning this ticket to you 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] PhatakN1 edited a comment on issue #1549: Potential issue when using Deltastreamer with DMS

Posted by GitBox <gi...@apache.org>.
PhatakN1 edited a comment on issue #1549:
URL: https://github.com/apache/incubator-hudi/issues/1549#issuecomment-618803454


   These are the contents of hoodie.properties
   ```
   ----------------------------------------------------------------------------------------
   hoodie.compaction.payload.class=org.apache.hudi.payload.AWSDmsAvroPayload
   hoodie.table.name=retail_transactions
   hoodie.archivelog.folder=archived
   hoodie.table.type=MERGE_ON_READ
   hoodie.timeline.layout.version=1
   ----------------------------------------------------------------------------------------
   ```
   
   Some more background and context on what I did.
   I used mySQL--> DMS--> S3--> Hudi for the initial load of the table. This is where I used hoodie.compaction.payload.class=org.apache.hudi.payload.AWSDmsAvroPayload in my command.
   
   For CDC, I used mySQL--> DMS--> Kafka--> Hudi. Here, I used JsonKafkaSource in my command. 
   Would this cause an issue somewhere?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org