You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Vinoth Chandar (Jira)" <ji...@apache.org> on 2019/11/26 13:02:00 UTC

[jira] [Commented] (HUDI-311) Support AWS DMS source on DeltaStreamer

    [ https://issues.apache.org/jira/browse/HUDI-311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16982449#comment-16982449 ] 

Vinoth Chandar commented on HUDI-311:
-------------------------------------

This is what we get as parquet files on S3, for bulk load, Insert, Update, Delete sequence off a MySQL CDC

{code}
scala> spark.read.parquet("file:///home/vinoth/Downloads/LOAD00000001.parquet").show(10, false)
+-------+----------+-------------+-------------+
|user_id|first_name|last_name    |company      |
+-------+----------+-------------+-------------+
|1      |vinoth    |chandar      |confluent inc|
|2      |balaji    |varadarajan  |uber         |
|3      |sudha     |saktheeswaran|uber         |
+-------+----------+-------------+-------------+


scala> spark.read.parquet("file:///home/vinoth/Downloads/20191126-124151666.parquet").show(10, false)
+---+-------+----------+-----------+---------+
|Op |user_id|first_name|last_name  |company  |
+---+-------+----------+-----------+---------+
|I  |4      |prasanna  |rajaperumal|snowflake|
+---+-------+----------+-----------+---------+


scala> spark.read.parquet("file:///home/vinoth/Downloads/20191126-124528981.parquet").show(10, false)
+---+-------+----------+---------+-------+
|Op |user_id|first_name|last_name|company|
+---+-------+----------+---------+-------+
|U  |1      |vinoth    |chandar  |????   |
+---+-------+----------+---------+-------+


scala> spark.read.parquet("file:///home/vinoth/Downloads/20191126-125001909.parquet").show(10, false)
+---+-------+----------+-----------+---------+
|Op |user_id|first_name|last_name  |company  |
+---+-------+----------+-----------+---------+
|D  |4      |prasanna  |rajaperumal|snowflake|
+---+-------+----------+-----------+---------+


scala> 
{code}

We need 
- a special payload implementation that looks at Op type and issues deletes
- Custom SQL transformer, that can  add the OP column if not present (seems its not present for the bulk load schema)


cc [~uditme] [~rbhartia] Seems doable..  Just FYI 

> Support AWS DMS source on DeltaStreamer
> ---------------------------------------
>
>                 Key: HUDI-311
>                 URL: https://issues.apache.org/jira/browse/HUDI-311
>             Project: Apache Hudi (incubating)
>          Issue Type: New Feature
>          Components: deltastreamer
>            Reporter: Vinoth Chandar
>            Assignee: Vinoth Chandar
>            Priority: Major
>             Fix For: 0.5.1
>
>
> https://aws.amazon.com/dms/ seems like a one-stop shop for database change logs. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)