You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Raymond Xu (Jira)" <ji...@apache.org> on 2022/10/13 14:00:04 UTC
[jira] [Updated] (HUDI-310) DynamoDB/Kinesis Change Capture using Delta Streamer
[ https://issues.apache.org/jira/browse/HUDI-310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Raymond Xu updated HUDI-310:
----------------------------
Epic Link: HUDI-1385
> DynamoDB/Kinesis Change Capture using Delta Streamer
> ----------------------------------------------------
>
> Key: HUDI-310
> URL: https://issues.apache.org/jira/browse/HUDI-310
> Project: Apache Hudi
> Issue Type: New Feature
> Components: deltastreamer
> Reporter: Vinoth Chandar
> Assignee: Vinay
> Priority: Major
>
> The goal here is to do CDC from DynamoDB and then have it be ingested into S3 as a Hudi dataset
> Few resources:
> # DynamoDB Streams [https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html] provides change capture logs in Kinesis.
> # Walkthrough [https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.KCLAdapter.Walkthrough.html] Code [https://github.com/awslabs/dynamodb-streams-kinesis-adapter]
> # Spark Streaming has support for reading Kinesis streams [https://spark.apache.org/docs/2.4.4/streaming-kinesis-integration.html] one of the many resources showing how to change the Spark Kinesis example code to consume dynamodb stream [https://medium.com/@ravi72munde/using-spark-streaming-with-dynamodb-d325b9a73c79]
> # In DeltaStreamer, we need to add some form of KinesisSource that returns a RDD with new data everytime `fetchNewData` is called [https://github.com/apache/incubator-hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/Source.java] . DeltaStreamer itself does not use Spark Streaming APIs
> # Internally, we have Avro, Json, Row sources that extract data in these formats.
> Open questions :
> # Should this just be a KinesisSource inside Hudi, that needs to be configured differently or do we need two sources: DynamoDBKinesisSource (that does some DynamoDB Stream specific setup/assumptions) and a plain KinesisSource. What's more valuable to do , if we have to pick one.
> # For Kafka integration, we just reused the KafkaRDD in Spark Streaming easily and avoided writing a lot of code by hand. Could we pull the same thing off for Kinesis? (probably needs digging through Spark code)
> # What's the format of the data for DynamoDB streams?
>
>
> We should probably flesh these out before going ahead with implementation?
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)