You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@seatunnel.apache.org by GitBox <gi...@apache.org> on 2022/08/10 08:20:48 UTC

[GitHub] [incubator-seatunnel] CalvinKirs opened a new issue, #2394: [Feature]Support CDC

CalvinKirs opened a new issue, #2394:
URL: https://github.com/apache/incubator-seatunnel/issues/2394

   Change data capture (CDC) refers to the process of identifying and capturing changes made to data in a database and then delivering those changes in real-time to a downstream process or system.
   
   CDC is mainly divided into two ways: query-based and Binlog-based.
   We know that MySQL has binlog (binary log) to record the user's changes to the database, so it is logical that one of the simplest and most efficient CDC implementations can be done using binlog. Of course, there are already many open source MySQL CDC implementations that work out of the box. Using binlog is not the only way to implement CDC (at least for MySQL), even database triggers can perform similar functions, but they may be dwarfed in terms of efficiency and impact on the database.
   
   Typically, after a CDC captures changes to a database, it will publish the change events to a message queue for consumers to consume, such as Debezium, which persists MySQL (and also supports PostgreSQL, Mongo, etc.) changes to Kafka, and by subscribing to the events in Kafka, we can get the content of the changes and implement the functionality we need.
   
   And as data synchronization, I think we need to support CDC as a feature, and I want to hear from you all how you think it can be implemented in SeaTunnel.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] 2013650523 commented on issue #2394: [Feature]Support CDC

Posted by GitBox <gi...@apache.org>.
2013650523 commented on issue #2394:
URL: https://github.com/apache/incubator-seatunnel/issues/2394#issuecomment-1212875617

   I'm going to use DeBezium to get the incremental real-time log, pass it to the API of the new Connector, save the source status and read back。


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] melin commented on issue #2394: [Feature]Support CDC

Posted by GitBox <gi...@apache.org>.
melin commented on issue #2394:
URL: https://github.com/apache/incubator-seatunnel/issues/2394#issuecomment-1272277209

   Implement CDC data synchronization hudi based on Debezium Server
   
   https://github.com/apache/hudi/issues/6853
   
   @CalvinKirs 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] CalvinKirs commented on issue #2394: [Feature]Support CDC

Posted by GitBox <gi...@apache.org>.
CalvinKirs commented on issue #2394:
URL: https://github.com/apache/incubator-seatunnel/issues/2394#issuecomment-1214333683

   > We can implement this based on FlinkCDC, configure through conf, and dynamically generate DDL SQL for Source and Sink.
   
   FlinkCDC basiert auf Flink, und für uns brauchen wir eine unabhängige Engine.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] guanboo commented on issue #2394: [Feature]Support CDC

Posted by GitBox <gi...@apache.org>.
guanboo commented on issue #2394:
URL: https://github.com/apache/incubator-seatunnel/issues/2394#issuecomment-1213376728

   We can implement this based on FlinkCDC, configure through conf, and dynamically generate SQL for source and sink.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] github-actions[bot] closed issue #2394: [Feature]Support CDC

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed issue #2394: [Feature]Support CDC
URL: https://github.com/apache/incubator-seatunnel/issues/2394


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] ashulin commented on issue #2394: [Feature]Support CDC

Posted by GitBox <gi...@apache.org>.
ashulin commented on issue #2394:
URL: https://github.com/apache/incubator-seatunnel/issues/2394#issuecomment-1275513274

   > We can implement it based on [Debezium](https://debezium.io/) and [Netflix's DBLog parallel algorithm](https://netflixtechblog.com/dblog-a-generic-change-data-capture-framework-69351fb9099b)
   
   1. The situation of multi-table and sharding should be considered to reduce the difficulty of user configuration;
   2. Supports parallel reading of historical data
   3. Supports reading incremental data
   4. Support heartbeat detection (small traffic)
   5. Support for dynamically adding new tables
   6. Support Schema evolution(DDL)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] CalvinKirs commented on issue #2394: [Feature]Support CDC

Posted by GitBox <gi...@apache.org>.
CalvinKirs commented on issue #2394:
URL: https://github.com/apache/incubator-seatunnel/issues/2394#issuecomment-1212890086

   > I'm going to use DeBezium to get the incremental real-time bin log, pass it to the API of the new Connector, save the source status and read back。
   Any detailed designs can be put here, we'll discuss them first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] cason0126 commented on issue #2394: [Feature]Support CDC

Posted by GitBox <gi...@apache.org>.
cason0126 commented on issue #2394:
URL: https://github.com/apache/incubator-seatunnel/issues/2394#issuecomment-1310514331

   I have several opinions on the necessity and realization of CDC:
   
   Necessity: As a new generation data integration platform, CDC is very necessary; Because in actual use, the demand for change flow capture in the enterprise is gradually increasing. The rapid development of Flink CDC is a good example.
   --If Sea Tunnel does not strengthen its ability to process CDC, it will be faced with the problem of using additional CDC processing tools after using ST to process batch data. Like Canal, Debezium, FlinkCDC.
   
   In addition, the change flow capture at the bottom of FlinkCDC also uses Debezium
   
   Several implementation concerns:
   ---Engine > Connector.
   Common databases with many requirements, such as MySQL and Oracle. Their CDC implementation schemes are roughly the same, all based on Binlog. For example:
   Flink Oracle CDCs use Debezium for CDC content collection, while Debezium uses a Logmienr based solution.
   StreamSets' processing of Oracle is also based on Logminer.
   Therefore, the priority of CDC content collection should be lowered, and the design of the processing engine should be considered first. This may include the unified process of CDC processing, such as consistency, breakpoint, batch flow connection, fault tolerance, failover, and other issues that should be handled uniformly.
   In this part, FlinkCDC is worthy of reference.
   
   Sinse Sea Tunnel 2.3.0, it has its own computing engine, which is the cornerstone of processing CDC (most of the time, when the outlet of the CDC stream is a single thread, the processing does not need to be distributed, so it does not rely on computing engines such as Flink or Spark).
   
   ---Format compatibility. In most cases, when Sea Tunnel has CDC processing capabilities, it will need to process messages sent to Kafka from other CDC tools. Therefore, compatibility with some common formats, such as Flink CDC, should also be considered. In other words, when designing its format, Sea Tunnel should be designed independently to ensure rapid development or compatible with common component formats in the market, which is also worth considering.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] github-actions[bot] commented on issue #2394: [Feature]Support CDC

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on issue #2394:
URL: https://github.com/apache/incubator-seatunnel/issues/2394#issuecomment-1482082469

   This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] TyrantLucifer commented on issue #2394: [Feature]Support CDC

Posted by GitBox <gi...@apache.org>.
TyrantLucifer commented on issue #2394:
URL: https://github.com/apache/incubator-seatunnel/issues/2394#issuecomment-1214289288

   What is the community's plan to achieve CDC without relying on any framework such as flink with pure java?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] ashulin commented on issue #2394: [Feature]Support CDC

Posted by GitBox <gi...@apache.org>.
ashulin commented on issue #2394:
URL: https://github.com/apache/incubator-seatunnel/issues/2394#issuecomment-1220606097

   We can implement it based on [Debezium](https://debezium.io/) and  [Netflix's DBLog parallel algorithm](https://netflixtechblog.com/dblog-a-generic-change-data-capture-framework-69351fb9099b)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] github-actions[bot] commented on issue #2394: [Feature]Support CDC

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on issue #2394:
URL: https://github.com/apache/incubator-seatunnel/issues/2394#issuecomment-1491129924

   This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org