You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@seatunnel.apache.org by GitBox <gi...@apache.org> on 2022/07/07 03:03:05 UTC

[GitHub] [incubator-seatunnel] Emor-nj opened a new pull request, #2147: [Connector-V2]Add Hudi Source

Emor-nj opened a new pull request, #2147:
URL: https://github.com/apache/incubator-seatunnel/pull/2147

   Related to https://github.com/apache/incubator-seatunnel/issues/1946
   Add new hudi source connector.
   The first version only support hudi cow table and Snapshot Query. (MOR table,Incremental Query, Read Optimized Query will be added later.)
   I have pass the hudi source test both on spark and flink batch mode. （Souce is Hudi, Sink is Console）
   example spark config:
   
   ```
   env {
     job.mode = BATCH
     spark.app.name = "SeaTunnel"
     spark.executor.instances = 2
     spark.executor.cores = 1
     spark.executor.memory = "1g"
     spark.master = local
   }
   source {
     Hudi {
       table.path = "/test/hudi/test_order/"
       table.type = "cow"
       conf.files = "/home/test/hdfs-site.xml;/home/test/core-site.xml;/home/test/yarn-site.xml"
       use.kerberos = true
       kerberos.principal = "test_user"
       kerberos.principal.file = "/home/test/test_user.keytab"
     }
   }
   transform {
   }
   sink {
     Console {}
   }
   ```
   example flink config:
   
   ```
   env {
     job.mode = "BATCH"
     execution.parallelism = 1
   }
   source {
     Hudi {
       table.path = "/test/hudi/test_order/"
       table.type = "cow"
       conf.files = "/home/test/hdfs-site.xml;/home/test/core-site.xml;/home/test/yarn-site.xml"
       use.kerberos = true
       kerberos.principal = "test_user"
       kerberos.principal.file = "/home/test_user/test_user.keytab"
     }
   }
   transform {
   }
   sink {
     Console {}
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-seatunnel] CalvinKirs merged pull request #2147: [Connector-V2]Add Hudi Source

Posted by GitBox <gi...@apache.org>.

CalvinKirs merged PR #2147:
URL: https://github.com/apache/incubator-seatunnel/pull/2147


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-seatunnel] Emor-nj commented on pull request #2147: [Connector-V2]Add Hudi Source

Posted by GitBox <gi...@apache.org>.

Emor-nj commented on PR #2147:
URL: https://github.com/apache/incubator-seatunnel/pull/2147#issuecomment-1178458610

   > Could you add docs to here https://github.com/apache/incubator-seatunnel/tree/dev/docs/en/connector-v2
   
   Sure, added done.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-seatunnel] cason0126 commented on pull request #2147: [Connector-V2]Add Hudi Source

Posted by GitBox <gi...@apache.org>.

cason0126 commented on PR #2147:
URL: https://github.com/apache/incubator-seatunnel/pull/2147#issuecomment-1310513308

I have several opinions on the necessity and realization of CDC:

1. Necessity: As a new generation data integration platform, CDC is very necessary; Because in actual use, the demand for change flow capture in the enterprise is gradually increasing. The rapid development of Flink CDC is a good example.

--If Sea Tunnel does not strengthen its ability to process CDC, it will be faced with the problem of using additional CDC processing tools after using ST to process batch data. Like Canal, Debezium, FlinkCDC.

In addition, the change flow capture at the bottom of FlinkCDC also uses Debezium

2. Several implementation concerns:
---Engine > Connector.
Common databases with many requirements, such as MySQL and Oracle. Their CDC implementation schemes are roughly the same, all based on Binlog. For example:
**Flink Oracle CDCs use Debezium for CDC content collection, while Debezium uses a Logmienr based solution.
StreamSets' processing of Oracle is also based on Logminer.**
Therefore, the priority of CDC content collection should be lowered, and the design of the processing engine should be considered first. This may include the unified process of CDC processing, such as **consistency, breakpoint, batch flow connection, fault tolerance, failover, and other issues that should be handled uniformly**.
In this part, FlinkCDC is worthy of reference.

Sinse Sea Tunnel 2.3.0, it has its own computing engine, which is the cornerstone of processing CDC (most of the time, when the outlet of the CDC stream is a single thread, the processing does not need to be distributed, so it does not rely on computing engines such as Flink or Spark).

---Format compatibility. In most cases, when Sea Tunnel has CDC processing capabilities, it will need to process messages sent to Kafka from other CDC tools. Therefore, compatibility with some common formats, such as Flink CDC, should also be considered. In other words, when designing its format, Sea Tunnel should be designed independently to ensure rapid development or compatible with common component formats in the market, which is also worth considering.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org