You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@seatunnel.apache.org by GitBox <gi...@apache.org> on 2022/08/12 02:40:28 UTC

[GitHub] [incubator-seatunnel] ashulin opened a new issue, #1608: [Umbrella][Feature][Core] Decoupling connectors from compute engines

ashulin opened a new issue, #1608:
URL: https://github.com/apache/incubator-seatunnel/issues/1608

   ### Search before asking
   
   - [X] I had searched in the [feature](https://github.com/apache/incubator-seatunnel/issues?q=is%3Aissue+label%3A%22Feature%22) and found no similar feature requirement.
   
   
   ### Description
   
   In the current implementation of SeaTunnel, the connector is coupled with the computing engine.
   so implementing a connector requires implementing the interface of each computing engine.
   
   The detailed design doc:https://docs.google.com/document/d/1h-BeGKK98VarSpYUfYkNtpYLVbYCh8PM_09N1pdq_aw/edit?usp=sharing
   
   ### Motivation
   1. A connector only needs to be implemented once and can be used on all engines;
   2. Supports multiple versions of Spark/Flink engines;
   3. Source interface to explicit partition/shard/split/parallel logic.
   4. Multiplexing JDBC/log connection.
   
   **Why not use Apache Beam?**
   The source of Apache Beam is divided into two categories: Unbounded and Bounded, which cannot achieve the purpose of one-time code;
   
   ### Overall Design
   
   ![SeaTunnel Framework](https://user-images.githubusercontent.com/36807946/162152969-c103a9a1-affe-4b15-94db-57be2678515a.png)
   
   -   **Catalog**:Metadata management, which can automatically discover the schema and other information of the structured database;
   -   **Catalog Storage**:Used to store metadata for unstructured storage engines (e.g. Kafka);
   
   -   **SQL**:
   
   -   **DataType**:Table Column Data Type;
   
   -   **Table API**:Used for context passing and SeaTunnel Source/Sink instantiation
   
   -   **Source API**:
        - Explicit partition/shard/split/parallel logic; 
        - Batch & Streaming Unification;
        - Multiplexing source connection;
   
   -   **Sink API**:
        - Distributed transaction; 
        - Aggregated commits;
   
   -   **Translation**:
        - Make the engine support the SeaTunnel connector. 
        - Convert data to Row inside the engine.
        - Data distribution after multiplexing.
   
   ### Simple Flow
   
   ![SeaTunnel Flow](https://user-images.githubusercontent.com/36807946/162159155-038b1bea-6d2c-4345-8076-f393c76ba168.png)
   
   **Why do we need multiplex connections**
   Streaming scene:
   - RDB (e.g. MySQL) may have too many connections errors or database pressure;
   - Duplicate parsing logs under change data capture (CDC) scenes  (e.g. MySQL binlog,Oracle Redolog);
   
   ### Simple Source & Sink Flow
   ![SeaTunnel Engine Flow](https://user-images.githubusercontent.com/36807946/163544249-2a0f99f8-55e5-42c5-9d12-cc4cdc9d541e.png)
   
   ### The subtasks: 
   - [x] #1701
   - [x] #1704
   - [x] #1734
   
   ### Are you willing to submit a PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] ashulin closed issue #1608: [Umbrella][Feature][Core] Decoupling connectors from compute engines

Posted by GitBox <gi...@apache.org>.
ashulin closed issue #1608: [Umbrella][Feature][Core] Decoupling connectors from compute engines
URL: https://github.com/apache/incubator-seatunnel/issues/1608


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] yx91490 commented on issue #1608: [Umbrella][Feature][Core] Decoupling connectors from compute engines

Posted by GitBox <gi...@apache.org>.
yx91490 commented on issue #1608:
URL: https://github.com/apache/incubator-seatunnel/issues/1608#issuecomment-1100779988

   what does your multiple connection means? does it means one job can have multiple sources in different type, or one job can have multiple connection to a same source instance to work on same or different split of a table?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] lonelyGhostisdog commented on issue #1608: [Umbrella][Feature][Core] Decoupling connectors from compute engines

Posted by GitBox <gi...@apache.org>.
lonelyGhostisdog commented on issue #1608:
URL: https://github.com/apache/incubator-seatunnel/issues/1608#issuecomment-1122386352

   Is one Coordinator better than many? Because if an exception occurs at Sink and the Source does not know about it and continues to send data downstream, is the data discarded? If the Sink end notifies the Coordinator on the Source end, why not use the same one?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] ashulin closed issue #1608: [Umbrella][Feature][Core] Decoupling connectors from compute engines

Posted by "ashulin (via GitHub)" <gi...@apache.org>.
ashulin closed issue #1608: [Umbrella][Feature][Core] Decoupling connectors from compute engines
URL: https://github.com/apache/incubator-seatunnel/issues/1608


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] gaojun2048 commented on issue #1608: [Umbrella][Feature][Core] Decoupling connectors from compute engines

Posted by GitBox <gi...@apache.org>.
gaojun2048 commented on issue #1608:
URL: https://github.com/apache/incubator-seatunnel/issues/1608#issuecomment-1123155696

   > > 
   > 
   > I have another question. If a Connector is not implemented in Spark or Flink, how will we implement this new Connector if we are based on a new engine? Do we still need to implement the corresponding Connector in Spark or Flink first?
   
   I suggest the new engine can adaptation the unified API too.  And then If the user develops a connector based on the unified tool, the connector will be able to run on spark, Flink and ST own engine.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] lonelyGhostisdog commented on issue #1608: [Umbrella][Feature][Core] Decoupling connectors from compute engines

Posted by GitBox <gi...@apache.org>.
lonelyGhostisdog commented on issue #1608:
URL: https://github.com/apache/incubator-seatunnel/issues/1608#issuecomment-1122393741

   Anyway, looking forward to the new API . 🎉


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] lonelyGhostisdog commented on issue #1608: [Umbrella][Feature][Core] Decoupling connectors from compute engines

Posted by GitBox <gi...@apache.org>.
lonelyGhostisdog commented on issue #1608:
URL: https://github.com/apache/incubator-seatunnel/issues/1608#issuecomment-1125037172

   From the code, I only saw the abstract upper design about catalog,table,source,sink... and some logic of converting Seatunnel to Flink types. There is no distributed execution or network transmission or seatunnel Connector conversion to flink/ Spark connector ,etc. 
   Is it not commited? Where can I find the corresponding designs?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] gaojun2048 commented on issue #1608: [Umbrella][Feature][Core] Decoupling connectors from compute engines

Posted by GitBox <gi...@apache.org>.
gaojun2048 commented on issue #1608:
URL: https://github.com/apache/incubator-seatunnel/issues/1608#issuecomment-1127188341

   > Or are we going to do a datax-like model for now?
   
   A seatunnel-engine is being designed. In the scenario of offline synchronization, it is more like dataX, and it supports distributed deployment and execution. It also needs to support real-time synchronization.
   
   We plan to discuss these in the mailing list after the preliminary design is completed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] ashulin commented on issue #1608: [Umbrella][Feature][Core] Decoupling connectors from compute engines

Posted by GitBox <gi...@apache.org>.
ashulin commented on issue #1608:
URL: https://github.com/apache/incubator-seatunnel/issues/1608#issuecomment-1122507093

   > Is one Coordinator better than many? Because if an exception occurs at Sink and the Source does not know about it and continues to send data downstream, is the data discarded? If the Sink end notifies the Coordinator on the Source end, why not use the same one?
   
   @lonelyGhostisdog The fault-tolerance between source and sink is supported by chandy-lamport algorithm (ie checkpoint).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] lonelyGhostisdog commented on issue #1608: [Umbrella][Feature][Core] Decoupling connectors from compute engines

Posted by GitBox <gi...@apache.org>.
lonelyGhostisdog commented on issue #1608:
URL: https://github.com/apache/incubator-seatunnel/issues/1608#issuecomment-1123132307

   > > We do translation work for different engines, so if faced with inconsistent semantics or inconsistent functions how to solve the situation? For example, Spark and Flink both have Checkpoint, but their mechanisms and functions are completely different. If we provide Checkpoint capability externally, how can we unify Checkpoint in the two engines?
   > 
   > @lonelyGhostisdog Because the checkpoint mechanism is not completely consistent, we are still discussing how to adapt spark's micro-batch and continuous reader.
   
   Why don't we design a new engine that doesn't dependency on Spark or Flink


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] gaojun2048 commented on issue #1608: [Umbrella][Feature][Core] Decoupling connectors from compute engines

Posted by GitBox <gi...@apache.org>.
gaojun2048 commented on issue #1608:
URL: https://github.com/apache/incubator-seatunnel/issues/1608#issuecomment-1123146072

   > > > We do translation work for different engines, so if faced with inconsistent semantics or inconsistent functions how to solve the situation? For example, Spark and Flink both have Checkpoint, but their mechanisms and functions are completely different. If we provide Checkpoint capability externally, how can we unify Checkpoint in the two engines?
   > > 
   > > 
   > > @lonelyGhostisdog Because the checkpoint mechanism is not completely consistent, we are still discussing how to adapt spark's micro-batch and continuous reader.
   > 
   > Why don't we design a new engine that doesn't dependency on Spark or Flink
   
   The new engine is being designed. However, because a large number of SeaTunnel users are using Flink and spark, we hope to be compatible with Flink and spark engines as much as possible. In the future, if our own engine is good enough, we will discuss again whether to continue to rely on Flink and spark.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] lonelyGhostisdog commented on issue #1608: [Umbrella][Feature][Core] Decoupling connectors from compute engines

Posted by GitBox <gi...@apache.org>.
lonelyGhostisdog commented on issue #1608:
URL: https://github.com/apache/incubator-seatunnel/issues/1608#issuecomment-1122392243

   We do translation work for different engines, so if faced with inconsistent semantics or inconsistent functions how to solve the situation? For example, Spark and Flink both have Checkpoint, but their mechanisms and functions are completely different. If we provide Checkpoint capability externally, how can we unify Checkpoint in the two engines?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] lonelyGhostisdog commented on issue #1608: [Umbrella][Feature][Core] Decoupling connectors from compute engines

Posted by GitBox <gi...@apache.org>.
lonelyGhostisdog commented on issue #1608:
URL: https://github.com/apache/incubator-seatunnel/issues/1608#issuecomment-1123193820

   Thank everyone for answering my questions, maybe I need to see the code to understand the design better. 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] ashulin commented on issue #1608: [Umbrella][Feature][Core] Decoupling connectors from compute engines

Posted by GitBox <gi...@apache.org>.
ashulin commented on issue #1608:
URL: https://github.com/apache/incubator-seatunnel/issues/1608#issuecomment-1101054230

   > what does your multiple connection means? does it means one job can have multiple sources in different type, or one job can have multiple connection to a same source instance to work on same or different split of a table?
   
   @yx91490 Multiplexing source connection: a source instance can read data from multiple tables.
   At present, the source connector of spark and Flink will create a connnection for each table; That is, a source instance will only read one table;
   This is fine for offline jobs, but not acceptable for real-time sync jobs with hundreds (or more) tables.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] ashulin commented on issue #1608: [Umbrella][Feature][Core] Decoupling connectors from compute engines

Posted by GitBox <gi...@apache.org>.
ashulin commented on issue #1608:
URL: https://github.com/apache/incubator-seatunnel/issues/1608#issuecomment-1123127769

   > We do translation work for different engines, so if faced with inconsistent semantics or inconsistent functions how to solve the situation? For example, Spark and Flink both have Checkpoint, but their mechanisms and functions are completely different. If we provide Checkpoint capability externally, how can we unify Checkpoint in the two engines?
   
   @lonelyGhostisdog Because the checkpoint mechanism is not completely consistent, we are still discussing how to adapt spark's micro-batch and continuous reader.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] gaojun2048 commented on issue #1608: [Umbrella][Feature][Core] Decoupling connectors from compute engines

Posted by GitBox <gi...@apache.org>.
gaojun2048 commented on issue #1608:
URL: https://github.com/apache/incubator-seatunnel/issues/1608#issuecomment-1127186787

   > From the code, I only saw the abstract upper design about catalog,table,source,sink... and some logic of converting Seatunnel to Flink types. There is no distributed execution or network transmission or seatunnel Connector conversion to flink/ Spark connector ,etc. Is it not commited? Where can I find the corresponding designs?
   
   Branch `api-draft` only contain connector api, not contain st-engine. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] lonelyGhostisdog commented on issue #1608: [Umbrella][Feature][Core] Decoupling connectors from compute engines

Posted by GitBox <gi...@apache.org>.
lonelyGhostisdog commented on issue #1608:
URL: https://github.com/apache/incubator-seatunnel/issues/1608#issuecomment-1123136195

   > > > We do translation work for different engines, so if faced with inconsistent semantics or inconsistent functions how to solve the situation? For example, Spark and Flink both have Checkpoint, but their mechanisms and functions are completely different. If we provide Checkpoint capability externally, how can we unify Checkpoint in the two engines?
   > > 
   > > 
   > > @lonelyGhostisdog Because the checkpoint mechanism is not completely consistent, we are still discussing how to adapt spark's micro-batch and continuous reader.
   > 
   > Why don't we design a new engine that doesn't dependency on Spark or Flink
   
   Should we first complete the design and construction of Source, Sink and Transform? Even if is only stand-alone mode like datax. Refine fault tolerance, tables, Sql through gradual iteration?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] ashulin commented on issue #1608: [Umbrella][Feature][Core] Decoupling connectors from compute engines

Posted by GitBox <gi...@apache.org>.
ashulin commented on issue #1608:
URL: https://github.com/apache/incubator-seatunnel/issues/1608#issuecomment-1212675295

   Enable the SPI factory classes to improve the entire process.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] dinggege1024 commented on issue #1608: [Umbrella][Feature][Core] Decoupling connectors from compute engines

Posted by GitBox <gi...@apache.org>.
dinggege1024 commented on issue #1608:
URL: https://github.com/apache/incubator-seatunnel/issues/1608#issuecomment-1220176556

   LGTM
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] lonelyGhostisdog commented on issue #1608: [Umbrella][Feature][Core] Decoupling connectors from compute engines

Posted by GitBox <gi...@apache.org>.
lonelyGhostisdog commented on issue #1608:
URL: https://github.com/apache/incubator-seatunnel/issues/1608#issuecomment-1125042053

   Or are we going to do a datax-like model for now?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] ashulin commented on issue #1608: [Umbrella][Feature][Core] Decoupling connectors from compute engines

Posted by GitBox <gi...@apache.org>.
ashulin commented on issue #1608:
URL: https://github.com/apache/incubator-seatunnel/issues/1608#issuecomment-1197640592

   Completed!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] lonelyGhostisdog commented on issue #1608: [Umbrella][Feature][Core] Decoupling connectors from compute engines

Posted by GitBox <gi...@apache.org>.
lonelyGhostisdog commented on issue #1608:
URL: https://github.com/apache/incubator-seatunnel/issues/1608#issuecomment-1123147301

   > 
   
   I have another question. If a Connector is not implemented in Spark or Flink, how will we implement this new Connector if we are based on a new engine? Do we still need to implement the corresponding Connector in Spark or Flink first?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org