You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@seatunnel.apache.org by GitBox <gi...@apache.org> on 2022/05/26 01:53:46 UTC

[GitHub] [incubator-seatunnel] gaojun2048 opened a new issue, #1954: [DISCUSS] Do we need to have our own engine

gaojun2048 opened a new issue, #1954:
URL: https://github.com/apache/incubator-seatunnel/issues/1954

   ### Search before asking
   
   - [X] I had searched in the [feature](https://github.com/apache/incubator-seatunnel/issues?q=is%3Aissue+label%3A%22Feature%22) and found no similar feature requirement.
   
   
   ### Description
   
   # Why do we need the SeaTunnel Engine, And what problems do we want to solve?
   
   ### Better resource utilization rate
   
   Real time data synchronization is an important user scenario. Sometimes we need real time synchronization of a full database. Now, Some common data synchronization engine practices are one job per table. The advantage of this practice is that one job failure does not influence another one. But this practice will cause more waste of resources when most of the tables only have a small amount of data.
   
   We hope the SeaTunnel Engine can solve this problem. We plan to support a more flexible resource share strategy. It will allow some jobs to share the resources when they submit by the same user. Users can even specify which jobs share resources between them. If anyone has an idea, welcome to discuss in the mail list or github issue.
   
   ### Fewer database connectors
   
   Another common problem in full database synchronization use CDC is each table needs a database connector. This will put a lot of pressure on the db server when there are a lot of tables in the database.
   
   Can we design the database connectors as a shared resource between jobs? users can configure their database connectors pool. When a job uses the connector pool, SeaTunnel Engine will init the connector pool at the node which the source/sink connector at. And then push the connector pool in the source/sink connector. With the feature of  [Better resource utilization rate](https://docs.google.com/document/d/1W8YgzpPUFhiKMstYWRefMmvRy_LDad-DuQlW_ygaVG4/edit#heading=h.hlnmzqjxexv8), we can reduce the number of database connections to an acceptable range.
   
   Another way to reduce database connectors used by CDC Source Connector is to make multiple table read support in CDC Source Connector. And then the stream will be split by table name in the SeaTunnel Engine. 
   
   <img width="1238" alt="image" src="https://user-images.githubusercontent.com/32193458/170399080-05b2d481-8ee4-4f97-8c0f-2b15598f36c3.png">
   
   
   This way reduces database connectors used by CDC Source Connector but it can not reduce the database connectors used by sink if the synchronization target is database too. So a shared database connector pool will be a good way to solve it.
   
   ### Data Cache between Source and Sink
   	
   Flume is an excellent data synchronization project. Flume Channel can cache data 
   when the sink fails and can not write data. This is useful in some scenarios. For example, some users have limited time to save their database logs. CDC Source Connector must ensure it can read database logs even if sink can not write data.
   
   A feasible solution is to start two jobs.  One job uses CDC Source Connector to read database logs and then use Kafka Sink Connector to write data to kafka. And another job uses Kafka Source Connector to read data from kafka and then use the target Sink Connector to write data to the target. This solution needs the user to have a deep understanding of low-level technology, And two jobs will increase the difficulty of operation and maintenance. Because every job needs a JobMaster, So it will need more resources.
   
   Ideally, users only know they will read data from source and write data to the sink and at the same time, in this process, the data can be cached in case the sink fails.  The synchronization engine needs to auto add cache operation to the execution plan and ensure the source can work even if the sink fails. In this process, the engine needs to ensure the data written to the cache and read from the cache is transactional, this can ensure the consistency of data.
   
   
   
   The execution plan like this:
   ![Uploading image.png…]()
   
   
   ### Schema Evolution
   
   Schema evolution is a feature that allows users to easily change a table’s current schema to accommodate data that is changing over time. Most commonly, it’s used when performing an append or overwrite operation, to automatically adapt the schema to include one or more new columns.
   
   This feature is required in real-time data warehouse scenarios. Currently, flink and spark engines do not support this feature.
   
   ### Finer fault tolerance
   
   At present, most real-time processing engines will make the job fail when one of the tasks is failed. The main reason is that the downstream operator depends on the calculation results of the upstream operator. However, in the scenario of data synchronization, the data is simply read from the source and then written to sink. It does not need to save the intermediate result state. Therefore, the failure of one task will not affect whether the results of other tasks are correct. 
   
   The new engine should provide more sophisticated fault-tolerant management. It should support the failure of a single task without affecting the execution of other tasks. It should provide an interface so that users can manually retry failed tasks instead of retrying the entire job.
   
   ### Speed Control 
   
   In Batch jobs, we need support speed control. Let users choose the synchronization speed they want to prevent too much impact on the source or target database.
   
   
   I make a simple design about SeaTunnel Engine.  You can learn more details in the following documents.
   
   https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub
   
   ### Usage Scenario
   
   _No response_
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] EricJoy2048 commented on issue #1954: [DISCUSS] Do we need to have our own engine

Posted by GitBox <gi...@apache.org>.
EricJoy2048 commented on issue #1954:
URL: https://github.com/apache/incubator-seatunnel/issues/1954#issuecomment-1181377673

   > hi, @EricJoy2048 , what's the conclusion of this proposal?
   > 
   > Would you help add a comment to let githuber-not-in-mail developer have knowledge about this?
   > 
   > Thanks.
   
   I'm glad you're following this proposal. After email discussion the community accepted this proposal. Recently, I discussed with some community contributors and have some basic design about the `st-engine`. We will share the design and discuss it at the weekly meeting tonight. If you are interested, you are welcome to attend the meeting tonight.
   
   Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] ashulin commented on issue #1954: [DISCUSS] Do we need to have our own engine

Posted by GitBox <gi...@apache.org>.
ashulin commented on issue #1954:
URL: https://github.com/apache/incubator-seatunnel/issues/1954#issuecomment-1138642761

   I agree with you, because some of the data integration features are not supported in Spark and Flink engines.
   Of course, this is a challenging thing, and I look forward to more detailed design proposals.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] gaojun2048 commented on issue #1954: [DISCUSS] Do we need to have our own engine

Posted by GitBox <gi...@apache.org>.
gaojun2048 commented on issue #1954:
URL: https://github.com/apache/incubator-seatunnel/issues/1954#issuecomment-1139462687

   > It's a big feature, can you discuss it in email?
   
   Ok, I will discuss it in email.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] legendtkl commented on issue #1954: [DISCUSS] Do we need to have our own engine

Posted by GitBox <gi...@apache.org>.
legendtkl commented on issue #1954:
URL: https://github.com/apache/incubator-seatunnel/issues/1954#issuecomment-1180367251

   hi, @EricJoy2048 , what's the conclusion of this proposal? 
   
   Would you help add a comment to let githuber-not-in-mail developer have knowledge about this? 
   
   Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] Hisoka-X closed issue #1954: [DISCUSS] Do we need to have our own engine

Posted by GitBox <gi...@apache.org>.
Hisoka-X closed issue #1954: [DISCUSS] Do we need to have our own engine
URL: https://github.com/apache/incubator-seatunnel/issues/1954


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] lhyundeadsoul commented on issue #1954: [DISCUSS] Do we need to have our own engine

Posted by GitBox <gi...@apache.org>.
lhyundeadsoul commented on issue #1954:
URL: https://github.com/apache/incubator-seatunnel/issues/1954#issuecomment-1209365402

   Most people are curious about the unique features of the new engine:
   
   Diff with DataX:
   1. the ability of massive data exchange
   2. union processor of batch & streaming
   
   Diff with flink/spark:
   1. source can run independently using cache when sink fail
   2. combine the same data sources of different task
   3. sync thousands of tables by one connection
   4. pipeline-grade failover rather than whole task(flink)
   If there are some more unique features, we can add them continually. @EricJoy2048 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] CalvinKirs commented on issue #1954: [DISCUSS] Do we need to have our own engine

Posted by GitBox <gi...@apache.org>.
CalvinKirs commented on issue #1954:
URL: https://github.com/apache/incubator-seatunnel/issues/1954#issuecomment-1139247239

   It's a bit feature, can you discuss it in email?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org