You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@seatunnel.apache.org by Lidong Dai <li...@apache.org> on 2022/06/04 17:05:54 UTC

Re: [DISCUSS] Do we need to have our own engine

hi,

the target of this engine is not designed to replace Flink or
Spark,users can choose which engine to run



Best Regards



---------------
Apache DolphinScheduler PMC Chair & Apache SeaTunnel PPMC
Lidong Dai
lidongdai@apache.org
Linkedin: https://www.linkedin.com/in/dailidong
Twitter: @WorkflowEasy

---------------

hi


Best Regards



---------------
Apache DolphinScheduler PMC Chair & Apache SeaTunnel PPMC
Lidong Dai
lidongdai@apache.org
Linkedin: https://www.linkedin.com/in/dailidong
Twitter: @WorkflowEasy

---------------
On Sun, Jun 5, 2022 at 12:44 AM Lidong Dai <li...@apache.org> wrote:
>
> hi,
>
>
>
> 1. the relationship with Flink/Spark. Our engine is designed to replace
>    Flink/Spark?
>
>
>
> the target of this engine is not designed to replace Flink or Spark,users can choose which engine to run
>
>
>
>
>
>
>
>
> Best Regards
>
>
>
>
>
>
>
> ---------------
>
> Apache DolphinScheduler PMC Chair & Apache SeaTunnel PPMC
>
> Lidong Dai
>
> lidongdai@apache.org
>
> Linkedin: https://www.linkedin.com/in/dailidong
>
> Twitter: @WorkflowEasy
>
>
>
> ---------------
>
>
>
>
>
> On Mon, May 30, 2022 at 9:01 PM 陶克路 <ta...@gmail.com> wrote:
>
> Hi, gaojun, thanks for sharing.
>
> I have some problems about the engine:
>
>    1. the relationship with Flink/Spark. Our engine is designed to replace
>    Flink/Spark?
>    2. if designed to replace Flink/Spark, how to build the huge thing from
>    scratch?
>    3. If designed above Flink/Spark, how to achieve our goal without
>    modifying Flink/Spark code?
>
>
> Thanks,
> Kelu
>
> On Fri, May 27, 2022 at 6:07 PM JUN GAO <ga...@apache.org> wrote:
>
> > Why do we need the SeaTunnel Engine, And what problems do we want to solve?
> >
> >
> >    - *Better resource utilization rate*
> >
> > Real time data synchronization is an important user scenario. Sometimes we
> > need real time synchronization of a full database. Now, Some common data
> > synchronization engine practices are one job per table. The advantage of
> > this practice is that one job failure does not influence another one. But
> > this practice will cause more waste of resources when most of the tables
> > only have a small amount of data.
> >
> > We hope the SeaTunnel Engine can solve this problem. We plan to support a
> > more flexible resource share strategy. It will allow some jobs to share the
> > resources when they submit by the same user. Users can even specify which
> > jobs share resources between them. If anyone has an idea, welcome to
> > discuss in the mail list or github issue.
> >
> >
> >    - *Fewer database connectors*
> >
> > Another common problem in full database synchronization use CDC is each
> > table needs a database connector. This will put a lot of pressure on the db
> > server when there are a lot of tables in the database.
> >
> > Can we design the database connectors as a shared resource between jobs?
> > users can configure their database connectors pool. When a job uses the
> > connector pool, SeaTunnel Engine will init the connector pool at the node
> > which the source/sink connector at. And then push the connector pool in the
> > source/sink connector. With the feature of  Better resource utilization
> > rate
> > <
> > https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub#h.hlnmzqjxexv8
> > >,
> > we can reduce the number of database connections to an acceptable range.
> >
> > Another way to reduce database connectors used by CDC Source Connector is
> > to make multiple table read support in CDC Source Connector. And then the
> > stream will be split by table name in the SeaTunnel Engine.
> >
> > This way reduces database connectors used by CDC Source Connector but it
> > can not reduce the database connectors used by sink if the synchronization
> > target is database too. So a shared database connector pool will be a good
> > way to solve it.
> >
> >
> >    - *Data Cache between Source and Sink*
> >
> >
> >
> > Flume is an excellent data synchronization project. Flume Channel can cache
> > data
> >
> > when the sink fails and can not write data. This is useful in some
> > scenarios.
> > For example, some users have limited time to save their database logs. CDC
> > Source Connector must ensure it can read database logs even if sink can not
> > write data.
> >
> > A feasible solution is to start two jobs.  One job uses CDC Source
> > Connector to read database logs and then use Kafka Sink Connector to write
> > data to kafka. And another job uses Kafka Source Connector to read data
> > from kafka and then use the target Sink Connector to write data to the
> > target. This solution needs the user to have a deep understanding of
> > low-level technology, And two jobs will increase the difficulty of
> > operation and maintenance. Because every job needs a JobMaster, So it will
> > need more resources.
> >
> > Ideally, users only know they will read data from source and write data to
> > the sink and at the same time, in this process, the data can be cached in
> > case the sink fails.  The synchronization engine needs to auto add cache
> > operation to the execution plan and ensure the source can work even if the
> > sink fails. In this process, the engine needs to ensure the data written to
> > the cache and read from the cache is transactional, this can ensure the
> > consistency of data.
> >
> > The execution plan like this:
> >
> >
> >    - *Schema Evolution*
> >
> > Schema evolution is a feature that allows users to easily change a table’s
> > current schema to accommodate data that is changing over time. Most
> > commonly, it’s used when performing an append or overwrite operation, to
> > automatically adapt the schema to include one or more new columns.
> >
> > This feature is required in real-time data warehouse scenarios. Currently,
> > flink and spark engines do not support this feature.
> >
> >
> >    - *Finer fault tolerance*
> >
> > At present, most real-time processing engines will make the job fail when
> > one of the tasks is failed. The main reason is that the downstream operator
> > depends on the calculation results of the upstream operator. However, in
> > the scenario of data synchronization, the data is simply read from the
> > source and then written to sink. It does not need to save the intermediate
> > result state. Therefore, the failure of one task will not affect whether
> > the results of other tasks are correct.
> >
> > The new engine should provide more sophisticated fault-tolerant management.
> > It should support the failure of a single task without affecting the
> > execution of other tasks. It should provide an interface so that users can
> > manually retry failed tasks instead of retrying the entire job.
> >
> >
> >    - *Speed Control*
> >
> > In Batch jobs, we need support speed control. Let users choose the
> > synchronization speed they want to prevent too much impact on the source or
> > target database.
> >
> >
> >
> > *More Information*
> >
> >
> > I make a simple design about SeaTunnel Engine.  You can learn more details
> > in the following documents.
> >
> >
> > https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub
> >
> >
> > --
> >
> > Best Regards
> >
> > ------------
> >
> > Apache DolphinScheduler PMC
> >
> > Jun Gao
> > gaojun2048@gmail.com
> >
>
>
> --
>
> Hello, Find me here: www.legendtkl.com.

回复: [DISCUSS] Do we need to have our own engine

Posted by 李明 <li...@hotmail.com>.

hi, gaojun

-1, this will make our project very complex and difficult for users to use.

Best,
Liming
________________________________
发件人: Lidong Dai <li...@apache.org>
发送时间: 2022年6月5日 1:05
收件人: dev@seatunnel.apache.org <de...@seatunnel.apache.org>
主题: Re: [DISCUSS] Do we need to have our own engine

hi,

the target of this engine is not designed to replace Flink or
Spark,users can choose which engine to run



Best Regards



---------------
Apache DolphinScheduler PMC Chair & Apache SeaTunnel PPMC
Lidong Dai
lidongdai@apache.org
Linkedin: https://www.linkedin.com/in/dailidong
Twitter: @WorkflowEasy

---------------

hi


Best Regards



---------------
Apache DolphinScheduler PMC Chair & Apache SeaTunnel PPMC
Lidong Dai
lidongdai@apache.org
Linkedin: https://www.linkedin.com/in/dailidong
Twitter: @WorkflowEasy

---------------
On Sun, Jun 5, 2022 at 12:44 AM Lidong Dai <li...@apache.org> wrote:
>
> hi,
>
>
>
> 1. the relationship with Flink/Spark. Our engine is designed to replace
>    Flink/Spark?
>
>
>
> the target of this engine is not designed to replace Flink or Spark,users can choose which engine to run
>
>
>
>
>
>
>
>
> Best Regards
>
>
>
>
>
>
>
> ---------------
>
> Apache DolphinScheduler PMC Chair & Apache SeaTunnel PPMC
>
> Lidong Dai
>
> lidongdai@apache.org
>
> Linkedin: https://www.linkedin.com/in/dailidong
>
> Twitter: @WorkflowEasy
>
>
>
> ---------------
>
>
>
>
>
> On Mon, May 30, 2022 at 9:01 PM 陶克路 <ta...@gmail.com> wrote:
>
> Hi, gaojun, thanks for sharing.
>
> I have some problems about the engine:
>
>    1. the relationship with Flink/Spark. Our engine is designed to replace
>    Flink/Spark?
>    2. if designed to replace Flink/Spark, how to build the huge thing from
>    scratch?
>    3. If designed above Flink/Spark, how to achieve our goal without
>    modifying Flink/Spark code?
>
>
> Thanks,
> Kelu
>
> On Fri, May 27, 2022 at 6:07 PM JUN GAO <ga...@apache.org> wrote:
>
> > Why do we need the SeaTunnel Engine, And what problems do we want to solve?
> >
> >
> >    - *Better resource utilization rate*
> >
> > Real time data synchronization is an important user scenario. Sometimes we
> > need real time synchronization of a full database. Now, Some common data
> > synchronization engine practices are one job per table. The advantage of
> > this practice is that one job failure does not influence another one. But
> > this practice will cause more waste of resources when most of the tables
> > only have a small amount of data.
> >
> > We hope the SeaTunnel Engine can solve this problem. We plan to support a
> > more flexible resource share strategy. It will allow some jobs to share the
> > resources when they submit by the same user. Users can even specify which
> > jobs share resources between them. If anyone has an idea, welcome to
> > discuss in the mail list or github issue.
> >
> >
> >    - *Fewer database connectors*
> >
> > Another common problem in full database synchronization use CDC is each
> > table needs a database connector. This will put a lot of pressure on the db
> > server when there are a lot of tables in the database.
> >
> > Can we design the database connectors as a shared resource between jobs?
> > users can configure their database connectors pool. When a job uses the
> > connector pool, SeaTunnel Engine will init the connector pool at the node
> > which the source/sink connector at. And then push the connector pool in the
> > source/sink connector. With the feature of  Better resource utilization
> > rate
> > <
> > https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub#h.hlnmzqjxexv8
> > >,
> > we can reduce the number of database connections to an acceptable range.
> >
> > Another way to reduce database connectors used by CDC Source Connector is
> > to make multiple table read support in CDC Source Connector. And then the
> > stream will be split by table name in the SeaTunnel Engine.
> >
> > This way reduces database connectors used by CDC Source Connector but it
> > can not reduce the database connectors used by sink if the synchronization
> > target is database too. So a shared database connector pool will be a good
> > way to solve it.
> >
> >
> >    - *Data Cache between Source and Sink*
> >
> >
> >
> > Flume is an excellent data synchronization project. Flume Channel can cache
> > data
> >
> > when the sink fails and can not write data. This is useful in some
> > scenarios.
> > For example, some users have limited time to save their database logs. CDC
> > Source Connector must ensure it can read database logs even if sink can not
> > write data.
> >
> > A feasible solution is to start two jobs.  One job uses CDC Source
> > Connector to read database logs and then use Kafka Sink Connector to write
> > data to kafka. And another job uses Kafka Source Connector to read data
> > from kafka and then use the target Sink Connector to write data to the
> > target. This solution needs the user to have a deep understanding of
> > low-level technology, And two jobs will increase the difficulty of
> > operation and maintenance. Because every job needs a JobMaster, So it will
> > need more resources.
> >
> > Ideally, users only know they will read data from source and write data to
> > the sink and at the same time, in this process, the data can be cached in
> > case the sink fails.  The synchronization engine needs to auto add cache
> > operation to the execution plan and ensure the source can work even if the
> > sink fails. In this process, the engine needs to ensure the data written to
> > the cache and read from the cache is transactional, this can ensure the
> > consistency of data.
> >
> > The execution plan like this:
> >
> >
> >    - *Schema Evolution*
> >
> > Schema evolution is a feature that allows users to easily change a table’s
> > current schema to accommodate data that is changing over time. Most
> > commonly, it’s used when performing an append or overwrite operation, to
> > automatically adapt the schema to include one or more new columns.
> >
> > This feature is required in real-time data warehouse scenarios. Currently,
> > flink and spark engines do not support this feature.
> >
> >
> >    - *Finer fault tolerance*
> >
> > At present, most real-time processing engines will make the job fail when
> > one of the tasks is failed. The main reason is that the downstream operator
> > depends on the calculation results of the upstream operator. However, in
> > the scenario of data synchronization, the data is simply read from the
> > source and then written to sink. It does not need to save the intermediate
> > result state. Therefore, the failure of one task will not affect whether
> > the results of other tasks are correct.
> >
> > The new engine should provide more sophisticated fault-tolerant management.
> > It should support the failure of a single task without affecting the
> > execution of other tasks. It should provide an interface so that users can
> > manually retry failed tasks instead of retrying the entire job.
> >
> >
> >    - *Speed Control*
> >
> > In Batch jobs, we need support speed control. Let users choose the
> > synchronization speed they want to prevent too much impact on the source or
> > target database.
> >
> >
> >
> > *More Information*
> >
> >
> > I make a simple design about SeaTunnel Engine.  You can learn more details
> > in the following documents.
> >
> >
> > https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub
> >
> >
> > --
> >
> > Best Regards
> >
> > ------------
> >
> > Apache DolphinScheduler PMC
> >
> > Jun Gao
> > gaojun2048@gmail.com
> >
>
>
> --
>
> Hello, Find me here: www.legendtkl.com<http://www.legendtkl.com>.