You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@seatunnel.apache.org by JUN GAO <ga...@apache.org> on 2022/05/27 10:06:29 UTC

[DISCUSS] Do we need to have our own engine

Why do we need the SeaTunnel Engine, And what problems do we want to solve?

- *Better resource utilization rate*

Real time data synchronization is an important user scenario. Sometimes we
need real time synchronization of a full database. Now, Some common data
synchronization engine practices are one job per table. The advantage of
this practice is that one job failure does not influence another one. But
this practice will cause more waste of resources when most of the tables
only have a small amount of data.

We hope the SeaTunnel Engine can solve this problem. We plan to support a
more flexible resource share strategy. It will allow some jobs to share the
resources when they submit by the same user. Users can even specify which
jobs share resources between them. If anyone has an idea, welcome to
discuss in the mail list or github issue.

- *Fewer database connectors*

Another common problem in full database synchronization use CDC is each
table needs a database connector. This will put a lot of pressure on the db
server when there are a lot of tables in the database.

Can we design the database connectors as a shared resource between jobs?
users can configure their database connectors pool. When a job uses the
connector pool, SeaTunnel Engine will init the connector pool at the node
which the source/sink connector at. And then push the connector pool in the
source/sink connector. With the feature of Better resource utilization rate
<https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub#h.hlnmzqjxexv8>,
we can reduce the number of database connections to an acceptable range.

Another way to reduce database connectors used by CDC Source Connector is
to make multiple table read support in CDC Source Connector. And then the
stream will be split by table name in the SeaTunnel Engine.

This way reduces database connectors used by CDC Source Connector but it
can not reduce the database connectors used by sink if the synchronization
target is database too. So a shared database connector pool will be a good
way to solve it.

- *Data Cache between Source and Sink*

Flume is an excellent data synchronization project. Flume Channel can cache
data

when the sink fails and can not write data. This is useful in some scenarios.
For example, some users have limited time to save their database logs. CDC
Source Connector must ensure it can read database logs even if sink can not
write data.

A feasible solution is to start two jobs. One job uses CDC Source
Connector to read database logs and then use Kafka Sink Connector to write
data to kafka. And another job uses Kafka Source Connector to read data
from kafka and then use the target Sink Connector to write data to the
target. This solution needs the user to have a deep understanding of
low-level technology, And two jobs will increase the difficulty of
operation and maintenance. Because every job needs a JobMaster, So it will
need more resources.

Ideally, users only know they will read data from source and write data to
the sink and at the same time, in this process, the data can be cached in
case the sink fails. The synchronization engine needs to auto add cache
operation to the execution plan and ensure the source can work even if the
sink fails. In this process, the engine needs to ensure the data written to
the cache and read from the cache is transactional, this can ensure the
consistency of data.

The execution plan like this:

- *Schema Evolution*

Schema evolution is a feature that allows users to easily change a table’s
current schema to accommodate data that is changing over time. Most
commonly, it’s used when performing an append or overwrite operation, to
automatically adapt the schema to include one or more new columns.

This feature is required in real-time data warehouse scenarios. Currently,
flink and spark engines do not support this feature.

- *Finer fault tolerance*

At present, most real-time processing engines will make the job fail when
one of the tasks is failed. The main reason is that the downstream operator
depends on the calculation results of the upstream operator. However, in
the scenario of data synchronization, the data is simply read from the
source and then written to sink. It does not need to save the intermediate
result state. Therefore, the failure of one task will not affect whether
the results of other tasks are correct.

The new engine should provide more sophisticated fault-tolerant management.
It should support the failure of a single task without affecting the
execution of other tasks. It should provide an interface so that users can
manually retry failed tasks instead of retrying the entire job.

- *Speed Control*

In Batch jobs, we need support speed control. Let users choose the
synchronization speed they want to prevent too much impact on the source or
target database.

*More Information*

I make a simple design about SeaTunnel Engine. You can learn more details
in the following documents.

https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub

Best Regards

------------

Apache DolphinScheduler PMC

Jun Gao
gaojun2048@gmail.com

Re: [DISCUSS] Do we need to have our own engine

Posted by JUN GAO <ga...@apache.org>.

Hi, @zhangtaoliang

Developers and users do not need to face three engines. I am concerned that
the community is promoting the api-draft branch. I think connector
developers only need to develop connectors based on the API draft branch.
The same connector can run in spark/flink/seatunnel-engine at the same time.

张 涛 <zh...@outlook.com> 于2022年6月6日周一 10:48写道：

>
> Developers and users will face three type of engine connectors(spark,
> flink, seatunnel),
> this will be very painful for the users.
>
> Best,
> Liming
> ________________________________
> 发件人: 张 涛 <zh...@outlook.com>
> 发送时间: 2022年6月6日 10:24
> 收件人: dev@seatunnel.apache.org <de...@seatunnel.apache.org>
> 主题: 回复: [DISCUSS] Do we need to have our own engine
>
>
> -1,
>
> > At the same time reduce the user's component deployment and maintenance
> costs.
>
> More engines will be unfriendly for users.
>
> > I think both our own engine and Flink/Spark will exist in the short term.
>
> Why get this conclusion?
>
> > I think it is possible to achieve it step by step.
>
> OK, this is not a good message to users.
>
> BTW, apache Flink/Spark can also run standalone mode, not only cluster
> mode. This is a bad decision,
> hoping that the SeaTunnel community can listen to other people's opinions.
>
> Best,
> Liming
> ________________________________
> 发件人: 范佳 <fa...@qq.com.INVALID>
> 发送时间: 2022年6月6日 10:06
> 收件人: dev@seatunnel.apache.org <de...@seatunnel.apache.org>
> 主题: Re: [DISCUSS] Do we need to have our own engine
>
> +1 ,
>          If we can implement the following features, it can help SeaTunnel
> provide better usability and performance. At the same time reduce the
> user's component deployment and maintenance costs.
>         I think both our own engine and Flink/Spark will exist in the
> short term. For example, our engine can provide a simpler operation mode in
> a single-machine environment, and Flink/Spark provides a clustered
> operation mode. In the end, the replacement is the best result.
>         To achieve such a large engine, I think it is possible to achieve
> it step by step.
>
> > 2022年5月27日 18:06，JUN GAO <ga...@apache.org> 写道：
> >
> > Why do we need the SeaTunnel Engine, And what problems do we want to
> solve?
> >
> >
> >   - *Better resource utilization rate*
> >
> > Real time data synchronization is an important user scenario. Sometimes
> we
> > need real time synchronization of a full database. Now, Some common data
> > synchronization engine practices are one job per table. The advantage of
> > this practice is that one job failure does not influence another one. But
> > this practice will cause more waste of resources when most of the tables
> > only have a small amount of data.
> >
> > We hope the SeaTunnel Engine can solve this problem. We plan to support a
> > more flexible resource share strategy. It will allow some jobs to share
> the
> > resources when they submit by the same user. Users can even specify which
> > jobs share resources between them. If anyone has an idea, welcome to
> > discuss in the mail list or github issue.
> >
> >
> >   - *Fewer database connectors*
> >
> > Another common problem in full database synchronization use CDC is each
> > table needs a database connector. This will put a lot of pressure on the
> db
> > server when there are a lot of tables in the database.
> >
> > Can we design the database connectors as a shared resource between jobs?
> > users can configure their database connectors pool. When a job uses the
> > connector pool, SeaTunnel Engine will init the connector pool at the node
> > which the source/sink connector at. And then push the connector pool in
> the
> > source/sink connector. With the feature of  Better resource utilization
> rate
> > <
> https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub#h.hlnmzqjxexv8
> >,
> > we can reduce the number of database connections to an acceptable range.
> >
> > Another way to reduce database connectors used by CDC Source Connector is
> > to make multiple table read support in CDC Source Connector. And then the
> > stream will be split by table name in the SeaTunnel Engine.
> >
> > This way reduces database connectors used by CDC Source Connector but it
> > can not reduce the database connectors used by sink if the
> synchronization
> > target is database too. So a shared database connector pool will be a
> good
> > way to solve it.
> >
> >
> >   - *Data Cache between Source and Sink*
> >
> >
> >
> > Flume is an excellent data synchronization project. Flume Channel can
> cache
> > data
> >
> > when the sink fails and can not write data. This is useful in some
> scenarios.
> > For example, some users have limited time to save their database logs.
> CDC
> > Source Connector must ensure it can read database logs even if sink can
> not
> > write data.
> >
> > A feasible solution is to start two jobs.  One job uses CDC Source
> > Connector to read database logs and then use Kafka Sink Connector to
> write
> > data to kafka. And another job uses Kafka Source Connector to read data
> > from kafka and then use the target Sink Connector to write data to the
> > target. This solution needs the user to have a deep understanding of
> > low-level technology, And two jobs will increase the difficulty of
> > operation and maintenance. Because every job needs a JobMaster, So it
> will
> > need more resources.
> >
> > Ideally, users only know they will read data from source and write data
> to
> > the sink and at the same time, in this process, the data can be cached in
> > case the sink fails.  The synchronization engine needs to auto add cache
> > operation to the execution plan and ensure the source can work even if
> the
> > sink fails. In this process, the engine needs to ensure the data written
> to
> > the cache and read from the cache is transactional, this can ensure the
> > consistency of data.
> >
> > The execution plan like this:
> >
> >
> >   - *Schema Evolution*
> >
> > Schema evolution is a feature that allows users to easily change a
> table’s
> > current schema to accommodate data that is changing over time. Most
> > commonly, it’s used when performing an append or overwrite operation, to
> > automatically adapt the schema to include one or more new columns.
> >
> > This feature is required in real-time data warehouse scenarios.
> Currently,
> > flink and spark engines do not support this feature.
> >
> >
> >   - *Finer fault tolerance*
> >
> > At present, most real-time processing engines will make the job fail when
> > one of the tasks is failed. The main reason is that the downstream
> operator
> > depends on the calculation results of the upstream operator. However, in
> > the scenario of data synchronization, the data is simply read from the
> > source and then written to sink. It does not need to save the
> intermediate
> > result state. Therefore, the failure of one task will not affect whether
> > the results of other tasks are correct.
> >
> > The new engine should provide more sophisticated fault-tolerant
> management.
> > It should support the failure of a single task without affecting the
> > execution of other tasks. It should provide an interface so that users
> can
> > manually retry failed tasks instead of retrying the entire job.
> >
> >
> >   - *Speed Control*
> >
> > In Batch jobs, we need support speed control. Let users choose the
> > synchronization speed they want to prevent too much impact on the source
> or
> > target database.
> >
> >
> >
> > *More Information*
> >
> >
> > I make a simple design about SeaTunnel Engine.  You can learn more
> details
> > in the following documents.
> >
> >
> https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub
> >
> >
> > --
> >
> > Best Regards
> >
> > ------------
> >
> > Apache DolphinScheduler PMC
> >
> > Jun Gao
> > gaojun2048@gmail.com
> >
>
>

-- 

Best Regards

------------

Apache DolphinScheduler PMC

Jun Gao
gaojun2048@gmail.com

回复: [DISCUSS] Do we need to have our own engine

Posted by 张涛 <zh...@outlook.com>.

Developers and users will face three type of engine connectors(spark, flink, seatunnel),
this will be very painful for the users.

Best,
Liming
________________________________
发件人: 张 涛 <zh...@outlook.com>
发送时间: 2022年6月6日 10:24
收件人: dev@seatunnel.apache.org <de...@seatunnel.apache.org>
主题: 回复: [DISCUSS] Do we need to have our own engine


-1,

> At the same time reduce the user's component deployment and maintenance costs.

More engines will be unfriendly for users.

> I think both our own engine and Flink/Spark will exist in the short term.

Why get this conclusion?

> I think it is possible to achieve it step by step.

OK, this is not a good message to users.

BTW, apache Flink/Spark can also run standalone mode, not only cluster mode. This is a bad decision,
hoping that the SeaTunnel community can listen to other people's opinions.

Best,
Liming
________________________________
发件人: 范佳 <fa...@qq.com.INVALID>
发送时间: 2022年6月6日 10:06
收件人: dev@seatunnel.apache.org <de...@seatunnel.apache.org>
主题: Re: [DISCUSS] Do we need to have our own engine

+1 ,
         If we can implement the following features, it can help SeaTunnel provide better usability and performance. At the same time reduce the user's component deployment and maintenance costs.
        I think both our own engine and Flink/Spark will exist in the short term. For example, our engine can provide a simpler operation mode in a single-machine environment, and Flink/Spark provides a clustered operation mode. In the end, the replacement is the best result.
        To achieve such a large engine, I think it is possible to achieve it step by step.

> 2022年5月27日 18:06，JUN GAO <ga...@apache.org> 写道：
>
> Why do we need the SeaTunnel Engine, And what problems do we want to solve?
>
>
>   - *Better resource utilization rate*
>
> Real time data synchronization is an important user scenario. Sometimes we
> need real time synchronization of a full database. Now, Some common data
> synchronization engine practices are one job per table. The advantage of
> this practice is that one job failure does not influence another one. But
> this practice will cause more waste of resources when most of the tables
> only have a small amount of data.
>
> We hope the SeaTunnel Engine can solve this problem. We plan to support a
> more flexible resource share strategy. It will allow some jobs to share the
> resources when they submit by the same user. Users can even specify which
> jobs share resources between them. If anyone has an idea, welcome to
> discuss in the mail list or github issue.
>
>
>   - *Fewer database connectors*
>
> Another common problem in full database synchronization use CDC is each
> table needs a database connector. This will put a lot of pressure on the db
> server when there are a lot of tables in the database.
>
> Can we design the database connectors as a shared resource between jobs?
> users can configure their database connectors pool. When a job uses the
> connector pool, SeaTunnel Engine will init the connector pool at the node
> which the source/sink connector at. And then push the connector pool in the
> source/sink connector. With the feature of  Better resource utilization rate
> <https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub#h.hlnmzqjxexv8>,
> we can reduce the number of database connections to an acceptable range.
>
> Another way to reduce database connectors used by CDC Source Connector is
> to make multiple table read support in CDC Source Connector. And then the
> stream will be split by table name in the SeaTunnel Engine.
>
> This way reduces database connectors used by CDC Source Connector but it
> can not reduce the database connectors used by sink if the synchronization
> target is database too. So a shared database connector pool will be a good
> way to solve it.
>
>
>   - *Data Cache between Source and Sink*
>
>
>
> Flume is an excellent data synchronization project. Flume Channel can cache
> data
>
> when the sink fails and can not write data. This is useful in some scenarios.
> For example, some users have limited time to save their database logs. CDC
> Source Connector must ensure it can read database logs even if sink can not
> write data.
>
> A feasible solution is to start two jobs.  One job uses CDC Source
> Connector to read database logs and then use Kafka Sink Connector to write
> data to kafka. And another job uses Kafka Source Connector to read data
> from kafka and then use the target Sink Connector to write data to the
> target. This solution needs the user to have a deep understanding of
> low-level technology, And two jobs will increase the difficulty of
> operation and maintenance. Because every job needs a JobMaster, So it will
> need more resources.
>
> Ideally, users only know they will read data from source and write data to
> the sink and at the same time, in this process, the data can be cached in
> case the sink fails.  The synchronization engine needs to auto add cache
> operation to the execution plan and ensure the source can work even if the
> sink fails. In this process, the engine needs to ensure the data written to
> the cache and read from the cache is transactional, this can ensure the
> consistency of data.
>
> The execution plan like this:
>
>
>   - *Schema Evolution*
>
> Schema evolution is a feature that allows users to easily change a table’s
> current schema to accommodate data that is changing over time. Most
> commonly, it’s used when performing an append or overwrite operation, to
> automatically adapt the schema to include one or more new columns.
>
> This feature is required in real-time data warehouse scenarios. Currently,
> flink and spark engines do not support this feature.
>
>
>   - *Finer fault tolerance*
>
> At present, most real-time processing engines will make the job fail when
> one of the tasks is failed. The main reason is that the downstream operator
> depends on the calculation results of the upstream operator. However, in
> the scenario of data synchronization, the data is simply read from the
> source and then written to sink. It does not need to save the intermediate
> result state. Therefore, the failure of one task will not affect whether
> the results of other tasks are correct.
>
> The new engine should provide more sophisticated fault-tolerant management.
> It should support the failure of a single task without affecting the
> execution of other tasks. It should provide an interface so that users can
> manually retry failed tasks instead of retrying the entire job.
>
>
>   - *Speed Control*
>
> In Batch jobs, we need support speed control. Let users choose the
> synchronization speed they want to prevent too much impact on the source or
> target database.
>
>
>
> *More Information*
>
>
> I make a simple design about SeaTunnel Engine.  You can learn more details
> in the following documents.
>
> https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub
>
>
> --
>
> Best Regards
>
> ------------
>
> Apache DolphinScheduler PMC
>
> Jun Gao
> gaojun2048@gmail.com
>

回复: [DISCUSS] Do we need to have our own engine

Posted by 张涛 <zh...@outlook.com>.

-1,

> At the same time reduce the user's component deployment and maintenance costs.

More engines will be unfriendly for users.

> I think both our own engine and Flink/Spark will exist in the short term.

Why get this conclusion?

> I think it is possible to achieve it step by step.

OK, this is not a good message to users.

BTW, apache Flink/Spark can also run standalone mode, not only cluster mode. This is a bad decision,
hoping that the SeaTunnel community can listen to other people's opinions.

Best,
Liming
________________________________
发件人: 范佳 <fa...@qq.com.INVALID>
发送时间: 2022年6月6日 10:06
收件人: dev@seatunnel.apache.org <de...@seatunnel.apache.org>
主题: Re: [DISCUSS] Do we need to have our own engine

+1 ,
         If we can implement the following features, it can help SeaTunnel provide better usability and performance. At the same time reduce the user's component deployment and maintenance costs.
        I think both our own engine and Flink/Spark will exist in the short term. For example, our engine can provide a simpler operation mode in a single-machine environment, and Flink/Spark provides a clustered operation mode. In the end, the replacement is the best result.
        To achieve such a large engine, I think it is possible to achieve it step by step.

> 2022年5月27日 18:06，JUN GAO <ga...@apache.org> 写道：
>
> Why do we need the SeaTunnel Engine, And what problems do we want to solve?
>
>
>   - *Better resource utilization rate*
>
> Real time data synchronization is an important user scenario. Sometimes we
> need real time synchronization of a full database. Now, Some common data
> synchronization engine practices are one job per table. The advantage of
> this practice is that one job failure does not influence another one. But
> this practice will cause more waste of resources when most of the tables
> only have a small amount of data.
>
> We hope the SeaTunnel Engine can solve this problem. We plan to support a
> more flexible resource share strategy. It will allow some jobs to share the
> resources when they submit by the same user. Users can even specify which
> jobs share resources between them. If anyone has an idea, welcome to
> discuss in the mail list or github issue.
>
>
>   - *Fewer database connectors*
>
> Another common problem in full database synchronization use CDC is each
> table needs a database connector. This will put a lot of pressure on the db
> server when there are a lot of tables in the database.
>
> Can we design the database connectors as a shared resource between jobs?
> users can configure their database connectors pool. When a job uses the
> connector pool, SeaTunnel Engine will init the connector pool at the node
> which the source/sink connector at. And then push the connector pool in the
> source/sink connector. With the feature of  Better resource utilization rate
> <https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub#h.hlnmzqjxexv8>,
> we can reduce the number of database connections to an acceptable range.
>
> Another way to reduce database connectors used by CDC Source Connector is
> to make multiple table read support in CDC Source Connector. And then the
> stream will be split by table name in the SeaTunnel Engine.
>
> This way reduces database connectors used by CDC Source Connector but it
> can not reduce the database connectors used by sink if the synchronization
> target is database too. So a shared database connector pool will be a good
> way to solve it.
>
>
>   - *Data Cache between Source and Sink*
>
>
>
> Flume is an excellent data synchronization project. Flume Channel can cache
> data
>
> when the sink fails and can not write data. This is useful in some scenarios.
> For example, some users have limited time to save their database logs. CDC
> Source Connector must ensure it can read database logs even if sink can not
> write data.
>
> A feasible solution is to start two jobs.  One job uses CDC Source
> Connector to read database logs and then use Kafka Sink Connector to write
> data to kafka. And another job uses Kafka Source Connector to read data
> from kafka and then use the target Sink Connector to write data to the
> target. This solution needs the user to have a deep understanding of
> low-level technology, And two jobs will increase the difficulty of
> operation and maintenance. Because every job needs a JobMaster, So it will
> need more resources.
>
> Ideally, users only know they will read data from source and write data to
> the sink and at the same time, in this process, the data can be cached in
> case the sink fails.  The synchronization engine needs to auto add cache
> operation to the execution plan and ensure the source can work even if the
> sink fails. In this process, the engine needs to ensure the data written to
> the cache and read from the cache is transactional, this can ensure the
> consistency of data.
>
> The execution plan like this:
>
>
>   - *Schema Evolution*
>
> Schema evolution is a feature that allows users to easily change a table’s
> current schema to accommodate data that is changing over time. Most
> commonly, it’s used when performing an append or overwrite operation, to
> automatically adapt the schema to include one or more new columns.
>
> This feature is required in real-time data warehouse scenarios. Currently,
> flink and spark engines do not support this feature.
>
>
>   - *Finer fault tolerance*
>
> At present, most real-time processing engines will make the job fail when
> one of the tasks is failed. The main reason is that the downstream operator
> depends on the calculation results of the upstream operator. However, in
> the scenario of data synchronization, the data is simply read from the
> source and then written to sink. It does not need to save the intermediate
> result state. Therefore, the failure of one task will not affect whether
> the results of other tasks are correct.
>
> The new engine should provide more sophisticated fault-tolerant management.
> It should support the failure of a single task without affecting the
> execution of other tasks. It should provide an interface so that users can
> manually retry failed tasks instead of retrying the entire job.
>
>
>   - *Speed Control*
>
> In Batch jobs, we need support speed control. Let users choose the
> synchronization speed they want to prevent too much impact on the source or
> target database.
>
>
>
> *More Information*
>
>
> I make a simple design about SeaTunnel Engine.  You can learn more details
> in the following documents.
>
> https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub
>
>
> --
>
> Best Regards
>
> ------------
>
> Apache DolphinScheduler PMC
>
> Jun Gao
> gaojun2048@gmail.com
>

Re: [DISCUSS] Do we need to have our own engine

Posted by 范佳 <fa...@qq.com.INVALID>.

+1 ,
	 If we can implement the following features, it can help SeaTunnel provide better usability and performance. At the same time reduce the user's component deployment and maintenance costs.
	I think both our own engine and Flink/Spark will exist in the short term. For example, our engine can provide a simpler operation mode in a single-machine environment, and Flink/Spark provides a clustered operation mode. In the end, the replacement is the best result. 
	To achieve such a large engine, I think it is possible to achieve it step by step.

> 2022年5月27日 18:06，JUN GAO <ga...@apache.org> 写道：
> 
> Why do we need the SeaTunnel Engine, And what problems do we want to solve?
> 
> 
>   - *Better resource utilization rate*
> 
> Real time data synchronization is an important user scenario. Sometimes we
> need real time synchronization of a full database. Now, Some common data
> synchronization engine practices are one job per table. The advantage of
> this practice is that one job failure does not influence another one. But
> this practice will cause more waste of resources when most of the tables
> only have a small amount of data.
> 
> We hope the SeaTunnel Engine can solve this problem. We plan to support a
> more flexible resource share strategy. It will allow some jobs to share the
> resources when they submit by the same user. Users can even specify which
> jobs share resources between them. If anyone has an idea, welcome to
> discuss in the mail list or github issue.
> 
> 
>   - *Fewer database connectors*
> 
> Another common problem in full database synchronization use CDC is each
> table needs a database connector. This will put a lot of pressure on the db
> server when there are a lot of tables in the database.
> 
> Can we design the database connectors as a shared resource between jobs?
> users can configure their database connectors pool. When a job uses the
> connector pool, SeaTunnel Engine will init the connector pool at the node
> which the source/sink connector at. And then push the connector pool in the
> source/sink connector. With the feature of  Better resource utilization rate
> <https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub#h.hlnmzqjxexv8>,
> we can reduce the number of database connections to an acceptable range.
> 
> Another way to reduce database connectors used by CDC Source Connector is
> to make multiple table read support in CDC Source Connector. And then the
> stream will be split by table name in the SeaTunnel Engine.
> 
> This way reduces database connectors used by CDC Source Connector but it
> can not reduce the database connectors used by sink if the synchronization
> target is database too. So a shared database connector pool will be a good
> way to solve it.
> 
> 
>   - *Data Cache between Source and Sink*
> 
> 
> 
> Flume is an excellent data synchronization project. Flume Channel can cache
> data
> 
> when the sink fails and can not write data. This is useful in some scenarios.
> For example, some users have limited time to save their database logs. CDC
> Source Connector must ensure it can read database logs even if sink can not
> write data.
> 
> A feasible solution is to start two jobs.  One job uses CDC Source
> Connector to read database logs and then use Kafka Sink Connector to write
> data to kafka. And another job uses Kafka Source Connector to read data
> from kafka and then use the target Sink Connector to write data to the
> target. This solution needs the user to have a deep understanding of
> low-level technology, And two jobs will increase the difficulty of
> operation and maintenance. Because every job needs a JobMaster, So it will
> need more resources.
> 
> Ideally, users only know they will read data from source and write data to
> the sink and at the same time, in this process, the data can be cached in
> case the sink fails.  The synchronization engine needs to auto add cache
> operation to the execution plan and ensure the source can work even if the
> sink fails. In this process, the engine needs to ensure the data written to
> the cache and read from the cache is transactional, this can ensure the
> consistency of data.
> 
> The execution plan like this:
> 
> 
>   - *Schema Evolution*
> 
> Schema evolution is a feature that allows users to easily change a table’s
> current schema to accommodate data that is changing over time. Most
> commonly, it’s used when performing an append or overwrite operation, to
> automatically adapt the schema to include one or more new columns.
> 
> This feature is required in real-time data warehouse scenarios. Currently,
> flink and spark engines do not support this feature.
> 
> 
>   - *Finer fault tolerance*
> 
> At present, most real-time processing engines will make the job fail when
> one of the tasks is failed. The main reason is that the downstream operator
> depends on the calculation results of the upstream operator. However, in
> the scenario of data synchronization, the data is simply read from the
> source and then written to sink. It does not need to save the intermediate
> result state. Therefore, the failure of one task will not affect whether
> the results of other tasks are correct.
> 
> The new engine should provide more sophisticated fault-tolerant management.
> It should support the failure of a single task without affecting the
> execution of other tasks. It should provide an interface so that users can
> manually retry failed tasks instead of retrying the entire job.
> 
> 
>   - *Speed Control*
> 
> In Batch jobs, we need support speed control. Let users choose the
> synchronization speed they want to prevent too much impact on the source or
> target database.
> 
> 
> 
> *More Information*
> 
> 
> I make a simple design about SeaTunnel Engine.  You can learn more details
> in the following documents.
> 
> https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub
> 
> 
> -- 
> 
> Best Regards
> 
> ------------
> 
> Apache DolphinScheduler PMC
> 
> Jun Gao
> gaojun2048@gmail.com
>

回复: [DISCUSS] Do we need to have our own engine

Posted by 李明 <li...@hotmail.com>.

-1, I agree with Kelu.

I have same problems about the engine:
  1. the relationship with Flink/Spark. Our engine is designed to replace Flink/Spark?
  2. If designed above Flink/Spark, how to achieve our goal without modifying Flink/Spark code?
  3. Why don't users use flink/spark directly, but use SeaTunnel Engine?
  4. How do users use the flink/spark native connector?


________________________________
发件人: 陶克路 <ta...@gmail.com>
发送时间: 2022年5月30日 20:59
收件人: dev@seatunnel.apache.org <de...@seatunnel.apache.org>
主题: Re: [DISCUSS] Do we need to have our own engine

Hi, gaojun, thanks for sharing.

I have some problems about the engine:

   1. the relationship with Flink/Spark. Our engine is designed to replace
   Flink/Spark?
   2. if designed to replace Flink/Spark, how to build the huge thing from
   scratch?
   3. If designed above Flink/Spark, how to achieve our goal without
   modifying Flink/Spark code?


Thanks,
Kelu

On Fri, May 27, 2022 at 6:07 PM JUN GAO <ga...@apache.org> wrote:

> Why do we need the SeaTunnel Engine, And what problems do we want to solve?
>
>
>    - *Better resource utilization rate*
>
> Real time data synchronization is an important user scenario. Sometimes we
> need real time synchronization of a full database. Now, Some common data
> synchronization engine practices are one job per table. The advantage of
> this practice is that one job failure does not influence another one. But
> this practice will cause more waste of resources when most of the tables
> only have a small amount of data.
>
> We hope the SeaTunnel Engine can solve this problem. We plan to support a
> more flexible resource share strategy. It will allow some jobs to share the
> resources when they submit by the same user. Users can even specify which
> jobs share resources between them. If anyone has an idea, welcome to
> discuss in the mail list or github issue.
>
>
>    - *Fewer database connectors*
>
> Another common problem in full database synchronization use CDC is each
> table needs a database connector. This will put a lot of pressure on the db
> server when there are a lot of tables in the database.
>
> Can we design the database connectors as a shared resource between jobs?
> users can configure their database connectors pool. When a job uses the
> connector pool, SeaTunnel Engine will init the connector pool at the node
> which the source/sink connector at. And then push the connector pool in the
> source/sink connector. With the feature of  Better resource utilization
> rate
> <
> https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub#h.hlnmzqjxexv8
> >,
> we can reduce the number of database connections to an acceptable range.
>
> Another way to reduce database connectors used by CDC Source Connector is
> to make multiple table read support in CDC Source Connector. And then the
> stream will be split by table name in the SeaTunnel Engine.
>
> This way reduces database connectors used by CDC Source Connector but it
> can not reduce the database connectors used by sink if the synchronization
> target is database too. So a shared database connector pool will be a good
> way to solve it.
>
>
>    - *Data Cache between Source and Sink*
>
>
>
> Flume is an excellent data synchronization project. Flume Channel can cache
> data
>
> when the sink fails and can not write data. This is useful in some
> scenarios.
> For example, some users have limited time to save their database logs. CDC
> Source Connector must ensure it can read database logs even if sink can not
> write data.
>
> A feasible solution is to start two jobs.  One job uses CDC Source
> Connector to read database logs and then use Kafka Sink Connector to write
> data to kafka. And another job uses Kafka Source Connector to read data
> from kafka and then use the target Sink Connector to write data to the
> target. This solution needs the user to have a deep understanding of
> low-level technology, And two jobs will increase the difficulty of
> operation and maintenance. Because every job needs a JobMaster, So it will
> need more resources.
>
> Ideally, users only know they will read data from source and write data to
> the sink and at the same time, in this process, the data can be cached in
> case the sink fails.  The synchronization engine needs to auto add cache
> operation to the execution plan and ensure the source can work even if the
> sink fails. In this process, the engine needs to ensure the data written to
> the cache and read from the cache is transactional, this can ensure the
> consistency of data.
>
> The execution plan like this:
>
>
>    - *Schema Evolution*
>
> Schema evolution is a feature that allows users to easily change a table’s
> current schema to accommodate data that is changing over time. Most
> commonly, it’s used when performing an append or overwrite operation, to
> automatically adapt the schema to include one or more new columns.
>
> This feature is required in real-time data warehouse scenarios. Currently,
> flink and spark engines do not support this feature.
>
>
>    - *Finer fault tolerance*
>
> At present, most real-time processing engines will make the job fail when
> one of the tasks is failed. The main reason is that the downstream operator
> depends on the calculation results of the upstream operator. However, in
> the scenario of data synchronization, the data is simply read from the
> source and then written to sink. It does not need to save the intermediate
> result state. Therefore, the failure of one task will not affect whether
> the results of other tasks are correct.
>
> The new engine should provide more sophisticated fault-tolerant management.
> It should support the failure of a single task without affecting the
> execution of other tasks. It should provide an interface so that users can
> manually retry failed tasks instead of retrying the entire job.
>
>
>    - *Speed Control*
>
> In Batch jobs, we need support speed control. Let users choose the
> synchronization speed they want to prevent too much impact on the source or
> target database.
>
>
>
> *More Information*
>
>
> I make a simple design about SeaTunnel Engine.  You can learn more details
> in the following documents.
>
>
> https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub
>
>
> --
>
> Best Regards
>
> ------------
>
> Apache DolphinScheduler PMC
>
> Jun Gao
> gaojun2048@gmail.com
>


--

Hello, Find me here: www.legendtkl.com<http://www.legendtkl.com>.

Re: [DISCUSS] Do we need to have our own engine

Posted by JUN GAO <ga...@gmail.com>.

Hi, Kelu

1. The SeaTunnel-Engine is not designed to replace Flink/Spark，It will be the third engine supported by seatunnel in addition to flink/spark. As I described in my email, it is designed to better solve the problems in the data synchronization process, which are not well solved in spark and Flink.

2. I think if we just develop an engine for data synchronization, it won't be too complicated. DataX is a good data synchronization engine. In fact, it doesn't have much code. Admittedly, the new engine will be more complex than dataX. Because we need support cluster mode. However, as a data synchronization engine, we do not need to consider shuffle, aggregation, join and other operations, so it should be simpler than spark or Flink. I have completed a part of the validation code https://github.com/apache/incubator-seatunnel/pull/1948 <https://github.com/apache/incubator-seatunnel/pull/1948> . I hope members of the community who are interested in the new engine can work with me to improve it.



Best Regards
---------------
Apache DolphinScheduler  PMC
Jun Gao 高俊
gaojun2048@gmail.com
---------------

> 2022年5月30日 下午8:59，陶克路 <ta...@gmail.com> 写道：
> 
> Hi, gaojun, thanks for sharing.
> 
> I have some problems about the engine:
> 
>   1. the relationship with Flink/Spark. Our engine is designed to replace
>   Flink/Spark?
>   2. if designed to replace Flink/Spark, how to build the huge thing from
>   scratch?
>   3. If designed above Flink/Spark, how to achieve our goal without
>   modifying Flink/Spark code?
> 
> 
> Thanks,
> Kelu
> 
> On Fri, May 27, 2022 at 6:07 PM JUN GAO <gaojun2048@apache.org <ma...@apache.org>> wrote:
> 
>> Why do we need the SeaTunnel Engine, And what problems do we want to solve?
>> 
>> 
>>   - *Better resource utilization rate*
>> 
>> Real time data synchronization is an important user scenario. Sometimes we
>> need real time synchronization of a full database. Now, Some common data
>> synchronization engine practices are one job per table. The advantage of
>> this practice is that one job failure does not influence another one. But
>> this practice will cause more waste of resources when most of the tables
>> only have a small amount of data.
>> 
>> We hope the SeaTunnel Engine can solve this problem. We plan to support a
>> more flexible resource share strategy. It will allow some jobs to share the
>> resources when they submit by the same user. Users can even specify which
>> jobs share resources between them. If anyone has an idea, welcome to
>> discuss in the mail list or github issue.
>> 
>> 
>>   - *Fewer database connectors*
>> 
>> Another common problem in full database synchronization use CDC is each
>> table needs a database connector. This will put a lot of pressure on the db
>> server when there are a lot of tables in the database.
>> 
>> Can we design the database connectors as a shared resource between jobs?
>> users can configure their database connectors pool. When a job uses the
>> connector pool, SeaTunnel Engine will init the connector pool at the node
>> which the source/sink connector at. And then push the connector pool in the
>> source/sink connector. With the feature of  Better resource utilization
>> rate
>> <
>> https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub#h.hlnmzqjxexv8 <https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub#h.hlnmzqjxexv8>
>>> ,
>> we can reduce the number of database connections to an acceptable range.
>> 
>> Another way to reduce database connectors used by CDC Source Connector is
>> to make multiple table read support in CDC Source Connector. And then the
>> stream will be split by table name in the SeaTunnel Engine.
>> 
>> This way reduces database connectors used by CDC Source Connector but it
>> can not reduce the database connectors used by sink if the synchronization
>> target is database too. So a shared database connector pool will be a good
>> way to solve it.
>> 
>> 
>>   - *Data Cache between Source and Sink*
>> 
>> 
>> 
>> Flume is an excellent data synchronization project. Flume Channel can cache
>> data
>> 
>> when the sink fails and can not write data. This is useful in some
>> scenarios.
>> For example, some users have limited time to save their database logs. CDC
>> Source Connector must ensure it can read database logs even if sink can not
>> write data.
>> 
>> A feasible solution is to start two jobs.  One job uses CDC Source
>> Connector to read database logs and then use Kafka Sink Connector to write
>> data to kafka. And another job uses Kafka Source Connector to read data
>> from kafka and then use the target Sink Connector to write data to the
>> target. This solution needs the user to have a deep understanding of
>> low-level technology, And two jobs will increase the difficulty of
>> operation and maintenance. Because every job needs a JobMaster, So it will
>> need more resources.
>> 
>> Ideally, users only know they will read data from source and write data to
>> the sink and at the same time, in this process, the data can be cached in
>> case the sink fails.  The synchronization engine needs to auto add cache
>> operation to the execution plan and ensure the source can work even if the
>> sink fails. In this process, the engine needs to ensure the data written to
>> the cache and read from the cache is transactional, this can ensure the
>> consistency of data.
>> 
>> The execution plan like this:
>> 
>> 
>>   - *Schema Evolution*
>> 
>> Schema evolution is a feature that allows users to easily change a table’s
>> current schema to accommodate data that is changing over time. Most
>> commonly, it’s used when performing an append or overwrite operation, to
>> automatically adapt the schema to include one or more new columns.
>> 
>> This feature is required in real-time data warehouse scenarios. Currently,
>> flink and spark engines do not support this feature.
>> 
>> 
>>   - *Finer fault tolerance*
>> 
>> At present, most real-time processing engines will make the job fail when
>> one of the tasks is failed. The main reason is that the downstream operator
>> depends on the calculation results of the upstream operator. However, in
>> the scenario of data synchronization, the data is simply read from the
>> source and then written to sink. It does not need to save the intermediate
>> result state. Therefore, the failure of one task will not affect whether
>> the results of other tasks are correct.
>> 
>> The new engine should provide more sophisticated fault-tolerant management.
>> It should support the failure of a single task without affecting the
>> execution of other tasks. It should provide an interface so that users can
>> manually retry failed tasks instead of retrying the entire job.
>> 
>> 
>>   - *Speed Control*
>> 
>> In Batch jobs, we need support speed control. Let users choose the
>> synchronization speed they want to prevent too much impact on the source or
>> target database.
>> 
>> 
>> 
>> *More Information*
>> 
>> 
>> I make a simple design about SeaTunnel Engine.  You can learn more details
>> in the following documents.
>> 
>> 
>> https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub
>> 
>> 
>> --
>> 
>> Best Regards
>> 
>> ------------
>> 
>> Apache DolphinScheduler PMC
>> 
>> Jun Gao
>> gaojun2048@gmail.com
>> 
> 
> 
> -- 
> 
> Hello, Find me here: www.legendtkl.com <http://www.legendtkl.com/>.

Re: [DISCUSS] Do we need to have our own engine

Posted by 陶克路 <ta...@gmail.com>.

Hi, gaojun, thanks for sharing.

I have some problems about the engine:

   1. the relationship with Flink/Spark. Our engine is designed to replace
   Flink/Spark?
   2. if designed to replace Flink/Spark, how to build the huge thing from
   scratch?
   3. If designed above Flink/Spark, how to achieve our goal without
   modifying Flink/Spark code?


Thanks,
Kelu

On Fri, May 27, 2022 at 6:07 PM JUN GAO <ga...@apache.org> wrote:

> Why do we need the SeaTunnel Engine, And what problems do we want to solve?
>
>
>    - *Better resource utilization rate*
>
> Real time data synchronization is an important user scenario. Sometimes we
> need real time synchronization of a full database. Now, Some common data
> synchronization engine practices are one job per table. The advantage of
> this practice is that one job failure does not influence another one. But
> this practice will cause more waste of resources when most of the tables
> only have a small amount of data.
>
> We hope the SeaTunnel Engine can solve this problem. We plan to support a
> more flexible resource share strategy. It will allow some jobs to share the
> resources when they submit by the same user. Users can even specify which
> jobs share resources between them. If anyone has an idea, welcome to
> discuss in the mail list or github issue.
>
>
>    - *Fewer database connectors*
>
> Another common problem in full database synchronization use CDC is each
> table needs a database connector. This will put a lot of pressure on the db
> server when there are a lot of tables in the database.
>
> Can we design the database connectors as a shared resource between jobs?
> users can configure their database connectors pool. When a job uses the
> connector pool, SeaTunnel Engine will init the connector pool at the node
> which the source/sink connector at. And then push the connector pool in the
> source/sink connector. With the feature of  Better resource utilization
> rate
> <
> https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub#h.hlnmzqjxexv8
> >,
> we can reduce the number of database connections to an acceptable range.
>
> Another way to reduce database connectors used by CDC Source Connector is
> to make multiple table read support in CDC Source Connector. And then the
> stream will be split by table name in the SeaTunnel Engine.
>
> This way reduces database connectors used by CDC Source Connector but it
> can not reduce the database connectors used by sink if the synchronization
> target is database too. So a shared database connector pool will be a good
> way to solve it.
>
>
>    - *Data Cache between Source and Sink*
>
>
>
> Flume is an excellent data synchronization project. Flume Channel can cache
> data
>
> when the sink fails and can not write data. This is useful in some
> scenarios.
> For example, some users have limited time to save their database logs. CDC
> Source Connector must ensure it can read database logs even if sink can not
> write data.
>
> A feasible solution is to start two jobs.  One job uses CDC Source
> Connector to read database logs and then use Kafka Sink Connector to write
> data to kafka. And another job uses Kafka Source Connector to read data
> from kafka and then use the target Sink Connector to write data to the
> target. This solution needs the user to have a deep understanding of
> low-level technology, And two jobs will increase the difficulty of
> operation and maintenance. Because every job needs a JobMaster, So it will
> need more resources.
>
> Ideally, users only know they will read data from source and write data to
> the sink and at the same time, in this process, the data can be cached in
> case the sink fails.  The synchronization engine needs to auto add cache
> operation to the execution plan and ensure the source can work even if the
> sink fails. In this process, the engine needs to ensure the data written to
> the cache and read from the cache is transactional, this can ensure the
> consistency of data.
>
> The execution plan like this:
>
>
>    - *Schema Evolution*
>
> Schema evolution is a feature that allows users to easily change a table’s
> current schema to accommodate data that is changing over time. Most
> commonly, it’s used when performing an append or overwrite operation, to
> automatically adapt the schema to include one or more new columns.
>
> This feature is required in real-time data warehouse scenarios. Currently,
> flink and spark engines do not support this feature.
>
>
>    - *Finer fault tolerance*
>
> At present, most real-time processing engines will make the job fail when
> one of the tasks is failed. The main reason is that the downstream operator
> depends on the calculation results of the upstream operator. However, in
> the scenario of data synchronization, the data is simply read from the
> source and then written to sink. It does not need to save the intermediate
> result state. Therefore, the failure of one task will not affect whether
> the results of other tasks are correct.
>
> The new engine should provide more sophisticated fault-tolerant management.
> It should support the failure of a single task without affecting the
> execution of other tasks. It should provide an interface so that users can
> manually retry failed tasks instead of retrying the entire job.
>
>
>    - *Speed Control*
>
> In Batch jobs, we need support speed control. Let users choose the
> synchronization speed they want to prevent too much impact on the source or
> target database.
>
>
>
> *More Information*
>
>
> I make a simple design about SeaTunnel Engine.  You can learn more details
> in the following documents.
>
>
> https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub
>
>
> --
>
> Best Regards
>
> ------------
>
> Apache DolphinScheduler PMC
>
> Jun Gao
> gaojun2048@gmail.com
>


-- 

Hello, Find me here: www.legendtkl.com.

Re: [DISCUSS] Do we need to have our own engine

Posted by JUN GAO <ga...@apache.org>.

Hi, Leo65535
Connectors based on the api-draft branch will be able to run directly in
the new engine, just as they can run directly in flink/spark.

leo65535 <le...@163.com> 于2022年5月30日周一 18:06写道：

>
>
> Hi gaojun,
>
>
> I think this is a good idea, I have a question that what is the difference
> between api-draft and engine branch?
>
>
> Best,
> Leo65535
> At 2022-05-27 18:06:29, "JUN GAO" <ga...@apache.org> wrote:
> >Why do we need the SeaTunnel Engine, And what problems do we want to
> solve?
> >
> >
> >   - *Better resource utilization rate*
> >
> >Real time data synchronization is an important user scenario. Sometimes we
> >need real time synchronization of a full database. Now, Some common data
> >synchronization engine practices are one job per table. The advantage of
> >this practice is that one job failure does not influence another one. But
> >this practice will cause more waste of resources when most of the tables
> >only have a small amount of data.
> >
> >We hope the SeaTunnel Engine can solve this problem. We plan to support a
> >more flexible resource share strategy. It will allow some jobs to share
> the
> >resources when they submit by the same user. Users can even specify which
> >jobs share resources between them. If anyone has an idea, welcome to
> >discuss in the mail list or github issue.
> >
> >
> >   - *Fewer database connectors*
> >
> >Another common problem in full database synchronization use CDC is each
> >table needs a database connector. This will put a lot of pressure on the
> db
> >server when there are a lot of tables in the database.
> >
> >Can we design the database connectors as a shared resource between jobs?
> >users can configure their database connectors pool. When a job uses the
> >connector pool, SeaTunnel Engine will init the connector pool at the node
> >which the source/sink connector at. And then push the connector pool in
> the
> >source/sink connector. With the feature of  Better resource utilization
> rate
> ><
> https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub#h.hlnmzqjxexv8
> >,
> >we can reduce the number of database connections to an acceptable range.
> >
> >Another way to reduce database connectors used by CDC Source Connector is
> >to make multiple table read support in CDC Source Connector. And then the
> >stream will be split by table name in the SeaTunnel Engine.
> >
> >This way reduces database connectors used by CDC Source Connector but it
> >can not reduce the database connectors used by sink if the synchronization
> >target is database too. So a shared database connector pool will be a good
> >way to solve it.
> >
> >
> >   - *Data Cache between Source and Sink*
> >
> >
> >
> >Flume is an excellent data synchronization project. Flume Channel can
> cache
> >data
> >
> >when the sink fails and can not write data. This is useful in some
> scenarios.
> >For example, some users have limited time to save their database logs. CDC
> >Source Connector must ensure it can read database logs even if sink can
> not
> >write data.
> >
> >A feasible solution is to start two jobs.  One job uses CDC Source
> >Connector to read database logs and then use Kafka Sink Connector to write
> >data to kafka. And another job uses Kafka Source Connector to read data
> >from kafka and then use the target Sink Connector to write data to the
> >target. This solution needs the user to have a deep understanding of
> >low-level technology, And two jobs will increase the difficulty of
> >operation and maintenance. Because every job needs a JobMaster, So it will
> >need more resources.
> >
> >Ideally, users only know they will read data from source and write data to
> >the sink and at the same time, in this process, the data can be cached in
> >case the sink fails.  The synchronization engine needs to auto add cache
> >operation to the execution plan and ensure the source can work even if the
> >sink fails. In this process, the engine needs to ensure the data written
> to
> >the cache and read from the cache is transactional, this can ensure the
> >consistency of data.
> >
> >The execution plan like this:
> >
> >
> >   - *Schema Evolution*
> >
> >Schema evolution is a feature that allows users to easily change a table’s
> >current schema to accommodate data that is changing over time. Most
> >commonly, it’s used when performing an append or overwrite operation, to
> >automatically adapt the schema to include one or more new columns.
> >
> >This feature is required in real-time data warehouse scenarios. Currently,
> >flink and spark engines do not support this feature.
> >
> >
> >   - *Finer fault tolerance*
> >
> >At present, most real-time processing engines will make the job fail when
> >one of the tasks is failed. The main reason is that the downstream
> operator
> >depends on the calculation results of the upstream operator. However, in
> >the scenario of data synchronization, the data is simply read from the
> >source and then written to sink. It does not need to save the intermediate
> >result state. Therefore, the failure of one task will not affect whether
> >the results of other tasks are correct.
> >
> >The new engine should provide more sophisticated fault-tolerant
> management.
> >It should support the failure of a single task without affecting the
> >execution of other tasks. It should provide an interface so that users can
> >manually retry failed tasks instead of retrying the entire job.
> >
> >
> >   - *Speed Control*
> >
> >In Batch jobs, we need support speed control. Let users choose the
> >synchronization speed they want to prevent too much impact on the source
> or
> >target database.
> >
> >
> >
> >*More Information*
> >
> >
> >I make a simple design about SeaTunnel Engine.  You can learn more details
> >in the following documents.
> >
> >
> https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub
> >
> >
> >--
> >
> >Best Regards
> >
> >------------
> >
> >Apache DolphinScheduler PMC
> >
> >Jun Gao
> >gaojun2048@gmail.com
>


-- 

Best Regards

------------

Apache DolphinScheduler PMC

Jun Gao
gaojun2048@gmail.com

Re:[DISCUSS] Do we need to have our own engine

Posted by leo65535 <le...@163.com>.


Hi gaojun,


I think this is a good idea, I have a question that what is the difference between api-draft and engine branch?


Best,
Leo65535
At 2022-05-27 18:06:29, "JUN GAO" <ga...@apache.org> wrote:
>Why do we need the SeaTunnel Engine, And what problems do we want to solve?
>
>
>   - *Better resource utilization rate*
>
>Real time data synchronization is an important user scenario. Sometimes we
>need real time synchronization of a full database. Now, Some common data
>synchronization engine practices are one job per table. The advantage of
>this practice is that one job failure does not influence another one. But
>this practice will cause more waste of resources when most of the tables
>only have a small amount of data.
>
>We hope the SeaTunnel Engine can solve this problem. We plan to support a
>more flexible resource share strategy. It will allow some jobs to share the
>resources when they submit by the same user. Users can even specify which
>jobs share resources between them. If anyone has an idea, welcome to
>discuss in the mail list or github issue.
>
>
>   - *Fewer database connectors*
>
>Another common problem in full database synchronization use CDC is each
>table needs a database connector. This will put a lot of pressure on the db
>server when there are a lot of tables in the database.
>
>Can we design the database connectors as a shared resource between jobs?
>users can configure their database connectors pool. When a job uses the
>connector pool, SeaTunnel Engine will init the connector pool at the node
>which the source/sink connector at. And then push the connector pool in the
>source/sink connector. With the feature of  Better resource utilization rate
><https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub#h.hlnmzqjxexv8>,
>we can reduce the number of database connections to an acceptable range.
>
>Another way to reduce database connectors used by CDC Source Connector is
>to make multiple table read support in CDC Source Connector. And then the
>stream will be split by table name in the SeaTunnel Engine.
>
>This way reduces database connectors used by CDC Source Connector but it
>can not reduce the database connectors used by sink if the synchronization
>target is database too. So a shared database connector pool will be a good
>way to solve it.
>
>
>   - *Data Cache between Source and Sink*
>
>
>
>Flume is an excellent data synchronization project. Flume Channel can cache
>data
>
>when the sink fails and can not write data. This is useful in some scenarios.
>For example, some users have limited time to save their database logs. CDC
>Source Connector must ensure it can read database logs even if sink can not
>write data.
>
>A feasible solution is to start two jobs.  One job uses CDC Source
>Connector to read database logs and then use Kafka Sink Connector to write
>data to kafka. And another job uses Kafka Source Connector to read data
>from kafka and then use the target Sink Connector to write data to the
>target. This solution needs the user to have a deep understanding of
>low-level technology, And two jobs will increase the difficulty of
>operation and maintenance. Because every job needs a JobMaster, So it will
>need more resources.
>
>Ideally, users only know they will read data from source and write data to
>the sink and at the same time, in this process, the data can be cached in
>case the sink fails.  The synchronization engine needs to auto add cache
>operation to the execution plan and ensure the source can work even if the
>sink fails. In this process, the engine needs to ensure the data written to
>the cache and read from the cache is transactional, this can ensure the
>consistency of data.
>
>The execution plan like this:
>
>
>   - *Schema Evolution*
>
>Schema evolution is a feature that allows users to easily change a table’s
>current schema to accommodate data that is changing over time. Most
>commonly, it’s used when performing an append or overwrite operation, to
>automatically adapt the schema to include one or more new columns.
>
>This feature is required in real-time data warehouse scenarios. Currently,
>flink and spark engines do not support this feature.
>
>
>   - *Finer fault tolerance*
>
>At present, most real-time processing engines will make the job fail when
>one of the tasks is failed. The main reason is that the downstream operator
>depends on the calculation results of the upstream operator. However, in
>the scenario of data synchronization, the data is simply read from the
>source and then written to sink. It does not need to save the intermediate
>result state. Therefore, the failure of one task will not affect whether
>the results of other tasks are correct.
>
>The new engine should provide more sophisticated fault-tolerant management.
>It should support the failure of a single task without affecting the
>execution of other tasks. It should provide an interface so that users can
>manually retry failed tasks instead of retrying the entire job.
>
>
>   - *Speed Control*
>
>In Batch jobs, we need support speed control. Let users choose the
>synchronization speed they want to prevent too much impact on the source or
>target database.
>
>
>
>*More Information*
>
>
>I make a simple design about SeaTunnel Engine.  You can learn more details
>in the following documents.
>
>https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub
>
>
>-- 
>
>Best Regards
>
>------------
>
>Apache DolphinScheduler PMC
>
>Jun Gao
>gaojun2048@gmail.com

Re: [DISCUSS] Do we need to have our own engine

Posted by Zongwen Li <zo...@gmail.com>.

+1, Currently, Flink and Spark don't support some data integration
features, and both of them focus on data computing rather than data
integration.

Best,
Zongwen Li


Li Liu <m_...@163.com> 于2022年6月6日周一 14:27写道：

> +1
>
> It is right to develop seatunnel-engine in the long run.
> After all, spark and flink are general computing engines, not for data
> synchronization scenarios. Although seatunnel-engine may not be able to
> surpass spark and flink in a short time. But in the long run, only
> seatunnel-engine can solve the problems specific to these synchronization
> scenarios. When the seatunnel-engine is mature enough, we can also consider
> canceling the support for spark and flink.
>
> Regards
>
> Ic4y
>
> > 2022年5月27日 18:06，JUN GAO <ga...@apache.org> 写道：
> >
> > Why do we need the SeaTunnel Engine, And what problems do we want to
> solve?
> >
> >
> >   - *Better resource utilization rate*
> >
> > Real time data synchronization is an important user scenario. Sometimes
> we
> > need real time synchronization of a full database. Now, Some common data
> > synchronization engine practices are one job per table. The advantage of
> > this practice is that one job failure does not influence another one. But
> > this practice will cause more waste of resources when most of the tables
> > only have a small amount of data.
> >
> > We hope the SeaTunnel Engine can solve this problem. We plan to support a
> > more flexible resource share strategy. It will allow some jobs to share
> the
> > resources when they submit by the same user. Users can even specify which
> > jobs share resources between them. If anyone has an idea, welcome to
> > discuss in the mail list or github issue.
> >
> >
> >   - *Fewer database connectors*
> >
> > Another common problem in full database synchronization use CDC is each
> > table needs a database connector. This will put a lot of pressure on the
> db
> > server when there are a lot of tables in the database.
> >
> > Can we design the database connectors as a shared resource between jobs?
> > users can configure their database connectors pool. When a job uses the
> > connector pool, SeaTunnel Engine will init the connector pool at the node
> > which the source/sink connector at. And then push the connector pool in
> the
> > source/sink connector. With the feature of  Better resource utilization
> rate
> > <
> https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub#h.hlnmzqjxexv8
> >,
> > we can reduce the number of database connections to an acceptable range.
> >
> > Another way to reduce database connectors used by CDC Source Connector is
> > to make multiple table read support in CDC Source Connector. And then the
> > stream will be split by table name in the SeaTunnel Engine.
> >
> > This way reduces database connectors used by CDC Source Connector but it
> > can not reduce the database connectors used by sink if the
> synchronization
> > target is database too. So a shared database connector pool will be a
> good
> > way to solve it.
> >
> >
> >   - *Data Cache between Source and Sink*
> >
> >
> >
> > Flume is an excellent data synchronization project. Flume Channel can
> cache
> > data
> >
> > when the sink fails and can not write data. This is useful in some
> scenarios.
> > For example, some users have limited time to save their database logs.
> CDC
> > Source Connector must ensure it can read database logs even if sink can
> not
> > write data.
> >
> > A feasible solution is to start two jobs.  One job uses CDC Source
> > Connector to read database logs and then use Kafka Sink Connector to
> write
> > data to kafka. And another job uses Kafka Source Connector to read data
> > from kafka and then use the target Sink Connector to write data to the
> > target. This solution needs the user to have a deep understanding of
> > low-level technology, And two jobs will increase the difficulty of
> > operation and maintenance. Because every job needs a JobMaster, So it
> will
> > need more resources.
> >
> > Ideally, users only know they will read data from source and write data
> to
> > the sink and at the same time, in this process, the data can be cached in
> > case the sink fails.  The synchronization engine needs to auto add cache
> > operation to the execution plan and ensure the source can work even if
> the
> > sink fails. In this process, the engine needs to ensure the data written
> to
> > the cache and read from the cache is transactional, this can ensure the
> > consistency of data.
> >
> > The execution plan like this:
> >
> >
> >   - *Schema Evolution*
> >
> > Schema evolution is a feature that allows users to easily change a
> table’s
> > current schema to accommodate data that is changing over time. Most
> > commonly, it’s used when performing an append or overwrite operation, to
> > automatically adapt the schema to include one or more new columns.
> >
> > This feature is required in real-time data warehouse scenarios.
> Currently,
> > flink and spark engines do not support this feature.
> >
> >
> >   - *Finer fault tolerance*
> >
> > At present, most real-time processing engines will make the job fail when
> > one of the tasks is failed. The main reason is that the downstream
> operator
> > depends on the calculation results of the upstream operator. However, in
> > the scenario of data synchronization, the data is simply read from the
> > source and then written to sink. It does not need to save the
> intermediate
> > result state. Therefore, the failure of one task will not affect whether
> > the results of other tasks are correct.
> >
> > The new engine should provide more sophisticated fault-tolerant
> management.
> > It should support the failure of a single task without affecting the
> > execution of other tasks. It should provide an interface so that users
> can
> > manually retry failed tasks instead of retrying the entire job.
> >
> >
> >   - *Speed Control*
> >
> > In Batch jobs, we need support speed control. Let users choose the
> > synchronization speed they want to prevent too much impact on the source
> or
> > target database.
> >
> >
> >
> > *More Information*
> >
> >
> > I make a simple design about SeaTunnel Engine.  You can learn more
> details
> > in the following documents.
> >
> >
> https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub
> >
> >
> > --
> >
> > Best Regards
> >
> > ------------
> >
> > Apache DolphinScheduler PMC
> >
> > Jun Gao
> > gaojun2048@gmail.com
> >
>
>

Re: [DISCUSS] Do we need to have our own engine

Posted by Li Liu <m_...@163.com>.

+1

It is right to develop seatunnel-engine in the long run.
After all, spark and flink are general computing engines, not for data synchronization scenarios. Although seatunnel-engine may not be able to surpass spark and flink in a short time. But in the long run, only seatunnel-engine can solve the problems specific to these synchronization scenarios. When the seatunnel-engine is mature enough, we can also consider canceling the support for spark and flink.

Regards 

Ic4y

> 2022年5月27日 18:06，JUN GAO <ga...@apache.org> 写道：
> 
> Why do we need the SeaTunnel Engine, And what problems do we want to solve?
> 
> 
>   - *Better resource utilization rate*
> 
> Real time data synchronization is an important user scenario. Sometimes we
> need real time synchronization of a full database. Now, Some common data
> synchronization engine practices are one job per table. The advantage of
> this practice is that one job failure does not influence another one. But
> this practice will cause more waste of resources when most of the tables
> only have a small amount of data.
> 
> We hope the SeaTunnel Engine can solve this problem. We plan to support a
> more flexible resource share strategy. It will allow some jobs to share the
> resources when they submit by the same user. Users can even specify which
> jobs share resources between them. If anyone has an idea, welcome to
> discuss in the mail list or github issue.
> 
> 
>   - *Fewer database connectors*
> 
> Another common problem in full database synchronization use CDC is each
> table needs a database connector. This will put a lot of pressure on the db
> server when there are a lot of tables in the database.
> 
> Can we design the database connectors as a shared resource between jobs?
> users can configure their database connectors pool. When a job uses the
> connector pool, SeaTunnel Engine will init the connector pool at the node
> which the source/sink connector at. And then push the connector pool in the
> source/sink connector. With the feature of  Better resource utilization rate
> <https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub#h.hlnmzqjxexv8>,
> we can reduce the number of database connections to an acceptable range.
> 
> Another way to reduce database connectors used by CDC Source Connector is
> to make multiple table read support in CDC Source Connector. And then the
> stream will be split by table name in the SeaTunnel Engine.
> 
> This way reduces database connectors used by CDC Source Connector but it
> can not reduce the database connectors used by sink if the synchronization
> target is database too. So a shared database connector pool will be a good
> way to solve it.
> 
> 
>   - *Data Cache between Source and Sink*
> 
> 
> 
> Flume is an excellent data synchronization project. Flume Channel can cache
> data
> 
> when the sink fails and can not write data. This is useful in some scenarios.
> For example, some users have limited time to save their database logs. CDC
> Source Connector must ensure it can read database logs even if sink can not
> write data.
> 
> A feasible solution is to start two jobs.  One job uses CDC Source
> Connector to read database logs and then use Kafka Sink Connector to write
> data to kafka. And another job uses Kafka Source Connector to read data
> from kafka and then use the target Sink Connector to write data to the
> target. This solution needs the user to have a deep understanding of
> low-level technology, And two jobs will increase the difficulty of
> operation and maintenance. Because every job needs a JobMaster, So it will
> need more resources.
> 
> Ideally, users only know they will read data from source and write data to
> the sink and at the same time, in this process, the data can be cached in
> case the sink fails.  The synchronization engine needs to auto add cache
> operation to the execution plan and ensure the source can work even if the
> sink fails. In this process, the engine needs to ensure the data written to
> the cache and read from the cache is transactional, this can ensure the
> consistency of data.
> 
> The execution plan like this:
> 
> 
>   - *Schema Evolution*
> 
> Schema evolution is a feature that allows users to easily change a table’s
> current schema to accommodate data that is changing over time. Most
> commonly, it’s used when performing an append or overwrite operation, to
> automatically adapt the schema to include one or more new columns.
> 
> This feature is required in real-time data warehouse scenarios. Currently,
> flink and spark engines do not support this feature.
> 
> 
>   - *Finer fault tolerance*
> 
> At present, most real-time processing engines will make the job fail when
> one of the tasks is failed. The main reason is that the downstream operator
> depends on the calculation results of the upstream operator. However, in
> the scenario of data synchronization, the data is simply read from the
> source and then written to sink. It does not need to save the intermediate
> result state. Therefore, the failure of one task will not affect whether
> the results of other tasks are correct.
> 
> The new engine should provide more sophisticated fault-tolerant management.
> It should support the failure of a single task without affecting the
> execution of other tasks. It should provide an interface so that users can
> manually retry failed tasks instead of retrying the entire job.
> 
> 
>   - *Speed Control*
> 
> In Batch jobs, we need support speed control. Let users choose the
> synchronization speed they want to prevent too much impact on the source or
> target database.
> 
> 
> 
> *More Information*
> 
> 
> I make a simple design about SeaTunnel Engine.  You can learn more details
> in the following documents.
> 
> https://docs.google.com/document/d/e/2PACX-1vR5fJ-8sH03DpMHJd1oZ6CHwBtqfk9QESdQYoJyiF2QuGnuPM1a3lmu8m9NhGrUTvkYRSNcBWbSuX_G/pub
> 
> 
> -- 
> 
> Best Regards
> 
> ------------
> 
> Apache DolphinScheduler PMC
> 
> Jun Gao
> gaojun2048@gmail.com
>