You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@seatunnel.apache.org by 陶克路 <ta...@gmail.com> on 2022/04/29 03:06:42 UTC

[Proposal] Flink SQL Improvement

The background https://github.com/apache/incubator-seatunnel/issues/1753

Let me have a brief introduction about the background. I found the Flink
SQL support in Seatunnel is very simple, so I want to do some improvements
on this story.

And now seatunnel uses many deprecated datastream apis, which are
encouraged to be replaced with SQL, such as `StreamTableEnvironment.connect`.
Maybe SQL would be an alternative.
[image: image.png]
[image: image.png]


Here are the improvement details:
1. refactor start-seatunnel-sql.sh. Now start-seatunnel-flink.sh
and start-seatunnel-spark.sh have been refactored, and the main logic has
been rewritten by java code. I think we can first keep them consistent.
2. enrich sql config file. Now flink sql job config is very simple, and
it's all about the sql script. I think we can add more sub-config into it.
3. sql connectors management. Flink community supports a rich set of SQL
connectors. Only with connectors, we can run our job successfully end-to-end
4. sql related logic. Such as validation before job running, throwing the
error as soon as possible
5. Catalog support. With catalog, we can reuse tables/udfs defined in
catalog.
6. kubernetes native mode support. Actually, this is a universal feature,
not just about sql. In Flink, to run job in kubernetes native mode, we must
bundle the main jar and dependency files into the Flink image. This is not
user-friendly. Community support a workaround for this, namely podTemplate
7. ...

This is a long-term plan. We can implement it step by step.

What do you think about this PROPOSAL? Feel free to give any comment or
suggestion.

Thanks.
Kelu.
-- 

Hello, Find me here: www.legendtkl.com.

Re: [Proposal] Flink SQL Improvement

Posted by Zongwen Li <zo...@gmail.com>.

2. I think this needs a detailed discussion.
3. Flink's sql-connector is an uber-jar packaged in a similar way to
SeaTunnel's jar.
5. At present, there is no unified sql engine, and I don't think it is
possible to manage udf in a unified way.


陶克路 <ta...@gmail.com> 于2022年4月29日周五 17:46写道：

> Hi, @leo65535, thanks for the reply.
>
> Very glad to hear the topic about decoupling connectors from The Flink
> version. That's a good direction.
>
> About sql connectors management, actually I do not have a complete design
> in my head.
>
> But some points would be discussed firstly here:
>
> 1. Plugin style.
> For SQL, we don't need to write the glue code to "register" connectors.
> The only thing we need to do is to put the connector jars into ClassPath,
> then the process (jm/tm) could find the connector class, which implements
> specified interfaces, based on a mechanism like SPI. So maybe we can't
> manage the connectors as plugins. When we submit the job, we add the
> corresponding connector jars as resources to job.
>
> 2. Decouple connectors with Flink version.
> Decoupling is necessary. To my knowledge, Flink SQL uses a factory
> discovery mechanism to manage connectors. So when this interface is not
> refactored in the Flink core, we can expect it as backward compatible.
> Moreover, if it is version related, we can manage them in the arch like
> `/plugins/<flink-version>/connectors/kafka`
>
> 3. Connectors dynamic loading.
> To reduce the class conflict between connectors, I propose to do connectors
> loading when necessary, namely dynamic loading. About how to find the
> connectors used in SQL, I thinks we specified the connectors in config
> file, or parse SQL to get the connectors info.
>
> Maybe we can discuss this topic lately in a more detailed tech design doc.
>
> Thanks,
> Kelu.
>
> On Fri, Apr 29, 2022 at 4:10 PM leo65535 <le...@163.com> wrote:
>
> >
> >
> > Hi @taokelu,
> >
> >
> > This proposal is nice!!
> > We also disscuss another topic "Decoupling connectors from compute
> > engines"[1], I have a question how to manager flink sql connector?
> >
> >
> > [1] https://lists.apache.org/thread/j99crn7nkfpwovng6ycbxhw65sxg9xn2
> >
> >
> > Thanks, leo65535
> >
> >
> >
> > At 2022-04-29 11:06:42, "陶克路" <ta...@gmail.com> wrote:
> >
> > The background https://github.com/apache/incubator-seatunnel/issues/1753
> >
> >
> > Let me have a brief introduction about the background. I found the Flink
> > SQL support in Seatunnel is very simple, so I want to do some
> improvements
> > on this story.
> >
> >
> > And now seatunnel uses many deprecated datastream apis, which are
> > encouraged to be replaced with SQL, such as
> > `StreamTableEnvironment.connect`. Maybe SQL would be an alternative.
> >
> >
> >
> >
> >
> >
> >
> >
> > Here are the improvement details:
> > 1. refactor start-seatunnel-sql.sh. Now start-seatunnel-flink.sh and
> > start-seatunnel-spark.sh have been refactored, and the main logic has
> been
> > rewritten by java code. I think we can first keep them consistent.
> > 2. enrich sql config file. Now flink sql job config is very simple, and
> > it's all about the sql script. I think we can add more sub-config into
> it.
> > 3. sql connectors management. Flink community supports a rich set of SQL
> > connectors. Only with connectors, we can run our job successfully
> end-to-end
> > 4. sql related logic. Such as validation before job running, throwing the
> > error as soon as possible
> > 5. Catalog support. With catalog, we can reuse tables/udfs defined in
> > catalog.
> > 6. kubernetes native mode support. Actually, this is a universal feature,
> > not just about sql. In Flink, to run job in kubernetes native mode, we
> must
> > bundle the main jar and dependency files into the Flink image. This is
> not
> > user-friendly. Community support a workaround for this, namely
> podTemplate
> > 7. ...
> >
> >
> > This is a long-term plan. We can implement it step by step.
> >
> >
> > What do you think about this PROPOSAL? Feel free to give any comment or
> > suggestion.
> >
> >
> > Thanks.
> > Kelu.
> > --
> >
> >
> >
> > Hello, Find me here: www.legendtkl.com.
>
>
>
> --
>
> Hello, Find me here: www.legendtkl.com.
>

Re: [Proposal] Flink SQL Improvement

Posted by 陶克路 <ta...@gmail.com>.

Hi, @leo65535, thanks for the reply.

Very glad to hear the topic about decoupling connectors from The Flink
version. That's a good direction.

About sql connectors management, actually I do not have a complete design
in my head.

But some points would be discussed firstly here:

1. Plugin style.
For SQL, we don't need to write the glue code to "register" connectors.
The only thing we need to do is to put the connector jars into ClassPath,
then the process (jm/tm) could find the connector class, which implements
specified interfaces, based on a mechanism like SPI. So maybe we can't
manage the connectors as plugins. When we submit the job, we add the
corresponding connector jars as resources to job.

2. Decouple connectors with Flink version.
Decoupling is necessary. To my knowledge, Flink SQL uses a factory
discovery mechanism to manage connectors. So when this interface is not
refactored in the Flink core, we can expect it as backward compatible.
Moreover, if it is version related, we can manage them in the arch like
`/plugins/<flink-version>/connectors/kafka`

3. Connectors dynamic loading.
To reduce the class conflict between connectors, I propose to do connectors
loading when necessary, namely dynamic loading. About how to find the
connectors used in SQL, I thinks we specified the connectors in config
file, or parse SQL to get the connectors info.

Maybe we can discuss this topic lately in a more detailed tech design doc.

Thanks,
Kelu.

On Fri, Apr 29, 2022 at 4:10 PM leo65535 <le...@163.com> wrote:

>
>
> Hi @taokelu,
>
>
> This proposal is nice!!
> We also disscuss another topic "Decoupling connectors from compute
> engines"[1], I have a question how to manager flink sql connector?
>
>
> [1] https://lists.apache.org/thread/j99crn7nkfpwovng6ycbxhw65sxg9xn2
>
>
> Thanks, leo65535
>
>
>
> At 2022-04-29 11:06:42, "陶克路" <ta...@gmail.com> wrote:
>
> The background https://github.com/apache/incubator-seatunnel/issues/1753
>
>
> Let me have a brief introduction about the background. I found the Flink
> SQL support in Seatunnel is very simple, so I want to do some improvements
> on this story.
>
>
> And now seatunnel uses many deprecated datastream apis, which are
> encouraged to be replaced with SQL, such as
> `StreamTableEnvironment.connect`. Maybe SQL would be an alternative.
>
>
>
>
>
>
>
>
> Here are the improvement details:
> 1. refactor start-seatunnel-sql.sh. Now start-seatunnel-flink.sh and
> start-seatunnel-spark.sh have been refactored, and the main logic has been
> rewritten by java code. I think we can first keep them consistent.
> 2. enrich sql config file. Now flink sql job config is very simple, and
> it's all about the sql script. I think we can add more sub-config into it.
> 3. sql connectors management. Flink community supports a rich set of SQL
> connectors. Only with connectors, we can run our job successfully end-to-end
> 4. sql related logic. Such as validation before job running, throwing the
> error as soon as possible
> 5. Catalog support. With catalog, we can reuse tables/udfs defined in
> catalog.
> 6. kubernetes native mode support. Actually, this is a universal feature,
> not just about sql. In Flink, to run job in kubernetes native mode, we must
> bundle the main jar and dependency files into the Flink image. This is not
> user-friendly. Community support a workaround for this, namely podTemplate
> 7. ...
>
>
> This is a long-term plan. We can implement it step by step.
>
>
> What do you think about this PROPOSAL? Feel free to give any comment or
> suggestion.
>
>
> Thanks.
> Kelu.
> --
>
>
>
> Hello, Find me here: www.legendtkl.com.

-- 

Hello, Find me here: www.legendtkl.com.

Re:[Proposal] Flink SQL Improvement

Posted by leo65535 <le...@163.com>.


Hi @taokelu,


This proposal is nice!!
We also disscuss another topic "Decoupling connectors from compute engines"[1], I have a question how to manager flink sql connector?


[1] https://lists.apache.org/thread/j99crn7nkfpwovng6ycbxhw65sxg9xn2


Thanks, leo65535



At 2022-04-29 11:06:42, "陶克路" <ta...@gmail.com> wrote:

The background https://github.com/apache/incubator-seatunnel/issues/1753


Let me have a brief introduction about the background. I found the Flink SQL support in Seatunnel is very simple, so I want to do some improvements on this story.


And now seatunnel uses many deprecated datastream apis, which are encouraged to be replaced with SQL, such as `StreamTableEnvironment.connect`. Maybe SQL would be an alternative.








Here are the improvement details:
1. refactor start-seatunnel-sql.sh. Now start-seatunnel-flink.sh and start-seatunnel-spark.sh have been refactored, and the main logic has been rewritten by java code. I think we can first keep them consistent.
2. enrich sql config file. Now flink sql job config is very simple, and it's all about the sql script. I think we can add more sub-config into it.
3. sql connectors management. Flink community supports a rich set of SQL connectors. Only with connectors, we can run our job successfully end-to-end
4. sql related logic. Such as validation before job running, throwing the error as soon as possible
5. Catalog support. With catalog, we can reuse tables/udfs defined in catalog.
6. kubernetes native mode support. Actually, this is a universal feature, not just about sql. In Flink, to run job in kubernetes native mode, we must bundle the main jar and dependency files into the Flink image. This is not user-friendly. Community support a workaround for this, namely podTemplate 
7. ...


This is a long-term plan. We can implement it step by step.


What do you think about this PROPOSAL? Feel free to give any comment or suggestion.


Thanks.
Kelu.
--



Hello, Find me here: www.legendtkl.com.

Re: [Proposal] Flink SQL Improvement

Posted by 范佳 <fa...@qq.com.INVALID>.

This will be good progress for SeaTunnel. BTW, when you start, please provide your more detailed design. Thanks

> 2022年4月29日 11:06，陶克路 <ta...@gmail.com> 写道：
> 
> The background https://github.com/apache/incubator-seatunnel/issues/1753 <https://github.com/apache/incubator-seatunnel/issues/1753>
> 
> Let me have a brief introduction about the background. I found the Flink SQL support in Seatunnel is very simple, so I want to do some improvements on this story.
> 
> And now seatunnel uses many deprecated datastream apis, which are encouraged to be replaced with SQL, such as `StreamTableEnvironment.connect`. Maybe SQL would be an alternative.
> 
> 
> 
> 
> Here are the improvement details:
> 1. refactor start-seatunnel-sql.sh. Now start-seatunnel-flink.sh and start-seatunnel-spark.sh have been refactored, and the main logic has been rewritten by java code. I think we can first keep them consistent.
> 2. enrich sql config file. Now flink sql job config is very simple, and it's all about the sql script. I think we can add more sub-config into it.
> 3. sql connectors management. Flink community supports a rich set of SQL connectors. Only with connectors, we can run our job successfully end-to-end
> 4. sql related logic. Such as validation before job running, throwing the error as soon as possible
> 5. Catalog support. With catalog, we can reuse tables/udfs defined in catalog.
> 6. kubernetes native mode support. Actually, this is a universal feature, not just about sql. In Flink, to run job in kubernetes native mode, we must bundle the main jar and dependency files into the Flink image. This is not user-friendly. Community support a workaround for this, namely podTemplate 
> 7. ...
> 
> This is a long-term plan. We can implement it step by step.
> 
> What do you think about this PROPOSAL? Feel free to give any comment or suggestion.
> 
> Thanks.
> Kelu.
> -- 
> 
> Hello, Find me here: www.legendtkl.com <http://www.legendtkl.com/>.

Re: [Proposal] Flink SQL Improvement

Posted by wenjun <we...@apache.org>.

+1.
1, 2, 3, 4 look good to me, 5,6 may need to be further discussed.

Re: [Proposal] Flink SQL Improvement

Posted by 陶克路 <ta...@gmail.com>.

sorry for my rookie work :) . I update image as follow.

https://user-images.githubusercontent.com/2370761/165883497-f1480703-5251-46d7-a249-53439c077502.png

https://user-images.githubusercontent.com/2370761/165883516-07ca0ee2-08a8-4ed5-9859-690c7a73a198.png

On Fri, Apr 29, 2022 at 11:51 AM wenjun <we...@apache.org> wrote:

> Hi Kelu,
>
> Glad to hear this proposal, the photo cannot display successfully in the
> mail, you can upload the photo to issue, and add the photo link.
>
> Best,
> Wenjun Ruan
>
> On Fri, Apr 29, 2022 at 11:06 AM 陶克路 <ta...@gmail.com> wrote:
>
> > The background https://github.com/apache/incubator-seatunnel/issues/1753
> >
> > Let me have a brief introduction about the background. I found the Flink
> > SQL support in Seatunnel is very simple, so I want to do some
> improvements
> > on this story.
> >
> > And now seatunnel uses many deprecated datastream apis, which are
> > encouraged to be replaced with SQL, such as `
> > StreamTableEnvironment.connect`. Maybe SQL would be an alternative.
> > [image: image.png]
> > [image: image.png]
> >
> >
> > Here are the improvement details:
> > 1. refactor start-seatunnel-sql.sh. Now start-seatunnel-flink.sh
> > and start-seatunnel-spark.sh have been refactored, and the main logic has
> > been rewritten by java code. I think we can first keep them consistent.
> > 2. enrich sql config file. Now flink sql job config is very simple, and
> > it's all about the sql script. I think we can add more sub-config into
> it.
> > 3. sql connectors management. Flink community supports a rich set of SQL
> > connectors. Only with connectors, we can run our job successfully
> end-to-end
> > 4. sql related logic. Such as validation before job running, throwing the
> > error as soon as possible
> > 5. Catalog support. With catalog, we can reuse tables/udfs defined in
> > catalog.
> > 6. kubernetes native mode support. Actually, this is a universal feature,
> > not just about sql. In Flink, to run job in kubernetes native mode, we
> must
> > bundle the main jar and dependency files into the Flink image. This is
> not
> > user-friendly. Community support a workaround for this, namely
> podTemplate
> > 7. ...
> >
> > This is a long-term plan. We can implement it step by step.
> >
> > What do you think about this PROPOSAL? Feel free to give any comment or
> > suggestion.
> >
> > Thanks.
> > Kelu.
> > --
> >
> > Hello, Find me here: www.legendtkl.com.
> >
>


-- 

Hello, Find me here: www.legendtkl.com.

Re: [Proposal] Flink SQL Improvement

Posted by wenjun <we...@apache.org>.

Hi Kelu,

Glad to hear this proposal, the photo cannot display successfully in the
mail, you can upload the photo to issue, and add the photo link.

Best,
Wenjun Ruan

On Fri, Apr 29, 2022 at 11:06 AM 陶克路 <ta...@gmail.com> wrote:

> The background https://github.com/apache/incubator-seatunnel/issues/1753
>
> Let me have a brief introduction about the background. I found the Flink
> SQL support in Seatunnel is very simple, so I want to do some improvements
> on this story.
>
> And now seatunnel uses many deprecated datastream apis, which are
> encouraged to be replaced with SQL, such as `
> StreamTableEnvironment.connect`. Maybe SQL would be an alternative.
> [image: image.png]
> [image: image.png]
>
>
> Here are the improvement details:
> 1. refactor start-seatunnel-sql.sh. Now start-seatunnel-flink.sh
> and start-seatunnel-spark.sh have been refactored, and the main logic has
> been rewritten by java code. I think we can first keep them consistent.
> 2. enrich sql config file. Now flink sql job config is very simple, and
> it's all about the sql script. I think we can add more sub-config into it.
> 3. sql connectors management. Flink community supports a rich set of SQL
> connectors. Only with connectors, we can run our job successfully end-to-end
> 4. sql related logic. Such as validation before job running, throwing the
> error as soon as possible
> 5. Catalog support. With catalog, we can reuse tables/udfs defined in
> catalog.
> 6. kubernetes native mode support. Actually, this is a universal feature,
> not just about sql. In Flink, to run job in kubernetes native mode, we must
> bundle the main jar and dependency files into the Flink image. This is not
> user-friendly. Community support a workaround for this, namely podTemplate
> 7. ...
>
> This is a long-term plan. We can implement it step by step.
>
> What do you think about this PROPOSAL? Feel free to give any comment or
> suggestion.
>
> Thanks.
> Kelu.
> --
>
> Hello, Find me here: www.legendtkl.com.
>