You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Martijn Visser <ma...@apache.org> on 2022/03/07 11:23:00 UTC

[DISCUSS] Flink's supported APIs and Hive query syntax

Hi everyone,

Flink currently has 4 APIs with multiple language support which can be used
to develop applications:

* DataStream API, both Java and Scala
* Table API, both Java and Scala
* Flink SQL, both in Flink query syntax and Hive query syntax (partially)
* Python API

Since FLIP-152 [1] the Flink SQL support has been extended to also support
the Hive query syntax. There is now a follow-up FLINK-26360 [2] to address
more syntax compatibility issues.

I would like to open a discussion on Flink directly supporting the Hive
query syntax. I have some concerns if having a 100% Hive query syntax is
indeed something that we should aim for in Flink.

I can understand that having Hive query syntax support in Flink could help
users due to interoperability and being able to migrate. However:

- Adding full Hive query syntax support will mean that we go from 6 fully
supported API/language combinations to 7. I think we are currently already
struggling with maintaining the existing combinations, let another one
more.
- Apache Hive is/appears to be a project that's not that actively developed
anymore. The last release was made in January 2021. It's popularity is
rapidly declining in Europe and the United State, also due Hadoop becoming
less popular.
- Related to the previous topic, other software like Snowflake,
Trino/Presto, Databricks are becoming more and more popular. If we add full
support for the Hive query syntax, then why not add support for Snowflake
and the others?
- We are supporting Hive versions that are no longer supported by the Hive
community with known security vulnerabilities. This makes Flink also
vulnerable for those type of vulnerabilities.
- The currently Hive implementation is done by using a lot of internals of
Flink, making Flink hard to maintain, with lots of tech debt and making
things overly complex.

From my perspective, I think it would be better to not have Hive query
syntax compatibility directly in Flink itself. Of course we should have a
proper Hive connector and a proper Hive catalog to make connectivity with
Hive (the versions that are still supported by the Hive community) itself
possible. Alternatively, if Hive query syntax is so important, it should
not rely on internals but be available as a dialect/pluggable option. That
could also open up the possibility to add more syntax support for others in
the future, but I really think we should just focus on Flink SQL itself.
That's already hard enough to maintain and improve on.

I'm looking forward to the thoughts of both Developers and Users, so I'm
cross-posting to both mailing lists.

Best regards,

Martijn Visser
https://twitter.com/MartijnVisser82

[1]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=165227316
[2] https://issues.apache.org/jira/browse/FLINK-21529

回复:Re: Fwd: [DISCUSS] Flink's supported APIs and Hive query syntax

Posted by "罗宇侠(莫辞)" <lu...@alibaba-inc.com.INVALID>.
Hi, thanks for the inputs. 
​I wrote a google doc “Plan for decoupling Hive connector with Flink planner” [1] that shows how to decouple Hive connector with planner. 

[1] https://docs.google.com/document/d/1LMQ_mWfB_mkYkEBCUa2DgCO2YdtiZV7YRs2mpXyjdP4/edit?usp=sharing

Best, 
Yuxia.------------------------------------------------------------------
发件人:Jark Wu<im...@gmail.com>
日 期:2022年03月10日 19:59:30
收件人:Francesco Guardiani<fr...@ververica.com>; dev<de...@flink.apache.org>
抄 送:Martijn Visser<ma...@apache.org>; User<us...@flink.apache.org>; 罗宇侠(莫辞)<lu...@alibaba-inc.com>
主 题:Re: Fwd: [DISCUSS] Flink's supported APIs and Hive query syntax

Hi Francesco,

Yes. The Hive syntax is a syntax plugin provided by Hive connector.

> But right now I don't think It's a good idea adding new features on top, 
as it will create only more maintenance burden both for Hive developers and for table developers.

We are not adding new Hive features, but fixing compatibility or behavior bugs, and almost all of them
are just related to the Hive connector code, nothing to do with table planner. 

I agree we should investigate how to and how much work to decouple Hive connector and planner ASAP. 
We will come up with a google doc soon. But AFAIK, this may not be a huge work and not conflict with the bugfix works. 

Best,
Jark
On Thu, 10 Mar 2022 at 17:03, Francesco Guardiani <fr...@ververica.com> wrote:

> We still need some work to make the Hive dialect purely rely on public APIs, and the Hive connector should be decopule with table planner. 

From the table perspective, I think this is the big pain point at the moment. First of all, when we talk about the Hive syntax, we're really talking about the Hive connector, as my understanding is that without the Hive connector in the classpath you can't use the Hive syntax [1].

The Hive connector is heavily relying on internals [2], and this is an important struggle for the table project, as sometimes is impedes and slows down development of new features and creates a huge maintenance burden for table developers [3]. The planner itself has some classes specific to Hive [4], making the codebase of the planner more complex than it already is. Some of these are just legacy, others exists because there are some abstractions missing in the table planner side, but those just need some work.

So I agree with Jark, when the two Hive modules (connector-hive and sql-parser-hive) reach a point where they don't depend at all on flink-table-planner, like every other connector (except for testing of course), we should be good to move them in a separate repo and continue committing to them. But right now I don't think It's a good idea adding new features on top, as it will create only more maintenance burden both for Hive developers and for table developers.

My concern with this plan is: how much realistic is to fix all the planner internal leaks in the existing Hive connector/parser? To me this seems like a huge task, including a non trivial amount of work to stabilize and design new entry points in Table API.

[1] HiveParser
[2] HiveParserCalcitePlanner
[3] Just talking about code coupling, not even mentioning problems like dependencies and security updates
[4] HiveAggSqlFunction
On Thu, Mar 10, 2022 at 9:05 AM Martijn Visser <ma...@apache.org> wrote:

Thank you Yuxia for volunteering, that's really much appreciated. It would be great if you can create an umbrella ticket for that. 

It would be great to get some insights from currently Flink and Hive users which versions are being used.
@Jark I would indeed deprecate the old Hive versions in Flink 1.15 and then drop them in Flink 1.16. That would also remove some tech debt and make it less work with regards to externalizing connectors.

Best regards,

Martijn
On Thu, 10 Mar 2022 at 07:39, Jark Wu <im...@gmail.com> wrote:

Thanks Martijn for the reply and summary. 

I totally agree with your plan and thank Yuxia for volunteering the Hive tech debt issue. 
I think we can create an umbrella issue for this and target version 1.16. We can discuss
details and create subtasks there. 

Regarding dropping old Hive versions, I'm also fine with that. But I would like to investigate
some Hive users first to see whether it's acceptable at this point. My first thought was we
can deprecate the old Hive versions in 1.15, and we can discuss dropping it in 1.16 or 1.17. 

Best,
Jark


On Thu, 10 Mar 2022 at 14:19, 罗宇侠(莫辞) <lu...@alibaba-inc.com> wrote:

Thanks Martijn for your insights.

About the tech debt/maintenance with regards to Hive query syntax, I would like to chip-in and expect it can be resolved for Flink 1.16.

Best regards,

Yuxia


 ------------------原始邮件 ------------------
发件人:Martijn Visser <ma...@apache.org>
发送时间:Thu Mar 10 04:03:34 2022
收件人:User <us...@flink.apache.org>
主题:Fwd: [DISCUSS] Flink's supported APIs and Hive query syntax

(Forwarding this also to the User mailing list as I made a typo when replying to this email thread)

---------- Forwarded message ---------
From: Martijn Visser <ma...@apache.org>
Date: Wed, 9 Mar 2022 at 20:57
Subject: Re: [DISCUSS] Flink's supported APIs and Hive query syntax
To: dev <de...@flink.apache.org>, Francesco Guardiani <fr...@ververica.com>, Timo Walther <tw...@apache.org>, <us...@flink.apache.org>


Hi everyone,

Thank you all very much for your input. From my perspective, I consider batch as a special case of streaming. So with Flink SQL, we can support both batch and streaming use cases and I think we should use Flink SQL as our target. 

To reply on some of the comments:

@Jing on your remark:
> Since Flink has a clear vision of unified batch and stream processing, supporting batch jobs will be one of the critical core features to help us reach the vision and let Flink have an even bigger impact in the industry.

I fully agree with that statement. I do think that having Hive syntax support doesn't help in that unified batch and stream processing. We're making it easier for batch users to run their Hive batch jobs on Flink, but that doesn't fit the "unified" part since it's focussed on batch, while Flink SQL focusses on batch and streaming. I would have rather invested time in making batch improvements to Flink and Flink SQL vs investing in Hive syntax support. I do understand from the given replies that Hive syntax support is valuable for those that are already running batch processing and would like to run these queries on Flink. I do think that's limited to mostly Chinese companies at the moment. 

@Jark I think you've provided great input and are spot on with: 
> Regarding the maintenance concern you raised, I think that's a good point and they are in the plan. The Hive dialect has already been a plugin and option now, and the implementation is located in hive-connector module. We still need some work to make the Hive dialect purely rely on public APIs, and the Hive connector should be decopule with table planner. At that time, we can move the whole Hive connector into a separate repository (I guess this is also in the externalize connectors plan).

I'm looping in Francesco and Timo who can elaborate more in depth on the current maintenance issues. I think we need to have a proper plan on how this tech debt/maintenance can be addressed and to get commitment that this will be resolved in Flink 1.16, since we indeed need to move out all previously agreed connectors before Flink 1.16 is released.

> From my perspective, Hive is still widely used and there exists many running Hive SQL jobs, so why not to provide users a better experience to help them migrate Hive jobs to Flink? Also, it doesn't conflict with Flink SQL as it's just a dialect option. 

I do think there is a conflict with Flink SQL; you can't use both of them at the same time, so you don't have access to all features in Flink. That increases feature sparsity and user friction. It also puts a bigger burden on the Flink community, because having both options available means more maintenance work. For example, an upgrade of Calcite is more impactful. The Flink codebase is already rather large and CI build times are already too long. More code means more risk of bugs. If a user at some point wants to change his Hive batch job to a streaming Flink SQL job, there's still migration work for the user, it just needs to happen at a later stage. 

@Jingsong I think you have a good argument that migrating SQL for Batch ETL is indeed an expensive effort. 

Last but not least, there was no one who has yet commented on the supported Hive versions and security issues. I've reached out to the Hive community and from the info I've received so far is that only Hive 3.1.x and Hive 2.3.x are still supported. The older Hive versions are no longer maintained and also don't receive security updates. This is important because many companies scan the Flink project for vulnerabilities and won't allow using it when these types of vulnerabilities are included. 

My summary would be the following:
* Like Jark said, in the short term, Hive syntax compatibility is the ticket for us to have a seat in the batch processing. Having improved Hive syntax support for that in Flink can help in this. 
* In the long term, we can and should drop it and focus on Flink SQL itself both for batch and stream processing.
* The Hive maintainers/volunteers should come up with a plan on how the tech debt/maintenance with regards to Hive query syntax can be addressed and will be resolved for Flink 1.16. This includes stuff like using public APIs and decoupling it from the planner. This is also extremely important since we want to move out connectors with Flink 1.16 (next Flink release). I'm hoping that those who can help out with this will chip-in. 
* We follow the Hive versions that are still supported, which means we drop support for Hive 1.*, 2.1.x and 2.2.x and upgrade Hive 2.3 and Hive 3.1 to the latest version. 

Thanks again for your input and looking forward to your thoughts on this.

Best regards,

Martijn 
On Tue, 8 Mar 2022 at 10:39, 罗宇侠(莫辞) <lu...@alibaba-inc.com> wrote:

Hi Martijn,
Thanks for driving this discussion. 

About your concerns, I would like to share my opinion.
Actually, more exactly, FLIP-152 [1] is not to extend Flink SQL to support Hive query synax, it provides a Hive dialect option to enable users to switch to Hive dialect. From the commits about the corresponding FLINK-21529, it doesn't involve much about Flink itself. 

- About the struggling with maintaining. The current implementation is just to provide an option for user to use Hive dialect. I think there won't be much bother.

- Although Apache Hive is less popular, it's widely used as an open source database over the years. There still exists many Hive SQL jobs in many companies.

- As I said, the current implementation for Hive SQL synax is more like pluggable, we can also support for Snowflake and the others as long as it's necessary.

- As for the know security vulnerabilities of Hive, maybe it's not a critical problem in this discuss.

- For current implementation for Hive SQL syntax, it uses a pluggable HiveParser[3] to parse the SQL statement. I think there won't be much complexity brought to Flink to support Hive syntax. 

From my perspective, Hive is still widely used and there exists many running Hive SQL jobs, so why not to provide users a better experience to help them migrate Hive jobs to Flink? Also, it doesn't conflict with Flink SQL as it's just a dialect option. 

[1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=165227316
[2] https://issues.apache.org/jira/browse/FLINK-21529
[3] https://issues.apache.org/jira/browse/FLINK-21531
Best regards,
Yuxia.



 ------------------原始邮件 ------------------
发件人:Martijn Visser <ma...@apache.org>
发送时间:Mon Mar 7 19:23:15 2022
收件人:dev <de...@flink.apache.org>, User <us...@flink.apache.org>
主题:[DISCUSS] Flink's supported APIs and Hive query syntax
Hi everyone,

Flink currently has 4 APIs with multiple language support which can be used
to develop applications:

* DataStream API, both Java and Scala
* Table API, both Java and Scala
* Flink SQL, both in Flink query syntax and Hive query syntax (partially)
* Python API

Since FLIP-152 [1] the Flink SQL support has been extended to also support
the Hive query syntax. There is now a follow-up FLINK-26360 [2] to address
more syntax compatibility issues.

I would like to open a discussion on Flink directly supporting the Hive
query syntax. I have some concerns if having a 100% Hive query syntax is
indeed something that we should aim for in Flink.

I can understand that having Hive query syntax support in Flink could help
users due to interoperability and being able to migrate. However:

- Adding full Hive query syntax support will mean that we go from 6 fully
supported API/language combinations to 7. I think we are currently already
struggling with maintaining the existing combinations, let another one
more.
- Apache Hive is/appears to be a project that's not that actively developed
anymore. The last release was made in January 2021. It's popularity is
rapidly declining in Europe and the United State, also due Hadoop becoming
less popular.
- Related to the previous topic, other software like Snowflake,
Trino/Presto, Databricks are becoming more and more popular. If we add full
support for the Hive query syntax, then why not add support for Snowflake
and the others?
- We are supporting Hive versions that are no longer supported by the Hive
community with known security vulnerabilities. This makes Flink also
vulnerable for those type of vulnerabilities.
- The currently Hive implementation is done by using a lot of internals of
Flink, making Flink hard to maintain, with lots of tech debt and making
things overly complex.

From my perspective, I think it would be better to not have Hive query
syntax compatibility directly in Flink itself. Of course we should have a
proper Hive connector and a proper Hive catalog to make connectivity with
Hive (the versions that are still supported by the Hive community) itself
possible. Alternatively, if Hive query syntax is so important, it should
not rely on internals but be available as a dialect/pluggable option. That
could also open up the possibility to add more syntax support for others in
the future, but I really think we should just focus on Flink SQL itself.
That's already hard enough to maintain and improve on.

I'm looking forward to the thoughts of both Developers and Users, so I'm
cross-posting to both mailing lists.

Best regards,

Martijn Visser
https://twitter.com/MartijnVisser82

[1]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=165227316
[2] https://issues.apache.org/jira/browse/FLINK-21529


回复:Re: Fwd: [DISCUSS] Flink's supported APIs and Hive query syntax

Posted by "罗宇侠(莫辞)" <lu...@alibaba-inc.com>.
Hi, thanks for the inputs. 
​I wrote a google doc “Plan for decoupling Hive connector with Flink planner” [1] that shows how to decouple Hive connector with planner. 

[1] https://docs.google.com/document/d/1LMQ_mWfB_mkYkEBCUa2DgCO2YdtiZV7YRs2mpXyjdP4/edit?usp=sharing

Best, 
Yuxia.------------------------------------------------------------------
发件人:Jark Wu<im...@gmail.com>
日 期:2022年03月10日 19:59:30
收件人:Francesco Guardiani<fr...@ververica.com>; dev<de...@flink.apache.org>
抄 送:Martijn Visser<ma...@apache.org>; User<us...@flink.apache.org>; 罗宇侠(莫辞)<lu...@alibaba-inc.com>
主 题:Re: Fwd: [DISCUSS] Flink's supported APIs and Hive query syntax

Hi Francesco,

Yes. The Hive syntax is a syntax plugin provided by Hive connector.

> But right now I don't think It's a good idea adding new features on top, 
as it will create only more maintenance burden both for Hive developers and for table developers.

We are not adding new Hive features, but fixing compatibility or behavior bugs, and almost all of them
are just related to the Hive connector code, nothing to do with table planner. 

I agree we should investigate how to and how much work to decouple Hive connector and planner ASAP. 
We will come up with a google doc soon. But AFAIK, this may not be a huge work and not conflict with the bugfix works. 

Best,
Jark
On Thu, 10 Mar 2022 at 17:03, Francesco Guardiani <fr...@ververica.com> wrote:

> We still need some work to make the Hive dialect purely rely on public APIs, and the Hive connector should be decopule with table planner. 

From the table perspective, I think this is the big pain point at the moment. First of all, when we talk about the Hive syntax, we're really talking about the Hive connector, as my understanding is that without the Hive connector in the classpath you can't use the Hive syntax [1].

The Hive connector is heavily relying on internals [2], and this is an important struggle for the table project, as sometimes is impedes and slows down development of new features and creates a huge maintenance burden for table developers [3]. The planner itself has some classes specific to Hive [4], making the codebase of the planner more complex than it already is. Some of these are just legacy, others exists because there are some abstractions missing in the table planner side, but those just need some work.

So I agree with Jark, when the two Hive modules (connector-hive and sql-parser-hive) reach a point where they don't depend at all on flink-table-planner, like every other connector (except for testing of course), we should be good to move them in a separate repo and continue committing to them. But right now I don't think It's a good idea adding new features on top, as it will create only more maintenance burden both for Hive developers and for table developers.

My concern with this plan is: how much realistic is to fix all the planner internal leaks in the existing Hive connector/parser? To me this seems like a huge task, including a non trivial amount of work to stabilize and design new entry points in Table API.

[1] HiveParser
[2] HiveParserCalcitePlanner
[3] Just talking about code coupling, not even mentioning problems like dependencies and security updates
[4] HiveAggSqlFunction
On Thu, Mar 10, 2022 at 9:05 AM Martijn Visser <ma...@apache.org> wrote:

Thank you Yuxia for volunteering, that's really much appreciated. It would be great if you can create an umbrella ticket for that. 

It would be great to get some insights from currently Flink and Hive users which versions are being used.
@Jark I would indeed deprecate the old Hive versions in Flink 1.15 and then drop them in Flink 1.16. That would also remove some tech debt and make it less work with regards to externalizing connectors.

Best regards,

Martijn
On Thu, 10 Mar 2022 at 07:39, Jark Wu <im...@gmail.com> wrote:

Thanks Martijn for the reply and summary. 

I totally agree with your plan and thank Yuxia for volunteering the Hive tech debt issue. 
I think we can create an umbrella issue for this and target version 1.16. We can discuss
details and create subtasks there. 

Regarding dropping old Hive versions, I'm also fine with that. But I would like to investigate
some Hive users first to see whether it's acceptable at this point. My first thought was we
can deprecate the old Hive versions in 1.15, and we can discuss dropping it in 1.16 or 1.17. 

Best,
Jark


On Thu, 10 Mar 2022 at 14:19, 罗宇侠(莫辞) <lu...@alibaba-inc.com> wrote:

Thanks Martijn for your insights.

About the tech debt/maintenance with regards to Hive query syntax, I would like to chip-in and expect it can be resolved for Flink 1.16.

Best regards,

Yuxia


 ------------------原始邮件 ------------------
发件人:Martijn Visser <ma...@apache.org>
发送时间:Thu Mar 10 04:03:34 2022
收件人:User <us...@flink.apache.org>
主题:Fwd: [DISCUSS] Flink's supported APIs and Hive query syntax

(Forwarding this also to the User mailing list as I made a typo when replying to this email thread)

---------- Forwarded message ---------
From: Martijn Visser <ma...@apache.org>
Date: Wed, 9 Mar 2022 at 20:57
Subject: Re: [DISCUSS] Flink's supported APIs and Hive query syntax
To: dev <de...@flink.apache.org>, Francesco Guardiani <fr...@ververica.com>, Timo Walther <tw...@apache.org>, <us...@flink.apache.org>


Hi everyone,

Thank you all very much for your input. From my perspective, I consider batch as a special case of streaming. So with Flink SQL, we can support both batch and streaming use cases and I think we should use Flink SQL as our target. 

To reply on some of the comments:

@Jing on your remark:
> Since Flink has a clear vision of unified batch and stream processing, supporting batch jobs will be one of the critical core features to help us reach the vision and let Flink have an even bigger impact in the industry.

I fully agree with that statement. I do think that having Hive syntax support doesn't help in that unified batch and stream processing. We're making it easier for batch users to run their Hive batch jobs on Flink, but that doesn't fit the "unified" part since it's focussed on batch, while Flink SQL focusses on batch and streaming. I would have rather invested time in making batch improvements to Flink and Flink SQL vs investing in Hive syntax support. I do understand from the given replies that Hive syntax support is valuable for those that are already running batch processing and would like to run these queries on Flink. I do think that's limited to mostly Chinese companies at the moment. 

@Jark I think you've provided great input and are spot on with: 
> Regarding the maintenance concern you raised, I think that's a good point and they are in the plan. The Hive dialect has already been a plugin and option now, and the implementation is located in hive-connector module. We still need some work to make the Hive dialect purely rely on public APIs, and the Hive connector should be decopule with table planner. At that time, we can move the whole Hive connector into a separate repository (I guess this is also in the externalize connectors plan).

I'm looping in Francesco and Timo who can elaborate more in depth on the current maintenance issues. I think we need to have a proper plan on how this tech debt/maintenance can be addressed and to get commitment that this will be resolved in Flink 1.16, since we indeed need to move out all previously agreed connectors before Flink 1.16 is released.

> From my perspective, Hive is still widely used and there exists many running Hive SQL jobs, so why not to provide users a better experience to help them migrate Hive jobs to Flink? Also, it doesn't conflict with Flink SQL as it's just a dialect option. 

I do think there is a conflict with Flink SQL; you can't use both of them at the same time, so you don't have access to all features in Flink. That increases feature sparsity and user friction. It also puts a bigger burden on the Flink community, because having both options available means more maintenance work. For example, an upgrade of Calcite is more impactful. The Flink codebase is already rather large and CI build times are already too long. More code means more risk of bugs. If a user at some point wants to change his Hive batch job to a streaming Flink SQL job, there's still migration work for the user, it just needs to happen at a later stage. 

@Jingsong I think you have a good argument that migrating SQL for Batch ETL is indeed an expensive effort. 

Last but not least, there was no one who has yet commented on the supported Hive versions and security issues. I've reached out to the Hive community and from the info I've received so far is that only Hive 3.1.x and Hive 2.3.x are still supported. The older Hive versions are no longer maintained and also don't receive security updates. This is important because many companies scan the Flink project for vulnerabilities and won't allow using it when these types of vulnerabilities are included. 

My summary would be the following:
* Like Jark said, in the short term, Hive syntax compatibility is the ticket for us to have a seat in the batch processing. Having improved Hive syntax support for that in Flink can help in this. 
* In the long term, we can and should drop it and focus on Flink SQL itself both for batch and stream processing.
* The Hive maintainers/volunteers should come up with a plan on how the tech debt/maintenance with regards to Hive query syntax can be addressed and will be resolved for Flink 1.16. This includes stuff like using public APIs and decoupling it from the planner. This is also extremely important since we want to move out connectors with Flink 1.16 (next Flink release). I'm hoping that those who can help out with this will chip-in. 
* We follow the Hive versions that are still supported, which means we drop support for Hive 1.*, 2.1.x and 2.2.x and upgrade Hive 2.3 and Hive 3.1 to the latest version. 

Thanks again for your input and looking forward to your thoughts on this.

Best regards,

Martijn 
On Tue, 8 Mar 2022 at 10:39, 罗宇侠(莫辞) <lu...@alibaba-inc.com> wrote:

Hi Martijn,
Thanks for driving this discussion. 

About your concerns, I would like to share my opinion.
Actually, more exactly, FLIP-152 [1] is not to extend Flink SQL to support Hive query synax, it provides a Hive dialect option to enable users to switch to Hive dialect. From the commits about the corresponding FLINK-21529, it doesn't involve much about Flink itself. 

- About the struggling with maintaining. The current implementation is just to provide an option for user to use Hive dialect. I think there won't be much bother.

- Although Apache Hive is less popular, it's widely used as an open source database over the years. There still exists many Hive SQL jobs in many companies.

- As I said, the current implementation for Hive SQL synax is more like pluggable, we can also support for Snowflake and the others as long as it's necessary.

- As for the know security vulnerabilities of Hive, maybe it's not a critical problem in this discuss.

- For current implementation for Hive SQL syntax, it uses a pluggable HiveParser[3] to parse the SQL statement. I think there won't be much complexity brought to Flink to support Hive syntax. 

From my perspective, Hive is still widely used and there exists many running Hive SQL jobs, so why not to provide users a better experience to help them migrate Hive jobs to Flink? Also, it doesn't conflict with Flink SQL as it's just a dialect option. 

[1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=165227316
[2] https://issues.apache.org/jira/browse/FLINK-21529
[3] https://issues.apache.org/jira/browse/FLINK-21531
Best regards,
Yuxia.



 ------------------原始邮件 ------------------
发件人:Martijn Visser <ma...@apache.org>
发送时间:Mon Mar 7 19:23:15 2022
收件人:dev <de...@flink.apache.org>, User <us...@flink.apache.org>
主题:[DISCUSS] Flink's supported APIs and Hive query syntax
Hi everyone,

Flink currently has 4 APIs with multiple language support which can be used
to develop applications:

* DataStream API, both Java and Scala
* Table API, both Java and Scala
* Flink SQL, both in Flink query syntax and Hive query syntax (partially)
* Python API

Since FLIP-152 [1] the Flink SQL support has been extended to also support
the Hive query syntax. There is now a follow-up FLINK-26360 [2] to address
more syntax compatibility issues.

I would like to open a discussion on Flink directly supporting the Hive
query syntax. I have some concerns if having a 100% Hive query syntax is
indeed something that we should aim for in Flink.

I can understand that having Hive query syntax support in Flink could help
users due to interoperability and being able to migrate. However:

- Adding full Hive query syntax support will mean that we go from 6 fully
supported API/language combinations to 7. I think we are currently already
struggling with maintaining the existing combinations, let another one
more.
- Apache Hive is/appears to be a project that's not that actively developed
anymore. The last release was made in January 2021. It's popularity is
rapidly declining in Europe and the United State, also due Hadoop becoming
less popular.
- Related to the previous topic, other software like Snowflake,
Trino/Presto, Databricks are becoming more and more popular. If we add full
support for the Hive query syntax, then why not add support for Snowflake
and the others?
- We are supporting Hive versions that are no longer supported by the Hive
community with known security vulnerabilities. This makes Flink also
vulnerable for those type of vulnerabilities.
- The currently Hive implementation is done by using a lot of internals of
Flink, making Flink hard to maintain, with lots of tech debt and making
things overly complex.

From my perspective, I think it would be better to not have Hive query
syntax compatibility directly in Flink itself. Of course we should have a
proper Hive connector and a proper Hive catalog to make connectivity with
Hive (the versions that are still supported by the Hive community) itself
possible. Alternatively, if Hive query syntax is so important, it should
not rely on internals but be available as a dialect/pluggable option. That
could also open up the possibility to add more syntax support for others in
the future, but I really think we should just focus on Flink SQL itself.
That's already hard enough to maintain and improve on.

I'm looking forward to the thoughts of both Developers and Users, so I'm
cross-posting to both mailing lists.

Best regards,

Martijn Visser
https://twitter.com/MartijnVisser82

[1]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=165227316
[2] https://issues.apache.org/jira/browse/FLINK-21529


Re: Fwd: [DISCUSS] Flink's supported APIs and Hive query syntax

Posted by Jark Wu <im...@gmail.com>.
Hi Francesco,

Yes. The Hive syntax is a syntax plugin provided by Hive connector.

> But right now I don't think It's a good idea adding new features on top,
as it will create only more maintenance burden both for Hive developers and
for table developers.

We are not adding new Hive features, but fixing compatibility or behavior
bugs, and almost all of them
are just related to the Hive connector code, nothing to do with table
planner.

I agree we should investigate how to and how much work to decouple Hive
connector and planner ASAP.
We will come up with a google doc soon. But AFAIK, this may not be a huge
work and not conflict with the bugfix works.

Best,
Jark

On Thu, 10 Mar 2022 at 17:03, Francesco Guardiani <fr...@ververica.com>
wrote:

> > We still need some work to make the Hive dialect purely rely on public
> APIs, and the Hive connector should be decopule with table planner.
>
> From the table perspective, I think this is the big pain point at the
> moment. First of all, when we talk about the Hive syntax, we're really
> talking about the Hive connector, as my understanding is that without the
> Hive connector in the classpath you can't use the Hive syntax [1].
>
> The Hive connector is heavily relying on internals [2], and this is an
> important struggle for the table project, as sometimes is impedes and slows
> down development of new features and creates a huge maintenance burden for
> table developers [3]. The planner itself has some classes specific to Hive
> [4], making the codebase of the planner more complex than it already is.
> Some of these are just legacy, others exists because there are some
> abstractions missing in the table planner side, but those just need some
> work.
>
> So I agree with Jark, when the two Hive modules (connector-hive and
> sql-parser-hive) reach a point where they don't depend at all on
> flink-table-planner, like every other connector (except for testing of
> course), we should be good to move them in a separate repo and continue
> committing to them. But right now I don't think It's a good idea adding new
> features on top, as it will create only more maintenance burden both for
> Hive developers and for table developers.
>
> My concern with this plan is: how much realistic is to fix all the planner
> internal leaks in the existing Hive connector/parser? To me this seems like
> a huge task, including a non trivial amount of work to stabilize and design
> new entry points in Table API.
>
> [1] HiveParser
> <https://github.com/apache/flink/blob/a5847e3871ffb9515af9c754bd10c42611976c82/flink-connectors/flink-connector-hive/src/main/java/org/apache/flink/table/planner/delegation/hive/HiveParser.java>
> [2] HiveParserCalcitePlanner
> <https://github.com/apache/flink/blob/6628237f72d818baec094a2426c236480ee33380/flink-connectors/flink-connector-hive/src/main/java/org/apache/flink/table/planner/delegation/hive/HiveParserCalcitePlanner.java>
> [3] Just talking about code coupling, not even mentioning problems like
> dependencies and security updates
> [4] HiveAggSqlFunction
> <https://github.com/apache/flink/blob/ab70dcfa19827febd2c3cdc5cb81e942caa5b2f0/flink-table/flink-table-planner/src/main/java/org/apache/flink/table/planner/functions/utils/HiveAggSqlFunction.java>
>
> On Thu, Mar 10, 2022 at 9:05 AM Martijn Visser <ma...@apache.org>
> wrote:
>
>> Thank you Yuxia for volunteering, that's really much appreciated. It
>> would be great if you can create an umbrella ticket for that.
>>
>> It would be great to get some insights from currently Flink and Hive
>> users which versions are being used.
>> @Jark I would indeed deprecate the old Hive versions in Flink 1.15 and
>> then drop them in Flink 1.16. That would also remove some tech debt and
>> make it less work with regards to externalizing connectors.
>>
>> Best regards,
>>
>> Martijn
>>
>> On Thu, 10 Mar 2022 at 07:39, Jark Wu <im...@gmail.com> wrote:
>>
>>> Thanks Martijn for the reply and summary.
>>>
>>> I totally agree with your plan and thank Yuxia for volunteering the Hive
>>> tech debt issue.
>>> I think we can create an umbrella issue for this and target version
>>> 1.16. We can discuss
>>> details and create subtasks there.
>>>
>>> Regarding dropping old Hive versions, I'm also fine with that. But I
>>> would like to investigate
>>> some Hive users first to see whether it's acceptable at this point. My
>>> first thought was we
>>> can deprecate the old Hive versions in 1.15, and we can discuss dropping
>>> it in 1.16 or 1.17.
>>>
>>> Best,
>>> Jark
>>>
>>>
>>> On Thu, 10 Mar 2022 at 14:19, 罗宇侠(莫辞) <lu...@alibaba-inc.com>
>>> wrote:
>>>
>>>> Thanks Martijn for your insights.
>>>>
>>>> About the tech debt/maintenance with regards to Hive query syntax, I
>>>> would like to chip-in and expect it can be resolved for Flink 1.16.
>>>>
>>>> Best regards,
>>>>
>>>> Yuxia
>>>>
>>>> ------------------原始邮件 ------------------
>>>> *发件人:*Martijn Visser <ma...@apache.org>
>>>> *发送时间:*Thu Mar 10 04:03:34 2022
>>>> *收件人:*User <us...@flink.apache.org>
>>>> *主题:*Fwd: [DISCUSS] Flink's supported APIs and Hive query syntax
>>>>
>>>>> (Forwarding this also to the User mailing list as I made a typo when
>>>>> replying to this email thread)
>>>>>
>>>>> ---------- Forwarded message ---------
>>>>> From: Martijn Visser <ma...@apache.org>
>>>>> Date: Wed, 9 Mar 2022 at 20:57
>>>>> Subject: Re: [DISCUSS] Flink's supported APIs and Hive query syntax
>>>>> To: dev <de...@flink.apache.org>, Francesco Guardiani <
>>>>> francesco@ververica.com>, Timo Walther <tw...@apache.org>, <
>>>>> users@flink.apache.org>
>>>>>
>>>>>
>>>>> Hi everyone,
>>>>>
>>>>> Thank you all very much for your input. From my perspective, I
>>>>> consider batch as a special case of streaming. So with Flink SQL, we can
>>>>> support both batch and streaming use cases and I think we should use Flink
>>>>> SQL as our target.
>>>>>
>>>>> To reply on some of the comments:
>>>>>
>>>>> @Jing on your remark:
>>>>> > Since Flink has a clear vision of unified batch and stream
>>>>> processing, supporting batch jobs will be one of the critical core features
>>>>> to help us reach the vision and let Flink have an even bigger impact in the
>>>>> industry.
>>>>>
>>>>> I fully agree with that statement. I do think that having Hive syntax
>>>>> support doesn't help in that unified batch and stream processing. We're
>>>>> making it easier for batch users to run their Hive batch jobs on Flink, but
>>>>> that doesn't fit the "unified" part since it's focussed on batch, while
>>>>> Flink SQL focusses on batch and streaming. I would have rather invested
>>>>> time in making batch improvements to Flink and Flink SQL vs investing in
>>>>> Hive syntax support. I do understand from the given replies that Hive
>>>>> syntax support is valuable for those that are already running batch
>>>>> processing and would like to run these queries on Flink. I do think that's
>>>>> limited to mostly Chinese companies at the moment.
>>>>>
>>>>> @Jark I think you've provided great input and are spot on with:
>>>>> > Regarding the maintenance concern you raised, I think that's a good
>>>>> point and they are in the plan. The Hive dialect has already been a plugin
>>>>> and option now, and the implementation is located in hive-connector module.
>>>>> We still need some work to make the Hive dialect purely rely on public
>>>>> APIs, and the Hive connector should be decopule with table planner. At that
>>>>> time, we can move the whole Hive connector into a separate repository (I
>>>>> guess this is also in the externalize connectors plan).
>>>>>
>>>>> I'm looping in Francesco and Timo who can elaborate more in depth on
>>>>> the current maintenance issues. I think we need to have a proper plan on
>>>>> how this tech debt/maintenance can be addressed and to get commitment that
>>>>> this will be resolved in Flink 1.16, since we indeed need to move out all
>>>>> previously agreed connectors before Flink 1.16 is released.
>>>>>
>>>>> > From my perspective, Hive is still widely used and there exists many
>>>>> running Hive SQL jobs, so why not to provide users a better experience to
>>>>> help them migrate Hive jobs to Flink? Also, it doesn't conflict with Flink
>>>>> SQL as it's just a dialect option.
>>>>>
>>>>> I do think there is a conflict with Flink SQL; you can't use both of
>>>>> them at the same time, so you don't have access to all features in Flink.
>>>>> That increases feature sparsity and user friction. It also puts a bigger
>>>>> burden on the Flink community, because having both options available means
>>>>> more maintenance work. For example, an upgrade of Calcite is more
>>>>> impactful. The Flink codebase is already rather large and CI build times
>>>>> are already too long. More code means more risk of bugs. If a user at some
>>>>> point wants to change his Hive batch job to a streaming Flink SQL job,
>>>>> there's still migration work for the user, it just needs to happen at a
>>>>> later stage.
>>>>>
>>>>> @Jingsong I think you have a good argument that migrating SQL for
>>>>> Batch ETL is indeed an expensive effort.
>>>>>
>>>>> Last but not least, there was no one who has yet commented on the
>>>>> supported Hive versions and security issues. I've reached out to the Hive
>>>>> community and from the info I've received so far is that only Hive 3.1.x
>>>>> and Hive 2.3.x are still supported. The older Hive versions are no longer
>>>>> maintained and also don't receive security updates. This is important
>>>>> because many companies scan the Flink project for vulnerabilities and won't
>>>>> allow using it when these types of vulnerabilities are included.
>>>>>
>>>>> My summary would be the following:
>>>>> * Like Jark said, in the short term, Hive syntax compatibility is the
>>>>> ticket for us to have a seat in the batch processing. Having improved Hive
>>>>> syntax support for that in Flink can help in this.
>>>>> * In the long term, we can and should drop it and focus on Flink SQL
>>>>> itself both for batch and stream processing.
>>>>> * The Hive maintainers/volunteers should come up with a plan on how
>>>>> the tech debt/maintenance with regards to Hive query syntax can be
>>>>> addressed and will be resolved for Flink 1.16. This includes stuff like
>>>>> using public APIs and decoupling it from the planner. This is also
>>>>> extremely important since we want to move out connectors with Flink 1.16
>>>>> (next Flink release). I'm hoping that those who can help out with this will
>>>>> chip-in.
>>>>> * We follow the Hive versions that are still supported, which means we
>>>>> drop support for Hive 1.*, 2.1.x and 2.2.x and upgrade Hive 2.3 and Hive
>>>>> 3.1 to the latest version.
>>>>>
>>>>> Thanks again for your input and looking forward to your thoughts on
>>>>> this.
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Martijn
>>>>>
>>>>> On Tue, 8 Mar 2022 at 10:39, 罗宇侠(莫辞) <
>>>>> luoyuxia.luoyuxia@alibaba-inc.com> wrote:
>>>>>
>>>>>> Hi Martijn,
>>>>>> Thanks for driving this discussion.
>>>>>>
>>>>>> About your concerns, I would like to share my opinion.
>>>>>>
>>>>>> Actually, more exactly, FLIP-152 [1] is not to extend Flink SQL to support Hive query synax, it provides a Hive dialect option to enable users to switch to Hive dialect. From the commits about the corresponding FLINK-21529, it doesn't involve much about Flink itself.
>>>>>>
>>>>>>
>>>>>> - About the struggling with maintaining. The current implementation is just to provide an option for user to use Hive dialect. I think there won't be much bother.
>>>>>>
>>>>>>
>>>>>> - Although Apache Hive is less popular, it's widely used as an open source database over the years. There still exists many Hive SQL jobs in many companies.
>>>>>>
>>>>>>
>>>>>> - As I said, the current implementation for Hive SQL synax is more like pluggable, we can also support for Snowflake and the others as long as it's necessary.
>>>>>>
>>>>>>
>>>>>> - As for the know security vulnerabilities of Hive, maybe it's not a critical problem in this discuss.
>>>>>>
>>>>>> - For current implementation for Hive SQL syntax, it uses a pluggable HiveParser[3] to parse the SQL statement. I think there won't be much complexity brought to Flink
>>>>>> to support Hive syntax.
>>>>>>
>>>>>>
>>>>>> From my perspective, Hive is still widely used and there exists many running Hive SQL jobs, so why not to provide users a better experience to help them migrate Hive jobs to Flink? Also, it doesn't conflict with Flink SQL as it's just a dialect option.
>>>>>>
>>>>>> [1]
>>>>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=165227316
>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-21529
>>>>>> [3] https://issues.apache.org/jira/browse/FLINK-21531
>>>>>> Best regards,
>>>>>> Yuxia.
>>>>>>
>>>>>>
>>>>>> ------------------原始邮件 ------------------
>>>>>> *发件人:*Martijn Visser <ma...@apache.org>
>>>>>> *发送时间:*Mon Mar 7 19:23:15 2022
>>>>>> *收件人:*dev <de...@flink.apache.org>, User <us...@flink.apache.org>
>>>>>> *主题:*[DISCUSS] Flink's supported APIs and Hive query syntax
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>>
>>>>>>> Flink currently has 4 APIs with multiple language support which can be used
>>>>>>> to develop applications:
>>>>>>>
>>>>>>> * DataStream API, both Java and Scala
>>>>>>> * Table API, both Java and Scala
>>>>>>>
>>>>>>> * Flink SQL, both in Flink query syntax and Hive query syntax (partially)
>>>>>>> * Python API
>>>>>>>
>>>>>>>
>>>>>>> Since FLIP-152 [1] the Flink SQL support has been extended to also support
>>>>>>>
>>>>>>> the Hive query syntax. There is now a follow-up FLINK-26360 [2] to address
>>>>>>> more syntax compatibility issues.
>>>>>>>
>>>>>>>
>>>>>>> I would like to open a discussion on Flink directly supporting the Hive
>>>>>>>
>>>>>>> query syntax. I have some concerns if having a 100% Hive query syntax is
>>>>>>> indeed something that we should aim for in Flink.
>>>>>>>
>>>>>>>
>>>>>>> I can understand that having Hive query syntax support in Flink could help
>>>>>>> users due to interoperability and being able to migrate. However:
>>>>>>>
>>>>>>>
>>>>>>> - Adding full Hive query syntax support will mean that we go from 6 fully
>>>>>>>
>>>>>>> supported API/language combinations to 7. I think we are currently already
>>>>>>>
>>>>>>> struggling with maintaining the existing combinations, let another one
>>>>>>> more.
>>>>>>>
>>>>>>> - Apache Hive is/appears to be a project that's not that actively developed
>>>>>>>
>>>>>>> anymore. The last release was made in January 2021. It's popularity is
>>>>>>>
>>>>>>> rapidly declining in Europe and the United State, also due Hadoop becoming
>>>>>>> less popular.
>>>>>>> - Related to the previous topic, other software like Snowflake,
>>>>>>>
>>>>>>> Trino/Presto, Databricks are becoming more and more popular. If we add full
>>>>>>>
>>>>>>> support for the Hive query syntax, then why not add support for Snowflake
>>>>>>> and the others?
>>>>>>>
>>>>>>> - We are supporting Hive versions that are no longer supported by the Hive
>>>>>>> community with known security vulnerabilities. This makes Flink also
>>>>>>> vulnerable for those type of vulnerabilities.
>>>>>>>
>>>>>>> - The currently Hive implementation is done by using a lot of internals of
>>>>>>>
>>>>>>> Flink, making Flink hard to maintain, with lots of tech debt and making
>>>>>>> things overly complex.
>>>>>>>
>>>>>>>
>>>>>>> From my perspective, I think it would be better to not have Hive query
>>>>>>>
>>>>>>> syntax compatibility directly in Flink itself. Of course we should have a
>>>>>>>
>>>>>>> proper Hive connector and a proper Hive catalog to make connectivity with
>>>>>>>
>>>>>>> Hive (the versions that are still supported by the Hive community) itself
>>>>>>>
>>>>>>> possible. Alternatively, if Hive query syntax is so important, it should
>>>>>>>
>>>>>>> not rely on internals but be available as a dialect/pluggable option. That
>>>>>>>
>>>>>>> could also open up the possibility to add more syntax support for others in
>>>>>>>
>>>>>>> the future, but I really think we should just focus on Flink SQL itself.
>>>>>>> That's already hard enough to maintain and improve on.
>>>>>>>
>>>>>>>
>>>>>>> I'm looking forward to the thoughts of both Developers and Users, so I'm
>>>>>>> cross-posting to both mailing lists.
>>>>>>>
>>>>>>> Best regards,
>>>>>>>
>>>>>>> Martijn Visser
>>>>>>> https://twitter.com/MartijnVisser82
>>>>>>>
>>>>>>> [1]
>>>>>>>
>>>>>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=165227316
>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-21529
>>>>>>>
>>>>>>

Re: Fwd: [DISCUSS] Flink's supported APIs and Hive query syntax

Posted by Jark Wu <im...@gmail.com>.
Hi Francesco,

Yes. The Hive syntax is a syntax plugin provided by Hive connector.

> But right now I don't think It's a good idea adding new features on top,
as it will create only more maintenance burden both for Hive developers and
for table developers.

We are not adding new Hive features, but fixing compatibility or behavior
bugs, and almost all of them
are just related to the Hive connector code, nothing to do with table
planner.

I agree we should investigate how to and how much work to decouple Hive
connector and planner ASAP.
We will come up with a google doc soon. But AFAIK, this may not be a huge
work and not conflict with the bugfix works.

Best,
Jark

On Thu, 10 Mar 2022 at 17:03, Francesco Guardiani <fr...@ververica.com>
wrote:

> > We still need some work to make the Hive dialect purely rely on public
> APIs, and the Hive connector should be decopule with table planner.
>
> From the table perspective, I think this is the big pain point at the
> moment. First of all, when we talk about the Hive syntax, we're really
> talking about the Hive connector, as my understanding is that without the
> Hive connector in the classpath you can't use the Hive syntax [1].
>
> The Hive connector is heavily relying on internals [2], and this is an
> important struggle for the table project, as sometimes is impedes and slows
> down development of new features and creates a huge maintenance burden for
> table developers [3]. The planner itself has some classes specific to Hive
> [4], making the codebase of the planner more complex than it already is.
> Some of these are just legacy, others exists because there are some
> abstractions missing in the table planner side, but those just need some
> work.
>
> So I agree with Jark, when the two Hive modules (connector-hive and
> sql-parser-hive) reach a point where they don't depend at all on
> flink-table-planner, like every other connector (except for testing of
> course), we should be good to move them in a separate repo and continue
> committing to them. But right now I don't think It's a good idea adding new
> features on top, as it will create only more maintenance burden both for
> Hive developers and for table developers.
>
> My concern with this plan is: how much realistic is to fix all the planner
> internal leaks in the existing Hive connector/parser? To me this seems like
> a huge task, including a non trivial amount of work to stabilize and design
> new entry points in Table API.
>
> [1] HiveParser
> <https://github.com/apache/flink/blob/a5847e3871ffb9515af9c754bd10c42611976c82/flink-connectors/flink-connector-hive/src/main/java/org/apache/flink/table/planner/delegation/hive/HiveParser.java>
> [2] HiveParserCalcitePlanner
> <https://github.com/apache/flink/blob/6628237f72d818baec094a2426c236480ee33380/flink-connectors/flink-connector-hive/src/main/java/org/apache/flink/table/planner/delegation/hive/HiveParserCalcitePlanner.java>
> [3] Just talking about code coupling, not even mentioning problems like
> dependencies and security updates
> [4] HiveAggSqlFunction
> <https://github.com/apache/flink/blob/ab70dcfa19827febd2c3cdc5cb81e942caa5b2f0/flink-table/flink-table-planner/src/main/java/org/apache/flink/table/planner/functions/utils/HiveAggSqlFunction.java>
>
> On Thu, Mar 10, 2022 at 9:05 AM Martijn Visser <ma...@apache.org>
> wrote:
>
>> Thank you Yuxia for volunteering, that's really much appreciated. It
>> would be great if you can create an umbrella ticket for that.
>>
>> It would be great to get some insights from currently Flink and Hive
>> users which versions are being used.
>> @Jark I would indeed deprecate the old Hive versions in Flink 1.15 and
>> then drop them in Flink 1.16. That would also remove some tech debt and
>> make it less work with regards to externalizing connectors.
>>
>> Best regards,
>>
>> Martijn
>>
>> On Thu, 10 Mar 2022 at 07:39, Jark Wu <im...@gmail.com> wrote:
>>
>>> Thanks Martijn for the reply and summary.
>>>
>>> I totally agree with your plan and thank Yuxia for volunteering the Hive
>>> tech debt issue.
>>> I think we can create an umbrella issue for this and target version
>>> 1.16. We can discuss
>>> details and create subtasks there.
>>>
>>> Regarding dropping old Hive versions, I'm also fine with that. But I
>>> would like to investigate
>>> some Hive users first to see whether it's acceptable at this point. My
>>> first thought was we
>>> can deprecate the old Hive versions in 1.15, and we can discuss dropping
>>> it in 1.16 or 1.17.
>>>
>>> Best,
>>> Jark
>>>
>>>
>>> On Thu, 10 Mar 2022 at 14:19, 罗宇侠(莫辞) <lu...@alibaba-inc.com>
>>> wrote:
>>>
>>>> Thanks Martijn for your insights.
>>>>
>>>> About the tech debt/maintenance with regards to Hive query syntax, I
>>>> would like to chip-in and expect it can be resolved for Flink 1.16.
>>>>
>>>> Best regards,
>>>>
>>>> Yuxia
>>>>
>>>> ------------------原始邮件 ------------------
>>>> *发件人:*Martijn Visser <ma...@apache.org>
>>>> *发送时间:*Thu Mar 10 04:03:34 2022
>>>> *收件人:*User <us...@flink.apache.org>
>>>> *主题:*Fwd: [DISCUSS] Flink's supported APIs and Hive query syntax
>>>>
>>>>> (Forwarding this also to the User mailing list as I made a typo when
>>>>> replying to this email thread)
>>>>>
>>>>> ---------- Forwarded message ---------
>>>>> From: Martijn Visser <ma...@apache.org>
>>>>> Date: Wed, 9 Mar 2022 at 20:57
>>>>> Subject: Re: [DISCUSS] Flink's supported APIs and Hive query syntax
>>>>> To: dev <de...@flink.apache.org>, Francesco Guardiani <
>>>>> francesco@ververica.com>, Timo Walther <tw...@apache.org>, <
>>>>> users@flink.apache.org>
>>>>>
>>>>>
>>>>> Hi everyone,
>>>>>
>>>>> Thank you all very much for your input. From my perspective, I
>>>>> consider batch as a special case of streaming. So with Flink SQL, we can
>>>>> support both batch and streaming use cases and I think we should use Flink
>>>>> SQL as our target.
>>>>>
>>>>> To reply on some of the comments:
>>>>>
>>>>> @Jing on your remark:
>>>>> > Since Flink has a clear vision of unified batch and stream
>>>>> processing, supporting batch jobs will be one of the critical core features
>>>>> to help us reach the vision and let Flink have an even bigger impact in the
>>>>> industry.
>>>>>
>>>>> I fully agree with that statement. I do think that having Hive syntax
>>>>> support doesn't help in that unified batch and stream processing. We're
>>>>> making it easier for batch users to run their Hive batch jobs on Flink, but
>>>>> that doesn't fit the "unified" part since it's focussed on batch, while
>>>>> Flink SQL focusses on batch and streaming. I would have rather invested
>>>>> time in making batch improvements to Flink and Flink SQL vs investing in
>>>>> Hive syntax support. I do understand from the given replies that Hive
>>>>> syntax support is valuable for those that are already running batch
>>>>> processing and would like to run these queries on Flink. I do think that's
>>>>> limited to mostly Chinese companies at the moment.
>>>>>
>>>>> @Jark I think you've provided great input and are spot on with:
>>>>> > Regarding the maintenance concern you raised, I think that's a good
>>>>> point and they are in the plan. The Hive dialect has already been a plugin
>>>>> and option now, and the implementation is located in hive-connector module.
>>>>> We still need some work to make the Hive dialect purely rely on public
>>>>> APIs, and the Hive connector should be decopule with table planner. At that
>>>>> time, we can move the whole Hive connector into a separate repository (I
>>>>> guess this is also in the externalize connectors plan).
>>>>>
>>>>> I'm looping in Francesco and Timo who can elaborate more in depth on
>>>>> the current maintenance issues. I think we need to have a proper plan on
>>>>> how this tech debt/maintenance can be addressed and to get commitment that
>>>>> this will be resolved in Flink 1.16, since we indeed need to move out all
>>>>> previously agreed connectors before Flink 1.16 is released.
>>>>>
>>>>> > From my perspective, Hive is still widely used and there exists many
>>>>> running Hive SQL jobs, so why not to provide users a better experience to
>>>>> help them migrate Hive jobs to Flink? Also, it doesn't conflict with Flink
>>>>> SQL as it's just a dialect option.
>>>>>
>>>>> I do think there is a conflict with Flink SQL; you can't use both of
>>>>> them at the same time, so you don't have access to all features in Flink.
>>>>> That increases feature sparsity and user friction. It also puts a bigger
>>>>> burden on the Flink community, because having both options available means
>>>>> more maintenance work. For example, an upgrade of Calcite is more
>>>>> impactful. The Flink codebase is already rather large and CI build times
>>>>> are already too long. More code means more risk of bugs. If a user at some
>>>>> point wants to change his Hive batch job to a streaming Flink SQL job,
>>>>> there's still migration work for the user, it just needs to happen at a
>>>>> later stage.
>>>>>
>>>>> @Jingsong I think you have a good argument that migrating SQL for
>>>>> Batch ETL is indeed an expensive effort.
>>>>>
>>>>> Last but not least, there was no one who has yet commented on the
>>>>> supported Hive versions and security issues. I've reached out to the Hive
>>>>> community and from the info I've received so far is that only Hive 3.1.x
>>>>> and Hive 2.3.x are still supported. The older Hive versions are no longer
>>>>> maintained and also don't receive security updates. This is important
>>>>> because many companies scan the Flink project for vulnerabilities and won't
>>>>> allow using it when these types of vulnerabilities are included.
>>>>>
>>>>> My summary would be the following:
>>>>> * Like Jark said, in the short term, Hive syntax compatibility is the
>>>>> ticket for us to have a seat in the batch processing. Having improved Hive
>>>>> syntax support for that in Flink can help in this.
>>>>> * In the long term, we can and should drop it and focus on Flink SQL
>>>>> itself both for batch and stream processing.
>>>>> * The Hive maintainers/volunteers should come up with a plan on how
>>>>> the tech debt/maintenance with regards to Hive query syntax can be
>>>>> addressed and will be resolved for Flink 1.16. This includes stuff like
>>>>> using public APIs and decoupling it from the planner. This is also
>>>>> extremely important since we want to move out connectors with Flink 1.16
>>>>> (next Flink release). I'm hoping that those who can help out with this will
>>>>> chip-in.
>>>>> * We follow the Hive versions that are still supported, which means we
>>>>> drop support for Hive 1.*, 2.1.x and 2.2.x and upgrade Hive 2.3 and Hive
>>>>> 3.1 to the latest version.
>>>>>
>>>>> Thanks again for your input and looking forward to your thoughts on
>>>>> this.
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Martijn
>>>>>
>>>>> On Tue, 8 Mar 2022 at 10:39, 罗宇侠(莫辞) <
>>>>> luoyuxia.luoyuxia@alibaba-inc.com> wrote:
>>>>>
>>>>>> Hi Martijn,
>>>>>> Thanks for driving this discussion.
>>>>>>
>>>>>> About your concerns, I would like to share my opinion.
>>>>>>
>>>>>> Actually, more exactly, FLIP-152 [1] is not to extend Flink SQL to support Hive query synax, it provides a Hive dialect option to enable users to switch to Hive dialect. From the commits about the corresponding FLINK-21529, it doesn't involve much about Flink itself.
>>>>>>
>>>>>>
>>>>>> - About the struggling with maintaining. The current implementation is just to provide an option for user to use Hive dialect. I think there won't be much bother.
>>>>>>
>>>>>>
>>>>>> - Although Apache Hive is less popular, it's widely used as an open source database over the years. There still exists many Hive SQL jobs in many companies.
>>>>>>
>>>>>>
>>>>>> - As I said, the current implementation for Hive SQL synax is more like pluggable, we can also support for Snowflake and the others as long as it's necessary.
>>>>>>
>>>>>>
>>>>>> - As for the know security vulnerabilities of Hive, maybe it's not a critical problem in this discuss.
>>>>>>
>>>>>> - For current implementation for Hive SQL syntax, it uses a pluggable HiveParser[3] to parse the SQL statement. I think there won't be much complexity brought to Flink
>>>>>> to support Hive syntax.
>>>>>>
>>>>>>
>>>>>> From my perspective, Hive is still widely used and there exists many running Hive SQL jobs, so why not to provide users a better experience to help them migrate Hive jobs to Flink? Also, it doesn't conflict with Flink SQL as it's just a dialect option.
>>>>>>
>>>>>> [1]
>>>>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=165227316
>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-21529
>>>>>> [3] https://issues.apache.org/jira/browse/FLINK-21531
>>>>>> Best regards,
>>>>>> Yuxia.
>>>>>>
>>>>>>
>>>>>> ------------------原始邮件 ------------------
>>>>>> *发件人:*Martijn Visser <ma...@apache.org>
>>>>>> *发送时间:*Mon Mar 7 19:23:15 2022
>>>>>> *收件人:*dev <de...@flink.apache.org>, User <us...@flink.apache.org>
>>>>>> *主题:*[DISCUSS] Flink's supported APIs and Hive query syntax
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>>
>>>>>>> Flink currently has 4 APIs with multiple language support which can be used
>>>>>>> to develop applications:
>>>>>>>
>>>>>>> * DataStream API, both Java and Scala
>>>>>>> * Table API, both Java and Scala
>>>>>>>
>>>>>>> * Flink SQL, both in Flink query syntax and Hive query syntax (partially)
>>>>>>> * Python API
>>>>>>>
>>>>>>>
>>>>>>> Since FLIP-152 [1] the Flink SQL support has been extended to also support
>>>>>>>
>>>>>>> the Hive query syntax. There is now a follow-up FLINK-26360 [2] to address
>>>>>>> more syntax compatibility issues.
>>>>>>>
>>>>>>>
>>>>>>> I would like to open a discussion on Flink directly supporting the Hive
>>>>>>>
>>>>>>> query syntax. I have some concerns if having a 100% Hive query syntax is
>>>>>>> indeed something that we should aim for in Flink.
>>>>>>>
>>>>>>>
>>>>>>> I can understand that having Hive query syntax support in Flink could help
>>>>>>> users due to interoperability and being able to migrate. However:
>>>>>>>
>>>>>>>
>>>>>>> - Adding full Hive query syntax support will mean that we go from 6 fully
>>>>>>>
>>>>>>> supported API/language combinations to 7. I think we are currently already
>>>>>>>
>>>>>>> struggling with maintaining the existing combinations, let another one
>>>>>>> more.
>>>>>>>
>>>>>>> - Apache Hive is/appears to be a project that's not that actively developed
>>>>>>>
>>>>>>> anymore. The last release was made in January 2021. It's popularity is
>>>>>>>
>>>>>>> rapidly declining in Europe and the United State, also due Hadoop becoming
>>>>>>> less popular.
>>>>>>> - Related to the previous topic, other software like Snowflake,
>>>>>>>
>>>>>>> Trino/Presto, Databricks are becoming more and more popular. If we add full
>>>>>>>
>>>>>>> support for the Hive query syntax, then why not add support for Snowflake
>>>>>>> and the others?
>>>>>>>
>>>>>>> - We are supporting Hive versions that are no longer supported by the Hive
>>>>>>> community with known security vulnerabilities. This makes Flink also
>>>>>>> vulnerable for those type of vulnerabilities.
>>>>>>>
>>>>>>> - The currently Hive implementation is done by using a lot of internals of
>>>>>>>
>>>>>>> Flink, making Flink hard to maintain, with lots of tech debt and making
>>>>>>> things overly complex.
>>>>>>>
>>>>>>>
>>>>>>> From my perspective, I think it would be better to not have Hive query
>>>>>>>
>>>>>>> syntax compatibility directly in Flink itself. Of course we should have a
>>>>>>>
>>>>>>> proper Hive connector and a proper Hive catalog to make connectivity with
>>>>>>>
>>>>>>> Hive (the versions that are still supported by the Hive community) itself
>>>>>>>
>>>>>>> possible. Alternatively, if Hive query syntax is so important, it should
>>>>>>>
>>>>>>> not rely on internals but be available as a dialect/pluggable option. That
>>>>>>>
>>>>>>> could also open up the possibility to add more syntax support for others in
>>>>>>>
>>>>>>> the future, but I really think we should just focus on Flink SQL itself.
>>>>>>> That's already hard enough to maintain and improve on.
>>>>>>>
>>>>>>>
>>>>>>> I'm looking forward to the thoughts of both Developers and Users, so I'm
>>>>>>> cross-posting to both mailing lists.
>>>>>>>
>>>>>>> Best regards,
>>>>>>>
>>>>>>> Martijn Visser
>>>>>>> https://twitter.com/MartijnVisser82
>>>>>>>
>>>>>>> [1]
>>>>>>>
>>>>>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=165227316
>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-21529
>>>>>>>
>>>>>>

Re: Fwd: [DISCUSS] Flink's supported APIs and Hive query syntax

Posted by Francesco Guardiani <fr...@ververica.com>.
> We still need some work to make the Hive dialect purely rely on public
APIs, and the Hive connector should be decopule with table planner.

From the table perspective, I think this is the big pain point at the
moment. First of all, when we talk about the Hive syntax, we're really
talking about the Hive connector, as my understanding is that without the
Hive connector in the classpath you can't use the Hive syntax [1].

The Hive connector is heavily relying on internals [2], and this is an
important struggle for the table project, as sometimes is impedes and slows
down development of new features and creates a huge maintenance burden for
table developers [3]. The planner itself has some classes specific to Hive
[4], making the codebase of the planner more complex than it already is.
Some of these are just legacy, others exists because there are some
abstractions missing in the table planner side, but those just need some
work.

So I agree with Jark, when the two Hive modules (connector-hive and
sql-parser-hive) reach a point where they don't depend at all on
flink-table-planner, like every other connector (except for testing of
course), we should be good to move them in a separate repo and continue
committing to them. But right now I don't think It's a good idea adding new
features on top, as it will create only more maintenance burden both for
Hive developers and for table developers.

My concern with this plan is: how much realistic is to fix all the planner
internal leaks in the existing Hive connector/parser? To me this seems like
a huge task, including a non trivial amount of work to stabilize and design
new entry points in Table API.

[1] HiveParser
<https://github.com/apache/flink/blob/a5847e3871ffb9515af9c754bd10c42611976c82/flink-connectors/flink-connector-hive/src/main/java/org/apache/flink/table/planner/delegation/hive/HiveParser.java>
[2] HiveParserCalcitePlanner
<https://github.com/apache/flink/blob/6628237f72d818baec094a2426c236480ee33380/flink-connectors/flink-connector-hive/src/main/java/org/apache/flink/table/planner/delegation/hive/HiveParserCalcitePlanner.java>
[3] Just talking about code coupling, not even mentioning problems like
dependencies and security updates
[4] HiveAggSqlFunction
<https://github.com/apache/flink/blob/ab70dcfa19827febd2c3cdc5cb81e942caa5b2f0/flink-table/flink-table-planner/src/main/java/org/apache/flink/table/planner/functions/utils/HiveAggSqlFunction.java>

On Thu, Mar 10, 2022 at 9:05 AM Martijn Visser <ma...@apache.org>
wrote:

> Thank you Yuxia for volunteering, that's really much appreciated. It would
> be great if you can create an umbrella ticket for that.
>
> It would be great to get some insights from currently Flink and Hive users
> which versions are being used.
> @Jark I would indeed deprecate the old Hive versions in Flink 1.15 and
> then drop them in Flink 1.16. That would also remove some tech debt and
> make it less work with regards to externalizing connectors.
>
> Best regards,
>
> Martijn
>
> On Thu, 10 Mar 2022 at 07:39, Jark Wu <im...@gmail.com> wrote:
>
>> Thanks Martijn for the reply and summary.
>>
>> I totally agree with your plan and thank Yuxia for volunteering the Hive
>> tech debt issue.
>> I think we can create an umbrella issue for this and target version 1.16.
>> We can discuss
>> details and create subtasks there.
>>
>> Regarding dropping old Hive versions, I'm also fine with that. But I
>> would like to investigate
>> some Hive users first to see whether it's acceptable at this point. My
>> first thought was we
>> can deprecate the old Hive versions in 1.15, and we can discuss dropping
>> it in 1.16 or 1.17.
>>
>> Best,
>> Jark
>>
>>
>> On Thu, 10 Mar 2022 at 14:19, 罗宇侠(莫辞) <lu...@alibaba-inc.com>
>> wrote:
>>
>>> Thanks Martijn for your insights.
>>>
>>> About the tech debt/maintenance with regards to Hive query syntax, I
>>> would like to chip-in and expect it can be resolved for Flink 1.16.
>>>
>>> Best regards,
>>>
>>> Yuxia
>>>
>>> ------------------原始邮件 ------------------
>>> *发件人:*Martijn Visser <ma...@apache.org>
>>> *发送时间:*Thu Mar 10 04:03:34 2022
>>> *收件人:*User <us...@flink.apache.org>
>>> *主题:*Fwd: [DISCUSS] Flink's supported APIs and Hive query syntax
>>>
>>>> (Forwarding this also to the User mailing list as I made a typo when
>>>> replying to this email thread)
>>>>
>>>> ---------- Forwarded message ---------
>>>> From: Martijn Visser <ma...@apache.org>
>>>> Date: Wed, 9 Mar 2022 at 20:57
>>>> Subject: Re: [DISCUSS] Flink's supported APIs and Hive query syntax
>>>> To: dev <de...@flink.apache.org>, Francesco Guardiani <
>>>> francesco@ververica.com>, Timo Walther <tw...@apache.org>, <
>>>> users@flink.apache.org>
>>>>
>>>>
>>>> Hi everyone,
>>>>
>>>> Thank you all very much for your input. From my perspective, I consider
>>>> batch as a special case of streaming. So with Flink SQL, we can support
>>>> both batch and streaming use cases and I think we should use Flink SQL as
>>>> our target.
>>>>
>>>> To reply on some of the comments:
>>>>
>>>> @Jing on your remark:
>>>> > Since Flink has a clear vision of unified batch and stream
>>>> processing, supporting batch jobs will be one of the critical core features
>>>> to help us reach the vision and let Flink have an even bigger impact in the
>>>> industry.
>>>>
>>>> I fully agree with that statement. I do think that having Hive syntax
>>>> support doesn't help in that unified batch and stream processing. We're
>>>> making it easier for batch users to run their Hive batch jobs on Flink, but
>>>> that doesn't fit the "unified" part since it's focussed on batch, while
>>>> Flink SQL focusses on batch and streaming. I would have rather invested
>>>> time in making batch improvements to Flink and Flink SQL vs investing in
>>>> Hive syntax support. I do understand from the given replies that Hive
>>>> syntax support is valuable for those that are already running batch
>>>> processing and would like to run these queries on Flink. I do think that's
>>>> limited to mostly Chinese companies at the moment.
>>>>
>>>> @Jark I think you've provided great input and are spot on with:
>>>> > Regarding the maintenance concern you raised, I think that's a good
>>>> point and they are in the plan. The Hive dialect has already been a plugin
>>>> and option now, and the implementation is located in hive-connector module.
>>>> We still need some work to make the Hive dialect purely rely on public
>>>> APIs, and the Hive connector should be decopule with table planner. At that
>>>> time, we can move the whole Hive connector into a separate repository (I
>>>> guess this is also in the externalize connectors plan).
>>>>
>>>> I'm looping in Francesco and Timo who can elaborate more in depth on
>>>> the current maintenance issues. I think we need to have a proper plan on
>>>> how this tech debt/maintenance can be addressed and to get commitment that
>>>> this will be resolved in Flink 1.16, since we indeed need to move out all
>>>> previously agreed connectors before Flink 1.16 is released.
>>>>
>>>> > From my perspective, Hive is still widely used and there exists many
>>>> running Hive SQL jobs, so why not to provide users a better experience to
>>>> help them migrate Hive jobs to Flink? Also, it doesn't conflict with Flink
>>>> SQL as it's just a dialect option.
>>>>
>>>> I do think there is a conflict with Flink SQL; you can't use both of
>>>> them at the same time, so you don't have access to all features in Flink.
>>>> That increases feature sparsity and user friction. It also puts a bigger
>>>> burden on the Flink community, because having both options available means
>>>> more maintenance work. For example, an upgrade of Calcite is more
>>>> impactful. The Flink codebase is already rather large and CI build times
>>>> are already too long. More code means more risk of bugs. If a user at some
>>>> point wants to change his Hive batch job to a streaming Flink SQL job,
>>>> there's still migration work for the user, it just needs to happen at a
>>>> later stage.
>>>>
>>>> @Jingsong I think you have a good argument that migrating SQL for Batch
>>>> ETL is indeed an expensive effort.
>>>>
>>>> Last but not least, there was no one who has yet commented on the
>>>> supported Hive versions and security issues. I've reached out to the Hive
>>>> community and from the info I've received so far is that only Hive 3.1.x
>>>> and Hive 2.3.x are still supported. The older Hive versions are no longer
>>>> maintained and also don't receive security updates. This is important
>>>> because many companies scan the Flink project for vulnerabilities and won't
>>>> allow using it when these types of vulnerabilities are included.
>>>>
>>>> My summary would be the following:
>>>> * Like Jark said, in the short term, Hive syntax compatibility is the
>>>> ticket for us to have a seat in the batch processing. Having improved Hive
>>>> syntax support for that in Flink can help in this.
>>>> * In the long term, we can and should drop it and focus on Flink SQL
>>>> itself both for batch and stream processing.
>>>> * The Hive maintainers/volunteers should come up with a plan on how the
>>>> tech debt/maintenance with regards to Hive query syntax can be addressed
>>>> and will be resolved for Flink 1.16. This includes stuff like using public
>>>> APIs and decoupling it from the planner. This is also extremely important
>>>> since we want to move out connectors with Flink 1.16 (next Flink release).
>>>> I'm hoping that those who can help out with this will chip-in.
>>>> * We follow the Hive versions that are still supported, which means we
>>>> drop support for Hive 1.*, 2.1.x and 2.2.x and upgrade Hive 2.3 and Hive
>>>> 3.1 to the latest version.
>>>>
>>>> Thanks again for your input and looking forward to your thoughts on
>>>> this.
>>>>
>>>> Best regards,
>>>>
>>>> Martijn
>>>>
>>>> On Tue, 8 Mar 2022 at 10:39, 罗宇侠(莫辞) <lu...@alibaba-inc.com>
>>>> wrote:
>>>>
>>>>> Hi Martijn,
>>>>> Thanks for driving this discussion.
>>>>>
>>>>> About your concerns, I would like to share my opinion.
>>>>>
>>>>> Actually, more exactly, FLIP-152 [1] is not to extend Flink SQL to support Hive query synax, it provides a Hive dialect option to enable users to switch to Hive dialect. From the commits about the corresponding FLINK-21529, it doesn't involve much about Flink itself.
>>>>>
>>>>>
>>>>> - About the struggling with maintaining. The current implementation is just to provide an option for user to use Hive dialect. I think there won't be much bother.
>>>>>
>>>>>
>>>>> - Although Apache Hive is less popular, it's widely used as an open source database over the years. There still exists many Hive SQL jobs in many companies.
>>>>>
>>>>>
>>>>> - As I said, the current implementation for Hive SQL synax is more like pluggable, we can also support for Snowflake and the others as long as it's necessary.
>>>>>
>>>>>
>>>>> - As for the know security vulnerabilities of Hive, maybe it's not a critical problem in this discuss.
>>>>>
>>>>> - For current implementation for Hive SQL syntax, it uses a pluggable HiveParser[3] to parse the SQL statement. I think there won't be much complexity brought to Flink
>>>>> to support Hive syntax.
>>>>>
>>>>>
>>>>> From my perspective, Hive is still widely used and there exists many running Hive SQL jobs, so why not to provide users a better experience to help them migrate Hive jobs to Flink? Also, it doesn't conflict with Flink SQL as it's just a dialect option.
>>>>>
>>>>> [1]
>>>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=165227316
>>>>> [2] https://issues.apache.org/jira/browse/FLINK-21529
>>>>> [3] https://issues.apache.org/jira/browse/FLINK-21531
>>>>> Best regards,
>>>>> Yuxia.
>>>>>
>>>>>
>>>>> ------------------原始邮件 ------------------
>>>>> *发件人:*Martijn Visser <ma...@apache.org>
>>>>> *发送时间:*Mon Mar 7 19:23:15 2022
>>>>> *收件人:*dev <de...@flink.apache.org>, User <us...@flink.apache.org>
>>>>> *主题:*[DISCUSS] Flink's supported APIs and Hive query syntax
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>>
>>>>>> Flink currently has 4 APIs with multiple language support which can be used
>>>>>> to develop applications:
>>>>>>
>>>>>> * DataStream API, both Java and Scala
>>>>>> * Table API, both Java and Scala
>>>>>>
>>>>>> * Flink SQL, both in Flink query syntax and Hive query syntax (partially)
>>>>>> * Python API
>>>>>>
>>>>>>
>>>>>> Since FLIP-152 [1] the Flink SQL support has been extended to also support
>>>>>>
>>>>>> the Hive query syntax. There is now a follow-up FLINK-26360 [2] to address
>>>>>> more syntax compatibility issues.
>>>>>>
>>>>>>
>>>>>> I would like to open a discussion on Flink directly supporting the Hive
>>>>>>
>>>>>> query syntax. I have some concerns if having a 100% Hive query syntax is
>>>>>> indeed something that we should aim for in Flink.
>>>>>>
>>>>>>
>>>>>> I can understand that having Hive query syntax support in Flink could help
>>>>>> users due to interoperability and being able to migrate. However:
>>>>>>
>>>>>>
>>>>>> - Adding full Hive query syntax support will mean that we go from 6 fully
>>>>>>
>>>>>> supported API/language combinations to 7. I think we are currently already
>>>>>> struggling with maintaining the existing combinations, let another one
>>>>>> more.
>>>>>>
>>>>>> - Apache Hive is/appears to be a project that's not that actively developed
>>>>>> anymore. The last release was made in January 2021. It's popularity is
>>>>>>
>>>>>> rapidly declining in Europe and the United State, also due Hadoop becoming
>>>>>> less popular.
>>>>>> - Related to the previous topic, other software like Snowflake,
>>>>>>
>>>>>> Trino/Presto, Databricks are becoming more and more popular. If we add full
>>>>>>
>>>>>> support for the Hive query syntax, then why not add support for Snowflake
>>>>>> and the others?
>>>>>>
>>>>>> - We are supporting Hive versions that are no longer supported by the Hive
>>>>>> community with known security vulnerabilities. This makes Flink also
>>>>>> vulnerable for those type of vulnerabilities.
>>>>>>
>>>>>> - The currently Hive implementation is done by using a lot of internals of
>>>>>>
>>>>>> Flink, making Flink hard to maintain, with lots of tech debt and making
>>>>>> things overly complex.
>>>>>>
>>>>>> From my perspective, I think it would be better to not have Hive query
>>>>>>
>>>>>> syntax compatibility directly in Flink itself. Of course we should have a
>>>>>>
>>>>>> proper Hive connector and a proper Hive catalog to make connectivity with
>>>>>>
>>>>>> Hive (the versions that are still supported by the Hive community) itself
>>>>>>
>>>>>> possible. Alternatively, if Hive query syntax is so important, it should
>>>>>>
>>>>>> not rely on internals but be available as a dialect/pluggable option. That
>>>>>>
>>>>>> could also open up the possibility to add more syntax support for others in
>>>>>>
>>>>>> the future, but I really think we should just focus on Flink SQL itself.
>>>>>> That's already hard enough to maintain and improve on.
>>>>>>
>>>>>>
>>>>>> I'm looking forward to the thoughts of both Developers and Users, so I'm
>>>>>> cross-posting to both mailing lists.
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> Martijn Visser
>>>>>> https://twitter.com/MartijnVisser82
>>>>>>
>>>>>> [1]
>>>>>>
>>>>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=165227316
>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-21529
>>>>>>
>>>>>

Re: Fwd: [DISCUSS] Flink's supported APIs and Hive query syntax

Posted by Martijn Visser <ma...@apache.org>.
Thank you Yuxia for volunteering, that's really much appreciated. It would
be great if you can create an umbrella ticket for that.

It would be great to get some insights from currently Flink and Hive users
which versions are being used.
@Jark I would indeed deprecate the old Hive versions in Flink 1.15 and then
drop them in Flink 1.16. That would also remove some tech debt and make it
less work with regards to externalizing connectors.

Best regards,

Martijn

On Thu, 10 Mar 2022 at 07:39, Jark Wu <im...@gmail.com> wrote:

> Thanks Martijn for the reply and summary.
>
> I totally agree with your plan and thank Yuxia for volunteering the Hive
> tech debt issue.
> I think we can create an umbrella issue for this and target version 1.16.
> We can discuss
> details and create subtasks there.
>
> Regarding dropping old Hive versions, I'm also fine with that. But I would
> like to investigate
> some Hive users first to see whether it's acceptable at this point. My
> first thought was we
> can deprecate the old Hive versions in 1.15, and we can discuss dropping
> it in 1.16 or 1.17.
>
> Best,
> Jark
>
>
> On Thu, 10 Mar 2022 at 14:19, 罗宇侠(莫辞) <lu...@alibaba-inc.com>
> wrote:
>
>> Thanks Martijn for your insights.
>>
>> About the tech debt/maintenance with regards to Hive query syntax, I
>> would like to chip-in and expect it can be resolved for Flink 1.16.
>>
>> Best regards,
>>
>> Yuxia
>>
>> ------------------原始邮件 ------------------
>> *发件人:*Martijn Visser <ma...@apache.org>
>> *发送时间:*Thu Mar 10 04:03:34 2022
>> *收件人:*User <us...@flink.apache.org>
>> *主题:*Fwd: [DISCUSS] Flink's supported APIs and Hive query syntax
>>
>>> (Forwarding this also to the User mailing list as I made a typo when
>>> replying to this email thread)
>>>
>>> ---------- Forwarded message ---------
>>> From: Martijn Visser <ma...@apache.org>
>>> Date: Wed, 9 Mar 2022 at 20:57
>>> Subject: Re: [DISCUSS] Flink's supported APIs and Hive query syntax
>>> To: dev <de...@flink.apache.org>, Francesco Guardiani <
>>> francesco@ververica.com>, Timo Walther <tw...@apache.org>, <
>>> users@flink.apache.org>
>>>
>>>
>>> Hi everyone,
>>>
>>> Thank you all very much for your input. From my perspective, I consider
>>> batch as a special case of streaming. So with Flink SQL, we can support
>>> both batch and streaming use cases and I think we should use Flink SQL as
>>> our target.
>>>
>>> To reply on some of the comments:
>>>
>>> @Jing on your remark:
>>> > Since Flink has a clear vision of unified batch and stream processing,
>>> supporting batch jobs will be one of the critical core features to help us
>>> reach the vision and let Flink have an even bigger impact in the industry.
>>>
>>> I fully agree with that statement. I do think that having Hive syntax
>>> support doesn't help in that unified batch and stream processing. We're
>>> making it easier for batch users to run their Hive batch jobs on Flink, but
>>> that doesn't fit the "unified" part since it's focussed on batch, while
>>> Flink SQL focusses on batch and streaming. I would have rather invested
>>> time in making batch improvements to Flink and Flink SQL vs investing in
>>> Hive syntax support. I do understand from the given replies that Hive
>>> syntax support is valuable for those that are already running batch
>>> processing and would like to run these queries on Flink. I do think that's
>>> limited to mostly Chinese companies at the moment.
>>>
>>> @Jark I think you've provided great input and are spot on with:
>>> > Regarding the maintenance concern you raised, I think that's a good
>>> point and they are in the plan. The Hive dialect has already been a plugin
>>> and option now, and the implementation is located in hive-connector module.
>>> We still need some work to make the Hive dialect purely rely on public
>>> APIs, and the Hive connector should be decopule with table planner. At that
>>> time, we can move the whole Hive connector into a separate repository (I
>>> guess this is also in the externalize connectors plan).
>>>
>>> I'm looping in Francesco and Timo who can elaborate more in depth on the
>>> current maintenance issues. I think we need to have a proper plan on how
>>> this tech debt/maintenance can be addressed and to get commitment that this
>>> will be resolved in Flink 1.16, since we indeed need to move out all
>>> previously agreed connectors before Flink 1.16 is released.
>>>
>>> > From my perspective, Hive is still widely used and there exists many
>>> running Hive SQL jobs, so why not to provide users a better experience to
>>> help them migrate Hive jobs to Flink? Also, it doesn't conflict with Flink
>>> SQL as it's just a dialect option.
>>>
>>> I do think there is a conflict with Flink SQL; you can't use both of
>>> them at the same time, so you don't have access to all features in Flink.
>>> That increases feature sparsity and user friction. It also puts a bigger
>>> burden on the Flink community, because having both options available means
>>> more maintenance work. For example, an upgrade of Calcite is more
>>> impactful. The Flink codebase is already rather large and CI build times
>>> are already too long. More code means more risk of bugs. If a user at some
>>> point wants to change his Hive batch job to a streaming Flink SQL job,
>>> there's still migration work for the user, it just needs to happen at a
>>> later stage.
>>>
>>> @Jingsong I think you have a good argument that migrating SQL for Batch
>>> ETL is indeed an expensive effort.
>>>
>>> Last but not least, there was no one who has yet commented on the
>>> supported Hive versions and security issues. I've reached out to the Hive
>>> community and from the info I've received so far is that only Hive 3.1.x
>>> and Hive 2.3.x are still supported. The older Hive versions are no longer
>>> maintained and also don't receive security updates. This is important
>>> because many companies scan the Flink project for vulnerabilities and won't
>>> allow using it when these types of vulnerabilities are included.
>>>
>>> My summary would be the following:
>>> * Like Jark said, in the short term, Hive syntax compatibility is the
>>> ticket for us to have a seat in the batch processing. Having improved Hive
>>> syntax support for that in Flink can help in this.
>>> * In the long term, we can and should drop it and focus on Flink SQL
>>> itself both for batch and stream processing.
>>> * The Hive maintainers/volunteers should come up with a plan on how the
>>> tech debt/maintenance with regards to Hive query syntax can be addressed
>>> and will be resolved for Flink 1.16. This includes stuff like using public
>>> APIs and decoupling it from the planner. This is also extremely important
>>> since we want to move out connectors with Flink 1.16 (next Flink release).
>>> I'm hoping that those who can help out with this will chip-in.
>>> * We follow the Hive versions that are still supported, which means we
>>> drop support for Hive 1.*, 2.1.x and 2.2.x and upgrade Hive 2.3 and Hive
>>> 3.1 to the latest version.
>>>
>>> Thanks again for your input and looking forward to your thoughts on this.
>>>
>>> Best regards,
>>>
>>> Martijn
>>>
>>> On Tue, 8 Mar 2022 at 10:39, 罗宇侠(莫辞) <lu...@alibaba-inc.com>
>>> wrote:
>>>
>>>> Hi Martijn,
>>>> Thanks for driving this discussion.
>>>>
>>>> About your concerns, I would like to share my opinion.
>>>>
>>>> Actually, more exactly, FLIP-152 [1] is not to extend Flink SQL to support Hive query synax, it provides a Hive dialect option to enable users to switch to Hive dialect. From the commits about the corresponding FLINK-21529, it doesn't involve much about Flink itself.
>>>>
>>>>
>>>> - About the struggling with maintaining. The current implementation is just to provide an option for user to use Hive dialect. I think there won't be much bother.
>>>>
>>>>
>>>> - Although Apache Hive is less popular, it's widely used as an open source database over the years. There still exists many Hive SQL jobs in many companies.
>>>>
>>>>
>>>> - As I said, the current implementation for Hive SQL synax is more like pluggable, we can also support for Snowflake and the others as long as it's necessary.
>>>>
>>>>
>>>> - As for the know security vulnerabilities of Hive, maybe it's not a critical problem in this discuss.
>>>>
>>>> - For current implementation for Hive SQL syntax, it uses a pluggable HiveParser[3] to parse the SQL statement. I think there won't be much complexity brought to Flink
>>>> to support Hive syntax.
>>>>
>>>>
>>>> From my perspective, Hive is still widely used and there exists many running Hive SQL jobs, so why not to provide users a better experience to help them migrate Hive jobs to Flink? Also, it doesn't conflict with Flink SQL as it's just a dialect option.
>>>>
>>>> [1]
>>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=165227316
>>>> [2] https://issues.apache.org/jira/browse/FLINK-21529
>>>> [3] https://issues.apache.org/jira/browse/FLINK-21531
>>>> Best regards,
>>>> Yuxia.
>>>>
>>>>
>>>> ------------------原始邮件 ------------------
>>>> *发件人:*Martijn Visser <ma...@apache.org>
>>>> *发送时间:*Mon Mar 7 19:23:15 2022
>>>> *收件人:*dev <de...@flink.apache.org>, User <us...@flink.apache.org>
>>>> *主题:*[DISCUSS] Flink's supported APIs and Hive query syntax
>>>>
>>>>> Hi everyone,
>>>>>
>>>>>
>>>>> Flink currently has 4 APIs with multiple language support which can be used
>>>>> to develop applications:
>>>>>
>>>>> * DataStream API, both Java and Scala
>>>>> * Table API, both Java and Scala
>>>>>
>>>>> * Flink SQL, both in Flink query syntax and Hive query syntax (partially)
>>>>> * Python API
>>>>>
>>>>>
>>>>> Since FLIP-152 [1] the Flink SQL support has been extended to also support
>>>>>
>>>>> the Hive query syntax. There is now a follow-up FLINK-26360 [2] to address
>>>>> more syntax compatibility issues.
>>>>>
>>>>> I would like to open a discussion on Flink directly supporting the Hive
>>>>>
>>>>> query syntax. I have some concerns if having a 100% Hive query syntax is
>>>>> indeed something that we should aim for in Flink.
>>>>>
>>>>>
>>>>> I can understand that having Hive query syntax support in Flink could help
>>>>> users due to interoperability and being able to migrate. However:
>>>>>
>>>>>
>>>>> - Adding full Hive query syntax support will mean that we go from 6 fully
>>>>>
>>>>> supported API/language combinations to 7. I think we are currently already
>>>>> struggling with maintaining the existing combinations, let another one
>>>>> more.
>>>>>
>>>>> - Apache Hive is/appears to be a project that's not that actively developed
>>>>> anymore. The last release was made in January 2021. It's popularity is
>>>>>
>>>>> rapidly declining in Europe and the United State, also due Hadoop becoming
>>>>> less popular.
>>>>> - Related to the previous topic, other software like Snowflake,
>>>>>
>>>>> Trino/Presto, Databricks are becoming more and more popular. If we add full
>>>>>
>>>>> support for the Hive query syntax, then why not add support for Snowflake
>>>>> and the others?
>>>>>
>>>>> - We are supporting Hive versions that are no longer supported by the Hive
>>>>> community with known security vulnerabilities. This makes Flink also
>>>>> vulnerable for those type of vulnerabilities.
>>>>>
>>>>> - The currently Hive implementation is done by using a lot of internals of
>>>>> Flink, making Flink hard to maintain, with lots of tech debt and making
>>>>> things overly complex.
>>>>>
>>>>> From my perspective, I think it would be better to not have Hive query
>>>>>
>>>>> syntax compatibility directly in Flink itself. Of course we should have a
>>>>>
>>>>> proper Hive connector and a proper Hive catalog to make connectivity with
>>>>>
>>>>> Hive (the versions that are still supported by the Hive community) itself
>>>>>
>>>>> possible. Alternatively, if Hive query syntax is so important, it should
>>>>>
>>>>> not rely on internals but be available as a dialect/pluggable option. That
>>>>>
>>>>> could also open up the possibility to add more syntax support for others in
>>>>>
>>>>> the future, but I really think we should just focus on Flink SQL itself.
>>>>> That's already hard enough to maintain and improve on.
>>>>>
>>>>>
>>>>> I'm looking forward to the thoughts of both Developers and Users, so I'm
>>>>> cross-posting to both mailing lists.
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Martijn Visser
>>>>> https://twitter.com/MartijnVisser82
>>>>>
>>>>> [1]
>>>>>
>>>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=165227316
>>>>> [2] https://issues.apache.org/jira/browse/FLINK-21529
>>>>>
>>>>

Re: Fwd: [DISCUSS] Flink's supported APIs and Hive query syntax

Posted by Jark Wu <im...@gmail.com>.
Thanks Martijn for the reply and summary.

I totally agree with your plan and thank Yuxia for volunteering the Hive
tech debt issue.
I think we can create an umbrella issue for this and target version 1.16.
We can discuss
details and create subtasks there.

Regarding dropping old Hive versions, I'm also fine with that. But I would
like to investigate
some Hive users first to see whether it's acceptable at this point. My
first thought was we
can deprecate the old Hive versions in 1.15, and we can discuss dropping it
in 1.16 or 1.17.

Best,
Jark


On Thu, 10 Mar 2022 at 14:19, 罗宇侠(莫辞) <lu...@alibaba-inc.com>
wrote:

> Thanks Martijn for your insights.
>
> About the tech debt/maintenance with regards to Hive query syntax, I
> would like to chip-in and expect it can be resolved for Flink 1.16.
>
> Best regards,
>
> Yuxia
> ​
>
> ------------------原始邮件 ------------------
> *发件人:*Martijn Visser <ma...@apache.org>
> *发送时间:*Thu Mar 10 04:03:34 2022
> *收件人:*User <us...@flink.apache.org>
> *主题:*Fwd: [DISCUSS] Flink's supported APIs and Hive query syntax
>
>> (Forwarding this also to the User mailing list as I made a typo when
>> replying to this email thread)
>>
>> ---------- Forwarded message ---------
>> From: Martijn Visser <ma...@apache.org>
>> Date: Wed, 9 Mar 2022 at 20:57
>> Subject: Re: [DISCUSS] Flink's supported APIs and Hive query syntax
>> To: dev <de...@flink.apache.org>, Francesco Guardiani <
>> francesco@ververica.com>, Timo Walther <tw...@apache.org>, <
>> users@flink.apache.org>
>>
>>
>> Hi everyone,
>>
>> Thank you all very much for your input. From my perspective, I consider
>> batch as a special case of streaming. So with Flink SQL, we can support
>> both batch and streaming use cases and I think we should use Flink SQL as
>> our target.
>>
>> To reply on some of the comments:
>>
>> @Jing on your remark:
>> > Since Flink has a clear vision of unified batch and stream processing,
>> supporting batch jobs will be one of the critical core features to help us
>> reach the vision and let Flink have an even bigger impact in the industry.
>>
>> I fully agree with that statement. I do think that having Hive syntax
>> support doesn't help in that unified batch and stream processing. We're
>> making it easier for batch users to run their Hive batch jobs on Flink, but
>> that doesn't fit the "unified" part since it's focussed on batch, while
>> Flink SQL focusses on batch and streaming. I would have rather invested
>> time in making batch improvements to Flink and Flink SQL vs investing in
>> Hive syntax support. I do understand from the given replies that Hive
>> syntax support is valuable for those that are already running batch
>> processing and would like to run these queries on Flink. I do think that's
>> limited to mostly Chinese companies at the moment.
>>
>> @Jark I think you've provided great input and are spot on with:
>> > Regarding the maintenance concern you raised, I think that's a good
>> point and they are in the plan. The Hive dialect has already been a plugin
>> and option now, and the implementation is located in hive-connector module.
>> We still need some work to make the Hive dialect purely rely on public
>> APIs, and the Hive connector should be decopule with table planner. At that
>> time, we can move the whole Hive connector into a separate repository (I
>> guess this is also in the externalize connectors plan).
>>
>> I'm looping in Francesco and Timo who can elaborate more in depth on the
>> current maintenance issues. I think we need to have a proper plan on how
>> this tech debt/maintenance can be addressed and to get commitment that this
>> will be resolved in Flink 1.16, since we indeed need to move out all
>> previously agreed connectors before Flink 1.16 is released.
>>
>> > From my perspective, Hive is still widely used and there exists many
>> running Hive SQL jobs, so why not to provide users a better experience to
>> help them migrate Hive jobs to Flink? Also, it doesn't conflict with Flink
>> SQL as it's just a dialect option.
>>
>> I do think there is a conflict with Flink SQL; you can't use both of them
>> at the same time, so you don't have access to all features in Flink. That
>> increases feature sparsity and user friction. It also puts a bigger burden
>> on the Flink community, because having both options available means more
>> maintenance work. For example, an upgrade of Calcite is more impactful. The
>> Flink codebase is already rather large and CI build times are already too
>> long. More code means more risk of bugs. If a user at some point wants to
>> change his Hive batch job to a streaming Flink SQL job, there's still
>> migration work for the user, it just needs to happen at a later stage.
>>
>> @Jingsong I think you have a good argument that migrating SQL for Batch
>> ETL is indeed an expensive effort.
>>
>> Last but not least, there was no one who has yet commented on the
>> supported Hive versions and security issues. I've reached out to the Hive
>> community and from the info I've received so far is that only Hive 3.1.x
>> and Hive 2.3.x are still supported. The older Hive versions are no longer
>> maintained and also don't receive security updates. This is important
>> because many companies scan the Flink project for vulnerabilities and won't
>> allow using it when these types of vulnerabilities are included.
>>
>> My summary would be the following:
>> * Like Jark said, in the short term, Hive syntax compatibility is the
>> ticket for us to have a seat in the batch processing. Having improved Hive
>> syntax support for that in Flink can help in this.
>> * In the long term, we can and should drop it and focus on Flink SQL
>> itself both for batch and stream processing.
>> * The Hive maintainers/volunteers should come up with a plan on how the
>> tech debt/maintenance with regards to Hive query syntax can be addressed
>> and will be resolved for Flink 1.16. This includes stuff like using public
>> APIs and decoupling it from the planner. This is also extremely important
>> since we want to move out connectors with Flink 1.16 (next Flink release).
>> I'm hoping that those who can help out with this will chip-in.
>> * We follow the Hive versions that are still supported, which means we
>> drop support for Hive 1.*, 2.1.x and 2.2.x and upgrade Hive 2.3 and Hive
>> 3.1 to the latest version.
>>
>> Thanks again for your input and looking forward to your thoughts on this.
>>
>> Best regards,
>>
>> Martijn
>>
>> On Tue, 8 Mar 2022 at 10:39, 罗宇侠(莫辞) <lu...@alibaba-inc.com>
>> wrote:
>>
>>> Hi Martijn,
>>> Thanks for driving this discussion.
>>>
>>> About your concerns, I would like to share my opinion.
>>>
>>> Actually, more exactly, FLIP-152 [1] is not to extend Flink SQL to support Hive query synax, it provides a Hive dialect option to enable users to switch to Hive dialect. From the commits about the corresponding FLINK-21529, it doesn't involve much about Flink itself.
>>>
>>>
>>> - About the struggling with maintaining. The current implementation is just to provide an option for user to use Hive dialect. I think there won't be much bother.
>>>
>>>
>>> - Although Apache Hive is less popular, it's widely used as an open source database over the years. There still exists many Hive SQL jobs in many companies.
>>>
>>>
>>> - As I said, the current implementation for Hive SQL synax is more like pluggable, we can also support for Snowflake and the others as long as it's necessary.
>>>
>>>
>>> - As for the know security vulnerabilities of Hive, maybe it's not a critical problem in this discuss.
>>>
>>> - For current implementation for Hive SQL syntax, it uses a pluggable HiveParser[3] to parse the SQL statement. I think there won't be much complexity brought to Flink
>>> to support Hive syntax.
>>>
>>>
>>> From my perspective, Hive is still widely used and there exists many running Hive SQL jobs, so why not to provide users a better experience to help them migrate Hive jobs to Flink? Also, it doesn't conflict with Flink SQL as it's just a dialect option.
>>>
>>> [1]
>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=165227316
>>> [2] https://issues.apache.org/jira/browse/FLINK-21529
>>> [3] https://issues.apache.org/jira/browse/FLINK-21531
>>> Best regards,
>>> Yuxia.
>>>
>>>
>>> ------------------原始邮件 ------------------
>>> *发件人:*Martijn Visser <ma...@apache.org>
>>> *发送时间:*Mon Mar 7 19:23:15 2022
>>> *收件人:*dev <de...@flink.apache.org>, User <us...@flink.apache.org>
>>> *主题:*[DISCUSS] Flink's supported APIs and Hive query syntax
>>>
>>>> Hi everyone,
>>>>
>>>>
>>>> Flink currently has 4 APIs with multiple language support which can be used
>>>> to develop applications:
>>>>
>>>> * DataStream API, both Java and Scala
>>>> * Table API, both Java and Scala
>>>>
>>>> * Flink SQL, both in Flink query syntax and Hive query syntax (partially)
>>>> * Python API
>>>>
>>>>
>>>> Since FLIP-152 [1] the Flink SQL support has been extended to also support
>>>>
>>>> the Hive query syntax. There is now a follow-up FLINK-26360 [2] to address
>>>> more syntax compatibility issues.
>>>>
>>>> I would like to open a discussion on Flink directly supporting the Hive
>>>> query syntax. I have some concerns if having a 100% Hive query syntax is
>>>> indeed something that we should aim for in Flink.
>>>>
>>>>
>>>> I can understand that having Hive query syntax support in Flink could help
>>>> users due to interoperability and being able to migrate. However:
>>>>
>>>>
>>>> - Adding full Hive query syntax support will mean that we go from 6 fully
>>>>
>>>> supported API/language combinations to 7. I think we are currently already
>>>> struggling with maintaining the existing combinations, let another one
>>>> more.
>>>>
>>>> - Apache Hive is/appears to be a project that's not that actively developed
>>>> anymore. The last release was made in January 2021. It's popularity is
>>>>
>>>> rapidly declining in Europe and the United State, also due Hadoop becoming
>>>> less popular.
>>>> - Related to the previous topic, other software like Snowflake,
>>>>
>>>> Trino/Presto, Databricks are becoming more and more popular. If we add full
>>>>
>>>> support for the Hive query syntax, then why not add support for Snowflake
>>>> and the others?
>>>>
>>>> - We are supporting Hive versions that are no longer supported by the Hive
>>>> community with known security vulnerabilities. This makes Flink also
>>>> vulnerable for those type of vulnerabilities.
>>>>
>>>> - The currently Hive implementation is done by using a lot of internals of
>>>> Flink, making Flink hard to maintain, with lots of tech debt and making
>>>> things overly complex.
>>>>
>>>> From my perspective, I think it would be better to not have Hive query
>>>>
>>>> syntax compatibility directly in Flink itself. Of course we should have a
>>>>
>>>> proper Hive connector and a proper Hive catalog to make connectivity with
>>>>
>>>> Hive (the versions that are still supported by the Hive community) itself
>>>> possible. Alternatively, if Hive query syntax is so important, it should
>>>>
>>>> not rely on internals but be available as a dialect/pluggable option. That
>>>>
>>>> could also open up the possibility to add more syntax support for others in
>>>> the future, but I really think we should just focus on Flink SQL itself.
>>>> That's already hard enough to maintain and improve on.
>>>>
>>>> I'm looking forward to the thoughts of both Developers and Users, so I'm
>>>> cross-posting to both mailing lists.
>>>>
>>>> Best regards,
>>>>
>>>> Martijn Visser
>>>> https://twitter.com/MartijnVisser82
>>>>
>>>> [1]
>>>>
>>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=165227316
>>>> [2] https://issues.apache.org/jira/browse/FLINK-21529
>>>>
>>>

回复:Fwd: [DISCUSS] Flink's supported APIs and Hive query syntax

Posted by "罗宇侠(莫辞)" <lu...@alibaba-inc.com>.
Thanks Martijn for your insights.

About the tech debt/maintenance with regards to Hive query syntax, I would like to chip-in and expect it can be resolved for Flink 1.16.

Best regards,

Yuxia
​

 ------------------原始邮件 ------------------
发件人:Martijn Visser <ma...@apache.org>
发送时间:Thu Mar 10 04:03:34 2022
收件人:User <us...@flink.apache.org>
主题:Fwd: [DISCUSS] Flink's supported APIs and Hive query syntax

(Forwarding this also to the User mailing list as I made a typo when replying to this email thread)

---------- Forwarded message ---------
From: Martijn Visser <ma...@apache.org>
Date: Wed, 9 Mar 2022 at 20:57
Subject: Re: [DISCUSS] Flink's supported APIs and Hive query syntax
To: dev <de...@flink.apache.org>, Francesco Guardiani <fr...@ververica.com>, Timo Walther <tw...@apache.org>, <us...@flink.apache.org>


Hi everyone,

Thank you all very much for your input. From my perspective, I consider batch as a special case of streaming. So with Flink SQL, we can support both batch and streaming use cases and I think we should use Flink SQL as our target. 

To reply on some of the comments:

@Jing on your remark:
> Since Flink has a clear vision of unified batch and stream processing, supporting batch jobs will be one of the critical core features to help us reach the vision and let Flink have an even bigger impact in the industry.

I fully agree with that statement. I do think that having Hive syntax support doesn't help in that unified batch and stream processing. We're making it easier for batch users to run their Hive batch jobs on Flink, but that doesn't fit the "unified" part since it's focussed on batch, while Flink SQL focusses on batch and streaming. I would have rather invested time in making batch improvements to Flink and Flink SQL vs investing in Hive syntax support. I do understand from the given replies that Hive syntax support is valuable for those that are already running batch processing and would like to run these queries on Flink. I do think that's limited to mostly Chinese companies at the moment. 

@Jark I think you've provided great input and are spot on with: 
> Regarding the maintenance concern you raised, I think that's a good point and they are in the plan. The Hive dialect has already been a plugin and option now, and the implementation is located in hive-connector module. We still need some work to make the Hive dialect purely rely on public APIs, and the Hive connector should be decopule with table planner. At that time, we can move the whole Hive connector into a separate repository (I guess this is also in the externalize connectors plan).

I'm looping in Francesco and Timo who can elaborate more in depth on the current maintenance issues. I think we need to have a proper plan on how this tech debt/maintenance can be addressed and to get commitment that this will be resolved in Flink 1.16, since we indeed need to move out all previously agreed connectors before Flink 1.16 is released.

> From my perspective, Hive is still widely used and there exists many running Hive SQL jobs, so why not to provide users a better experience to help them migrate Hive jobs to Flink? Also, it doesn't conflict with Flink SQL as it's just a dialect option. 

I do think there is a conflict with Flink SQL; you can't use both of them at the same time, so you don't have access to all features in Flink. That increases feature sparsity and user friction. It also puts a bigger burden on the Flink community, because having both options available means more maintenance work. For example, an upgrade of Calcite is more impactful. The Flink codebase is already rather large and CI build times are already too long. More code means more risk of bugs. If a user at some point wants to change his Hive batch job to a streaming Flink SQL job, there's still migration work for the user, it just needs to happen at a later stage. 

@Jingsong I think you have a good argument that migrating SQL for Batch ETL is indeed an expensive effort. 

Last but not least, there was no one who has yet commented on the supported Hive versions and security issues. I've reached out to the Hive community and from the info I've received so far is that only Hive 3.1.x and Hive 2.3.x are still supported. The older Hive versions are no longer maintained and also don't receive security updates. This is important because many companies scan the Flink project for vulnerabilities and won't allow using it when these types of vulnerabilities are included. 

My summary would be the following:
* Like Jark said, in the short term, Hive syntax compatibility is the ticket for us to have a seat in the batch processing. Having improved Hive syntax support for that in Flink can help in this. 
* In the long term, we can and should drop it and focus on Flink SQL itself both for batch and stream processing.
* The Hive maintainers/volunteers should come up with a plan on how the tech debt/maintenance with regards to Hive query syntax can be addressed and will be resolved for Flink 1.16. This includes stuff like using public APIs and decoupling it from the planner. This is also extremely important since we want to move out connectors with Flink 1.16 (next Flink release). I'm hoping that those who can help out with this will chip-in. 
* We follow the Hive versions that are still supported, which means we drop support for Hive 1.*, 2.1.x and 2.2.x and upgrade Hive 2.3 and Hive 3.1 to the latest version. 

Thanks again for your input and looking forward to your thoughts on this.

Best regards,

Martijn 
On Tue, 8 Mar 2022 at 10:39, 罗宇侠(莫辞) <lu...@alibaba-inc.com> wrote:

Hi Martijn,
Thanks for driving this discussion. 

About your concerns, I would like to share my opinion.
Actually, more exactly, FLIP-152 [1] is not to extend Flink SQL to support Hive query synax, it provides a Hive dialect option to enable users to switch to Hive dialect. From the commits about the corresponding FLINK-21529, it doesn't involve much about Flink itself. 

- About the struggling with maintaining. The current implementation is just to provide an option for user to use Hive dialect. I think there won't be much bother.

- Although Apache Hive is less popular, it's widely used as an open source database over the years. There still exists many Hive SQL jobs in many companies.

- As I said, the current implementation for Hive SQL synax is more like pluggable, we can also support for Snowflake and the others as long as it's necessary.

- As for the know security vulnerabilities of Hive, maybe it's not a critical problem in this discuss.

- For current implementation for Hive SQL syntax, it uses a pluggable HiveParser[3] to parse the SQL statement. I think there won't be much complexity brought to Flink to support Hive syntax. 

From my perspective, Hive is still widely used and there exists many running Hive SQL jobs, so why not to provide users a better experience to help them migrate Hive jobs to Flink? Also, it doesn't conflict with Flink SQL as it's just a dialect option. 

[1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=165227316
[2] https://issues.apache.org/jira/browse/FLINK-21529
[3] https://issues.apache.org/jira/browse/FLINK-21531
Best regards,
Yuxia.



 ------------------原始邮件 ------------------
发件人:Martijn Visser <ma...@apache.org>
发送时间:Mon Mar 7 19:23:15 2022
收件人:dev <de...@flink.apache.org>, User <us...@flink.apache.org>
主题:[DISCUSS] Flink's supported APIs and Hive query syntax
Hi everyone,

Flink currently has 4 APIs with multiple language support which can be used
to develop applications:

* DataStream API, both Java and Scala
* Table API, both Java and Scala
* Flink SQL, both in Flink query syntax and Hive query syntax (partially)
* Python API

Since FLIP-152 [1] the Flink SQL support has been extended to also support
the Hive query syntax. There is now a follow-up FLINK-26360 [2] to address
more syntax compatibility issues.

I would like to open a discussion on Flink directly supporting the Hive
query syntax. I have some concerns if having a 100% Hive query syntax is
indeed something that we should aim for in Flink.

I can understand that having Hive query syntax support in Flink could help
users due to interoperability and being able to migrate. However:

- Adding full Hive query syntax support will mean that we go from 6 fully
supported API/language combinations to 7. I think we are currently already
struggling with maintaining the existing combinations, let another one
more.
- Apache Hive is/appears to be a project that's not that actively developed
anymore. The last release was made in January 2021. It's popularity is
rapidly declining in Europe and the United State, also due Hadoop becoming
less popular.
- Related to the previous topic, other software like Snowflake,
Trino/Presto, Databricks are becoming more and more popular. If we add full
support for the Hive query syntax, then why not add support for Snowflake
and the others?
- We are supporting Hive versions that are no longer supported by the Hive
community with known security vulnerabilities. This makes Flink also
vulnerable for those type of vulnerabilities.
- The currently Hive implementation is done by using a lot of internals of
Flink, making Flink hard to maintain, with lots of tech debt and making
things overly complex.

From my perspective, I think it would be better to not have Hive query
syntax compatibility directly in Flink itself. Of course we should have a
proper Hive connector and a proper Hive catalog to make connectivity with
Hive (the versions that are still supported by the Hive community) itself
possible. Alternatively, if Hive query syntax is so important, it should
not rely on internals but be available as a dialect/pluggable option. That
could also open up the possibility to add more syntax support for others in
the future, but I really think we should just focus on Flink SQL itself.
That's already hard enough to maintain and improve on.

I'm looking forward to the thoughts of both Developers and Users, so I'm
cross-posting to both mailing lists.

Best regards,

Martijn Visser
https://twitter.com/MartijnVisser82

[1]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=165227316
[2] https://issues.apache.org/jira/browse/FLINK-21529

Fwd: [DISCUSS] Flink's supported APIs and Hive query syntax

Posted by Martijn Visser <ma...@apache.org>.
(Forwarding this also to the User mailing list as I made a typo when
replying to this email thread)

---------- Forwarded message ---------
From: Martijn Visser <ma...@apache.org>
Date: Wed, 9 Mar 2022 at 20:57
Subject: Re: [DISCUSS] Flink's supported APIs and Hive query syntax
To: dev <de...@flink.apache.org>, Francesco Guardiani <fr...@ververica.com>,
Timo Walther <tw...@apache.org>, <us...@flink.apache.org>


Hi everyone,

Thank you all very much for your input. From my perspective, I consider
batch as a special case of streaming. So with Flink SQL, we can support
both batch and streaming use cases and I think we should use Flink SQL as
our target.

To reply on some of the comments:

@Jing on your remark:
> Since Flink has a clear vision of unified batch and stream processing,
supporting batch jobs will be one of the critical core features to help us
reach the vision and let Flink have an even bigger impact in the industry.

I fully agree with that statement. I do think that having Hive syntax
support doesn't help in that unified batch and stream processing. We're
making it easier for batch users to run their Hive batch jobs on Flink, but
that doesn't fit the "unified" part since it's focussed on batch, while
Flink SQL focusses on batch and streaming. I would have rather invested
time in making batch improvements to Flink and Flink SQL vs investing in
Hive syntax support. I do understand from the given replies that Hive
syntax support is valuable for those that are already running batch
processing and would like to run these queries on Flink. I do think that's
limited to mostly Chinese companies at the moment.

@Jark I think you've provided great input and are spot on with:
> Regarding the maintenance concern you raised, I think that's a good point
and they are in the plan. The Hive dialect has already been a plugin and
option now, and the implementation is located in hive-connector module. We
still need some work to make the Hive dialect purely rely on public APIs,
and the Hive connector should be decopule with table planner. At that time,
we can move the whole Hive connector into a separate repository (I guess
this is also in the externalize connectors plan).

I'm looping in Francesco and Timo who can elaborate more in depth on the
current maintenance issues. I think we need to have a proper plan on how
this tech debt/maintenance can be addressed and to get commitment that this
will be resolved in Flink 1.16, since we indeed need to move out all
previously agreed connectors before Flink 1.16 is released.

> From my perspective, Hive is still widely used and there exists many
running Hive SQL jobs, so why not to provide users a better experience to
help them migrate Hive jobs to Flink? Also, it doesn't conflict with Flink
SQL as it's just a dialect option.

I do think there is a conflict with Flink SQL; you can't use both of them
at the same time, so you don't have access to all features in Flink. That
increases feature sparsity and user friction. It also puts a bigger burden
on the Flink community, because having both options available means more
maintenance work. For example, an upgrade of Calcite is more impactful. The
Flink codebase is already rather large and CI build times are already too
long. More code means more risk of bugs. If a user at some point wants to
change his Hive batch job to a streaming Flink SQL job, there's still
migration work for the user, it just needs to happen at a later stage.

@Jingsong I think you have a good argument that migrating SQL for Batch ETL
is indeed an expensive effort.

Last but not least, there was no one who has yet commented on the supported
Hive versions and security issues. I've reached out to the Hive community
and from the info I've received so far is that only Hive 3.1.x and Hive
2.3.x are still supported. The older Hive versions are no longer maintained
and also don't receive security updates. This is important because many
companies scan the Flink project for vulnerabilities and won't allow using
it when these types of vulnerabilities are included.

My summary would be the following:
* Like Jark said, in the short term, Hive syntax compatibility is the
ticket for us to have a seat in the batch processing. Having improved Hive
syntax support for that in Flink can help in this.
* In the long term, we can and should drop it and focus on Flink SQL itself
both for batch and stream processing.
* The Hive maintainers/volunteers should come up with a plan on how the
tech debt/maintenance with regards to Hive query syntax can be addressed
and will be resolved for Flink 1.16. This includes stuff like using public
APIs and decoupling it from the planner. This is also extremely important
since we want to move out connectors with Flink 1.16 (next Flink release).
I'm hoping that those who can help out with this will chip-in.
* We follow the Hive versions that are still supported, which means we drop
support for Hive 1.*, 2.1.x and 2.2.x and upgrade Hive 2.3 and Hive 3.1 to
the latest version.

Thanks again for your input and looking forward to your thoughts on this.

Best regards,

Martijn

On Tue, 8 Mar 2022 at 10:39, 罗宇侠(莫辞) <lu...@alibaba-inc.com>
wrote:

> Hi Martijn,
> Thanks for driving this discussion.
>
> About your concerns, I would like to share my opinion.
>
> Actually, more exactly, FLIP-152 [1] is not to extend Flink SQL to support Hive query synax, it provides a Hive dialect option to enable users to switch to Hive dialect. From the commits about the corresponding FLINK-21529, it doesn't involve much about Flink itself.
>
>
> - About the struggling with maintaining. The current implementation is just to provide an option for user to use Hive dialect. I think there won't be much bother.
>
>
> - Although Apache Hive is less popular, it's widely used as an open source database over the years. There still exists many Hive SQL jobs in many companies.
>
>
> - As I said, the current implementation for Hive SQL synax is more like pluggable, we can also support for Snowflake and the others as long as it's necessary.
>
>
> - As for the know security vulnerabilities of Hive, maybe it's not a critical problem in this discuss.
>
> - For current implementation for Hive SQL syntax, it uses a pluggable HiveParser[3] to parse the SQL statement. I think there won't be much complexity brought to Flink
> to support Hive syntax.
>
>
> From my perspective, Hive is still widely used and there exists many running Hive SQL jobs, so why not to provide users a better experience to help them migrate Hive jobs to Flink? Also, it doesn't conflict with Flink SQL as it's just a dialect option.
>
> [1]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=165227316
> [2] https://issues.apache.org/jira/browse/FLINK-21529
> [3] https://issues.apache.org/jira/browse/FLINK-21531
> Best regards,
> Yuxia.
>
>
> ------------------原始邮件 ------------------
> *发件人:*Martijn Visser <ma...@apache.org>
> *发送时间:*Mon Mar 7 19:23:15 2022
> *收件人:*dev <de...@flink.apache.org>, User <us...@flink.apache.org>
> *主题:*[DISCUSS] Flink's supported APIs and Hive query syntax
>
>> Hi everyone,
>>
>>
>> Flink currently has 4 APIs with multiple language support which can be used
>> to develop applications:
>>
>> * DataStream API, both Java and Scala
>> * Table API, both Java and Scala
>> * Flink SQL, both in Flink query syntax and Hive query syntax (partially)
>> * Python API
>>
>> Since FLIP-152 [1] the Flink SQL support has been extended to also support
>> the Hive query syntax. There is now a follow-up FLINK-26360 [2] to address
>> more syntax compatibility issues.
>>
>> I would like to open a discussion on Flink directly supporting the Hive
>> query syntax. I have some concerns if having a 100% Hive query syntax is
>> indeed something that we should aim for in Flink.
>>
>> I can understand that having Hive query syntax support in Flink could help
>> users due to interoperability and being able to migrate. However:
>>
>> - Adding full Hive query syntax support will mean that we go from 6 fully
>> supported API/language combinations to 7. I think we are currently already
>> struggling with maintaining the existing combinations, let another one
>> more.
>>
>> - Apache Hive is/appears to be a project that's not that actively developed
>> anymore. The last release was made in January 2021. It's popularity is
>> rapidly declining in Europe and the United State, also due Hadoop becoming
>> less popular.
>> - Related to the previous topic, other software like Snowflake,
>>
>> Trino/Presto, Databricks are becoming more and more popular. If we add full
>> support for the Hive query syntax, then why not add support for Snowflake
>> and the others?
>> - We are supporting Hive versions that are no longer supported by the Hive
>> community with known security vulnerabilities. This makes Flink also
>> vulnerable for those type of vulnerabilities.
>> - The currently Hive implementation is done by using a lot of internals of
>> Flink, making Flink hard to maintain, with lots of tech debt and making
>> things overly complex.
>>
>> From my perspective, I think it would be better to not have Hive query
>> syntax compatibility directly in Flink itself. Of course we should have a
>> proper Hive connector and a proper Hive catalog to make connectivity with
>> Hive (the versions that are still supported by the Hive community) itself
>> possible. Alternatively, if Hive query syntax is so important, it should
>> not rely on internals but be available as a dialect/pluggable option. That
>>
>> could also open up the possibility to add more syntax support for others in
>> the future, but I really think we should just focus on Flink SQL itself.
>> That's already hard enough to maintain and improve on.
>>
>> I'm looking forward to the thoughts of both Developers and Users, so I'm
>> cross-posting to both mailing lists.
>>
>> Best regards,
>>
>> Martijn Visser
>> https://twitter.com/MartijnVisser82
>>
>> [1]
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=165227316
>> [2] https://issues.apache.org/jira/browse/FLINK-21529
>>
>

Re: [DISCUSS] Flink's supported APIs and Hive query syntax

Posted by Martijn Visser <ma...@apache.org>.
Hi everyone,

Thank you all very much for your input. From my perspective, I consider
batch as a special case of streaming. So with Flink SQL, we can support
both batch and streaming use cases and I think we should use Flink SQL as
our target.

To reply on some of the comments:

@Jing on your remark:
> Since Flink has a clear vision of unified batch and stream processing,
supporting batch jobs will be one of the critical core features to help us
reach the vision and let Flink have an even bigger impact in the industry.

I fully agree with that statement. I do think that having Hive syntax
support doesn't help in that unified batch and stream processing. We're
making it easier for batch users to run their Hive batch jobs on Flink, but
that doesn't fit the "unified" part since it's focussed on batch, while
Flink SQL focusses on batch and streaming. I would have rather invested
time in making batch improvements to Flink and Flink SQL vs investing in
Hive syntax support. I do understand from the given replies that Hive
syntax support is valuable for those that are already running batch
processing and would like to run these queries on Flink. I do think that's
limited to mostly Chinese companies at the moment.

@Jark I think you've provided great input and are spot on with:
> Regarding the maintenance concern you raised, I think that's a good point
and they are in the plan. The Hive dialect has already been a plugin and
option now, and the implementation is located in hive-connector module. We
still need some work to make the Hive dialect purely rely on public APIs,
and the Hive connector should be decopule with table planner. At that time,
we can move the whole Hive connector into a separate repository (I guess
this is also in the externalize connectors plan).

I'm looping in Francesco and Timo who can elaborate more in depth on the
current maintenance issues. I think we need to have a proper plan on how
this tech debt/maintenance can be addressed and to get commitment that this
will be resolved in Flink 1.16, since we indeed need to move out all
previously agreed connectors before Flink 1.16 is released.

> From my perspective, Hive is still widely used and there exists many
running Hive SQL jobs, so why not to provide users a better experience to
help them migrate Hive jobs to Flink? Also, it doesn't conflict with Flink
SQL as it's just a dialect option.

I do think there is a conflict with Flink SQL; you can't use both of them
at the same time, so you don't have access to all features in Flink. That
increases feature sparsity and user friction. It also puts a bigger burden
on the Flink community, because having both options available means more
maintenance work. For example, an upgrade of Calcite is more impactful. The
Flink codebase is already rather large and CI build times are already too
long. More code means more risk of bugs. If a user at some point wants to
change his Hive batch job to a streaming Flink SQL job, there's still
migration work for the user, it just needs to happen at a later stage.

@Jingsong I think you have a good argument that migrating SQL for Batch ETL
is indeed an expensive effort.

Last but not least, there was no one who has yet commented on the supported
Hive versions and security issues. I've reached out to the Hive community
and from the info I've received so far is that only Hive 3.1.x and Hive
2.3.x are still supported. The older Hive versions are no longer maintained
and also don't receive security updates. This is important because many
companies scan the Flink project for vulnerabilities and won't allow using
it when these types of vulnerabilities are included.

My summary would be the following:
* Like Jark said, in the short term, Hive syntax compatibility is the
ticket for us to have a seat in the batch processing. Having improved Hive
syntax support for that in Flink can help in this.
* In the long term, we can and should drop it and focus on Flink SQL itself
both for batch and stream processing.
* The Hive maintainers/volunteers should come up with a plan on how the
tech debt/maintenance with regards to Hive query syntax can be addressed
and will be resolved for Flink 1.16. This includes stuff like using public
APIs and decoupling it from the planner. This is also extremely important
since we want to move out connectors with Flink 1.16 (next Flink release).
I'm hoping that those who can help out with this will chip-in.
* We follow the Hive versions that are still supported, which means we drop
support for Hive 1.*, 2.1.x and 2.2.x and upgrade Hive 2.3 and Hive 3.1 to
the latest version.

Thanks again for your input and looking forward to your thoughts on this.

Best regards,

Martijn

On Tue, 8 Mar 2022 at 10:39, 罗宇侠(莫辞) <lu...@alibaba-inc.com>
wrote:

> Hi Martijn,
> Thanks for driving this discussion.
>
> About your concerns, I would like to share my opinion.
>
> Actually, more exactly, FLIP-152 [1] is not to extend Flink SQL to support Hive query synax, it provides a Hive dialect option to enable users to switch to Hive dialect. From the commits about the corresponding FLINK-21529, it doesn't involve much about Flink itself.
>
>
> - About the struggling with maintaining. The current implementation is just to provide an option for user to use Hive dialect. I think there won't be much bother.
>
>
> - Although Apache Hive is less popular, it's widely used as an open source database over the years. There still exists many Hive SQL jobs in many companies.
>
>
> - As I said, the current implementation for Hive SQL synax is more like pluggable, we can also support for Snowflake and the others as long as it's necessary.
>
>
> - As for the know security vulnerabilities of Hive, maybe it's not a critical problem in this discuss.
>
> - For current implementation for Hive SQL syntax, it uses a pluggable HiveParser[3] to parse the SQL statement. I think there won't be much complexity brought to Flink
> to support Hive syntax.
>
>
> From my perspective, Hive is still widely used and there exists many running Hive SQL jobs, so why not to provide users a better experience to help them migrate Hive jobs to Flink? Also, it doesn't conflict with Flink SQL as it's just a dialect option.
>
> [1]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=165227316
> [2] https://issues.apache.org/jira/browse/FLINK-21529
> [3] https://issues.apache.org/jira/browse/FLINK-21531
> Best regards,
> Yuxia.
>
>
> ------------------原始邮件 ------------------
> *发件人:*Martijn Visser <ma...@apache.org>
> *发送时间:*Mon Mar 7 19:23:15 2022
> *收件人:*dev <de...@flink.apache.org>, User <us...@flink.apache.org>
> *主题:*[DISCUSS] Flink's supported APIs and Hive query syntax
>
>> Hi everyone,
>>
>>
>> Flink currently has 4 APIs with multiple language support which can be used
>> to develop applications:
>>
>> * DataStream API, both Java and Scala
>> * Table API, both Java and Scala
>> * Flink SQL, both in Flink query syntax and Hive query syntax (partially)
>> * Python API
>>
>> Since FLIP-152 [1] the Flink SQL support has been extended to also support
>> the Hive query syntax. There is now a follow-up FLINK-26360 [2] to address
>> more syntax compatibility issues.
>>
>> I would like to open a discussion on Flink directly supporting the Hive
>> query syntax. I have some concerns if having a 100% Hive query syntax is
>> indeed something that we should aim for in Flink.
>>
>> I can understand that having Hive query syntax support in Flink could help
>> users due to interoperability and being able to migrate. However:
>>
>> - Adding full Hive query syntax support will mean that we go from 6 fully
>> supported API/language combinations to 7. I think we are currently already
>> struggling with maintaining the existing combinations, let another one
>> more.
>>
>> - Apache Hive is/appears to be a project that's not that actively developed
>> anymore. The last release was made in January 2021. It's popularity is
>> rapidly declining in Europe and the United State, also due Hadoop becoming
>> less popular.
>> - Related to the previous topic, other software like Snowflake,
>>
>> Trino/Presto, Databricks are becoming more and more popular. If we add full
>> support for the Hive query syntax, then why not add support for Snowflake
>> and the others?
>> - We are supporting Hive versions that are no longer supported by the Hive
>> community with known security vulnerabilities. This makes Flink also
>> vulnerable for those type of vulnerabilities.
>> - The currently Hive implementation is done by using a lot of internals of
>> Flink, making Flink hard to maintain, with lots of tech debt and making
>> things overly complex.
>>
>> From my perspective, I think it would be better to not have Hive query
>> syntax compatibility directly in Flink itself. Of course we should have a
>> proper Hive connector and a proper Hive catalog to make connectivity with
>> Hive (the versions that are still supported by the Hive community) itself
>> possible. Alternatively, if Hive query syntax is so important, it should
>> not rely on internals but be available as a dialect/pluggable option. That
>>
>> could also open up the possibility to add more syntax support for others in
>> the future, but I really think we should just focus on Flink SQL itself.
>> That's already hard enough to maintain and improve on.
>>
>> I'm looking forward to the thoughts of both Developers and Users, so I'm
>> cross-posting to both mailing lists.
>>
>> Best regards,
>>
>> Martijn Visser
>> https://twitter.com/MartijnVisser82
>>
>> [1]
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=165227316
>> [2] https://issues.apache.org/jira/browse/FLINK-21529
>>
>

回复:[DISCUSS] Flink's supported APIs and Hive query syntax

Posted by "罗宇侠(莫辞)" <lu...@alibaba-inc.com.INVALID>.
Hi Martijn,
Thanks for driving this discussion. 

About your concerns, I would like to share my opinion.
Actually, more exactly, FLIP-152 [1] is not to extend Flink SQL to support Hive query synax, it provides a Hive dialect option to enable users to switch to Hive dialect. From the commits about the corresponding FLINK-21529, it doesn't involve much about Flink itself. 

- About the struggling with maintaining. The current implementation is just to provide an option for user to use Hive dialect. I think there won't be much bother.

- Although Apache Hive is less popular, it's widely used as an open source database over the years. There still exists many Hive SQL jobs in many companies.

- As I said, the current implementation for Hive SQL synax is more like pluggable, we can also support for Snowflake and the others as long as it's necessary.

- As for the know security vulnerabilities of Hive, maybe it's not a critical problem in this discuss.

- For current implementation for Hive SQL syntax, it uses a pluggable HiveParser[3] to parse the SQL statement. I think there won't be much complexity brought to Flink to support Hive syntax. 

From my perspective, Hive is still widely used and there exists many running Hive SQL jobs, so why not to provide users a better experience to help them migrate Hive jobs to Flink? Also, it doesn't conflict with Flink SQL as it's just a dialect option. 

[1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=165227316
[2] https://issues.apache.org/jira/browse/FLINK-21529
[3] https://issues.apache.org/jira/browse/FLINK-21531
Best regards,
Yuxia.

​

 ------------------原始邮件 ------------------
发件人:Martijn Visser <ma...@apache.org>
发送时间:Mon Mar 7 19:23:15 2022
收件人:dev <de...@flink.apache.org>, User <us...@flink.apache.org>
主题:[DISCUSS] Flink's supported APIs and Hive query syntax
Hi everyone,

Flink currently has 4 APIs with multiple language support which can be used
to develop applications:

* DataStream API, both Java and Scala
* Table API, both Java and Scala
* Flink SQL, both in Flink query syntax and Hive query syntax (partially)
* Python API

Since FLIP-152 [1] the Flink SQL support has been extended to also support
the Hive query syntax. There is now a follow-up FLINK-26360 [2] to address
more syntax compatibility issues.

I would like to open a discussion on Flink directly supporting the Hive
query syntax. I have some concerns if having a 100% Hive query syntax is
indeed something that we should aim for in Flink.

I can understand that having Hive query syntax support in Flink could help
users due to interoperability and being able to migrate. However:

- Adding full Hive query syntax support will mean that we go from 6 fully
supported API/language combinations to 7. I think we are currently already
struggling with maintaining the existing combinations, let another one
more.
- Apache Hive is/appears to be a project that's not that actively developed
anymore. The last release was made in January 2021. It's popularity is
rapidly declining in Europe and the United State, also due Hadoop becoming
less popular.
- Related to the previous topic, other software like Snowflake,
Trino/Presto, Databricks are becoming more and more popular. If we add full
support for the Hive query syntax, then why not add support for Snowflake
and the others?
- We are supporting Hive versions that are no longer supported by the Hive
community with known security vulnerabilities. This makes Flink also
vulnerable for those type of vulnerabilities.
- The currently Hive implementation is done by using a lot of internals of
Flink, making Flink hard to maintain, with lots of tech debt and making
things overly complex.

From my perspective, I think it would be better to not have Hive query
syntax compatibility directly in Flink itself. Of course we should have a
proper Hive connector and a proper Hive catalog to make connectivity with
Hive (the versions that are still supported by the Hive community) itself
possible. Alternatively, if Hive query syntax is so important, it should
not rely on internals but be available as a dialect/pluggable option. That
could also open up the possibility to add more syntax support for others in
the future, but I really think we should just focus on Flink SQL itself.
That's already hard enough to maintain and improve on.

I'm looking forward to the thoughts of both Developers and Users, so I'm
cross-posting to both mailing lists.

Best regards,

Martijn Visser
https://twitter.com/MartijnVisser82

[1]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=165227316
[2] https://issues.apache.org/jira/browse/FLINK-21529

Re: [DISCUSS] Flink's supported APIs and Hive query syntax

Posted by Jingsong Li <ji...@gmail.com>.
Thanks all for your discussions.

I'll share my opinion here:

1. Hive SQL and Hive-like SQL are the absolute mainstay of current
Batch ETL in China. Hive+Spark (HiveSQL-like)+Databricks also occupies
a large market worldwide.

- Unlike OLAP SQL (such as presto, which is ansi-sql rather than hive
sql), Batch ETL is run periodically, which means that a large number
of Batch Pipelines have already been built, and if they need to be
migrated to a new system, it will be extremely costly to migrate the
SQLs.

2. Our current Hive dialect is immature and we need to put more effort
to decouple it from the flink planner.

Best,
Jingsong

On Tue, Mar 8, 2022 at 4:27 PM Zou Dan <zo...@163.com> wrote:
>
> Hi Martijn,
> Thanks for bringing this up.
> Hive SQL (using in Hive & Spark) plays an important role in batch processing, it has almost become de facto standard in batch processing. In our company, there are hundreds of thousands of spark jobs each day.
> IMO, if we want to promote Flink batch, Hive syntax compatibility is a crucial point of it.
> Thanks to this feature, we have migrated 800+ Spark jobs to Flink smoothly.
>
> So, I quite agree with putting more effort into Hive syntax compatibility.
>
> Best,
> Dan Zou
>
> 2022年3月7日 19:23,Martijn Visser <ma...@apache.org> 写道:
>
> query
>
>

Re: [DISCUSS] Flink's supported APIs and Hive query syntax

Posted by Jingsong Li <ji...@gmail.com>.
Thanks all for your discussions.

I'll share my opinion here:

1. Hive SQL and Hive-like SQL are the absolute mainstay of current
Batch ETL in China. Hive+Spark (HiveSQL-like)+Databricks also occupies
a large market worldwide.

- Unlike OLAP SQL (such as presto, which is ansi-sql rather than hive
sql), Batch ETL is run periodically, which means that a large number
of Batch Pipelines have already been built, and if they need to be
migrated to a new system, it will be extremely costly to migrate the
SQLs.

2. Our current Hive dialect is immature and we need to put more effort
to decouple it from the flink planner.

Best,
Jingsong

On Tue, Mar 8, 2022 at 4:27 PM Zou Dan <zo...@163.com> wrote:
>
> Hi Martijn,
> Thanks for bringing this up.
> Hive SQL (using in Hive & Spark) plays an important role in batch processing, it has almost become de facto standard in batch processing. In our company, there are hundreds of thousands of spark jobs each day.
> IMO, if we want to promote Flink batch, Hive syntax compatibility is a crucial point of it.
> Thanks to this feature, we have migrated 800+ Spark jobs to Flink smoothly.
>
> So, I quite agree with putting more effort into Hive syntax compatibility.
>
> Best,
> Dan Zou
>
> 2022年3月7日 19:23,Martijn Visser <ma...@apache.org> 写道:
>
> query
>
>

Re: [DISCUSS] Flink's supported APIs and Hive query syntax

Posted by Zou Dan <zo...@163.com>.
Hi Martijn,
Thanks for bringing this up.
Hive SQL (using in Hive & Spark) plays an important role in batch processing, it has almost become de facto standard in batch processing. In our company, there are hundreds of thousands of spark jobs each day.
IMO, if we want to promote Flink batch, Hive syntax compatibility is a crucial point of it.
Thanks to this feature, we have migrated 800+ Spark jobs to Flink smoothly.

So, I quite agree with putting more effort into Hive syntax compatibility.

Best,
Dan Zou

> 2022年3月7日 19:23,Martijn Visser <ma...@apache.org> 写道:
> 
> query


Re: Re: [DISCUSS] Flink's supported APIs and Hive query syntax

Posted by Jark Wu <im...@gmail.com>.
Hi Martijn,

Thanks for starting this discussion. I think it's great
for the community to to reach a consensus on the roadmap
of Hive query syntax.

I agree that the Hive project is not actively developed nowadays.
However, Hive still occupies the majority of the batch market
and the Hive ecosystem is even more active now. For example,
the Apache Kyuubi[1] is a new project that is a JDBC server
which is compatible with HiveServer2. And the Apache Iceberg
and Apache Hudi are mainly using Hive Metastore as the table catalog.
The Spark SQL is 99% compatible with Hive SQL. We have to admit
that Hive is the open-source de facto standard for batch processing.

As far as I can see, almost all the companies (including ByteDance,
Kuaishou, NetEase, etc..) in China are using Hive SQL for batch
processing, even the underlying is using Spark as the engine.
I don't know how the batch users can migrate to Flink if Flink
doesn't provide the Hive compatibility. IMO, in the short term,
Hive syntax compatibility is the ticket for us to have a seat
in the batch processing. In the long term, we can drop it and
focus on Flink SQL itself both for batch and stream processing.

Regarding the maintenance concern you raised, I think that's a good
point and they are in the plan. The Hive dialect has already been
a plugin and option now, and the implementation is located in
hive-connector module. We still need some work to make the Hive
dialect purely rely on public APIs, and the Hive connector should be
decopule with table planner. At that time, we can move the whole Hive
connector into a separate repository (I guess this is also in the
externalize connectors plan).

What do you think?

Best,
Jark

[1]:
https://kyuubi.apache.org/docs/latest/overview/kyuubi_vs_thriftserver.html
[2]: https://iceberg.apache.org/docs/latest/spark-configuration/
[3]: https://hudi.apache.org/docs/next/syncing_metastore/

On Tue, 8 Mar 2022 at 11:46, Mang Zhang <zh...@163.com> wrote:

> Hi Martijn,
>
> Thanks for driving this discussion.
>
> +1 on efforts on more hive/spark syntax compatibility.The hive/spark
> syntax is the most popular in batch computing.Within our company, many
> users have the desire to use Flink to realize the integration of streaming
> and batching,and some users have been running in production for months.And
> we have integrated Flink with our internal remote shuffle service, flink
> save user a lot of development and maintenance costs,user feedback is very
> good.Enrich flink's ecology and provide users with more choices, so I think
> pluggable support for hive/spark dialects is very necessary.We need better
> designs for future multi-source fusion.
>
>
>
>
>
>
>
> Best regards,
>
> Mang Zhang
>
>
>
>
>
> At 2022-03-07 20:52:42, "Jing Zhang" <be...@gmail.com> wrote:
> >Hi Martijn,
> >
> >Thanks for driving this discussion.
> >
> >+1 on efforts on more hive syntax compatibility.
> >
> >With the efforts on batch processing in recent versions(1.10~1.15), many
> >users have run batch processing jobs based on Flink.
> >In our team, we are trying to migrate most of the existing online batch
> >jobs from Hive/Spark to Flink. We hope this migration does not require
> >users to modify their sql.
> >Although Hive is not as popular as it used to be, Hive SQL is still alive
> >because many users still use Hive SQL to run spark jobs.
> >Therefore, compatibility with more HIVE syntax is critical to this
> >migration work.
> >
> >Best,
> >Jing Zhang
> >
> >
> >
> >Martijn Visser <ma...@apache.org> 于2022年3月7日周一 19:23写道:
> >
> >> Hi everyone,
> >>
> >> Flink currently has 4 APIs with multiple language support which can be
> used
> >> to develop applications:
> >>
> >> * DataStream API, both Java and Scala
> >> * Table API, both Java and Scala
> >> * Flink SQL, both in Flink query syntax and Hive query syntax
> (partially)
> >> * Python API
> >>
> >> Since FLIP-152 [1] the Flink SQL support has been extended to also
> support
> >> the Hive query syntax. There is now a follow-up FLINK-26360 [2] to
> address
> >> more syntax compatibility issues.
> >>
> >> I would like to open a discussion on Flink directly supporting the Hive
> >> query syntax. I have some concerns if having a 100% Hive query syntax is
> >> indeed something that we should aim for in Flink.
> >>
> >> I can understand that having Hive query syntax support in Flink could
> help
> >> users due to interoperability and being able to migrate. However:
> >>
> >> - Adding full Hive query syntax support will mean that we go from 6
> fully
> >> supported API/language combinations to 7. I think we are currently
> already
> >> struggling with maintaining the existing combinations, let another one
> >> more.
> >> - Apache Hive is/appears to be a project that's not that actively
> developed
> >> anymore. The last release was made in January 2021. It's popularity is
> >> rapidly declining in Europe and the United State, also due Hadoop
> becoming
> >> less popular.
> >> - Related to the previous topic, other software like Snowflake,
> >> Trino/Presto, Databricks are becoming more and more popular. If we add
> full
> >> support for the Hive query syntax, then why not add support for
> Snowflake
> >> and the others?
> >> - We are supporting Hive versions that are no longer supported by the
> Hive
> >> community with known security vulnerabilities. This makes Flink also
> >> vulnerable for those type of vulnerabilities.
> >> - The currently Hive implementation is done by using a lot of internals
> of
> >> Flink, making Flink hard to maintain, with lots of tech debt and making
> >> things overly complex.
> >>
> >> From my perspective, I think it would be better to not have Hive query
> >> syntax compatibility directly in Flink itself. Of course we should have
> a
> >> proper Hive connector and a proper Hive catalog to make connectivity
> with
> >> Hive (the versions that are still supported by the Hive community)
> itself
> >> possible. Alternatively, if Hive query syntax is so important, it should
> >> not rely on internals but be available as a dialect/pluggable option.
> That
> >> could also open up the possibility to add more syntax support for
> others in
> >> the future, but I really think we should just focus on Flink SQL itself.
> >> That's already hard enough to maintain and improve on.
> >>
> >> I'm looking forward to the thoughts of both Developers and Users, so I'm
> >> cross-posting to both mailing lists.
> >>
> >> Best regards,
> >>
> >> Martijn Visser
> >> https://twitter.com/MartijnVisser82
> >>
> >> [1]
> >>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=165227316
> >> [2] https://issues.apache.org/jira/browse/FLINK-21529
> >>
>

Re: Re: [DISCUSS] Flink's supported APIs and Hive query syntax

Posted by Jark Wu <im...@gmail.com>.
Hi Martijn,

Thanks for starting this discussion. I think it's great
for the community to to reach a consensus on the roadmap
of Hive query syntax.

I agree that the Hive project is not actively developed nowadays.
However, Hive still occupies the majority of the batch market
and the Hive ecosystem is even more active now. For example,
the Apache Kyuubi[1] is a new project that is a JDBC server
which is compatible with HiveServer2. And the Apache Iceberg
and Apache Hudi are mainly using Hive Metastore as the table catalog.
The Spark SQL is 99% compatible with Hive SQL. We have to admit
that Hive is the open-source de facto standard for batch processing.

As far as I can see, almost all the companies (including ByteDance,
Kuaishou, NetEase, etc..) in China are using Hive SQL for batch
processing, even the underlying is using Spark as the engine.
I don't know how the batch users can migrate to Flink if Flink
doesn't provide the Hive compatibility. IMO, in the short term,
Hive syntax compatibility is the ticket for us to have a seat
in the batch processing. In the long term, we can drop it and
focus on Flink SQL itself both for batch and stream processing.

Regarding the maintenance concern you raised, I think that's a good
point and they are in the plan. The Hive dialect has already been
a plugin and option now, and the implementation is located in
hive-connector module. We still need some work to make the Hive
dialect purely rely on public APIs, and the Hive connector should be
decopule with table planner. At that time, we can move the whole Hive
connector into a separate repository (I guess this is also in the
externalize connectors plan).

What do you think?

Best,
Jark

[1]:
https://kyuubi.apache.org/docs/latest/overview/kyuubi_vs_thriftserver.html
[2]: https://iceberg.apache.org/docs/latest/spark-configuration/
[3]: https://hudi.apache.org/docs/next/syncing_metastore/

On Tue, 8 Mar 2022 at 11:46, Mang Zhang <zh...@163.com> wrote:

> Hi Martijn,
>
> Thanks for driving this discussion.
>
> +1 on efforts on more hive/spark syntax compatibility.The hive/spark
> syntax is the most popular in batch computing.Within our company, many
> users have the desire to use Flink to realize the integration of streaming
> and batching,and some users have been running in production for months.And
> we have integrated Flink with our internal remote shuffle service, flink
> save user a lot of development and maintenance costs,user feedback is very
> good.Enrich flink's ecology and provide users with more choices, so I think
> pluggable support for hive/spark dialects is very necessary.We need better
> designs for future multi-source fusion.
>
>
>
>
>
>
>
> Best regards,
>
> Mang Zhang
>
>
>
>
>
> At 2022-03-07 20:52:42, "Jing Zhang" <be...@gmail.com> wrote:
> >Hi Martijn,
> >
> >Thanks for driving this discussion.
> >
> >+1 on efforts on more hive syntax compatibility.
> >
> >With the efforts on batch processing in recent versions(1.10~1.15), many
> >users have run batch processing jobs based on Flink.
> >In our team, we are trying to migrate most of the existing online batch
> >jobs from Hive/Spark to Flink. We hope this migration does not require
> >users to modify their sql.
> >Although Hive is not as popular as it used to be, Hive SQL is still alive
> >because many users still use Hive SQL to run spark jobs.
> >Therefore, compatibility with more HIVE syntax is critical to this
> >migration work.
> >
> >Best,
> >Jing Zhang
> >
> >
> >
> >Martijn Visser <ma...@apache.org> 于2022年3月7日周一 19:23写道:
> >
> >> Hi everyone,
> >>
> >> Flink currently has 4 APIs with multiple language support which can be
> used
> >> to develop applications:
> >>
> >> * DataStream API, both Java and Scala
> >> * Table API, both Java and Scala
> >> * Flink SQL, both in Flink query syntax and Hive query syntax
> (partially)
> >> * Python API
> >>
> >> Since FLIP-152 [1] the Flink SQL support has been extended to also
> support
> >> the Hive query syntax. There is now a follow-up FLINK-26360 [2] to
> address
> >> more syntax compatibility issues.
> >>
> >> I would like to open a discussion on Flink directly supporting the Hive
> >> query syntax. I have some concerns if having a 100% Hive query syntax is
> >> indeed something that we should aim for in Flink.
> >>
> >> I can understand that having Hive query syntax support in Flink could
> help
> >> users due to interoperability and being able to migrate. However:
> >>
> >> - Adding full Hive query syntax support will mean that we go from 6
> fully
> >> supported API/language combinations to 7. I think we are currently
> already
> >> struggling with maintaining the existing combinations, let another one
> >> more.
> >> - Apache Hive is/appears to be a project that's not that actively
> developed
> >> anymore. The last release was made in January 2021. It's popularity is
> >> rapidly declining in Europe and the United State, also due Hadoop
> becoming
> >> less popular.
> >> - Related to the previous topic, other software like Snowflake,
> >> Trino/Presto, Databricks are becoming more and more popular. If we add
> full
> >> support for the Hive query syntax, then why not add support for
> Snowflake
> >> and the others?
> >> - We are supporting Hive versions that are no longer supported by the
> Hive
> >> community with known security vulnerabilities. This makes Flink also
> >> vulnerable for those type of vulnerabilities.
> >> - The currently Hive implementation is done by using a lot of internals
> of
> >> Flink, making Flink hard to maintain, with lots of tech debt and making
> >> things overly complex.
> >>
> >> From my perspective, I think it would be better to not have Hive query
> >> syntax compatibility directly in Flink itself. Of course we should have
> a
> >> proper Hive connector and a proper Hive catalog to make connectivity
> with
> >> Hive (the versions that are still supported by the Hive community)
> itself
> >> possible. Alternatively, if Hive query syntax is so important, it should
> >> not rely on internals but be available as a dialect/pluggable option.
> That
> >> could also open up the possibility to add more syntax support for
> others in
> >> the future, but I really think we should just focus on Flink SQL itself.
> >> That's already hard enough to maintain and improve on.
> >>
> >> I'm looking forward to the thoughts of both Developers and Users, so I'm
> >> cross-posting to both mailing lists.
> >>
> >> Best regards,
> >>
> >> Martijn Visser
> >> https://twitter.com/MartijnVisser82
> >>
> >> [1]
> >>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=165227316
> >> [2] https://issues.apache.org/jira/browse/FLINK-21529
> >>
>

Re:Re: [DISCUSS] Flink's supported APIs and Hive query syntax

Posted by Mang Zhang <zh...@163.com>.
Hi Martijn,

Thanks for driving this discussion.

+1 on efforts on more hive/spark syntax compatibility.The hive/spark syntax is the most popular in batch computing.Within our company, many users have the desire to use Flink to realize the integration of streaming and batching,and some users have been running in production for months.And we have integrated Flink with our internal remote shuffle service, flink save user a lot of development and maintenance costs,user feedback is very good.Enrich flink's ecology and provide users with more choices, so I think pluggable support for hive/spark dialects is very necessary.We need better designs for future multi-source fusion.







Best regards,

Mang Zhang





At 2022-03-07 20:52:42, "Jing Zhang" <be...@gmail.com> wrote:
>Hi Martijn,
>
>Thanks for driving this discussion.
>
>+1 on efforts on more hive syntax compatibility.
>
>With the efforts on batch processing in recent versions(1.10~1.15), many
>users have run batch processing jobs based on Flink.
>In our team, we are trying to migrate most of the existing online batch
>jobs from Hive/Spark to Flink. We hope this migration does not require
>users to modify their sql.
>Although Hive is not as popular as it used to be, Hive SQL is still alive
>because many users still use Hive SQL to run spark jobs.
>Therefore, compatibility with more HIVE syntax is critical to this
>migration work.
>
>Best,
>Jing Zhang
>
>
>
>Martijn Visser <ma...@apache.org> 于2022年3月7日周一 19:23写道:
>
>> Hi everyone,
>>
>> Flink currently has 4 APIs with multiple language support which can be used
>> to develop applications:
>>
>> * DataStream API, both Java and Scala
>> * Table API, both Java and Scala
>> * Flink SQL, both in Flink query syntax and Hive query syntax (partially)
>> * Python API
>>
>> Since FLIP-152 [1] the Flink SQL support has been extended to also support
>> the Hive query syntax. There is now a follow-up FLINK-26360 [2] to address
>> more syntax compatibility issues.
>>
>> I would like to open a discussion on Flink directly supporting the Hive
>> query syntax. I have some concerns if having a 100% Hive query syntax is
>> indeed something that we should aim for in Flink.
>>
>> I can understand that having Hive query syntax support in Flink could help
>> users due to interoperability and being able to migrate. However:
>>
>> - Adding full Hive query syntax support will mean that we go from 6 fully
>> supported API/language combinations to 7. I think we are currently already
>> struggling with maintaining the existing combinations, let another one
>> more.
>> - Apache Hive is/appears to be a project that's not that actively developed
>> anymore. The last release was made in January 2021. It's popularity is
>> rapidly declining in Europe and the United State, also due Hadoop becoming
>> less popular.
>> - Related to the previous topic, other software like Snowflake,
>> Trino/Presto, Databricks are becoming more and more popular. If we add full
>> support for the Hive query syntax, then why not add support for Snowflake
>> and the others?
>> - We are supporting Hive versions that are no longer supported by the Hive
>> community with known security vulnerabilities. This makes Flink also
>> vulnerable for those type of vulnerabilities.
>> - The currently Hive implementation is done by using a lot of internals of
>> Flink, making Flink hard to maintain, with lots of tech debt and making
>> things overly complex.
>>
>> From my perspective, I think it would be better to not have Hive query
>> syntax compatibility directly in Flink itself. Of course we should have a
>> proper Hive connector and a proper Hive catalog to make connectivity with
>> Hive (the versions that are still supported by the Hive community) itself
>> possible. Alternatively, if Hive query syntax is so important, it should
>> not rely on internals but be available as a dialect/pluggable option. That
>> could also open up the possibility to add more syntax support for others in
>> the future, but I really think we should just focus on Flink SQL itself.
>> That's already hard enough to maintain and improve on.
>>
>> I'm looking forward to the thoughts of both Developers and Users, so I'm
>> cross-posting to both mailing lists.
>>
>> Best regards,
>>
>> Martijn Visser
>> https://twitter.com/MartijnVisser82
>>
>> [1]
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=165227316
>> [2] https://issues.apache.org/jira/browse/FLINK-21529
>>

Re: [DISCUSS] Flink's supported APIs and Hive query syntax

Posted by Jing Ge <ji...@ververica.com>.
Hi,

Thanks Martijn for driving this discussion. Your concerns are very
rational.

We should do our best to keep the Flink development on the right track. I
would suggest discussing it in a vision/goal oriented way. Since Flink has
a clear vision of unified batch and stream processing, supporting batch
jobs will be one of the critical core features to help us reach the vision
and let Flink have an even bigger impact in the industry. I fully agree
with you that we should not focus on the Hive query syntax. Instead of it,
we should build a plan/schedule to support batch query syntax for the
vision. If there is any conflict between Hive query syntax and common batch
query syntax, we should stick with the common batch query syntax. For any
Hive specific query syntax, which is not supported as a common case by
other batch process engines, we should think very carefully and implement
it as a dialect extension like you suggested, but only when it is a
critical business requirement and has broad impact on many use cases. Last
but not least, from architecture's perspective, it is good to have the
capability to support arbitrary syntax via dialect/extension/plugin. But it
will also require a lot of effort to make it happen. Trade-off is always
the key. Currently, I have to agree with you again, we should focus more on
the common (batch) cases.


Best regards,
Jing

On Mon, Mar 7, 2022 at 1:53 PM Jing Zhang <be...@gmail.com> wrote:

> Hi Martijn,
>
> Thanks for driving this discussion.
>
> +1 on efforts on more hive syntax compatibility.
>
> With the efforts on batch processing in recent versions(1.10~1.15), many
> users have run batch processing jobs based on Flink.
> In our team, we are trying to migrate most of the existing online batch
> jobs from Hive/Spark to Flink. We hope this migration does not require
> users to modify their sql.
> Although Hive is not as popular as it used to be, Hive SQL is still alive
> because many users still use Hive SQL to run spark jobs.
> Therefore, compatibility with more HIVE syntax is critical to this
> migration work.
>
> Best,
> Jing Zhang
>
>
>
> Martijn Visser <ma...@apache.org> 于2022年3月7日周一 19:23写道:
>
>> Hi everyone,
>>
>> Flink currently has 4 APIs with multiple language support which can be
>> used
>> to develop applications:
>>
>> * DataStream API, both Java and Scala
>> * Table API, both Java and Scala
>> * Flink SQL, both in Flink query syntax and Hive query syntax (partially)
>> * Python API
>>
>> Since FLIP-152 [1] the Flink SQL support has been extended to also support
>> the Hive query syntax. There is now a follow-up FLINK-26360 [2] to address
>> more syntax compatibility issues.
>>
>> I would like to open a discussion on Flink directly supporting the Hive
>> query syntax. I have some concerns if having a 100% Hive query syntax is
>> indeed something that we should aim for in Flink.
>>
>> I can understand that having Hive query syntax support in Flink could help
>> users due to interoperability and being able to migrate. However:
>>
>> - Adding full Hive query syntax support will mean that we go from 6 fully
>> supported API/language combinations to 7. I think we are currently already
>> struggling with maintaining the existing combinations, let another one
>> more.
>> - Apache Hive is/appears to be a project that's not that actively
>> developed
>> anymore. The last release was made in January 2021. It's popularity is
>> rapidly declining in Europe and the United State, also due Hadoop becoming
>> less popular.
>> - Related to the previous topic, other software like Snowflake,
>> Trino/Presto, Databricks are becoming more and more popular. If we add
>> full
>> support for the Hive query syntax, then why not add support for Snowflake
>> and the others?
>> - We are supporting Hive versions that are no longer supported by the Hive
>> community with known security vulnerabilities. This makes Flink also
>> vulnerable for those type of vulnerabilities.
>> - The currently Hive implementation is done by using a lot of internals of
>> Flink, making Flink hard to maintain, with lots of tech debt and making
>> things overly complex.
>>
>> From my perspective, I think it would be better to not have Hive query
>> syntax compatibility directly in Flink itself. Of course we should have a
>> proper Hive connector and a proper Hive catalog to make connectivity with
>> Hive (the versions that are still supported by the Hive community) itself
>> possible. Alternatively, if Hive query syntax is so important, it should
>> not rely on internals but be available as a dialect/pluggable option. That
>> could also open up the possibility to add more syntax support for others
>> in
>> the future, but I really think we should just focus on Flink SQL itself.
>> That's already hard enough to maintain and improve on.
>>
>> I'm looking forward to the thoughts of both Developers and Users, so I'm
>> cross-posting to both mailing lists.
>>
>> Best regards,
>>
>> Martijn Visser
>> https://twitter.com/MartijnVisser82
>>
>> [1]
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=165227316
>> [2] https://issues.apache.org/jira/browse/FLINK-21529
>>
>

Re: [DISCUSS] Flink's supported APIs and Hive query syntax

Posted by Jing Ge <ji...@ververica.com>.
Hi,

Thanks Martijn for driving this discussion. Your concerns are very
rational.

We should do our best to keep the Flink development on the right track. I
would suggest discussing it in a vision/goal oriented way. Since Flink has
a clear vision of unified batch and stream processing, supporting batch
jobs will be one of the critical core features to help us reach the vision
and let Flink have an even bigger impact in the industry. I fully agree
with you that we should not focus on the Hive query syntax. Instead of it,
we should build a plan/schedule to support batch query syntax for the
vision. If there is any conflict between Hive query syntax and common batch
query syntax, we should stick with the common batch query syntax. For any
Hive specific query syntax, which is not supported as a common case by
other batch process engines, we should think very carefully and implement
it as a dialect extension like you suggested, but only when it is a
critical business requirement and has broad impact on many use cases. Last
but not least, from architecture's perspective, it is good to have the
capability to support arbitrary syntax via dialect/extension/plugin. But it
will also require a lot of effort to make it happen. Trade-off is always
the key. Currently, I have to agree with you again, we should focus more on
the common (batch) cases.


Best regards,
Jing

On Mon, Mar 7, 2022 at 1:53 PM Jing Zhang <be...@gmail.com> wrote:

> Hi Martijn,
>
> Thanks for driving this discussion.
>
> +1 on efforts on more hive syntax compatibility.
>
> With the efforts on batch processing in recent versions(1.10~1.15), many
> users have run batch processing jobs based on Flink.
> In our team, we are trying to migrate most of the existing online batch
> jobs from Hive/Spark to Flink. We hope this migration does not require
> users to modify their sql.
> Although Hive is not as popular as it used to be, Hive SQL is still alive
> because many users still use Hive SQL to run spark jobs.
> Therefore, compatibility with more HIVE syntax is critical to this
> migration work.
>
> Best,
> Jing Zhang
>
>
>
> Martijn Visser <ma...@apache.org> 于2022年3月7日周一 19:23写道:
>
>> Hi everyone,
>>
>> Flink currently has 4 APIs with multiple language support which can be
>> used
>> to develop applications:
>>
>> * DataStream API, both Java and Scala
>> * Table API, both Java and Scala
>> * Flink SQL, both in Flink query syntax and Hive query syntax (partially)
>> * Python API
>>
>> Since FLIP-152 [1] the Flink SQL support has been extended to also support
>> the Hive query syntax. There is now a follow-up FLINK-26360 [2] to address
>> more syntax compatibility issues.
>>
>> I would like to open a discussion on Flink directly supporting the Hive
>> query syntax. I have some concerns if having a 100% Hive query syntax is
>> indeed something that we should aim for in Flink.
>>
>> I can understand that having Hive query syntax support in Flink could help
>> users due to interoperability and being able to migrate. However:
>>
>> - Adding full Hive query syntax support will mean that we go from 6 fully
>> supported API/language combinations to 7. I think we are currently already
>> struggling with maintaining the existing combinations, let another one
>> more.
>> - Apache Hive is/appears to be a project that's not that actively
>> developed
>> anymore. The last release was made in January 2021. It's popularity is
>> rapidly declining in Europe and the United State, also due Hadoop becoming
>> less popular.
>> - Related to the previous topic, other software like Snowflake,
>> Trino/Presto, Databricks are becoming more and more popular. If we add
>> full
>> support for the Hive query syntax, then why not add support for Snowflake
>> and the others?
>> - We are supporting Hive versions that are no longer supported by the Hive
>> community with known security vulnerabilities. This makes Flink also
>> vulnerable for those type of vulnerabilities.
>> - The currently Hive implementation is done by using a lot of internals of
>> Flink, making Flink hard to maintain, with lots of tech debt and making
>> things overly complex.
>>
>> From my perspective, I think it would be better to not have Hive query
>> syntax compatibility directly in Flink itself. Of course we should have a
>> proper Hive connector and a proper Hive catalog to make connectivity with
>> Hive (the versions that are still supported by the Hive community) itself
>> possible. Alternatively, if Hive query syntax is so important, it should
>> not rely on internals but be available as a dialect/pluggable option. That
>> could also open up the possibility to add more syntax support for others
>> in
>> the future, but I really think we should just focus on Flink SQL itself.
>> That's already hard enough to maintain and improve on.
>>
>> I'm looking forward to the thoughts of both Developers and Users, so I'm
>> cross-posting to both mailing lists.
>>
>> Best regards,
>>
>> Martijn Visser
>> https://twitter.com/MartijnVisser82
>>
>> [1]
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=165227316
>> [2] https://issues.apache.org/jira/browse/FLINK-21529
>>
>

Re: [DISCUSS] Flink's supported APIs and Hive query syntax

Posted by Jing Zhang <be...@gmail.com>.
Hi Martijn,

Thanks for driving this discussion.

+1 on efforts on more hive syntax compatibility.

With the efforts on batch processing in recent versions(1.10~1.15), many
users have run batch processing jobs based on Flink.
In our team, we are trying to migrate most of the existing online batch
jobs from Hive/Spark to Flink. We hope this migration does not require
users to modify their sql.
Although Hive is not as popular as it used to be, Hive SQL is still alive
because many users still use Hive SQL to run spark jobs.
Therefore, compatibility with more HIVE syntax is critical to this
migration work.

Best,
Jing Zhang



Martijn Visser <ma...@apache.org> 于2022年3月7日周一 19:23写道:

> Hi everyone,
>
> Flink currently has 4 APIs with multiple language support which can be used
> to develop applications:
>
> * DataStream API, both Java and Scala
> * Table API, both Java and Scala
> * Flink SQL, both in Flink query syntax and Hive query syntax (partially)
> * Python API
>
> Since FLIP-152 [1] the Flink SQL support has been extended to also support
> the Hive query syntax. There is now a follow-up FLINK-26360 [2] to address
> more syntax compatibility issues.
>
> I would like to open a discussion on Flink directly supporting the Hive
> query syntax. I have some concerns if having a 100% Hive query syntax is
> indeed something that we should aim for in Flink.
>
> I can understand that having Hive query syntax support in Flink could help
> users due to interoperability and being able to migrate. However:
>
> - Adding full Hive query syntax support will mean that we go from 6 fully
> supported API/language combinations to 7. I think we are currently already
> struggling with maintaining the existing combinations, let another one
> more.
> - Apache Hive is/appears to be a project that's not that actively developed
> anymore. The last release was made in January 2021. It's popularity is
> rapidly declining in Europe and the United State, also due Hadoop becoming
> less popular.
> - Related to the previous topic, other software like Snowflake,
> Trino/Presto, Databricks are becoming more and more popular. If we add full
> support for the Hive query syntax, then why not add support for Snowflake
> and the others?
> - We are supporting Hive versions that are no longer supported by the Hive
> community with known security vulnerabilities. This makes Flink also
> vulnerable for those type of vulnerabilities.
> - The currently Hive implementation is done by using a lot of internals of
> Flink, making Flink hard to maintain, with lots of tech debt and making
> things overly complex.
>
> From my perspective, I think it would be better to not have Hive query
> syntax compatibility directly in Flink itself. Of course we should have a
> proper Hive connector and a proper Hive catalog to make connectivity with
> Hive (the versions that are still supported by the Hive community) itself
> possible. Alternatively, if Hive query syntax is so important, it should
> not rely on internals but be available as a dialect/pluggable option. That
> could also open up the possibility to add more syntax support for others in
> the future, but I really think we should just focus on Flink SQL itself.
> That's already hard enough to maintain and improve on.
>
> I'm looking forward to the thoughts of both Developers and Users, so I'm
> cross-posting to both mailing lists.
>
> Best regards,
>
> Martijn Visser
> https://twitter.com/MartijnVisser82
>
> [1]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=165227316
> [2] https://issues.apache.org/jira/browse/FLINK-21529
>

Re: [DISCUSS] Flink's supported APIs and Hive query syntax

Posted by Jing Zhang <be...@gmail.com>.
Hi Martijn,

Thanks for driving this discussion.

+1 on efforts on more hive syntax compatibility.

With the efforts on batch processing in recent versions(1.10~1.15), many
users have run batch processing jobs based on Flink.
In our team, we are trying to migrate most of the existing online batch
jobs from Hive/Spark to Flink. We hope this migration does not require
users to modify their sql.
Although Hive is not as popular as it used to be, Hive SQL is still alive
because many users still use Hive SQL to run spark jobs.
Therefore, compatibility with more HIVE syntax is critical to this
migration work.

Best,
Jing Zhang



Martijn Visser <ma...@apache.org> 于2022年3月7日周一 19:23写道:

> Hi everyone,
>
> Flink currently has 4 APIs with multiple language support which can be used
> to develop applications:
>
> * DataStream API, both Java and Scala
> * Table API, both Java and Scala
> * Flink SQL, both in Flink query syntax and Hive query syntax (partially)
> * Python API
>
> Since FLIP-152 [1] the Flink SQL support has been extended to also support
> the Hive query syntax. There is now a follow-up FLINK-26360 [2] to address
> more syntax compatibility issues.
>
> I would like to open a discussion on Flink directly supporting the Hive
> query syntax. I have some concerns if having a 100% Hive query syntax is
> indeed something that we should aim for in Flink.
>
> I can understand that having Hive query syntax support in Flink could help
> users due to interoperability and being able to migrate. However:
>
> - Adding full Hive query syntax support will mean that we go from 6 fully
> supported API/language combinations to 7. I think we are currently already
> struggling with maintaining the existing combinations, let another one
> more.
> - Apache Hive is/appears to be a project that's not that actively developed
> anymore. The last release was made in January 2021. It's popularity is
> rapidly declining in Europe and the United State, also due Hadoop becoming
> less popular.
> - Related to the previous topic, other software like Snowflake,
> Trino/Presto, Databricks are becoming more and more popular. If we add full
> support for the Hive query syntax, then why not add support for Snowflake
> and the others?
> - We are supporting Hive versions that are no longer supported by the Hive
> community with known security vulnerabilities. This makes Flink also
> vulnerable for those type of vulnerabilities.
> - The currently Hive implementation is done by using a lot of internals of
> Flink, making Flink hard to maintain, with lots of tech debt and making
> things overly complex.
>
> From my perspective, I think it would be better to not have Hive query
> syntax compatibility directly in Flink itself. Of course we should have a
> proper Hive connector and a proper Hive catalog to make connectivity with
> Hive (the versions that are still supported by the Hive community) itself
> possible. Alternatively, if Hive query syntax is so important, it should
> not rely on internals but be available as a dialect/pluggable option. That
> could also open up the possibility to add more syntax support for others in
> the future, but I really think we should just focus on Flink SQL itself.
> That's already hard enough to maintain and improve on.
>
> I'm looking forward to the thoughts of both Developers and Users, so I'm
> cross-posting to both mailing lists.
>
> Best regards,
>
> Martijn Visser
> https://twitter.com/MartijnVisser82
>
> [1]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=165227316
> [2] https://issues.apache.org/jira/browse/FLINK-21529
>

Re: [DISCUSS] Flink's supported APIs and Hive query syntax

Posted by Zou Dan <zo...@163.com>.
Hi Martijn,
Thanks for bringing this up.
Hive SQL (using in Hive & Spark) plays an important role in batch processing, it has almost become de facto standard in batch processing. In our company, there are hundreds of thousands of spark jobs each day.
IMO, if we want to promote Flink batch, Hive syntax compatibility is a crucial point of it.
Thanks to this feature, we have migrated 800+ Spark jobs to Flink smoothly.

So, I quite agree with putting more effort into Hive syntax compatibility.

Best,
Dan Zou

> 2022年3月7日 19:23,Martijn Visser <ma...@apache.org> 写道:
> 
> query


Re: [DISCUSS] Flink's supported APIs and Hive query syntax

Posted by Ingo Bürk <ai...@apache.org>.
Hi,

thanks Martijn for bringing this up and raising very valid concerns. I 
agree with the notion that Flink supporting Hive should come with a 
proper commitment, and otherwise we should consider not supporting it at 
all (in Flink itself, that is).

Given that Hive is an Apache project, my first thought was whether we 
shouldn't just reach out to the project to understand their plans 
regarding vulnerabilities and the future of the project?


Best
Ingo

On 07.03.22 12:23, Martijn Visser wrote:
> Hi everyone,
> 
> Flink currently has 4 APIs with multiple language support which can be 
> used to develop applications:
> 
> * DataStream API, both Java and Scala
> * Table API, both Java and Scala
> * Flink SQL, both in Flink query syntax and Hive query syntax (partially)
> * Python API
> 
> Since FLIP-152 [1] the Flink SQL support has been extended to also 
> support the Hive query syntax. There is now a follow-up FLINK-26360 [2] 
> to address more syntax compatibility issues.
> 
> I would like to open a discussion on Flink directly supporting the Hive 
> query syntax. I have some concerns if having a 100% Hive query syntax is 
> indeed something that we should aim for in Flink.
> 
> I can understand that having Hive query syntax support in Flink could 
> help users due to interoperability and being able to migrate. However:
> 
> - Adding full Hive query syntax support will mean that we go from 6 
> fully supported API/language combinations to 7. I think we are currently 
> already struggling with maintaining the existing combinations, let 
> another one more.
> - Apache Hive is/appears to be a project that's not that actively 
> developed anymore. The last release was made in January 2021. It's 
> popularity is rapidly declining in Europe and the United State, also due 
> Hadoop becoming less popular.
> - Related to the previous topic, other software like Snowflake, 
> Trino/Presto, Databricks are becoming more and more popular. If we add 
> full support for the Hive query syntax, then why not add support for 
> Snowflake and the others?
> - We are supporting Hive versions that are no longer supported by the 
> Hive community with known security vulnerabilities. This makes Flink 
> also vulnerable for those type of vulnerabilities.
> - The currently Hive implementation is done by using a lot of internals 
> of Flink, making Flink hard to maintain, with lots of tech debt and 
> making things overly complex.
> 
>  From my perspective, I think it would be better to not have Hive query 
> syntax compatibility directly in Flink itself. Of course we should have 
> a proper Hive connector and a proper Hive catalog to make connectivity 
> with Hive (the versions that are still supported by the Hive community) 
> itself possible. Alternatively, if Hive query syntax is so important, it 
> should not rely on internals but be available as a dialect/pluggable 
> option. That could also open up the possibility to add more syntax 
> support for others in the future, but I really think we should just 
> focus on Flink SQL itself. That's already hard enough to maintain and 
> improve on.
> 
> I'm looking forward to the thoughts of both Developers and Users, so I'm 
> cross-posting to both mailing lists.
> 
> Best regards,
> 
> Martijn Visser
> https://twitter.com/MartijnVisser82 <https://twitter.com/MartijnVisser82>
> 
> [1] 
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=165227316 
> <https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=165227316>
> [2] https://issues.apache.org/jira/browse/FLINK-21529 
> <https://issues.apache.org/jira/browse/FLINK-21529>