You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by "Zhang, Xuefu" <xu...@alibaba-inc.com> on 2018/10/09 17:22:23 UTC

[DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi all,

Along with the community's effort, inside Alibaba we have explored Flink's potential as an execution engine not just for stream processing but also for batch processing. We are encouraged by our findings and have initiated our effort to make Flink's SQL capabilities full-fledged. When comparing what's available in Flink to the offerings from competitive data processing engines, we identified a major gap in Flink: a well integration with Hive ecosystem. This is crucial to the success of Flink SQL and batch due to the well-established data ecosystem around Hive. Therefore, we have done some initial work along this direction but there are still a lot of effort needed.

We have two strategies in mind. The first one is to make Flink SQL full-fledged and well-integrated with Hive ecosystem. This is a similar approach to what Spark SQL adopted. The second strategy is to make Hive itself work with Flink, similar to the proposal in [1]. Each approach bears its pros and cons, but they don’t need to be mutually exclusive with each targeting at different users and use cases. We believe that both will promote a much greater adoption of Flink beyond stream processing.

We have been focused on the first approach and would like to showcase Flink's batch and SQL capabilities with Flink SQL. However, we have also planned to start strategy #2 as the follow-up effort.

I'm completely new to Flink(, with a short bio [2] below), though many of my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like to share our thoughts and invite your early feedback. At the same time, I am working on a detailed proposal on Flink SQL's integration with Hive ecosystem, which will be also shared when ready.

While the ideas are simple, each approach will demand significant effort, more than what we can afford. Thus, the input and contributions from the communities are greatly welcome and appreciated.

Regards,


Xuefu

References:

[1] https://issues.apache.org/jira/browse/HIVE-10712
[2] Xuefu Zhang is a long-time open source veteran, worked or working on many projects under Apache Foundation, of which he is also an honored member. About 10 years ago he worked in the Hadoop team at Yahoo where the projects just got started. Later he worked at Cloudera, initiating and leading the development of Hive on Spark project in the communities and across many organizations. Prior to joining Alibaba, he worked at Uber where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and significantly improved Uber's cluster efficiency.



Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by "Zhang, Xuefu" <xu...@alibaba-inc.com>.
Hi Timo,

Thank you for your input. It's exciting to see that the community has already initiated some of the topics. We'd certainly like to leverage the current and previous work and make progress in phases. Here I'd like to comment on a few things on top of your feedback.

1. I think there are two aspects #1 and #2 with regard to Hive metastore: a) as an backend storage for Flink's metadata (currently in memory), and b) an external catalog (just like a JDBC catalog) that Flink can interact with. While it may be possible and would be nice if we can achieve both in a single design, our focus has been on the latter. We will consider both cases in our design.

2. Re #5, I agree that Flink seems having the majority of data types. However, supporting some of them (such as struct) at SQL layer needs work on the parser (Calcite).

3. Similarly for #6, work needs to be done on parsing side. We can certainly ask Calcite community to provide Hive dialect parsing. This can be challenging and time-consuming. At the same time, we can also explore the possibilities of solving the problem in Flink, such as using Calcite's official extension mechanism. We will open the discussion when we get there.

Yes, I agree with you that we should start with a small scope while keeping a forward thinking. Specifically, we will first look at the metadata and data compatibilities, data types, DDL/DML, Query, UDFs, and so on. I think we align well on this.

Please let me know if you have further thoughts or commends.

Thanks,
Xuefu


------------------------------------------------------------------
Sender:Timo Walther <tw...@apache.org>
Sent at:2018 Oct 11 (Thu) 15:46
Recipient:dev <de...@flink.apache.org>; "Jörn Franke" <jo...@gmail.com>; Xuefu <xu...@alibaba-inc.com>
Cc:vino yang <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Xuefu,

thanks for your proposal, it is a nice summary. Here are my thoughts to 
your list:

1. I think this is also on our current mid-term roadmap. Flink lacks a 
poper catalog support for a very long time. Before we can connect 
catalogs we need to define how to map all the information from a catalog 
to Flink's representation. This is why the work on the unified connector 
API [1] is going on for quite some time as it is the first approach to 
discuss and represent the pure characteristics of connectors.
2. It would be helpful to figure out what is missing in [1] to to ensure 
this point. I guess we will need a new design document just for a proper 
Hive catalog integration.
3. This is already work in progress. ORC has been merged, Parquet is on 
its way [1].
4. This should be easy. There was a PR in past that I reviewed but was 
not maintained anymore.
5. The type system of Flink SQL is very flexible. Only UNION type is 
missing.
6. A Flink SQL DDL is on the roadmap soon once we are done with [1]. 
Support for Hive syntax also needs cooperation with Apache Calcite.
7-11. Long-term goals.

I would also propose to start with a smaller scope where also current 
Flink SQL users can profit: 1, 2, 5, 3. This would allow to grow the 
Flink SQL ecosystem. After that we can aim to be fully compatible 
including syntax and UDFs (4, 6 etc.). Once the core is ready, we can 
work on the tooling (7, 8, 9) and performance (10, 11).

@Jörn: Yes, we should not have a tight dependency on Hive. It should be 
treated as one "connector" system out of many.

Thanks,
Timo

[1] 
https://docs.google.com/document/d/1Yaxp1UJUFW-peGLt8EIidwKIZEWrrA-pznWLuvaH39Y/edit?ts=5bb62df4#
[2] https://github.com/apache/flink/pull/6483

Am 11.10.18 um 07:54 schrieb Jörn Franke:
> Would it maybe make sense to provide Flink as an engine on Hive („flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely coupled than integrating hive in all possible flink core modules and thus introducing a very tight dependency to Hive in the core.
> 1,2,3 could be achieved via a connector based on the Flink Table API.
> Just as a proposal to start this Endeavour as independent projects (hive engine, connector) to avoid too tight coupling with Flink. Maybe in a more distant future if the Hive integration is heavily demanded one could then integrate it more tightly if needed.
>
> What is meant by 11?
>> Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
>>
>> Hi Fabian/Vno,
>>
>> Thank you very much for your encouragement inquiry. Sorry that I didn't see Fabian's email until I read Vino's response just now. (Somehow Fabian's went to the spam folder.)
>>
>> My proposal contains long-term and short-terms goals. Nevertheless, the effort will focus on the following areas, including Fabian's list:
>>
>> 1. Hive metastore connectivity - This covers both read/write access, which means Flink can make full use of Hive's metastore as its catalog (at least for the batch but can extend for streaming as well).
>> 2. Metadata compatibility - Objects (databases, tables, partitions, etc) created by Hive can be understood by Flink and the reverse direction is true also.
>> 3. Data compatibility - Similar to #2, data produced by Hive can be consumed by Flink and vise versa.
>> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its own implementation or make Hive's implementation work in Flink. Further, for user created UDFs in Hive, Flink SQL should provide a mechanism allowing user to import them into Flink without any code change required.
>> 5. Data types -  Flink SQL should support all data types that are available in Hive.
>> 6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) with extension to support Hive's syntax and language features, around DDL, DML, and SELECT queries.
>> 7.  SQL CLI - this is currently developing in Flink but more effort is needed.
>> 8. Server - provide a server that's compatible with Hive's HiverServer2 in thrift APIs, such that HiveServer2 users can reuse their existing client (such as beeline) but connect to Flink's thrift server instead.
>> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other application to use to connect to its thrift server
>> 10. Support other user's customizations in Hive, such as Hive Serdes, storage handlers, etc.
>> 11. Better task failure tolerance and task scheduling at Flink runtime.
>>
>> As you can see, achieving all those requires significant effort and across all layers in Flink. However, a short-term goal could  include only core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3, #6).
>>
>> Please share your further thoughts. If we generally agree that this is the right direction, I could come up with a formal proposal quickly and then we can follow up with broader discussions.
>>
>> Thanks,
>> Xuefu
>>
>>
>>
>> ------------------------------------------------------------------
>> Sender:vino yang <ya...@gmail.com>
>> Sent at:2018 Oct 11 (Thu) 09:45
>> Recipient:Fabian Hueske <fh...@gmail.com>
>> Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <us...@flink.apache.org>
>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>
>> Hi Xuefu,
>>
>> Appreciate this proposal, and like Fabian, it would look better if you can give more details of the plan.
>>
>> Thanks, vino.
>>
>> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
>> Hi Xuefu,
>>
>> Welcome to the Flink community and thanks for starting this discussion! Better Hive integration would be really great!
>> Can you go into details of what you are proposing? I can think of a couple ways to improve Flink in that regard:
>>
>> * Support for Hive UDFs
>> * Support for Hive metadata catalog
>> * Support for HiveQL syntax
>> * ???
>>
>> Best, Fabian
>>
>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
>> Hi all,
>>
>> Along with the community's effort, inside Alibaba we have explored Flink's potential as an execution engine not just for stream processing but also for batch processing. We are encouraged by our findings and have initiated our effort to make Flink's SQL capabilities full-fledged. When comparing what's available in Flink to the offerings from competitive data processing engines, we identified a major gap in Flink: a well integration with Hive ecosystem. This is crucial to the success of Flink SQL and batch due to the well-established data ecosystem around Hive. Therefore, we have done some initial work along this direction but there are still a lot of effort needed.
>>
>> We have two strategies in mind. The first one is to make Flink SQL full-fledged and well-integrated with Hive ecosystem. This is a similar approach to what Spark SQL adopted. The second strategy is to make Hive itself work with Flink, similar to the proposal in [1]. Each approach bears its pros and cons, but they don’t need to be mutually exclusive with each targeting at different users and use cases. We believe that both will promote a much greater adoption of Flink beyond stream processing.
>>
>> We have been focused on the first approach and would like to showcase Flink's batch and SQL capabilities with Flink SQL. However, we have also planned to start strategy #2 as the follow-up effort.
>>
>> I'm completely new to Flink(, with a short bio [2] below), though many of my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like to share our thoughts and invite your early feedback. At the same time, I am working on a detailed proposal on Flink SQL's integration with Hive ecosystem, which will be also shared when ready.
>>
>> While the ideas are simple, each approach will demand significant effort, more than what we can afford. Thus, the input and contributions from the communities are greatly welcome and appreciated.
>>
>> Regards,
>>
>>
>> Xuefu
>>
>> References:
>>
>> [1] https://issues.apache.org/jira/browse/HIVE-10712
>> [2] Xuefu Zhang is a long-time open source veteran, worked or working on many projects under Apache Foundation, of which he is also an honored member. About 10 years ago he worked in the Hadoop team at Yahoo where the projects just got started. Later he worked at Cloudera, initiating and leading the development of Hive on Spark project in the communities and across many organizations. Prior to joining Alibaba, he worked at Uber where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and significantly improved Uber's cluster efficiency.
>>
>>

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by Taher Koitawala <ta...@gslab.com>.
Sounds smashing; I think the initial integration will help 60% or so flink
sql users and a lot other use cases will emerge when we solve the first one.

Thanks,
Taher Koitawala




On Fri 12 Oct, 2018, 10:13 AM Zhang, Xuefu, <xu...@alibaba-inc.com> wrote:

> Hi Taher,
>
> Thank you for your input. I think you emphasized two important points:
>
> 1. Hive metastore could be used for storing Flink metadata
> 2. There are some usability issues around Flink SQL configuration
>
> I think we all agree on #1. #2 may be well true and the usability should
> be improved. However, I'm afraid that this is orthogonal to Hive
> integration and the proposed solution might be just one of the possible
> solutions. On the surface, the extensions you proposed seem going beyond
> the syntax and semantics of SQL language in general.
>
> I don't disagree on the value of your proposal. I guess it's better to
> solve #1 first and leave #2 for follow-up discussions. How does this sound
> to you?
>
> Thanks,
> Xuefu
>
> ------------------------------------------------------------------
> Sender:Taher Koitawala <ta...@gslab.com>
> Sent at:2018 Oct 12 (Fri) 10:06
> Recipient:Xuefu <xu...@alibaba-inc.com>
> Cc:Rong Rong <wa...@gmail.com>; Timo Walther <tw...@apache.org>;
> dev <de...@flink.apache.org>; jornfranke <jo...@gmail.com>; vino yang <
> yanghua1127@gmail.com>; Fabian Hueske <fh...@gmail.com>; user <
> user@flink.apache.org>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> One other thought on the same lines was to use hive tables to store kafka
> information to process streaming tables. Something like
>
> "create table streaming_table (
> bootstrapServers string,
> topic string, keySerialiser string, ValueSerialiser string)"
>
> Insert into streaming_table values(,"10.17.1.1:9092,10.17.2.2:9092,
> 10.17.3.3:9092", "KafkaTopicName", "SimpleStringSchema",
> "SimpleSchemaString");
>
> Create table processingtable(
> //Enter fields here which match the kafka records schema);
>
> Now we make a custom clause called something like "using"
>
> The way we use this is:
>
> Using streaming_table as configuration select count(*) from
> processingtable as streaming;
>
>
> This way users can now pass Flink SQL info easily and get rid of the Flink
> SQL configuration file all together. This is simple and easy to understand
> and I think most users would follow this.
>
> Thanks,
> Taher Koitawala
>
> On Fri 12 Oct, 2018, 7:24 AM Taher Koitawala, <ta...@gslab.com>
> wrote:
> I think integrating Flink with Hive would be an amazing option and also to
> get Flink's SQL up to pace would be amazing.
>
> Current Flink Sql syntax to prepare and process a table is too verbose,
> users manually need to retype table definitions and that's a pain. Hive
> metastore integration should be done through, many users are okay defining
> their table schemas in Hive as it is easy to main, change or even migrate.
>
> Also we could simply choosing batch and stream there with simply something
> like a "process as" clause.
>
> select count(*) from flink_mailing_list process as stream;
>
> select count(*) from flink_mailing_list process as batch;
>
> This way we could completely get rid of Flink SQL configuration files.
>
> Thanks,
> Taher Koitawala
>
> Integrating
> On Fri 12 Oct, 2018, 2:35 AM Zhang, Xuefu, <xu...@alibaba-inc.com>
> wrote:
> Hi Rong,
>
> Thanks for your feedback. Some of my earlier comments might have addressed
> some of your points, so here I'd like to cover some specifics.
>
> 1. Yes, I expect that table stats stored in Hive will be used in Flink
> plan optimization, but it's not part of compatibility concern (yet).
> 2. Both implementing Hive UDFs in Flink natively and making Hive UDFs work
> in Flink are considered.
> 3. I am aware of FLIP-24, but here the proposal is to make remote server
> compatible with HiveServer2. They are not mutually exclusive either.
> 4. The JDBC/ODBC driver in question is for the remote server that Flink
> provides. It's usually the servicer owner who provides drivers to their
> services. We weren't talking about JDBC/ODBC driver to external DB systems.
>
> Let me know if you have further questions.
>
> Thanks,
> Xuefu
>
> ------------------------------------------------------------------
> Sender:Rong Rong <wa...@gmail.com>
> Sent at:2018 Oct 12 (Fri) 01:52
> Recipient:Timo Walther <tw...@apache.org>
> Cc:dev <de...@flink.apache.org>; jornfranke <jo...@gmail.com>; Xuefu <
> xuefu.z@alibaba-inc.com>; vino yang <ya...@gmail.com>; Fabian
> Hueske <fh...@gmail.com>; user <us...@flink.apache.org>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> Hi Xuefu,
>
> Thanks for putting together the overview. I would like to add some more on
> top of Timo's comments.
> 1,2. I agree with Timo that a proper catalog support should also address
> the metadata compatibility issues. I was actually wondering if you are
> referring to something like utilizing table stats for plan optimization?
> 4. If the key is to have users integrate Hive UDF without code changes to
> Flink UDF, it shouldn't be a problem as Timo mentioned. Is your concern
> mostly on the support of Hive UDFs that should be implemented in
> Flink-table natively?
> 7,8. Correct me if I am wrong, but I feel like some of the related
> components might have already been discussed in the longer term road map of
> FLIP-24 [1]?
> 9. per Jorn's comment to stay clear from a tight dependency on Hive and
> treat it as one "connector" system. Should we also consider treating
> JDBC/ODBC driver as part of the component from the connector system instead
> of having Flink to provide them?
>
> Thanks,
> Rong
>
> [1].
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-24+-+SQL+Client
>
> On Thu, Oct 11, 2018 at 12:46 AM Timo Walther <tw...@apache.org> wrote:
> Hi Xuefu,
>
> thanks for your proposal, it is a nice summary. Here are my thoughts to
> your list:
>
> 1. I think this is also on our current mid-term roadmap. Flink lacks a
> poper catalog support for a very long time. Before we can connect
> catalogs we need to define how to map all the information from a catalog
> to Flink's representation. This is why the work on the unified connector
> API [1] is going on for quite some time as it is the first approach to
> discuss and represent the pure characteristics of connectors.
> 2. It would be helpful to figure out what is missing in [1] to to ensure
> this point. I guess we will need a new design document just for a proper
> Hive catalog integration.
> 3. This is already work in progress. ORC has been merged, Parquet is on
> its way [1].
> 4. This should be easy. There was a PR in past that I reviewed but was
> not maintained anymore.
> 5. The type system of Flink SQL is very flexible. Only UNION type is
> missing.
> 6. A Flink SQL DDL is on the roadmap soon once we are done with [1].
> Support for Hive syntax also needs cooperation with Apache Calcite.
> 7-11. Long-term goals.
>
> I would also propose to start with a smaller scope where also current
> Flink SQL users can profit: 1, 2, 5, 3. This would allow to grow the
> Flink SQL ecosystem. After that we can aim to be fully compatible
> including syntax and UDFs (4, 6 etc.). Once the core is ready, we can
> work on the tooling (7, 8, 9) and performance (10, 11).
>
> @Jörn: Yes, we should not have a tight dependency on Hive. It should be
> treated as one "connector" system out of many.
>
> Thanks,
> Timo
>
> [1]
>
> https://docs.google.com/document/d/1Yaxp1UJUFW-peGLt8EIidwKIZEWrrA-pznWLuvaH39Y/edit?ts=5bb62df4#
> [2] https://github.com/apache/flink/pull/6483
>
> Am 11.10.18 um 07:54 schrieb Jörn Franke:
> > Would it maybe make sense to provide Flink as an engine on Hive
> („flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely
> coupled than integrating hive in all possible flink core modules and thus
> introducing a very tight dependency to Hive in the core.
> > 1,2,3 could be achieved via a connector based on the Flink Table API.
> > Just as a proposal to start this Endeavour as independent projects (hive
> engine, connector) to avoid too tight coupling with Flink. Maybe in a more
> distant future if the Hive integration is heavily demanded one could then
> integrate it more tightly if needed.
> >
> > What is meant by 11?
> >> Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
> >>
> >> Hi Fabian/Vno,
> >>
> >> Thank you very much for your encouragement inquiry. Sorry that I didn't
> see Fabian's email until I read Vino's response just now. (Somehow Fabian's
> went to the spam folder.)
> >>
> >> My proposal contains long-term and short-terms goals. Nevertheless, the
> effort will focus on the following areas, including Fabian's list:
> >>
> >> 1. Hive metastore connectivity - This covers both read/write access,
> which means Flink can make full use of Hive's metastore as its catalog (at
> least for the batch but can extend for streaming as well).
> >> 2. Metadata compatibility - Objects (databases, tables, partitions,
> etc) created by Hive can be understood by Flink and the reverse direction
> is true also.
> >> 3. Data compatibility - Similar to #2, data produced by Hive can be
> consumed by Flink and vise versa.
> >> 4. Support Hive UDFs - For all Hive's native udfs, Flink either
> provides its own implementation or make Hive's implementation work in
> Flink. Further, for user created UDFs in Hive, Flink SQL should provide a
> mechanism allowing user to import them into Flink without any code change
> required.
> >> 5. Data types -  Flink SQL should support all data types that are
> available in Hive.
> >> 6. SQL Language - Flink SQL should support SQL standard (such as
> SQL2003) with extension to support Hive's syntax and language features,
> around DDL, DML, and SELECT queries.
> >> 7.  SQL CLI - this is currently developing in Flink but more effort is
> needed.
> >> 8. Server - provide a server that's compatible with Hive's HiverServer2
> in thrift APIs, such that HiveServer2 users can reuse their existing client
> (such as beeline) but connect to Flink's thrift server instead.
> >> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for
> other application to use to connect to its thrift server
> >> 10. Support other user's customizations in Hive, such as Hive Serdes,
> storage handlers, etc.
> >> 11. Better task failure tolerance and task scheduling at Flink runtime.
> >>
> >> As you can see, achieving all those requires significant effort and
> across all layers in Flink. However, a short-term goal could  include only
> core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as
> #3, #6).
> >>
> >> Please share your further thoughts. If we generally agree that this is
> the right direction, I could come up with a formal proposal quickly and
> then we can follow up with broader discussions.
> >>
> >> Thanks,
> >> Xuefu
> >>
> >>
> >>
> >> ------------------------------------------------------------------
> >> Sender:vino yang <ya...@gmail.com>
> >> Sent at:2018 Oct 11 (Thu) 09:45
> >> Recipient:Fabian Hueske <fh...@gmail.com>
> >> Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <
> user@flink.apache.org>
> >> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> >>
> >> Hi Xuefu,
> >>
> >> Appreciate this proposal, and like Fabian, it would look better if you
> can give more details of the plan.
> >>
> >> Thanks, vino.
> >>
> >> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
> >> Hi Xuefu,
> >>
> >> Welcome to the Flink community and thanks for starting this discussion!
> Better Hive integration would be really great!
> >> Can you go into details of what you are proposing? I can think of a
> couple ways to improve Flink in that regard:
> >>
> >> * Support for Hive UDFs
> >> * Support for Hive metadata catalog
> >> * Support for HiveQL syntax
> >> * ???
> >>
> >> Best, Fabian
> >>
> >> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
> xuefu.z@alibaba-inc.com>:
> >> Hi all,
> >>
> >> Along with the community's effort, inside Alibaba we have explored
> Flink's potential as an execution engine not just for stream processing but
> also for batch processing. We are encouraged by our findings and have
> initiated our effort to make Flink's SQL capabilities full-fledged. When
> comparing what's available in Flink to the offerings from competitive data
> processing engines, we identified a major gap in Flink: a well integration
> with Hive ecosystem. This is crucial to the success of Flink SQL and batch
> due to the well-established data ecosystem around Hive. Therefore, we have
> done some initial work along this direction but there are still a lot of
> effort needed.
> >>
> >> We have two strategies in mind. The first one is to make Flink SQL
> full-fledged and well-integrated with Hive ecosystem. This is a similar
> approach to what Spark SQL adopted. The second strategy is to make Hive
> itself work with Flink, similar to the proposal in [1]. Each approach bears
> its pros and cons, but they don’t need to be mutually exclusive with each
> targeting at different users and use cases. We believe that both will
> promote a much greater adoption of Flink beyond stream processing.
> >>
> >> We have been focused on the first approach and would like to showcase
> Flink's batch and SQL capabilities with Flink SQL. However, we have also
> planned to start strategy #2 as the follow-up effort.
> >>
> >> I'm completely new to Flink(, with a short bio [2] below), though many
> of my colleagues here at Alibaba are long-time contributors. Nevertheless,
> I'd like to share our thoughts and invite your early feedback. At the same
> time, I am working on a detailed proposal on Flink SQL's integration with
> Hive ecosystem, which will be also shared when ready.
> >>
> >> While the ideas are simple, each approach will demand significant
> effort, more than what we can afford. Thus, the input and contributions
> from the communities are greatly welcome and appreciated.
> >>
> >> Regards,
> >>
> >>
> >> Xuefu
> >>
> >> References:
> >>
> >> [1] https://issues.apache.org/jira/browse/HIVE-10712
> >> [2] Xuefu Zhang is a long-time open source veteran, worked or working
> on many projects under Apache Foundation, of which he is also an honored
> member. About 10 years ago he worked in the Hadoop team at Yahoo where the
> projects just got started. Later he worked at Cloudera, initiating and
> leading the development of Hive on Spark project in the communities and
> across many organizations. Prior to joining Alibaba, he worked at Uber
> where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and
> significantly improved Uber's cluster efficiency.
> >>
> >>
>
>

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by Taher Koitawala <ta...@gslab.com>.
Sounds smashing; I think the initial integration will help 60% or so flink
sql users and a lot other use cases will emerge when we solve the first one.

Thanks,
Taher Koitawala




On Fri 12 Oct, 2018, 10:13 AM Zhang, Xuefu, <xu...@alibaba-inc.com> wrote:

> Hi Taher,
>
> Thank you for your input. I think you emphasized two important points:
>
> 1. Hive metastore could be used for storing Flink metadata
> 2. There are some usability issues around Flink SQL configuration
>
> I think we all agree on #1. #2 may be well true and the usability should
> be improved. However, I'm afraid that this is orthogonal to Hive
> integration and the proposed solution might be just one of the possible
> solutions. On the surface, the extensions you proposed seem going beyond
> the syntax and semantics of SQL language in general.
>
> I don't disagree on the value of your proposal. I guess it's better to
> solve #1 first and leave #2 for follow-up discussions. How does this sound
> to you?
>
> Thanks,
> Xuefu
>
> ------------------------------------------------------------------
> Sender:Taher Koitawala <ta...@gslab.com>
> Sent at:2018 Oct 12 (Fri) 10:06
> Recipient:Xuefu <xu...@alibaba-inc.com>
> Cc:Rong Rong <wa...@gmail.com>; Timo Walther <tw...@apache.org>;
> dev <de...@flink.apache.org>; jornfranke <jo...@gmail.com>; vino yang <
> yanghua1127@gmail.com>; Fabian Hueske <fh...@gmail.com>; user <
> user@flink.apache.org>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> One other thought on the same lines was to use hive tables to store kafka
> information to process streaming tables. Something like
>
> "create table streaming_table (
> bootstrapServers string,
> topic string, keySerialiser string, ValueSerialiser string)"
>
> Insert into streaming_table values(,"10.17.1.1:9092,10.17.2.2:9092,
> 10.17.3.3:9092", "KafkaTopicName", "SimpleStringSchema",
> "SimpleSchemaString");
>
> Create table processingtable(
> //Enter fields here which match the kafka records schema);
>
> Now we make a custom clause called something like "using"
>
> The way we use this is:
>
> Using streaming_table as configuration select count(*) from
> processingtable as streaming;
>
>
> This way users can now pass Flink SQL info easily and get rid of the Flink
> SQL configuration file all together. This is simple and easy to understand
> and I think most users would follow this.
>
> Thanks,
> Taher Koitawala
>
> On Fri 12 Oct, 2018, 7:24 AM Taher Koitawala, <ta...@gslab.com>
> wrote:
> I think integrating Flink with Hive would be an amazing option and also to
> get Flink's SQL up to pace would be amazing.
>
> Current Flink Sql syntax to prepare and process a table is too verbose,
> users manually need to retype table definitions and that's a pain. Hive
> metastore integration should be done through, many users are okay defining
> their table schemas in Hive as it is easy to main, change or even migrate.
>
> Also we could simply choosing batch and stream there with simply something
> like a "process as" clause.
>
> select count(*) from flink_mailing_list process as stream;
>
> select count(*) from flink_mailing_list process as batch;
>
> This way we could completely get rid of Flink SQL configuration files.
>
> Thanks,
> Taher Koitawala
>
> Integrating
> On Fri 12 Oct, 2018, 2:35 AM Zhang, Xuefu, <xu...@alibaba-inc.com>
> wrote:
> Hi Rong,
>
> Thanks for your feedback. Some of my earlier comments might have addressed
> some of your points, so here I'd like to cover some specifics.
>
> 1. Yes, I expect that table stats stored in Hive will be used in Flink
> plan optimization, but it's not part of compatibility concern (yet).
> 2. Both implementing Hive UDFs in Flink natively and making Hive UDFs work
> in Flink are considered.
> 3. I am aware of FLIP-24, but here the proposal is to make remote server
> compatible with HiveServer2. They are not mutually exclusive either.
> 4. The JDBC/ODBC driver in question is for the remote server that Flink
> provides. It's usually the servicer owner who provides drivers to their
> services. We weren't talking about JDBC/ODBC driver to external DB systems.
>
> Let me know if you have further questions.
>
> Thanks,
> Xuefu
>
> ------------------------------------------------------------------
> Sender:Rong Rong <wa...@gmail.com>
> Sent at:2018 Oct 12 (Fri) 01:52
> Recipient:Timo Walther <tw...@apache.org>
> Cc:dev <de...@flink.apache.org>; jornfranke <jo...@gmail.com>; Xuefu <
> xuefu.z@alibaba-inc.com>; vino yang <ya...@gmail.com>; Fabian
> Hueske <fh...@gmail.com>; user <us...@flink.apache.org>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> Hi Xuefu,
>
> Thanks for putting together the overview. I would like to add some more on
> top of Timo's comments.
> 1,2. I agree with Timo that a proper catalog support should also address
> the metadata compatibility issues. I was actually wondering if you are
> referring to something like utilizing table stats for plan optimization?
> 4. If the key is to have users integrate Hive UDF without code changes to
> Flink UDF, it shouldn't be a problem as Timo mentioned. Is your concern
> mostly on the support of Hive UDFs that should be implemented in
> Flink-table natively?
> 7,8. Correct me if I am wrong, but I feel like some of the related
> components might have already been discussed in the longer term road map of
> FLIP-24 [1]?
> 9. per Jorn's comment to stay clear from a tight dependency on Hive and
> treat it as one "connector" system. Should we also consider treating
> JDBC/ODBC driver as part of the component from the connector system instead
> of having Flink to provide them?
>
> Thanks,
> Rong
>
> [1].
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-24+-+SQL+Client
>
> On Thu, Oct 11, 2018 at 12:46 AM Timo Walther <tw...@apache.org> wrote:
> Hi Xuefu,
>
> thanks for your proposal, it is a nice summary. Here are my thoughts to
> your list:
>
> 1. I think this is also on our current mid-term roadmap. Flink lacks a
> poper catalog support for a very long time. Before we can connect
> catalogs we need to define how to map all the information from a catalog
> to Flink's representation. This is why the work on the unified connector
> API [1] is going on for quite some time as it is the first approach to
> discuss and represent the pure characteristics of connectors.
> 2. It would be helpful to figure out what is missing in [1] to to ensure
> this point. I guess we will need a new design document just for a proper
> Hive catalog integration.
> 3. This is already work in progress. ORC has been merged, Parquet is on
> its way [1].
> 4. This should be easy. There was a PR in past that I reviewed but was
> not maintained anymore.
> 5. The type system of Flink SQL is very flexible. Only UNION type is
> missing.
> 6. A Flink SQL DDL is on the roadmap soon once we are done with [1].
> Support for Hive syntax also needs cooperation with Apache Calcite.
> 7-11. Long-term goals.
>
> I would also propose to start with a smaller scope where also current
> Flink SQL users can profit: 1, 2, 5, 3. This would allow to grow the
> Flink SQL ecosystem. After that we can aim to be fully compatible
> including syntax and UDFs (4, 6 etc.). Once the core is ready, we can
> work on the tooling (7, 8, 9) and performance (10, 11).
>
> @Jörn: Yes, we should not have a tight dependency on Hive. It should be
> treated as one "connector" system out of many.
>
> Thanks,
> Timo
>
> [1]
>
> https://docs.google.com/document/d/1Yaxp1UJUFW-peGLt8EIidwKIZEWrrA-pznWLuvaH39Y/edit?ts=5bb62df4#
> [2] https://github.com/apache/flink/pull/6483
>
> Am 11.10.18 um 07:54 schrieb Jörn Franke:
> > Would it maybe make sense to provide Flink as an engine on Hive
> („flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely
> coupled than integrating hive in all possible flink core modules and thus
> introducing a very tight dependency to Hive in the core.
> > 1,2,3 could be achieved via a connector based on the Flink Table API.
> > Just as a proposal to start this Endeavour as independent projects (hive
> engine, connector) to avoid too tight coupling with Flink. Maybe in a more
> distant future if the Hive integration is heavily demanded one could then
> integrate it more tightly if needed.
> >
> > What is meant by 11?
> >> Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
> >>
> >> Hi Fabian/Vno,
> >>
> >> Thank you very much for your encouragement inquiry. Sorry that I didn't
> see Fabian's email until I read Vino's response just now. (Somehow Fabian's
> went to the spam folder.)
> >>
> >> My proposal contains long-term and short-terms goals. Nevertheless, the
> effort will focus on the following areas, including Fabian's list:
> >>
> >> 1. Hive metastore connectivity - This covers both read/write access,
> which means Flink can make full use of Hive's metastore as its catalog (at
> least for the batch but can extend for streaming as well).
> >> 2. Metadata compatibility - Objects (databases, tables, partitions,
> etc) created by Hive can be understood by Flink and the reverse direction
> is true also.
> >> 3. Data compatibility - Similar to #2, data produced by Hive can be
> consumed by Flink and vise versa.
> >> 4. Support Hive UDFs - For all Hive's native udfs, Flink either
> provides its own implementation or make Hive's implementation work in
> Flink. Further, for user created UDFs in Hive, Flink SQL should provide a
> mechanism allowing user to import them into Flink without any code change
> required.
> >> 5. Data types -  Flink SQL should support all data types that are
> available in Hive.
> >> 6. SQL Language - Flink SQL should support SQL standard (such as
> SQL2003) with extension to support Hive's syntax and language features,
> around DDL, DML, and SELECT queries.
> >> 7.  SQL CLI - this is currently developing in Flink but more effort is
> needed.
> >> 8. Server - provide a server that's compatible with Hive's HiverServer2
> in thrift APIs, such that HiveServer2 users can reuse their existing client
> (such as beeline) but connect to Flink's thrift server instead.
> >> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for
> other application to use to connect to its thrift server
> >> 10. Support other user's customizations in Hive, such as Hive Serdes,
> storage handlers, etc.
> >> 11. Better task failure tolerance and task scheduling at Flink runtime.
> >>
> >> As you can see, achieving all those requires significant effort and
> across all layers in Flink. However, a short-term goal could  include only
> core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as
> #3, #6).
> >>
> >> Please share your further thoughts. If we generally agree that this is
> the right direction, I could come up with a formal proposal quickly and
> then we can follow up with broader discussions.
> >>
> >> Thanks,
> >> Xuefu
> >>
> >>
> >>
> >> ------------------------------------------------------------------
> >> Sender:vino yang <ya...@gmail.com>
> >> Sent at:2018 Oct 11 (Thu) 09:45
> >> Recipient:Fabian Hueske <fh...@gmail.com>
> >> Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <
> user@flink.apache.org>
> >> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> >>
> >> Hi Xuefu,
> >>
> >> Appreciate this proposal, and like Fabian, it would look better if you
> can give more details of the plan.
> >>
> >> Thanks, vino.
> >>
> >> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
> >> Hi Xuefu,
> >>
> >> Welcome to the Flink community and thanks for starting this discussion!
> Better Hive integration would be really great!
> >> Can you go into details of what you are proposing? I can think of a
> couple ways to improve Flink in that regard:
> >>
> >> * Support for Hive UDFs
> >> * Support for Hive metadata catalog
> >> * Support for HiveQL syntax
> >> * ???
> >>
> >> Best, Fabian
> >>
> >> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
> xuefu.z@alibaba-inc.com>:
> >> Hi all,
> >>
> >> Along with the community's effort, inside Alibaba we have explored
> Flink's potential as an execution engine not just for stream processing but
> also for batch processing. We are encouraged by our findings and have
> initiated our effort to make Flink's SQL capabilities full-fledged. When
> comparing what's available in Flink to the offerings from competitive data
> processing engines, we identified a major gap in Flink: a well integration
> with Hive ecosystem. This is crucial to the success of Flink SQL and batch
> due to the well-established data ecosystem around Hive. Therefore, we have
> done some initial work along this direction but there are still a lot of
> effort needed.
> >>
> >> We have two strategies in mind. The first one is to make Flink SQL
> full-fledged and well-integrated with Hive ecosystem. This is a similar
> approach to what Spark SQL adopted. The second strategy is to make Hive
> itself work with Flink, similar to the proposal in [1]. Each approach bears
> its pros and cons, but they don’t need to be mutually exclusive with each
> targeting at different users and use cases. We believe that both will
> promote a much greater adoption of Flink beyond stream processing.
> >>
> >> We have been focused on the first approach and would like to showcase
> Flink's batch and SQL capabilities with Flink SQL. However, we have also
> planned to start strategy #2 as the follow-up effort.
> >>
> >> I'm completely new to Flink(, with a short bio [2] below), though many
> of my colleagues here at Alibaba are long-time contributors. Nevertheless,
> I'd like to share our thoughts and invite your early feedback. At the same
> time, I am working on a detailed proposal on Flink SQL's integration with
> Hive ecosystem, which will be also shared when ready.
> >>
> >> While the ideas are simple, each approach will demand significant
> effort, more than what we can afford. Thus, the input and contributions
> from the communities are greatly welcome and appreciated.
> >>
> >> Regards,
> >>
> >>
> >> Xuefu
> >>
> >> References:
> >>
> >> [1] https://issues.apache.org/jira/browse/HIVE-10712
> >> [2] Xuefu Zhang is a long-time open source veteran, worked or working
> on many projects under Apache Foundation, of which he is also an honored
> member. About 10 years ago he worked in the Hadoop team at Yahoo where the
> projects just got started. Later he worked at Cloudera, initiating and
> leading the development of Hive on Spark project in the communities and
> across many organizations. Prior to joining Alibaba, he worked at Uber
> where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and
> significantly improved Uber's cluster efficiency.
> >>
> >>
>
>

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by "Zhang, Xuefu" <xu...@alibaba-inc.com>.
Hi Taher,

Thank you for your input. I think you emphasized two important points:

1. Hive metastore could be used for storing Flink metadata
2. There are some usability issues around Flink SQL configuration

I think we all agree on #1. #2 may be well true and the usability should be improved. However, I'm afraid that this is orthogonal to Hive integration and the proposed solution might be just one of the possible solutions. On the surface, the extensions you proposed seem going beyond the syntax and semantics of SQL language in general.

I don't disagree on the value of your proposal. I guess it's better to solve #1 first and leave #2 for follow-up discussions. How does this sound to you?

Thanks,
Xuefu


------------------------------------------------------------------
Sender:Taher Koitawala <ta...@gslab.com>
Sent at:2018 Oct 12 (Fri) 10:06
Recipient:Xuefu <xu...@alibaba-inc.com>
Cc:Rong Rong <wa...@gmail.com>; Timo Walther <tw...@apache.org>; dev <de...@flink.apache.org>; jornfranke <jo...@gmail.com>; vino yang <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

One other thought on the same lines was to use hive tables to store kafka information to process streaming tables. Something like 

"create table streaming_table (
bootstrapServers string,
topic string, keySerialiser string, ValueSerialiser string)"

Insert into streaming_table values(,"10.17.1.1:9092,10.17.2.2:9092,10.17.3.3:9092", "KafkaTopicName", "SimpleStringSchema", "SimpleSchemaString");

Create table processingtable(
//Enter fields here which match the kafka records schema);

Now we make a custom clause called something like "using"

The way we use this is:

Using streaming_table as configuration select count(*) from processingtable as streaming;


This way users can now pass Flink SQL info easily and get rid of the Flink SQL configuration file all together. This is simple and easy to understand and I think most users would follow this.

Thanks, 
Taher Koitawala 
On Fri 12 Oct, 2018, 7:24 AM Taher Koitawala, <ta...@gslab.com> wrote:

I think integrating Flink with Hive would be an amazing option and also to get Flink's SQL up to pace would be amazing. 

Current Flink Sql syntax to prepare and process a table is too verbose, users manually need to retype table definitions and that's a pain. Hive metastore integration should be done through, many users are okay defining their table schemas in Hive as it is easy to main, change or even migrate. 

Also we could simply choosing batch and stream there with simply something like a "process as" clause. 

select count(*) from flink_mailing_list process as stream;

select count(*) from flink_mailing_list process as batch;

This way we could completely get rid of Flink SQL configuration files. 

Thanks,
Taher Koitawala 

Integrating 
On Fri 12 Oct, 2018, 2:35 AM Zhang, Xuefu, <xu...@alibaba-inc.com> wrote:
Hi Rong,

Thanks for your feedback. Some of my earlier comments might have addressed some of your points, so here I'd like to cover some specifics.

1. Yes, I expect that table stats stored in Hive will be used in Flink plan optimization, but it's not part of compatibility concern (yet).
2. Both implementing Hive UDFs in Flink natively and making Hive UDFs work in Flink are considered.
3. I am aware of FLIP-24, but here the proposal is to make remote server compatible with HiveServer2. They are not mutually exclusive either.
4. The JDBC/ODBC driver in question is for the remote server that Flink provides. It's usually the servicer owner who provides drivers to their services. We weren't talking about JDBC/ODBC driver to external DB systems.

Let me know if you have further questions.

Thanks,
Xuefu

------------------------------------------------------------------
Sender:Rong Rong <wa...@gmail.com>
Sent at:2018 Oct 12 (Fri) 01:52
Recipient:Timo Walther <tw...@apache.org>
Cc:dev <de...@flink.apache.org>; jornfranke <jo...@gmail.com>; Xuefu <xu...@alibaba-inc.com>; vino yang <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Xuefu, 

Thanks for putting together the overview. I would like to add some more on top of Timo's comments.
1,2. I agree with Timo that a proper catalog support should also address the metadata compatibility issues. I was actually wondering if you are referring to something like utilizing table stats for plan optimization?
4. If the key is to have users integrate Hive UDF without code changes to Flink UDF, it shouldn't be a problem as Timo mentioned. Is your concern mostly on the support of Hive UDFs that should be implemented in Flink-table natively?
7,8. Correct me if I am wrong, but I feel like some of the related components might have already been discussed in the longer term road map of FLIP-24 [1]?
9. per Jorn's comment to stay clear from a tight dependency on Hive and treat it as one "connector" system. Should we also consider treating JDBC/ODBC driver as part of the component from the connector system instead of having Flink to provide them?

Thanks,
Rong

[1]. https://cwiki.apache.org/confluence/display/FLINK/FLIP-24+-+SQL+Client
On Thu, Oct 11, 2018 at 12:46 AM Timo Walther <tw...@apache.org> wrote:
Hi Xuefu,

 thanks for your proposal, it is a nice summary. Here are my thoughts to 
 your list:

 1. I think this is also on our current mid-term roadmap. Flink lacks a 
 poper catalog support for a very long time. Before we can connect 
 catalogs we need to define how to map all the information from a catalog 
 to Flink's representation. This is why the work on the unified connector 
 API [1] is going on for quite some time as it is the first approach to 
 discuss and represent the pure characteristics of connectors.
 2. It would be helpful to figure out what is missing in [1] to to ensure 
 this point. I guess we will need a new design document just for a proper 
 Hive catalog integration.
 3. This is already work in progress. ORC has been merged, Parquet is on 
 its way [1].
 4. This should be easy. There was a PR in past that I reviewed but was 
 not maintained anymore.
 5. The type system of Flink SQL is very flexible. Only UNION type is 
 missing.
 6. A Flink SQL DDL is on the roadmap soon once we are done with [1]. 
 Support for Hive syntax also needs cooperation with Apache Calcite.
 7-11. Long-term goals.

 I would also propose to start with a smaller scope where also current 
 Flink SQL users can profit: 1, 2, 5, 3. This would allow to grow the 
 Flink SQL ecosystem. After that we can aim to be fully compatible 
 including syntax and UDFs (4, 6 etc.). Once the core is ready, we can 
 work on the tooling (7, 8, 9) and performance (10, 11).

 @Jörn: Yes, we should not have a tight dependency on Hive. It should be 
 treated as one "connector" system out of many.

 Thanks,
 Timo

 [1] 
https://docs.google.com/document/d/1Yaxp1UJUFW-peGLt8EIidwKIZEWrrA-pznWLuvaH39Y/edit?ts=5bb62df4#
 [2] https://github.com/apache/flink/pull/6483

 Am 11.10.18 um 07:54 schrieb Jörn Franke:
 > Would it maybe make sense to provide Flink as an engine on Hive („flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely coupled than integrating hive in all possible flink core modules and thus introducing a very tight dependency to Hive in the core.
 > 1,2,3 could be achieved via a connector based on the Flink Table API.
 > Just as a proposal to start this Endeavour as independent projects (hive engine, connector) to avoid too tight coupling with Flink. Maybe in a more distant future if the Hive integration is heavily demanded one could then integrate it more tightly if needed.
 >
 > What is meant by 11?
 >> Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
 >>
 >> Hi Fabian/Vno,
 >>
 >> Thank you very much for your encouragement inquiry. Sorry that I didn't see Fabian's email until I read Vino's response just now. (Somehow Fabian's went to the spam folder.)
 >>
 >> My proposal contains long-term and short-terms goals. Nevertheless, the effort will focus on the following areas, including Fabian's list:
 >>
 >> 1. Hive metastore connectivity - This covers both read/write access, which means Flink can make full use of Hive's metastore as its catalog (at least for the batch but can extend for streaming as well).
 >> 2. Metadata compatibility - Objects (databases, tables, partitions, etc) created by Hive can be understood by Flink and the reverse direction is true also.
 >> 3. Data compatibility - Similar to #2, data produced by Hive can be consumed by Flink and vise versa.
 >> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its own implementation or make Hive's implementation work in Flink. Further, for user created UDFs in Hive, Flink SQL should provide a mechanism allowing user to import them into Flink without any code change required.
 >> 5. Data types -  Flink SQL should support all data types that are available in Hive.
 >> 6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) with extension to support Hive's syntax and language features, around DDL, DML, and SELECT queries.
 >> 7.  SQL CLI - this is currently developing in Flink but more effort is needed.
 >> 8. Server - provide a server that's compatible with Hive's HiverServer2 in thrift APIs, such that HiveServer2 users can reuse their existing client (such as beeline) but connect to Flink's thrift server instead.
 >> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other application to use to connect to its thrift server
 >> 10. Support other user's customizations in Hive, such as Hive Serdes, storage handlers, etc.
 >> 11. Better task failure tolerance and task scheduling at Flink runtime.
 >>
 >> As you can see, achieving all those requires significant effort and across all layers in Flink. However, a short-term goal could  include only core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3, #6).
 >>
 >> Please share your further thoughts. If we generally agree that this is the right direction, I could come up with a formal proposal quickly and then we can follow up with broader discussions.
 >>
 >> Thanks,
 >> Xuefu
 >>
 >>
 >>
 >> ------------------------------------------------------------------
 >> Sender:vino yang <ya...@gmail.com>
 >> Sent at:2018 Oct 11 (Thu) 09:45
 >> Recipient:Fabian Hueske <fh...@gmail.com>
 >> Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <us...@flink.apache.org>
 >> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
 >>
 >> Hi Xuefu,
 >>
 >> Appreciate this proposal, and like Fabian, it would look better if you can give more details of the plan.
 >>
 >> Thanks, vino.
 >>
 >> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
 >> Hi Xuefu,
 >>
 >> Welcome to the Flink community and thanks for starting this discussion! Better Hive integration would be really great!
 >> Can you go into details of what you are proposing? I can think of a couple ways to improve Flink in that regard:
 >>
 >> * Support for Hive UDFs
 >> * Support for Hive metadata catalog
 >> * Support for HiveQL syntax
 >> * ???
 >>
 >> Best, Fabian
 >>
 >> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
 >> Hi all,
 >>
 >> Along with the community's effort, inside Alibaba we have explored Flink's potential as an execution engine not just for stream processing but also for batch processing. We are encouraged by our findings and have initiated our effort to make Flink's SQL capabilities full-fledged. When comparing what's available in Flink to the offerings from competitive data processing engines, we identified a major gap in Flink: a well integration with Hive ecosystem. This is crucial to the success of Flink SQL and batch due to the well-established data ecosystem around Hive. Therefore, we have done some initial work along this direction but there are still a lot of effort needed.
 >>
 >> We have two strategies in mind. The first one is to make Flink SQL full-fledged and well-integrated with Hive ecosystem. This is a similar approach to what Spark SQL adopted. The second strategy is to make Hive itself work with Flink, similar to the proposal in [1]. Each approach bears its pros and cons, but they don’t need to be mutually exclusive with each targeting at different users and use cases. We believe that both will promote a much greater adoption of Flink beyond stream processing.
 >>
 >> We have been focused on the first approach and would like to showcase Flink's batch and SQL capabilities with Flink SQL. However, we have also planned to start strategy #2 as the follow-up effort.
 >>
 >> I'm completely new to Flink(, with a short bio [2] below), though many of my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like to share our thoughts and invite your early feedback. At the same time, I am working on a detailed proposal on Flink SQL's integration with Hive ecosystem, which will be also shared when ready.
 >>
 >> While the ideas are simple, each approach will demand significant effort, more than what we can afford. Thus, the input and contributions from the communities are greatly welcome and appreciated.
 >>
 >> Regards,
 >>
 >>
 >> Xuefu
 >>
 >> References:
 >>
 >> [1] https://issues.apache.org/jira/browse/HIVE-10712
 >> [2] Xuefu Zhang is a long-time open source veteran, worked or working on many projects under Apache Foundation, of which he is also an honored member. About 10 years ago he worked in the Hadoop team at Yahoo where the projects just got started. Later he worked at Cloudera, initiating and leading the development of Hive on Spark project in the communities and across many organizations. Prior to joining Alibaba, he worked at Uber where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and significantly improved Uber's cluster efficiency.
 >>
 >>


Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by "Zhang, Xuefu" <xu...@alibaba-inc.com>.
Hi Taher,

Thank you for your input. I think you emphasized two important points:

1. Hive metastore could be used for storing Flink metadata
2. There are some usability issues around Flink SQL configuration

I think we all agree on #1. #2 may be well true and the usability should be improved. However, I'm afraid that this is orthogonal to Hive integration and the proposed solution might be just one of the possible solutions. On the surface, the extensions you proposed seem going beyond the syntax and semantics of SQL language in general.

I don't disagree on the value of your proposal. I guess it's better to solve #1 first and leave #2 for follow-up discussions. How does this sound to you?

Thanks,
Xuefu


------------------------------------------------------------------
Sender:Taher Koitawala <ta...@gslab.com>
Sent at:2018 Oct 12 (Fri) 10:06
Recipient:Xuefu <xu...@alibaba-inc.com>
Cc:Rong Rong <wa...@gmail.com>; Timo Walther <tw...@apache.org>; dev <de...@flink.apache.org>; jornfranke <jo...@gmail.com>; vino yang <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

One other thought on the same lines was to use hive tables to store kafka information to process streaming tables. Something like 

"create table streaming_table (
bootstrapServers string,
topic string, keySerialiser string, ValueSerialiser string)"

Insert into streaming_table values(,"10.17.1.1:9092,10.17.2.2:9092,10.17.3.3:9092", "KafkaTopicName", "SimpleStringSchema", "SimpleSchemaString");

Create table processingtable(
//Enter fields here which match the kafka records schema);

Now we make a custom clause called something like "using"

The way we use this is:

Using streaming_table as configuration select count(*) from processingtable as streaming;


This way users can now pass Flink SQL info easily and get rid of the Flink SQL configuration file all together. This is simple and easy to understand and I think most users would follow this.

Thanks, 
Taher Koitawala 
On Fri 12 Oct, 2018, 7:24 AM Taher Koitawala, <ta...@gslab.com> wrote:

I think integrating Flink with Hive would be an amazing option and also to get Flink's SQL up to pace would be amazing. 

Current Flink Sql syntax to prepare and process a table is too verbose, users manually need to retype table definitions and that's a pain. Hive metastore integration should be done through, many users are okay defining their table schemas in Hive as it is easy to main, change or even migrate. 

Also we could simply choosing batch and stream there with simply something like a "process as" clause. 

select count(*) from flink_mailing_list process as stream;

select count(*) from flink_mailing_list process as batch;

This way we could completely get rid of Flink SQL configuration files. 

Thanks,
Taher Koitawala 

Integrating 
On Fri 12 Oct, 2018, 2:35 AM Zhang, Xuefu, <xu...@alibaba-inc.com> wrote:
Hi Rong,

Thanks for your feedback. Some of my earlier comments might have addressed some of your points, so here I'd like to cover some specifics.

1. Yes, I expect that table stats stored in Hive will be used in Flink plan optimization, but it's not part of compatibility concern (yet).
2. Both implementing Hive UDFs in Flink natively and making Hive UDFs work in Flink are considered.
3. I am aware of FLIP-24, but here the proposal is to make remote server compatible with HiveServer2. They are not mutually exclusive either.
4. The JDBC/ODBC driver in question is for the remote server that Flink provides. It's usually the servicer owner who provides drivers to their services. We weren't talking about JDBC/ODBC driver to external DB systems.

Let me know if you have further questions.

Thanks,
Xuefu

------------------------------------------------------------------
Sender:Rong Rong <wa...@gmail.com>
Sent at:2018 Oct 12 (Fri) 01:52
Recipient:Timo Walther <tw...@apache.org>
Cc:dev <de...@flink.apache.org>; jornfranke <jo...@gmail.com>; Xuefu <xu...@alibaba-inc.com>; vino yang <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Xuefu, 

Thanks for putting together the overview. I would like to add some more on top of Timo's comments.
1,2. I agree with Timo that a proper catalog support should also address the metadata compatibility issues. I was actually wondering if you are referring to something like utilizing table stats for plan optimization?
4. If the key is to have users integrate Hive UDF without code changes to Flink UDF, it shouldn't be a problem as Timo mentioned. Is your concern mostly on the support of Hive UDFs that should be implemented in Flink-table natively?
7,8. Correct me if I am wrong, but I feel like some of the related components might have already been discussed in the longer term road map of FLIP-24 [1]?
9. per Jorn's comment to stay clear from a tight dependency on Hive and treat it as one "connector" system. Should we also consider treating JDBC/ODBC driver as part of the component from the connector system instead of having Flink to provide them?

Thanks,
Rong

[1]. https://cwiki.apache.org/confluence/display/FLINK/FLIP-24+-+SQL+Client
On Thu, Oct 11, 2018 at 12:46 AM Timo Walther <tw...@apache.org> wrote:
Hi Xuefu,

 thanks for your proposal, it is a nice summary. Here are my thoughts to 
 your list:

 1. I think this is also on our current mid-term roadmap. Flink lacks a 
 poper catalog support for a very long time. Before we can connect 
 catalogs we need to define how to map all the information from a catalog 
 to Flink's representation. This is why the work on the unified connector 
 API [1] is going on for quite some time as it is the first approach to 
 discuss and represent the pure characteristics of connectors.
 2. It would be helpful to figure out what is missing in [1] to to ensure 
 this point. I guess we will need a new design document just for a proper 
 Hive catalog integration.
 3. This is already work in progress. ORC has been merged, Parquet is on 
 its way [1].
 4. This should be easy. There was a PR in past that I reviewed but was 
 not maintained anymore.
 5. The type system of Flink SQL is very flexible. Only UNION type is 
 missing.
 6. A Flink SQL DDL is on the roadmap soon once we are done with [1]. 
 Support for Hive syntax also needs cooperation with Apache Calcite.
 7-11. Long-term goals.

 I would also propose to start with a smaller scope where also current 
 Flink SQL users can profit: 1, 2, 5, 3. This would allow to grow the 
 Flink SQL ecosystem. After that we can aim to be fully compatible 
 including syntax and UDFs (4, 6 etc.). Once the core is ready, we can 
 work on the tooling (7, 8, 9) and performance (10, 11).

 @Jörn: Yes, we should not have a tight dependency on Hive. It should be 
 treated as one "connector" system out of many.

 Thanks,
 Timo

 [1] 
https://docs.google.com/document/d/1Yaxp1UJUFW-peGLt8EIidwKIZEWrrA-pznWLuvaH39Y/edit?ts=5bb62df4#
 [2] https://github.com/apache/flink/pull/6483

 Am 11.10.18 um 07:54 schrieb Jörn Franke:
 > Would it maybe make sense to provide Flink as an engine on Hive („flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely coupled than integrating hive in all possible flink core modules and thus introducing a very tight dependency to Hive in the core.
 > 1,2,3 could be achieved via a connector based on the Flink Table API.
 > Just as a proposal to start this Endeavour as independent projects (hive engine, connector) to avoid too tight coupling with Flink. Maybe in a more distant future if the Hive integration is heavily demanded one could then integrate it more tightly if needed.
 >
 > What is meant by 11?
 >> Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
 >>
 >> Hi Fabian/Vno,
 >>
 >> Thank you very much for your encouragement inquiry. Sorry that I didn't see Fabian's email until I read Vino's response just now. (Somehow Fabian's went to the spam folder.)
 >>
 >> My proposal contains long-term and short-terms goals. Nevertheless, the effort will focus on the following areas, including Fabian's list:
 >>
 >> 1. Hive metastore connectivity - This covers both read/write access, which means Flink can make full use of Hive's metastore as its catalog (at least for the batch but can extend for streaming as well).
 >> 2. Metadata compatibility - Objects (databases, tables, partitions, etc) created by Hive can be understood by Flink and the reverse direction is true also.
 >> 3. Data compatibility - Similar to #2, data produced by Hive can be consumed by Flink and vise versa.
 >> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its own implementation or make Hive's implementation work in Flink. Further, for user created UDFs in Hive, Flink SQL should provide a mechanism allowing user to import them into Flink without any code change required.
 >> 5. Data types -  Flink SQL should support all data types that are available in Hive.
 >> 6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) with extension to support Hive's syntax and language features, around DDL, DML, and SELECT queries.
 >> 7.  SQL CLI - this is currently developing in Flink but more effort is needed.
 >> 8. Server - provide a server that's compatible with Hive's HiverServer2 in thrift APIs, such that HiveServer2 users can reuse their existing client (such as beeline) but connect to Flink's thrift server instead.
 >> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other application to use to connect to its thrift server
 >> 10. Support other user's customizations in Hive, such as Hive Serdes, storage handlers, etc.
 >> 11. Better task failure tolerance and task scheduling at Flink runtime.
 >>
 >> As you can see, achieving all those requires significant effort and across all layers in Flink. However, a short-term goal could  include only core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3, #6).
 >>
 >> Please share your further thoughts. If we generally agree that this is the right direction, I could come up with a formal proposal quickly and then we can follow up with broader discussions.
 >>
 >> Thanks,
 >> Xuefu
 >>
 >>
 >>
 >> ------------------------------------------------------------------
 >> Sender:vino yang <ya...@gmail.com>
 >> Sent at:2018 Oct 11 (Thu) 09:45
 >> Recipient:Fabian Hueske <fh...@gmail.com>
 >> Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <us...@flink.apache.org>
 >> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
 >>
 >> Hi Xuefu,
 >>
 >> Appreciate this proposal, and like Fabian, it would look better if you can give more details of the plan.
 >>
 >> Thanks, vino.
 >>
 >> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
 >> Hi Xuefu,
 >>
 >> Welcome to the Flink community and thanks for starting this discussion! Better Hive integration would be really great!
 >> Can you go into details of what you are proposing? I can think of a couple ways to improve Flink in that regard:
 >>
 >> * Support for Hive UDFs
 >> * Support for Hive metadata catalog
 >> * Support for HiveQL syntax
 >> * ???
 >>
 >> Best, Fabian
 >>
 >> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
 >> Hi all,
 >>
 >> Along with the community's effort, inside Alibaba we have explored Flink's potential as an execution engine not just for stream processing but also for batch processing. We are encouraged by our findings and have initiated our effort to make Flink's SQL capabilities full-fledged. When comparing what's available in Flink to the offerings from competitive data processing engines, we identified a major gap in Flink: a well integration with Hive ecosystem. This is crucial to the success of Flink SQL and batch due to the well-established data ecosystem around Hive. Therefore, we have done some initial work along this direction but there are still a lot of effort needed.
 >>
 >> We have two strategies in mind. The first one is to make Flink SQL full-fledged and well-integrated with Hive ecosystem. This is a similar approach to what Spark SQL adopted. The second strategy is to make Hive itself work with Flink, similar to the proposal in [1]. Each approach bears its pros and cons, but they don’t need to be mutually exclusive with each targeting at different users and use cases. We believe that both will promote a much greater adoption of Flink beyond stream processing.
 >>
 >> We have been focused on the first approach and would like to showcase Flink's batch and SQL capabilities with Flink SQL. However, we have also planned to start strategy #2 as the follow-up effort.
 >>
 >> I'm completely new to Flink(, with a short bio [2] below), though many of my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like to share our thoughts and invite your early feedback. At the same time, I am working on a detailed proposal on Flink SQL's integration with Hive ecosystem, which will be also shared when ready.
 >>
 >> While the ideas are simple, each approach will demand significant effort, more than what we can afford. Thus, the input and contributions from the communities are greatly welcome and appreciated.
 >>
 >> Regards,
 >>
 >>
 >> Xuefu
 >>
 >> References:
 >>
 >> [1] https://issues.apache.org/jira/browse/HIVE-10712
 >> [2] Xuefu Zhang is a long-time open source veteran, worked or working on many projects under Apache Foundation, of which he is also an honored member. About 10 years ago he worked in the Hadoop team at Yahoo where the projects just got started. Later he worked at Cloudera, initiating and leading the development of Hive on Spark project in the communities and across many organizations. Prior to joining Alibaba, he worked at Uber where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and significantly improved Uber's cluster efficiency.
 >>
 >>


Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by Taher Koitawala <ta...@gslab.com>.
One other thought on the same lines was to use hive tables to store kafka
information to process streaming tables. Something like

"create table streaming_table (
bootstrapServers string,
topic string, keySerialiser string, ValueSerialiser string)"

Insert into streaming_table values(,"10.17.1.1:9092,10.17.2.2:9092,
10.17.3.3:9092", "KafkaTopicName", "SimpleStringSchema",
"SimpleSchemaString");

Create table processingtable(
//Enter fields here which match the kafka records schema);

Now we make a custom clause called something like "using"

The way we use this is:

Using streaming_table as configuration select count(*) from processingtable
as streaming;


This way users can now pass Flink SQL info easily and get rid of the Flink
SQL configuration file all together. This is simple and easy to understand
and I think most users would follow this.

Thanks,
Taher Koitawala

On Fri 12 Oct, 2018, 7:24 AM Taher Koitawala, <ta...@gslab.com>
wrote:

> I think integrating Flink with Hive would be an amazing option and also to
> get Flink's SQL up to pace would be amazing.
>
> Current Flink Sql syntax to prepare and process a table is too verbose,
> users manually need to retype table definitions and that's a pain. Hive
> metastore integration should be done through, many users are okay defining
> their table schemas in Hive as it is easy to main, change or even migrate.
>
> Also we could simply choosing batch and stream there with simply something
> like a "process as" clause.
>
> select count(*) from flink_mailing_list process as stream;
>
> select count(*) from flink_mailing_list process as batch;
>
> This way we could completely get rid of Flink SQL configuration files.
>
> Thanks,
> Taher Koitawala
>
> Integrating
> On Fri 12 Oct, 2018, 2:35 AM Zhang, Xuefu, <xu...@alibaba-inc.com>
> wrote:
>
>> Hi Rong,
>>
>> Thanks for your feedback. Some of my earlier comments might have
>> addressed some of your points, so here I'd like to cover some specifics.
>>
>> 1. Yes, I expect that table stats stored in Hive will be used in Flink
>> plan optimization, but it's not part of compatibility concern (yet).
>> 2. Both implementing Hive UDFs in Flink natively and making Hive UDFs
>> work in Flink are considered.
>> 3. I am aware of FLIP-24, but here the proposal is to make remote server
>> compatible with HiveServer2. They are not mutually exclusive either.
>> 4. The JDBC/ODBC driver in question is for the remote server that Flink
>> provides. It's usually the servicer owner who provides drivers to their
>> services. We weren't talking about JDBC/ODBC driver to external DB systems.
>>
>> Let me know if you have further questions.
>>
>> Thanks,
>> Xuefu
>>
>> ------------------------------------------------------------------
>> Sender:Rong Rong <wa...@gmail.com>
>> Sent at:2018 Oct 12 (Fri) 01:52
>> Recipient:Timo Walther <tw...@apache.org>
>> Cc:dev <de...@flink.apache.org>; jornfranke <jo...@gmail.com>; Xuefu <
>> xuefu.z@alibaba-inc.com>; vino yang <ya...@gmail.com>; Fabian
>> Hueske <fh...@gmail.com>; user <us...@flink.apache.org>
>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>
>> Hi Xuefu,
>>
>> Thanks for putting together the overview. I would like to add some more
>> on top of Timo's comments.
>> 1,2. I agree with Timo that a proper catalog support should also address
>> the metadata compatibility issues. I was actually wondering if you are
>> referring to something like utilizing table stats for plan optimization?
>> 4. If the key is to have users integrate Hive UDF without code changes to
>> Flink UDF, it shouldn't be a problem as Timo mentioned. Is your concern
>> mostly on the support of Hive UDFs that should be implemented in
>> Flink-table natively?
>> 7,8. Correct me if I am wrong, but I feel like some of the related
>> components might have already been discussed in the longer term road map of
>> FLIP-24 [1]?
>> 9. per Jorn's comment to stay clear from a tight dependency on Hive and
>> treat it as one "connector" system. Should we also consider treating
>> JDBC/ODBC driver as part of the component from the connector system instead
>> of having Flink to provide them?
>>
>> Thanks,
>> Rong
>>
>> [1].
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-24+-+SQL+Client
>>
>> On Thu, Oct 11, 2018 at 12:46 AM Timo Walther <tw...@apache.org> wrote:
>> Hi Xuefu,
>>
>> thanks for your proposal, it is a nice summary. Here are my thoughts to
>> your list:
>>
>> 1. I think this is also on our current mid-term roadmap. Flink lacks a
>> poper catalog support for a very long time. Before we can connect
>> catalogs we need to define how to map all the information from a catalog
>> to Flink's representation. This is why the work on the unified connector
>> API [1] is going on for quite some time as it is the first approach to
>> discuss and represent the pure characteristics of connectors.
>> 2. It would be helpful to figure out what is missing in [1] to to ensure
>> this point. I guess we will need a new design document just for a proper
>> Hive catalog integration.
>> 3. This is already work in progress. ORC has been merged, Parquet is on
>> its way [1].
>> 4. This should be easy. There was a PR in past that I reviewed but was
>> not maintained anymore.
>> 5. The type system of Flink SQL is very flexible. Only UNION type is
>> missing.
>> 6. A Flink SQL DDL is on the roadmap soon once we are done with [1].
>> Support for Hive syntax also needs cooperation with Apache Calcite.
>> 7-11. Long-term goals.
>>
>> I would also propose to start with a smaller scope where also current
>> Flink SQL users can profit: 1, 2, 5, 3. This would allow to grow the
>> Flink SQL ecosystem. After that we can aim to be fully compatible
>> including syntax and UDFs (4, 6 etc.). Once the core is ready, we can
>> work on the tooling (7, 8, 9) and performance (10, 11).
>>
>> @Jörn: Yes, we should not have a tight dependency on Hive. It should be
>> treated as one "connector" system out of many.
>>
>> Thanks,
>> Timo
>>
>> [1]
>>
>> https://docs.google.com/document/d/1Yaxp1UJUFW-peGLt8EIidwKIZEWrrA-pznWLuvaH39Y/edit?ts=5bb62df4#
>> [2] https://github.com/apache/flink/pull/6483
>>
>> Am 11.10.18 um 07:54 schrieb Jörn Franke:
>> > Would it maybe make sense to provide Flink as an engine on Hive
>> („flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely
>> coupled than integrating hive in all possible flink core modules and thus
>> introducing a very tight dependency to Hive in the core.
>> > 1,2,3 could be achieved via a connector based on the Flink Table API.
>> > Just as a proposal to start this Endeavour as independent projects
>> (hive engine, connector) to avoid too tight coupling with Flink. Maybe in a
>> more distant future if the Hive integration is heavily demanded one could
>> then integrate it more tightly if needed.
>> >
>> > What is meant by 11?
>> >> Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
>> >>
>> >> Hi Fabian/Vno,
>> >>
>> >> Thank you very much for your encouragement inquiry. Sorry that I
>> didn't see Fabian's email until I read Vino's response just now. (Somehow
>> Fabian's went to the spam folder.)
>> >>
>> >> My proposal contains long-term and short-terms goals. Nevertheless,
>> the effort will focus on the following areas, including Fabian's list:
>> >>
>> >> 1. Hive metastore connectivity - This covers both read/write access,
>> which means Flink can make full use of Hive's metastore as its catalog (at
>> least for the batch but can extend for streaming as well).
>> >> 2. Metadata compatibility - Objects (databases, tables, partitions,
>> etc) created by Hive can be understood by Flink and the reverse direction
>> is true also.
>> >> 3. Data compatibility - Similar to #2, data produced by Hive can be
>> consumed by Flink and vise versa.
>> >> 4. Support Hive UDFs - For all Hive's native udfs, Flink either
>> provides its own implementation or make Hive's implementation work in
>> Flink. Further, for user created UDFs in Hive, Flink SQL should provide a
>> mechanism allowing user to import them into Flink without any code change
>> required.
>> >> 5. Data types -  Flink SQL should support all data types that are
>> available in Hive.
>> >> 6. SQL Language - Flink SQL should support SQL standard (such as
>> SQL2003) with extension to support Hive's syntax and language features,
>> around DDL, DML, and SELECT queries.
>> >> 7.  SQL CLI - this is currently developing in Flink but more effort is
>> needed.
>> >> 8. Server - provide a server that's compatible with Hive's
>> HiverServer2 in thrift APIs, such that HiveServer2 users can reuse their
>> existing client (such as beeline) but connect to Flink's thrift server
>> instead.
>> >> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for
>> other application to use to connect to its thrift server
>> >> 10. Support other user's customizations in Hive, such as Hive Serdes,
>> storage handlers, etc.
>> >> 11. Better task failure tolerance and task scheduling at Flink runtime.
>> >>
>> >> As you can see, achieving all those requires significant effort and
>> across all layers in Flink. However, a short-term goal could  include only
>> core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as
>> #3, #6).
>> >>
>> >> Please share your further thoughts. If we generally agree that this is
>> the right direction, I could come up with a formal proposal quickly and
>> then we can follow up with broader discussions.
>> >>
>> >> Thanks,
>> >> Xuefu
>> >>
>> >>
>> >>
>> >> ------------------------------------------------------------------
>> >> Sender:vino yang <ya...@gmail.com>
>> >> Sent at:2018 Oct 11 (Thu) 09:45
>> >> Recipient:Fabian Hueske <fh...@gmail.com>
>> >> Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <
>> user@flink.apache.org>
>> >> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>> >>
>> >> Hi Xuefu,
>> >>
>> >> Appreciate this proposal, and like Fabian, it would look better if you
>> can give more details of the plan.
>> >>
>> >> Thanks, vino.
>> >>
>> >> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
>> >> Hi Xuefu,
>> >>
>> >> Welcome to the Flink community and thanks for starting this
>> discussion! Better Hive integration would be really great!
>> >> Can you go into details of what you are proposing? I can think of a
>> couple ways to improve Flink in that regard:
>> >>
>> >> * Support for Hive UDFs
>> >> * Support for Hive metadata catalog
>> >> * Support for HiveQL syntax
>> >> * ???
>> >>
>> >> Best, Fabian
>> >>
>> >> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
>> xuefu.z@alibaba-inc.com>:
>> >> Hi all,
>> >>
>> >> Along with the community's effort, inside Alibaba we have explored
>> Flink's potential as an execution engine not just for stream processing but
>> also for batch processing. We are encouraged by our findings and have
>> initiated our effort to make Flink's SQL capabilities full-fledged. When
>> comparing what's available in Flink to the offerings from competitive data
>> processing engines, we identified a major gap in Flink: a well integration
>> with Hive ecosystem. This is crucial to the success of Flink SQL and batch
>> due to the well-established data ecosystem around Hive. Therefore, we have
>> done some initial work along this direction but there are still a lot of
>> effort needed.
>> >>
>> >> We have two strategies in mind. The first one is to make Flink SQL
>> full-fledged and well-integrated with Hive ecosystem. This is a similar
>> approach to what Spark SQL adopted. The second strategy is to make Hive
>> itself work with Flink, similar to the proposal in [1]. Each approach bears
>> its pros and cons, but they don’t need to be mutually exclusive with each
>> targeting at different users and use cases. We believe that both will
>> promote a much greater adoption of Flink beyond stream processing.
>> >>
>> >> We have been focused on the first approach and would like to showcase
>> Flink's batch and SQL capabilities with Flink SQL. However, we have also
>> planned to start strategy #2 as the follow-up effort.
>> >>
>> >> I'm completely new to Flink(, with a short bio [2] below), though many
>> of my colleagues here at Alibaba are long-time contributors. Nevertheless,
>> I'd like to share our thoughts and invite your early feedback. At the same
>> time, I am working on a detailed proposal on Flink SQL's integration with
>> Hive ecosystem, which will be also shared when ready.
>> >>
>> >> While the ideas are simple, each approach will demand significant
>> effort, more than what we can afford. Thus, the input and contributions
>> from the communities are greatly welcome and appreciated.
>> >>
>> >> Regards,
>> >>
>> >>
>> >> Xuefu
>> >>
>> >> References:
>> >>
>> >> [1] https://issues.apache.org/jira/browse/HIVE-10712
>> >> [2] Xuefu Zhang is a long-time open source veteran, worked or working
>> on many projects under Apache Foundation, of which he is also an honored
>> member. About 10 years ago he worked in the Hadoop team at Yahoo where the
>> projects just got started. Later he worked at Cloudera, initiating and
>> leading the development of Hive on Spark project in the communities and
>> across many organizations. Prior to joining Alibaba, he worked at Uber
>> where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and
>> significantly improved Uber's cluster efficiency.
>> >>
>> >>
>>
>>

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by Taher Koitawala <ta...@gslab.com>.
One other thought on the same lines was to use hive tables to store kafka
information to process streaming tables. Something like

"create table streaming_table (
bootstrapServers string,
topic string, keySerialiser string, ValueSerialiser string)"

Insert into streaming_table values(,"10.17.1.1:9092,10.17.2.2:9092,
10.17.3.3:9092", "KafkaTopicName", "SimpleStringSchema",
"SimpleSchemaString");

Create table processingtable(
//Enter fields here which match the kafka records schema);

Now we make a custom clause called something like "using"

The way we use this is:

Using streaming_table as configuration select count(*) from processingtable
as streaming;


This way users can now pass Flink SQL info easily and get rid of the Flink
SQL configuration file all together. This is simple and easy to understand
and I think most users would follow this.

Thanks,
Taher Koitawala

On Fri 12 Oct, 2018, 7:24 AM Taher Koitawala, <ta...@gslab.com>
wrote:

> I think integrating Flink with Hive would be an amazing option and also to
> get Flink's SQL up to pace would be amazing.
>
> Current Flink Sql syntax to prepare and process a table is too verbose,
> users manually need to retype table definitions and that's a pain. Hive
> metastore integration should be done through, many users are okay defining
> their table schemas in Hive as it is easy to main, change or even migrate.
>
> Also we could simply choosing batch and stream there with simply something
> like a "process as" clause.
>
> select count(*) from flink_mailing_list process as stream;
>
> select count(*) from flink_mailing_list process as batch;
>
> This way we could completely get rid of Flink SQL configuration files.
>
> Thanks,
> Taher Koitawala
>
> Integrating
> On Fri 12 Oct, 2018, 2:35 AM Zhang, Xuefu, <xu...@alibaba-inc.com>
> wrote:
>
>> Hi Rong,
>>
>> Thanks for your feedback. Some of my earlier comments might have
>> addressed some of your points, so here I'd like to cover some specifics.
>>
>> 1. Yes, I expect that table stats stored in Hive will be used in Flink
>> plan optimization, but it's not part of compatibility concern (yet).
>> 2. Both implementing Hive UDFs in Flink natively and making Hive UDFs
>> work in Flink are considered.
>> 3. I am aware of FLIP-24, but here the proposal is to make remote server
>> compatible with HiveServer2. They are not mutually exclusive either.
>> 4. The JDBC/ODBC driver in question is for the remote server that Flink
>> provides. It's usually the servicer owner who provides drivers to their
>> services. We weren't talking about JDBC/ODBC driver to external DB systems.
>>
>> Let me know if you have further questions.
>>
>> Thanks,
>> Xuefu
>>
>> ------------------------------------------------------------------
>> Sender:Rong Rong <wa...@gmail.com>
>> Sent at:2018 Oct 12 (Fri) 01:52
>> Recipient:Timo Walther <tw...@apache.org>
>> Cc:dev <de...@flink.apache.org>; jornfranke <jo...@gmail.com>; Xuefu <
>> xuefu.z@alibaba-inc.com>; vino yang <ya...@gmail.com>; Fabian
>> Hueske <fh...@gmail.com>; user <us...@flink.apache.org>
>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>
>> Hi Xuefu,
>>
>> Thanks for putting together the overview. I would like to add some more
>> on top of Timo's comments.
>> 1,2. I agree with Timo that a proper catalog support should also address
>> the metadata compatibility issues. I was actually wondering if you are
>> referring to something like utilizing table stats for plan optimization?
>> 4. If the key is to have users integrate Hive UDF without code changes to
>> Flink UDF, it shouldn't be a problem as Timo mentioned. Is your concern
>> mostly on the support of Hive UDFs that should be implemented in
>> Flink-table natively?
>> 7,8. Correct me if I am wrong, but I feel like some of the related
>> components might have already been discussed in the longer term road map of
>> FLIP-24 [1]?
>> 9. per Jorn's comment to stay clear from a tight dependency on Hive and
>> treat it as one "connector" system. Should we also consider treating
>> JDBC/ODBC driver as part of the component from the connector system instead
>> of having Flink to provide them?
>>
>> Thanks,
>> Rong
>>
>> [1].
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-24+-+SQL+Client
>>
>> On Thu, Oct 11, 2018 at 12:46 AM Timo Walther <tw...@apache.org> wrote:
>> Hi Xuefu,
>>
>> thanks for your proposal, it is a nice summary. Here are my thoughts to
>> your list:
>>
>> 1. I think this is also on our current mid-term roadmap. Flink lacks a
>> poper catalog support for a very long time. Before we can connect
>> catalogs we need to define how to map all the information from a catalog
>> to Flink's representation. This is why the work on the unified connector
>> API [1] is going on for quite some time as it is the first approach to
>> discuss and represent the pure characteristics of connectors.
>> 2. It would be helpful to figure out what is missing in [1] to to ensure
>> this point. I guess we will need a new design document just for a proper
>> Hive catalog integration.
>> 3. This is already work in progress. ORC has been merged, Parquet is on
>> its way [1].
>> 4. This should be easy. There was a PR in past that I reviewed but was
>> not maintained anymore.
>> 5. The type system of Flink SQL is very flexible. Only UNION type is
>> missing.
>> 6. A Flink SQL DDL is on the roadmap soon once we are done with [1].
>> Support for Hive syntax also needs cooperation with Apache Calcite.
>> 7-11. Long-term goals.
>>
>> I would also propose to start with a smaller scope where also current
>> Flink SQL users can profit: 1, 2, 5, 3. This would allow to grow the
>> Flink SQL ecosystem. After that we can aim to be fully compatible
>> including syntax and UDFs (4, 6 etc.). Once the core is ready, we can
>> work on the tooling (7, 8, 9) and performance (10, 11).
>>
>> @Jörn: Yes, we should not have a tight dependency on Hive. It should be
>> treated as one "connector" system out of many.
>>
>> Thanks,
>> Timo
>>
>> [1]
>>
>> https://docs.google.com/document/d/1Yaxp1UJUFW-peGLt8EIidwKIZEWrrA-pznWLuvaH39Y/edit?ts=5bb62df4#
>> [2] https://github.com/apache/flink/pull/6483
>>
>> Am 11.10.18 um 07:54 schrieb Jörn Franke:
>> > Would it maybe make sense to provide Flink as an engine on Hive
>> („flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely
>> coupled than integrating hive in all possible flink core modules and thus
>> introducing a very tight dependency to Hive in the core.
>> > 1,2,3 could be achieved via a connector based on the Flink Table API.
>> > Just as a proposal to start this Endeavour as independent projects
>> (hive engine, connector) to avoid too tight coupling with Flink. Maybe in a
>> more distant future if the Hive integration is heavily demanded one could
>> then integrate it more tightly if needed.
>> >
>> > What is meant by 11?
>> >> Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
>> >>
>> >> Hi Fabian/Vno,
>> >>
>> >> Thank you very much for your encouragement inquiry. Sorry that I
>> didn't see Fabian's email until I read Vino's response just now. (Somehow
>> Fabian's went to the spam folder.)
>> >>
>> >> My proposal contains long-term and short-terms goals. Nevertheless,
>> the effort will focus on the following areas, including Fabian's list:
>> >>
>> >> 1. Hive metastore connectivity - This covers both read/write access,
>> which means Flink can make full use of Hive's metastore as its catalog (at
>> least for the batch but can extend for streaming as well).
>> >> 2. Metadata compatibility - Objects (databases, tables, partitions,
>> etc) created by Hive can be understood by Flink and the reverse direction
>> is true also.
>> >> 3. Data compatibility - Similar to #2, data produced by Hive can be
>> consumed by Flink and vise versa.
>> >> 4. Support Hive UDFs - For all Hive's native udfs, Flink either
>> provides its own implementation or make Hive's implementation work in
>> Flink. Further, for user created UDFs in Hive, Flink SQL should provide a
>> mechanism allowing user to import them into Flink without any code change
>> required.
>> >> 5. Data types -  Flink SQL should support all data types that are
>> available in Hive.
>> >> 6. SQL Language - Flink SQL should support SQL standard (such as
>> SQL2003) with extension to support Hive's syntax and language features,
>> around DDL, DML, and SELECT queries.
>> >> 7.  SQL CLI - this is currently developing in Flink but more effort is
>> needed.
>> >> 8. Server - provide a server that's compatible with Hive's
>> HiverServer2 in thrift APIs, such that HiveServer2 users can reuse their
>> existing client (such as beeline) but connect to Flink's thrift server
>> instead.
>> >> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for
>> other application to use to connect to its thrift server
>> >> 10. Support other user's customizations in Hive, such as Hive Serdes,
>> storage handlers, etc.
>> >> 11. Better task failure tolerance and task scheduling at Flink runtime.
>> >>
>> >> As you can see, achieving all those requires significant effort and
>> across all layers in Flink. However, a short-term goal could  include only
>> core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as
>> #3, #6).
>> >>
>> >> Please share your further thoughts. If we generally agree that this is
>> the right direction, I could come up with a formal proposal quickly and
>> then we can follow up with broader discussions.
>> >>
>> >> Thanks,
>> >> Xuefu
>> >>
>> >>
>> >>
>> >> ------------------------------------------------------------------
>> >> Sender:vino yang <ya...@gmail.com>
>> >> Sent at:2018 Oct 11 (Thu) 09:45
>> >> Recipient:Fabian Hueske <fh...@gmail.com>
>> >> Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <
>> user@flink.apache.org>
>> >> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>> >>
>> >> Hi Xuefu,
>> >>
>> >> Appreciate this proposal, and like Fabian, it would look better if you
>> can give more details of the plan.
>> >>
>> >> Thanks, vino.
>> >>
>> >> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
>> >> Hi Xuefu,
>> >>
>> >> Welcome to the Flink community and thanks for starting this
>> discussion! Better Hive integration would be really great!
>> >> Can you go into details of what you are proposing? I can think of a
>> couple ways to improve Flink in that regard:
>> >>
>> >> * Support for Hive UDFs
>> >> * Support for Hive metadata catalog
>> >> * Support for HiveQL syntax
>> >> * ???
>> >>
>> >> Best, Fabian
>> >>
>> >> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
>> xuefu.z@alibaba-inc.com>:
>> >> Hi all,
>> >>
>> >> Along with the community's effort, inside Alibaba we have explored
>> Flink's potential as an execution engine not just for stream processing but
>> also for batch processing. We are encouraged by our findings and have
>> initiated our effort to make Flink's SQL capabilities full-fledged. When
>> comparing what's available in Flink to the offerings from competitive data
>> processing engines, we identified a major gap in Flink: a well integration
>> with Hive ecosystem. This is crucial to the success of Flink SQL and batch
>> due to the well-established data ecosystem around Hive. Therefore, we have
>> done some initial work along this direction but there are still a lot of
>> effort needed.
>> >>
>> >> We have two strategies in mind. The first one is to make Flink SQL
>> full-fledged and well-integrated with Hive ecosystem. This is a similar
>> approach to what Spark SQL adopted. The second strategy is to make Hive
>> itself work with Flink, similar to the proposal in [1]. Each approach bears
>> its pros and cons, but they don’t need to be mutually exclusive with each
>> targeting at different users and use cases. We believe that both will
>> promote a much greater adoption of Flink beyond stream processing.
>> >>
>> >> We have been focused on the first approach and would like to showcase
>> Flink's batch and SQL capabilities with Flink SQL. However, we have also
>> planned to start strategy #2 as the follow-up effort.
>> >>
>> >> I'm completely new to Flink(, with a short bio [2] below), though many
>> of my colleagues here at Alibaba are long-time contributors. Nevertheless,
>> I'd like to share our thoughts and invite your early feedback. At the same
>> time, I am working on a detailed proposal on Flink SQL's integration with
>> Hive ecosystem, which will be also shared when ready.
>> >>
>> >> While the ideas are simple, each approach will demand significant
>> effort, more than what we can afford. Thus, the input and contributions
>> from the communities are greatly welcome and appreciated.
>> >>
>> >> Regards,
>> >>
>> >>
>> >> Xuefu
>> >>
>> >> References:
>> >>
>> >> [1] https://issues.apache.org/jira/browse/HIVE-10712
>> >> [2] Xuefu Zhang is a long-time open source veteran, worked or working
>> on many projects under Apache Foundation, of which he is also an honored
>> member. About 10 years ago he worked in the Hadoop team at Yahoo where the
>> projects just got started. Later he worked at Cloudera, initiating and
>> leading the development of Hive on Spark project in the communities and
>> across many organizations. Prior to joining Alibaba, he worked at Uber
>> where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and
>> significantly improved Uber's cluster efficiency.
>> >>
>> >>
>>
>>

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by Taher Koitawala <ta...@gslab.com>.
I think integrating Flink with Hive would be an amazing option and also to
get Flink's SQL up to pace would be amazing.

Current Flink Sql syntax to prepare and process a table is too verbose,
users manually need to retype table definitions and that's a pain. Hive
metastore integration should be done through, many users are okay defining
their table schemas in Hive as it is easy to main, change or even migrate.

Also we could simply choosing batch and stream there with simply something
like a "process as" clause.

select count(*) from flink_mailing_list process as stream;

select count(*) from flink_mailing_list process as batch;

This way we could completely get rid of Flink SQL configuration files.

Thanks,
Taher Koitawala

Integrating
On Fri 12 Oct, 2018, 2:35 AM Zhang, Xuefu, <xu...@alibaba-inc.com> wrote:

> Hi Rong,
>
> Thanks for your feedback. Some of my earlier comments might have addressed
> some of your points, so here I'd like to cover some specifics.
>
> 1. Yes, I expect that table stats stored in Hive will be used in Flink
> plan optimization, but it's not part of compatibility concern (yet).
> 2. Both implementing Hive UDFs in Flink natively and making Hive UDFs work
> in Flink are considered.
> 3. I am aware of FLIP-24, but here the proposal is to make remote server
> compatible with HiveServer2. They are not mutually exclusive either.
> 4. The JDBC/ODBC driver in question is for the remote server that Flink
> provides. It's usually the servicer owner who provides drivers to their
> services. We weren't talking about JDBC/ODBC driver to external DB systems.
>
> Let me know if you have further questions.
>
> Thanks,
> Xuefu
>
> ------------------------------------------------------------------
> Sender:Rong Rong <wa...@gmail.com>
> Sent at:2018 Oct 12 (Fri) 01:52
> Recipient:Timo Walther <tw...@apache.org>
> Cc:dev <de...@flink.apache.org>; jornfranke <jo...@gmail.com>; Xuefu <
> xuefu.z@alibaba-inc.com>; vino yang <ya...@gmail.com>; Fabian
> Hueske <fh...@gmail.com>; user <us...@flink.apache.org>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> Hi Xuefu,
>
> Thanks for putting together the overview. I would like to add some more on
> top of Timo's comments.
> 1,2. I agree with Timo that a proper catalog support should also address
> the metadata compatibility issues. I was actually wondering if you are
> referring to something like utilizing table stats for plan optimization?
> 4. If the key is to have users integrate Hive UDF without code changes to
> Flink UDF, it shouldn't be a problem as Timo mentioned. Is your concern
> mostly on the support of Hive UDFs that should be implemented in
> Flink-table natively?
> 7,8. Correct me if I am wrong, but I feel like some of the related
> components might have already been discussed in the longer term road map of
> FLIP-24 [1]?
> 9. per Jorn's comment to stay clear from a tight dependency on Hive and
> treat it as one "connector" system. Should we also consider treating
> JDBC/ODBC driver as part of the component from the connector system instead
> of having Flink to provide them?
>
> Thanks,
> Rong
>
> [1].
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-24+-+SQL+Client
>
> On Thu, Oct 11, 2018 at 12:46 AM Timo Walther <tw...@apache.org> wrote:
> Hi Xuefu,
>
> thanks for your proposal, it is a nice summary. Here are my thoughts to
> your list:
>
> 1. I think this is also on our current mid-term roadmap. Flink lacks a
> poper catalog support for a very long time. Before we can connect
> catalogs we need to define how to map all the information from a catalog
> to Flink's representation. This is why the work on the unified connector
> API [1] is going on for quite some time as it is the first approach to
> discuss and represent the pure characteristics of connectors.
> 2. It would be helpful to figure out what is missing in [1] to to ensure
> this point. I guess we will need a new design document just for a proper
> Hive catalog integration.
> 3. This is already work in progress. ORC has been merged, Parquet is on
> its way [1].
> 4. This should be easy. There was a PR in past that I reviewed but was
> not maintained anymore.
> 5. The type system of Flink SQL is very flexible. Only UNION type is
> missing.
> 6. A Flink SQL DDL is on the roadmap soon once we are done with [1].
> Support for Hive syntax also needs cooperation with Apache Calcite.
> 7-11. Long-term goals.
>
> I would also propose to start with a smaller scope where also current
> Flink SQL users can profit: 1, 2, 5, 3. This would allow to grow the
> Flink SQL ecosystem. After that we can aim to be fully compatible
> including syntax and UDFs (4, 6 etc.). Once the core is ready, we can
> work on the tooling (7, 8, 9) and performance (10, 11).
>
> @Jörn: Yes, we should not have a tight dependency on Hive. It should be
> treated as one "connector" system out of many.
>
> Thanks,
> Timo
>
> [1]
>
> https://docs.google.com/document/d/1Yaxp1UJUFW-peGLt8EIidwKIZEWrrA-pznWLuvaH39Y/edit?ts=5bb62df4#
> [2] https://github.com/apache/flink/pull/6483
>
> Am 11.10.18 um 07:54 schrieb Jörn Franke:
> > Would it maybe make sense to provide Flink as an engine on Hive
> („flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely
> coupled than integrating hive in all possible flink core modules and thus
> introducing a very tight dependency to Hive in the core.
> > 1,2,3 could be achieved via a connector based on the Flink Table API.
> > Just as a proposal to start this Endeavour as independent projects (hive
> engine, connector) to avoid too tight coupling with Flink. Maybe in a more
> distant future if the Hive integration is heavily demanded one could then
> integrate it more tightly if needed.
> >
> > What is meant by 11?
> >> Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
> >>
> >> Hi Fabian/Vno,
> >>
> >> Thank you very much for your encouragement inquiry. Sorry that I didn't
> see Fabian's email until I read Vino's response just now. (Somehow Fabian's
> went to the spam folder.)
> >>
> >> My proposal contains long-term and short-terms goals. Nevertheless, the
> effort will focus on the following areas, including Fabian's list:
> >>
> >> 1. Hive metastore connectivity - This covers both read/write access,
> which means Flink can make full use of Hive's metastore as its catalog (at
> least for the batch but can extend for streaming as well).
> >> 2. Metadata compatibility - Objects (databases, tables, partitions,
> etc) created by Hive can be understood by Flink and the reverse direction
> is true also.
> >> 3. Data compatibility - Similar to #2, data produced by Hive can be
> consumed by Flink and vise versa.
> >> 4. Support Hive UDFs - For all Hive's native udfs, Flink either
> provides its own implementation or make Hive's implementation work in
> Flink. Further, for user created UDFs in Hive, Flink SQL should provide a
> mechanism allowing user to import them into Flink without any code change
> required.
> >> 5. Data types -  Flink SQL should support all data types that are
> available in Hive.
> >> 6. SQL Language - Flink SQL should support SQL standard (such as
> SQL2003) with extension to support Hive's syntax and language features,
> around DDL, DML, and SELECT queries.
> >> 7.  SQL CLI - this is currently developing in Flink but more effort is
> needed.
> >> 8. Server - provide a server that's compatible with Hive's HiverServer2
> in thrift APIs, such that HiveServer2 users can reuse their existing client
> (such as beeline) but connect to Flink's thrift server instead.
> >> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for
> other application to use to connect to its thrift server
> >> 10. Support other user's customizations in Hive, such as Hive Serdes,
> storage handlers, etc.
> >> 11. Better task failure tolerance and task scheduling at Flink runtime.
> >>
> >> As you can see, achieving all those requires significant effort and
> across all layers in Flink. However, a short-term goal could  include only
> core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as
> #3, #6).
> >>
> >> Please share your further thoughts. If we generally agree that this is
> the right direction, I could come up with a formal proposal quickly and
> then we can follow up with broader discussions.
> >>
> >> Thanks,
> >> Xuefu
> >>
> >>
> >>
> >> ------------------------------------------------------------------
> >> Sender:vino yang <ya...@gmail.com>
> >> Sent at:2018 Oct 11 (Thu) 09:45
> >> Recipient:Fabian Hueske <fh...@gmail.com>
> >> Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <
> user@flink.apache.org>
> >> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> >>
> >> Hi Xuefu,
> >>
> >> Appreciate this proposal, and like Fabian, it would look better if you
> can give more details of the plan.
> >>
> >> Thanks, vino.
> >>
> >> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
> >> Hi Xuefu,
> >>
> >> Welcome to the Flink community and thanks for starting this discussion!
> Better Hive integration would be really great!
> >> Can you go into details of what you are proposing? I can think of a
> couple ways to improve Flink in that regard:
> >>
> >> * Support for Hive UDFs
> >> * Support for Hive metadata catalog
> >> * Support for HiveQL syntax
> >> * ???
> >>
> >> Best, Fabian
> >>
> >> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
> xuefu.z@alibaba-inc.com>:
> >> Hi all,
> >>
> >> Along with the community's effort, inside Alibaba we have explored
> Flink's potential as an execution engine not just for stream processing but
> also for batch processing. We are encouraged by our findings and have
> initiated our effort to make Flink's SQL capabilities full-fledged. When
> comparing what's available in Flink to the offerings from competitive data
> processing engines, we identified a major gap in Flink: a well integration
> with Hive ecosystem. This is crucial to the success of Flink SQL and batch
> due to the well-established data ecosystem around Hive. Therefore, we have
> done some initial work along this direction but there are still a lot of
> effort needed.
> >>
> >> We have two strategies in mind. The first one is to make Flink SQL
> full-fledged and well-integrated with Hive ecosystem. This is a similar
> approach to what Spark SQL adopted. The second strategy is to make Hive
> itself work with Flink, similar to the proposal in [1]. Each approach bears
> its pros and cons, but they don’t need to be mutually exclusive with each
> targeting at different users and use cases. We believe that both will
> promote a much greater adoption of Flink beyond stream processing.
> >>
> >> We have been focused on the first approach and would like to showcase
> Flink's batch and SQL capabilities with Flink SQL. However, we have also
> planned to start strategy #2 as the follow-up effort.
> >>
> >> I'm completely new to Flink(, with a short bio [2] below), though many
> of my colleagues here at Alibaba are long-time contributors. Nevertheless,
> I'd like to share our thoughts and invite your early feedback. At the same
> time, I am working on a detailed proposal on Flink SQL's integration with
> Hive ecosystem, which will be also shared when ready.
> >>
> >> While the ideas are simple, each approach will demand significant
> effort, more than what we can afford. Thus, the input and contributions
> from the communities are greatly welcome and appreciated.
> >>
> >> Regards,
> >>
> >>
> >> Xuefu
> >>
> >> References:
> >>
> >> [1] https://issues.apache.org/jira/browse/HIVE-10712
> >> [2] Xuefu Zhang is a long-time open source veteran, worked or working
> on many projects under Apache Foundation, of which he is also an honored
> member. About 10 years ago he worked in the Hadoop team at Yahoo where the
> projects just got started. Later he worked at Cloudera, initiating and
> leading the development of Hive on Spark project in the communities and
> across many organizations. Prior to joining Alibaba, he worked at Uber
> where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and
> significantly improved Uber's cluster efficiency.
> >>
> >>
>
>

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by Taher Koitawala <ta...@gslab.com>.
I think integrating Flink with Hive would be an amazing option and also to
get Flink's SQL up to pace would be amazing.

Current Flink Sql syntax to prepare and process a table is too verbose,
users manually need to retype table definitions and that's a pain. Hive
metastore integration should be done through, many users are okay defining
their table schemas in Hive as it is easy to main, change or even migrate.

Also we could simply choosing batch and stream there with simply something
like a "process as" clause.

select count(*) from flink_mailing_list process as stream;

select count(*) from flink_mailing_list process as batch;

This way we could completely get rid of Flink SQL configuration files.

Thanks,
Taher Koitawala

Integrating
On Fri 12 Oct, 2018, 2:35 AM Zhang, Xuefu, <xu...@alibaba-inc.com> wrote:

> Hi Rong,
>
> Thanks for your feedback. Some of my earlier comments might have addressed
> some of your points, so here I'd like to cover some specifics.
>
> 1. Yes, I expect that table stats stored in Hive will be used in Flink
> plan optimization, but it's not part of compatibility concern (yet).
> 2. Both implementing Hive UDFs in Flink natively and making Hive UDFs work
> in Flink are considered.
> 3. I am aware of FLIP-24, but here the proposal is to make remote server
> compatible with HiveServer2. They are not mutually exclusive either.
> 4. The JDBC/ODBC driver in question is for the remote server that Flink
> provides. It's usually the servicer owner who provides drivers to their
> services. We weren't talking about JDBC/ODBC driver to external DB systems.
>
> Let me know if you have further questions.
>
> Thanks,
> Xuefu
>
> ------------------------------------------------------------------
> Sender:Rong Rong <wa...@gmail.com>
> Sent at:2018 Oct 12 (Fri) 01:52
> Recipient:Timo Walther <tw...@apache.org>
> Cc:dev <de...@flink.apache.org>; jornfranke <jo...@gmail.com>; Xuefu <
> xuefu.z@alibaba-inc.com>; vino yang <ya...@gmail.com>; Fabian
> Hueske <fh...@gmail.com>; user <us...@flink.apache.org>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> Hi Xuefu,
>
> Thanks for putting together the overview. I would like to add some more on
> top of Timo's comments.
> 1,2. I agree with Timo that a proper catalog support should also address
> the metadata compatibility issues. I was actually wondering if you are
> referring to something like utilizing table stats for plan optimization?
> 4. If the key is to have users integrate Hive UDF without code changes to
> Flink UDF, it shouldn't be a problem as Timo mentioned. Is your concern
> mostly on the support of Hive UDFs that should be implemented in
> Flink-table natively?
> 7,8. Correct me if I am wrong, but I feel like some of the related
> components might have already been discussed in the longer term road map of
> FLIP-24 [1]?
> 9. per Jorn's comment to stay clear from a tight dependency on Hive and
> treat it as one "connector" system. Should we also consider treating
> JDBC/ODBC driver as part of the component from the connector system instead
> of having Flink to provide them?
>
> Thanks,
> Rong
>
> [1].
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-24+-+SQL+Client
>
> On Thu, Oct 11, 2018 at 12:46 AM Timo Walther <tw...@apache.org> wrote:
> Hi Xuefu,
>
> thanks for your proposal, it is a nice summary. Here are my thoughts to
> your list:
>
> 1. I think this is also on our current mid-term roadmap. Flink lacks a
> poper catalog support for a very long time. Before we can connect
> catalogs we need to define how to map all the information from a catalog
> to Flink's representation. This is why the work on the unified connector
> API [1] is going on for quite some time as it is the first approach to
> discuss and represent the pure characteristics of connectors.
> 2. It would be helpful to figure out what is missing in [1] to to ensure
> this point. I guess we will need a new design document just for a proper
> Hive catalog integration.
> 3. This is already work in progress. ORC has been merged, Parquet is on
> its way [1].
> 4. This should be easy. There was a PR in past that I reviewed but was
> not maintained anymore.
> 5. The type system of Flink SQL is very flexible. Only UNION type is
> missing.
> 6. A Flink SQL DDL is on the roadmap soon once we are done with [1].
> Support for Hive syntax also needs cooperation with Apache Calcite.
> 7-11. Long-term goals.
>
> I would also propose to start with a smaller scope where also current
> Flink SQL users can profit: 1, 2, 5, 3. This would allow to grow the
> Flink SQL ecosystem. After that we can aim to be fully compatible
> including syntax and UDFs (4, 6 etc.). Once the core is ready, we can
> work on the tooling (7, 8, 9) and performance (10, 11).
>
> @Jörn: Yes, we should not have a tight dependency on Hive. It should be
> treated as one "connector" system out of many.
>
> Thanks,
> Timo
>
> [1]
>
> https://docs.google.com/document/d/1Yaxp1UJUFW-peGLt8EIidwKIZEWrrA-pznWLuvaH39Y/edit?ts=5bb62df4#
> [2] https://github.com/apache/flink/pull/6483
>
> Am 11.10.18 um 07:54 schrieb Jörn Franke:
> > Would it maybe make sense to provide Flink as an engine on Hive
> („flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely
> coupled than integrating hive in all possible flink core modules and thus
> introducing a very tight dependency to Hive in the core.
> > 1,2,3 could be achieved via a connector based on the Flink Table API.
> > Just as a proposal to start this Endeavour as independent projects (hive
> engine, connector) to avoid too tight coupling with Flink. Maybe in a more
> distant future if the Hive integration is heavily demanded one could then
> integrate it more tightly if needed.
> >
> > What is meant by 11?
> >> Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
> >>
> >> Hi Fabian/Vno,
> >>
> >> Thank you very much for your encouragement inquiry. Sorry that I didn't
> see Fabian's email until I read Vino's response just now. (Somehow Fabian's
> went to the spam folder.)
> >>
> >> My proposal contains long-term and short-terms goals. Nevertheless, the
> effort will focus on the following areas, including Fabian's list:
> >>
> >> 1. Hive metastore connectivity - This covers both read/write access,
> which means Flink can make full use of Hive's metastore as its catalog (at
> least for the batch but can extend for streaming as well).
> >> 2. Metadata compatibility - Objects (databases, tables, partitions,
> etc) created by Hive can be understood by Flink and the reverse direction
> is true also.
> >> 3. Data compatibility - Similar to #2, data produced by Hive can be
> consumed by Flink and vise versa.
> >> 4. Support Hive UDFs - For all Hive's native udfs, Flink either
> provides its own implementation or make Hive's implementation work in
> Flink. Further, for user created UDFs in Hive, Flink SQL should provide a
> mechanism allowing user to import them into Flink without any code change
> required.
> >> 5. Data types -  Flink SQL should support all data types that are
> available in Hive.
> >> 6. SQL Language - Flink SQL should support SQL standard (such as
> SQL2003) with extension to support Hive's syntax and language features,
> around DDL, DML, and SELECT queries.
> >> 7.  SQL CLI - this is currently developing in Flink but more effort is
> needed.
> >> 8. Server - provide a server that's compatible with Hive's HiverServer2
> in thrift APIs, such that HiveServer2 users can reuse their existing client
> (such as beeline) but connect to Flink's thrift server instead.
> >> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for
> other application to use to connect to its thrift server
> >> 10. Support other user's customizations in Hive, such as Hive Serdes,
> storage handlers, etc.
> >> 11. Better task failure tolerance and task scheduling at Flink runtime.
> >>
> >> As you can see, achieving all those requires significant effort and
> across all layers in Flink. However, a short-term goal could  include only
> core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as
> #3, #6).
> >>
> >> Please share your further thoughts. If we generally agree that this is
> the right direction, I could come up with a formal proposal quickly and
> then we can follow up with broader discussions.
> >>
> >> Thanks,
> >> Xuefu
> >>
> >>
> >>
> >> ------------------------------------------------------------------
> >> Sender:vino yang <ya...@gmail.com>
> >> Sent at:2018 Oct 11 (Thu) 09:45
> >> Recipient:Fabian Hueske <fh...@gmail.com>
> >> Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <
> user@flink.apache.org>
> >> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> >>
> >> Hi Xuefu,
> >>
> >> Appreciate this proposal, and like Fabian, it would look better if you
> can give more details of the plan.
> >>
> >> Thanks, vino.
> >>
> >> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
> >> Hi Xuefu,
> >>
> >> Welcome to the Flink community and thanks for starting this discussion!
> Better Hive integration would be really great!
> >> Can you go into details of what you are proposing? I can think of a
> couple ways to improve Flink in that regard:
> >>
> >> * Support for Hive UDFs
> >> * Support for Hive metadata catalog
> >> * Support for HiveQL syntax
> >> * ???
> >>
> >> Best, Fabian
> >>
> >> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
> xuefu.z@alibaba-inc.com>:
> >> Hi all,
> >>
> >> Along with the community's effort, inside Alibaba we have explored
> Flink's potential as an execution engine not just for stream processing but
> also for batch processing. We are encouraged by our findings and have
> initiated our effort to make Flink's SQL capabilities full-fledged. When
> comparing what's available in Flink to the offerings from competitive data
> processing engines, we identified a major gap in Flink: a well integration
> with Hive ecosystem. This is crucial to the success of Flink SQL and batch
> due to the well-established data ecosystem around Hive. Therefore, we have
> done some initial work along this direction but there are still a lot of
> effort needed.
> >>
> >> We have two strategies in mind. The first one is to make Flink SQL
> full-fledged and well-integrated with Hive ecosystem. This is a similar
> approach to what Spark SQL adopted. The second strategy is to make Hive
> itself work with Flink, similar to the proposal in [1]. Each approach bears
> its pros and cons, but they don’t need to be mutually exclusive with each
> targeting at different users and use cases. We believe that both will
> promote a much greater adoption of Flink beyond stream processing.
> >>
> >> We have been focused on the first approach and would like to showcase
> Flink's batch and SQL capabilities with Flink SQL. However, we have also
> planned to start strategy #2 as the follow-up effort.
> >>
> >> I'm completely new to Flink(, with a short bio [2] below), though many
> of my colleagues here at Alibaba are long-time contributors. Nevertheless,
> I'd like to share our thoughts and invite your early feedback. At the same
> time, I am working on a detailed proposal on Flink SQL's integration with
> Hive ecosystem, which will be also shared when ready.
> >>
> >> While the ideas are simple, each approach will demand significant
> effort, more than what we can afford. Thus, the input and contributions
> from the communities are greatly welcome and appreciated.
> >>
> >> Regards,
> >>
> >>
> >> Xuefu
> >>
> >> References:
> >>
> >> [1] https://issues.apache.org/jira/browse/HIVE-10712
> >> [2] Xuefu Zhang is a long-time open source veteran, worked or working
> on many projects under Apache Foundation, of which he is also an honored
> member. About 10 years ago he worked in the Hadoop team at Yahoo where the
> projects just got started. Later he worked at Cloudera, initiating and
> leading the development of Hive on Spark project in the communities and
> across many organizations. Prior to joining Alibaba, he worked at Uber
> where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and
> significantly improved Uber's cluster efficiency.
> >>
> >>
>
>

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by "Zhang, Xuefu" <xu...@alibaba-inc.com>.
Hi Rong,

Thanks for your feedback. Some of my earlier comments might have addressed some of your points, so here I'd like to cover some specifics.

1. Yes, I expect that table stats stored in Hive will be used in Flink plan optimization, but it's not part of compatibility concern (yet).
2. Both implementing Hive UDFs in Flink natively and making Hive UDFs work in Flink are considered.
3. I am aware of FLIP-24, but here the proposal is to make remote server compatible with HiveServer2. They are not mutually exclusive either.
4. The JDBC/ODBC driver in question is for the remote server that Flink provides. It's usually the servicer owner who provides drivers to their services. We weren't talking about JDBC/ODBC driver to external DB systems.

Let me know if you have further questions.

Thanks,
Xuefu


------------------------------------------------------------------
Sender:Rong Rong <wa...@gmail.com>
Sent at:2018 Oct 12 (Fri) 01:52
Recipient:Timo Walther <tw...@apache.org>
Cc:dev <de...@flink.apache.org>; jornfranke <jo...@gmail.com>; Xuefu <xu...@alibaba-inc.com>; vino yang <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Xuefu, 

Thanks for putting together the overview. I would like to add some more on top of Timo's comments.
1,2. I agree with Timo that a proper catalog support should also address the metadata compatibility issues. I was actually wondering if you are referring to something like utilizing table stats for plan optimization?
4. If the key is to have users integrate Hive UDF without code changes to Flink UDF, it shouldn't be a problem as Timo mentioned. Is your concern mostly on the support of Hive UDFs that should be implemented in Flink-table natively?
7,8. Correct me if I am wrong, but I feel like some of the related components might have already been discussed in the longer term road map of FLIP-24 [1]?
9. per Jorn's comment to stay clear from a tight dependency on Hive and treat it as one "connector" system. Should we also consider treating JDBC/ODBC driver as part of the component from the connector system instead of having Flink to provide them?

Thanks,
Rong

[1]. https://cwiki.apache.org/confluence/display/FLINK/FLIP-24+-+SQL+Client
On Thu, Oct 11, 2018 at 12:46 AM Timo Walther <tw...@apache.org> wrote:
Hi Xuefu,

 thanks for your proposal, it is a nice summary. Here are my thoughts to 
 your list:

 1. I think this is also on our current mid-term roadmap. Flink lacks a 
 poper catalog support for a very long time. Before we can connect 
 catalogs we need to define how to map all the information from a catalog 
 to Flink's representation. This is why the work on the unified connector 
 API [1] is going on for quite some time as it is the first approach to 
 discuss and represent the pure characteristics of connectors.
 2. It would be helpful to figure out what is missing in [1] to to ensure 
 this point. I guess we will need a new design document just for a proper 
 Hive catalog integration.
 3. This is already work in progress. ORC has been merged, Parquet is on 
 its way [1].
 4. This should be easy. There was a PR in past that I reviewed but was 
 not maintained anymore.
 5. The type system of Flink SQL is very flexible. Only UNION type is 
 missing.
 6. A Flink SQL DDL is on the roadmap soon once we are done with [1]. 
 Support for Hive syntax also needs cooperation with Apache Calcite.
 7-11. Long-term goals.

 I would also propose to start with a smaller scope where also current 
 Flink SQL users can profit: 1, 2, 5, 3. This would allow to grow the 
 Flink SQL ecosystem. After that we can aim to be fully compatible 
 including syntax and UDFs (4, 6 etc.). Once the core is ready, we can 
 work on the tooling (7, 8, 9) and performance (10, 11).

 @Jörn: Yes, we should not have a tight dependency on Hive. It should be 
 treated as one "connector" system out of many.

 Thanks,
 Timo

 [1] 
https://docs.google.com/document/d/1Yaxp1UJUFW-peGLt8EIidwKIZEWrrA-pznWLuvaH39Y/edit?ts=5bb62df4#
 [2] https://github.com/apache/flink/pull/6483

 Am 11.10.18 um 07:54 schrieb Jörn Franke:
 > Would it maybe make sense to provide Flink as an engine on Hive („flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely coupled than integrating hive in all possible flink core modules and thus introducing a very tight dependency to Hive in the core.
 > 1,2,3 could be achieved via a connector based on the Flink Table API.
 > Just as a proposal to start this Endeavour as independent projects (hive engine, connector) to avoid too tight coupling with Flink. Maybe in a more distant future if the Hive integration is heavily demanded one could then integrate it more tightly if needed.
 >
 > What is meant by 11?
 >> Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
 >>
 >> Hi Fabian/Vno,
 >>
 >> Thank you very much for your encouragement inquiry. Sorry that I didn't see Fabian's email until I read Vino's response just now. (Somehow Fabian's went to the spam folder.)
 >>
 >> My proposal contains long-term and short-terms goals. Nevertheless, the effort will focus on the following areas, including Fabian's list:
 >>
 >> 1. Hive metastore connectivity - This covers both read/write access, which means Flink can make full use of Hive's metastore as its catalog (at least for the batch but can extend for streaming as well).
 >> 2. Metadata compatibility - Objects (databases, tables, partitions, etc) created by Hive can be understood by Flink and the reverse direction is true also.
 >> 3. Data compatibility - Similar to #2, data produced by Hive can be consumed by Flink and vise versa.
 >> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its own implementation or make Hive's implementation work in Flink. Further, for user created UDFs in Hive, Flink SQL should provide a mechanism allowing user to import them into Flink without any code change required.
 >> 5. Data types -  Flink SQL should support all data types that are available in Hive.
 >> 6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) with extension to support Hive's syntax and language features, around DDL, DML, and SELECT queries.
 >> 7.  SQL CLI - this is currently developing in Flink but more effort is needed.
 >> 8. Server - provide a server that's compatible with Hive's HiverServer2 in thrift APIs, such that HiveServer2 users can reuse their existing client (such as beeline) but connect to Flink's thrift server instead.
 >> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other application to use to connect to its thrift server
 >> 10. Support other user's customizations in Hive, such as Hive Serdes, storage handlers, etc.
 >> 11. Better task failure tolerance and task scheduling at Flink runtime.
 >>
 >> As you can see, achieving all those requires significant effort and across all layers in Flink. However, a short-term goal could  include only core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3, #6).
 >>
 >> Please share your further thoughts. If we generally agree that this is the right direction, I could come up with a formal proposal quickly and then we can follow up with broader discussions.
 >>
 >> Thanks,
 >> Xuefu
 >>
 >>
 >>
 >> ------------------------------------------------------------------
 >> Sender:vino yang <ya...@gmail.com>
 >> Sent at:2018 Oct 11 (Thu) 09:45
 >> Recipient:Fabian Hueske <fh...@gmail.com>
 >> Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <us...@flink.apache.org>
 >> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
 >>
 >> Hi Xuefu,
 >>
 >> Appreciate this proposal, and like Fabian, it would look better if you can give more details of the plan.
 >>
 >> Thanks, vino.
 >>
 >> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
 >> Hi Xuefu,
 >>
 >> Welcome to the Flink community and thanks for starting this discussion! Better Hive integration would be really great!
 >> Can you go into details of what you are proposing? I can think of a couple ways to improve Flink in that regard:
 >>
 >> * Support for Hive UDFs
 >> * Support for Hive metadata catalog
 >> * Support for HiveQL syntax
 >> * ???
 >>
 >> Best, Fabian
 >>
 >> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
 >> Hi all,
 >>
 >> Along with the community's effort, inside Alibaba we have explored Flink's potential as an execution engine not just for stream processing but also for batch processing. We are encouraged by our findings and have initiated our effort to make Flink's SQL capabilities full-fledged. When comparing what's available in Flink to the offerings from competitive data processing engines, we identified a major gap in Flink: a well integration with Hive ecosystem. This is crucial to the success of Flink SQL and batch due to the well-established data ecosystem around Hive. Therefore, we have done some initial work along this direction but there are still a lot of effort needed.
 >>
 >> We have two strategies in mind. The first one is to make Flink SQL full-fledged and well-integrated with Hive ecosystem. This is a similar approach to what Spark SQL adopted. The second strategy is to make Hive itself work with Flink, similar to the proposal in [1]. Each approach bears its pros and cons, but they don’t need to be mutually exclusive with each targeting at different users and use cases. We believe that both will promote a much greater adoption of Flink beyond stream processing.
 >>
 >> We have been focused on the first approach and would like to showcase Flink's batch and SQL capabilities with Flink SQL. However, we have also planned to start strategy #2 as the follow-up effort.
 >>
 >> I'm completely new to Flink(, with a short bio [2] below), though many of my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like to share our thoughts and invite your early feedback. At the same time, I am working on a detailed proposal on Flink SQL's integration with Hive ecosystem, which will be also shared when ready.
 >>
 >> While the ideas are simple, each approach will demand significant effort, more than what we can afford. Thus, the input and contributions from the communities are greatly welcome and appreciated.
 >>
 >> Regards,
 >>
 >>
 >> Xuefu
 >>
 >> References:
 >>
 >> [1] https://issues.apache.org/jira/browse/HIVE-10712
 >> [2] Xuefu Zhang is a long-time open source veteran, worked or working on many projects under Apache Foundation, of which he is also an honored member. About 10 years ago he worked in the Hadoop team at Yahoo where the projects just got started. Later he worked at Cloudera, initiating and leading the development of Hive on Spark project in the communities and across many organizations. Prior to joining Alibaba, he worked at Uber where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and significantly improved Uber's cluster efficiency.
 >>
 >>


Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by "Zhang, Xuefu" <xu...@alibaba-inc.com>.
Hi Rong,

Thanks for your feedback. Some of my earlier comments might have addressed some of your points, so here I'd like to cover some specifics.

1. Yes, I expect that table stats stored in Hive will be used in Flink plan optimization, but it's not part of compatibility concern (yet).
2. Both implementing Hive UDFs in Flink natively and making Hive UDFs work in Flink are considered.
3. I am aware of FLIP-24, but here the proposal is to make remote server compatible with HiveServer2. They are not mutually exclusive either.
4. The JDBC/ODBC driver in question is for the remote server that Flink provides. It's usually the servicer owner who provides drivers to their services. We weren't talking about JDBC/ODBC driver to external DB systems.

Let me know if you have further questions.

Thanks,
Xuefu


------------------------------------------------------------------
Sender:Rong Rong <wa...@gmail.com>
Sent at:2018 Oct 12 (Fri) 01:52
Recipient:Timo Walther <tw...@apache.org>
Cc:dev <de...@flink.apache.org>; jornfranke <jo...@gmail.com>; Xuefu <xu...@alibaba-inc.com>; vino yang <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Xuefu, 

Thanks for putting together the overview. I would like to add some more on top of Timo's comments.
1,2. I agree with Timo that a proper catalog support should also address the metadata compatibility issues. I was actually wondering if you are referring to something like utilizing table stats for plan optimization?
4. If the key is to have users integrate Hive UDF without code changes to Flink UDF, it shouldn't be a problem as Timo mentioned. Is your concern mostly on the support of Hive UDFs that should be implemented in Flink-table natively?
7,8. Correct me if I am wrong, but I feel like some of the related components might have already been discussed in the longer term road map of FLIP-24 [1]?
9. per Jorn's comment to stay clear from a tight dependency on Hive and treat it as one "connector" system. Should we also consider treating JDBC/ODBC driver as part of the component from the connector system instead of having Flink to provide them?

Thanks,
Rong

[1]. https://cwiki.apache.org/confluence/display/FLINK/FLIP-24+-+SQL+Client
On Thu, Oct 11, 2018 at 12:46 AM Timo Walther <tw...@apache.org> wrote:
Hi Xuefu,

 thanks for your proposal, it is a nice summary. Here are my thoughts to 
 your list:

 1. I think this is also on our current mid-term roadmap. Flink lacks a 
 poper catalog support for a very long time. Before we can connect 
 catalogs we need to define how to map all the information from a catalog 
 to Flink's representation. This is why the work on the unified connector 
 API [1] is going on for quite some time as it is the first approach to 
 discuss and represent the pure characteristics of connectors.
 2. It would be helpful to figure out what is missing in [1] to to ensure 
 this point. I guess we will need a new design document just for a proper 
 Hive catalog integration.
 3. This is already work in progress. ORC has been merged, Parquet is on 
 its way [1].
 4. This should be easy. There was a PR in past that I reviewed but was 
 not maintained anymore.
 5. The type system of Flink SQL is very flexible. Only UNION type is 
 missing.
 6. A Flink SQL DDL is on the roadmap soon once we are done with [1]. 
 Support for Hive syntax also needs cooperation with Apache Calcite.
 7-11. Long-term goals.

 I would also propose to start with a smaller scope where also current 
 Flink SQL users can profit: 1, 2, 5, 3. This would allow to grow the 
 Flink SQL ecosystem. After that we can aim to be fully compatible 
 including syntax and UDFs (4, 6 etc.). Once the core is ready, we can 
 work on the tooling (7, 8, 9) and performance (10, 11).

 @Jörn: Yes, we should not have a tight dependency on Hive. It should be 
 treated as one "connector" system out of many.

 Thanks,
 Timo

 [1] 
https://docs.google.com/document/d/1Yaxp1UJUFW-peGLt8EIidwKIZEWrrA-pznWLuvaH39Y/edit?ts=5bb62df4#
 [2] https://github.com/apache/flink/pull/6483

 Am 11.10.18 um 07:54 schrieb Jörn Franke:
 > Would it maybe make sense to provide Flink as an engine on Hive („flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely coupled than integrating hive in all possible flink core modules and thus introducing a very tight dependency to Hive in the core.
 > 1,2,3 could be achieved via a connector based on the Flink Table API.
 > Just as a proposal to start this Endeavour as independent projects (hive engine, connector) to avoid too tight coupling with Flink. Maybe in a more distant future if the Hive integration is heavily demanded one could then integrate it more tightly if needed.
 >
 > What is meant by 11?
 >> Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
 >>
 >> Hi Fabian/Vno,
 >>
 >> Thank you very much for your encouragement inquiry. Sorry that I didn't see Fabian's email until I read Vino's response just now. (Somehow Fabian's went to the spam folder.)
 >>
 >> My proposal contains long-term and short-terms goals. Nevertheless, the effort will focus on the following areas, including Fabian's list:
 >>
 >> 1. Hive metastore connectivity - This covers both read/write access, which means Flink can make full use of Hive's metastore as its catalog (at least for the batch but can extend for streaming as well).
 >> 2. Metadata compatibility - Objects (databases, tables, partitions, etc) created by Hive can be understood by Flink and the reverse direction is true also.
 >> 3. Data compatibility - Similar to #2, data produced by Hive can be consumed by Flink and vise versa.
 >> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its own implementation or make Hive's implementation work in Flink. Further, for user created UDFs in Hive, Flink SQL should provide a mechanism allowing user to import them into Flink without any code change required.
 >> 5. Data types -  Flink SQL should support all data types that are available in Hive.
 >> 6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) with extension to support Hive's syntax and language features, around DDL, DML, and SELECT queries.
 >> 7.  SQL CLI - this is currently developing in Flink but more effort is needed.
 >> 8. Server - provide a server that's compatible with Hive's HiverServer2 in thrift APIs, such that HiveServer2 users can reuse their existing client (such as beeline) but connect to Flink's thrift server instead.
 >> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other application to use to connect to its thrift server
 >> 10. Support other user's customizations in Hive, such as Hive Serdes, storage handlers, etc.
 >> 11. Better task failure tolerance and task scheduling at Flink runtime.
 >>
 >> As you can see, achieving all those requires significant effort and across all layers in Flink. However, a short-term goal could  include only core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3, #6).
 >>
 >> Please share your further thoughts. If we generally agree that this is the right direction, I could come up with a formal proposal quickly and then we can follow up with broader discussions.
 >>
 >> Thanks,
 >> Xuefu
 >>
 >>
 >>
 >> ------------------------------------------------------------------
 >> Sender:vino yang <ya...@gmail.com>
 >> Sent at:2018 Oct 11 (Thu) 09:45
 >> Recipient:Fabian Hueske <fh...@gmail.com>
 >> Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <us...@flink.apache.org>
 >> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
 >>
 >> Hi Xuefu,
 >>
 >> Appreciate this proposal, and like Fabian, it would look better if you can give more details of the plan.
 >>
 >> Thanks, vino.
 >>
 >> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
 >> Hi Xuefu,
 >>
 >> Welcome to the Flink community and thanks for starting this discussion! Better Hive integration would be really great!
 >> Can you go into details of what you are proposing? I can think of a couple ways to improve Flink in that regard:
 >>
 >> * Support for Hive UDFs
 >> * Support for Hive metadata catalog
 >> * Support for HiveQL syntax
 >> * ???
 >>
 >> Best, Fabian
 >>
 >> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
 >> Hi all,
 >>
 >> Along with the community's effort, inside Alibaba we have explored Flink's potential as an execution engine not just for stream processing but also for batch processing. We are encouraged by our findings and have initiated our effort to make Flink's SQL capabilities full-fledged. When comparing what's available in Flink to the offerings from competitive data processing engines, we identified a major gap in Flink: a well integration with Hive ecosystem. This is crucial to the success of Flink SQL and batch due to the well-established data ecosystem around Hive. Therefore, we have done some initial work along this direction but there are still a lot of effort needed.
 >>
 >> We have two strategies in mind. The first one is to make Flink SQL full-fledged and well-integrated with Hive ecosystem. This is a similar approach to what Spark SQL adopted. The second strategy is to make Hive itself work with Flink, similar to the proposal in [1]. Each approach bears its pros and cons, but they don’t need to be mutually exclusive with each targeting at different users and use cases. We believe that both will promote a much greater adoption of Flink beyond stream processing.
 >>
 >> We have been focused on the first approach and would like to showcase Flink's batch and SQL capabilities with Flink SQL. However, we have also planned to start strategy #2 as the follow-up effort.
 >>
 >> I'm completely new to Flink(, with a short bio [2] below), though many of my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like to share our thoughts and invite your early feedback. At the same time, I am working on a detailed proposal on Flink SQL's integration with Hive ecosystem, which will be also shared when ready.
 >>
 >> While the ideas are simple, each approach will demand significant effort, more than what we can afford. Thus, the input and contributions from the communities are greatly welcome and appreciated.
 >>
 >> Regards,
 >>
 >>
 >> Xuefu
 >>
 >> References:
 >>
 >> [1] https://issues.apache.org/jira/browse/HIVE-10712
 >> [2] Xuefu Zhang is a long-time open source veteran, worked or working on many projects under Apache Foundation, of which he is also an honored member. About 10 years ago he worked in the Hadoop team at Yahoo where the projects just got started. Later he worked at Cloudera, initiating and leading the development of Hive on Spark project in the communities and across many organizations. Prior to joining Alibaba, he worked at Uber where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and significantly improved Uber's cluster efficiency.
 >>
 >>


Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by Rong Rong <wa...@gmail.com>.
Hi Xuefu,

Thanks for putting together the overview. I would like to add some more on
top of Timo's comments.
1,2. I agree with Timo that a proper catalog support should also address
the metadata compatibility issues. I was actually wondering if you are
referring to something like utilizing table stats for plan optimization?
4. If the key is to have users integrate Hive UDF without code changes to
Flink UDF, it shouldn't be a problem as Timo mentioned. Is your concern
mostly on the support of Hive UDFs that should be implemented in
Flink-table natively?
7,8. Correct me if I am wrong, but I feel like some of the related
components might have already been discussed in the longer term road map of
FLIP-24 [1]?
9. per Jorn's comment to stay clear from a tight dependency on Hive and
treat it as one "connector" system. Should we also consider treating
JDBC/ODBC driver as part of the component from the connector system instead
of having Flink to provide them?

Thanks,
Rong

[1]. https://cwiki.apache.org/confluence/display/FLINK/FLIP-24+-+SQL+Client

On Thu, Oct 11, 2018 at 12:46 AM Timo Walther <tw...@apache.org> wrote:

> Hi Xuefu,
>
> thanks for your proposal, it is a nice summary. Here are my thoughts to
> your list:
>
> 1. I think this is also on our current mid-term roadmap. Flink lacks a
> poper catalog support for a very long time. Before we can connect
> catalogs we need to define how to map all the information from a catalog
> to Flink's representation. This is why the work on the unified connector
> API [1] is going on for quite some time as it is the first approach to
> discuss and represent the pure characteristics of connectors.
> 2. It would be helpful to figure out what is missing in [1] to to ensure
> this point. I guess we will need a new design document just for a proper
> Hive catalog integration.
> 3. This is already work in progress. ORC has been merged, Parquet is on
> its way [1].
> 4. This should be easy. There was a PR in past that I reviewed but was
> not maintained anymore.
> 5. The type system of Flink SQL is very flexible. Only UNION type is
> missing.
> 6. A Flink SQL DDL is on the roadmap soon once we are done with [1].
> Support for Hive syntax also needs cooperation with Apache Calcite.
> 7-11. Long-term goals.
>
> I would also propose to start with a smaller scope where also current
> Flink SQL users can profit: 1, 2, 5, 3. This would allow to grow the
> Flink SQL ecosystem. After that we can aim to be fully compatible
> including syntax and UDFs (4, 6 etc.). Once the core is ready, we can
> work on the tooling (7, 8, 9) and performance (10, 11).
>
> @Jörn: Yes, we should not have a tight dependency on Hive. It should be
> treated as one "connector" system out of many.
>
> Thanks,
> Timo
>
> [1]
>
> https://docs.google.com/document/d/1Yaxp1UJUFW-peGLt8EIidwKIZEWrrA-pznWLuvaH39Y/edit?ts=5bb62df4#
> [2] https://github.com/apache/flink/pull/6483
>
> Am 11.10.18 um 07:54 schrieb Jörn Franke:
> > Would it maybe make sense to provide Flink as an engine on Hive
> („flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely
> coupled than integrating hive in all possible flink core modules and thus
> introducing a very tight dependency to Hive in the core.
> > 1,2,3 could be achieved via a connector based on the Flink Table API.
> > Just as a proposal to start this Endeavour as independent projects (hive
> engine, connector) to avoid too tight coupling with Flink. Maybe in a more
> distant future if the Hive integration is heavily demanded one could then
> integrate it more tightly if needed.
> >
> > What is meant by 11?
> >> Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
> >>
> >> Hi Fabian/Vno,
> >>
> >> Thank you very much for your encouragement inquiry. Sorry that I didn't
> see Fabian's email until I read Vino's response just now. (Somehow Fabian's
> went to the spam folder.)
> >>
> >> My proposal contains long-term and short-terms goals. Nevertheless, the
> effort will focus on the following areas, including Fabian's list:
> >>
> >> 1. Hive metastore connectivity - This covers both read/write access,
> which means Flink can make full use of Hive's metastore as its catalog (at
> least for the batch but can extend for streaming as well).
> >> 2. Metadata compatibility - Objects (databases, tables, partitions,
> etc) created by Hive can be understood by Flink and the reverse direction
> is true also.
> >> 3. Data compatibility - Similar to #2, data produced by Hive can be
> consumed by Flink and vise versa.
> >> 4. Support Hive UDFs - For all Hive's native udfs, Flink either
> provides its own implementation or make Hive's implementation work in
> Flink. Further, for user created UDFs in Hive, Flink SQL should provide a
> mechanism allowing user to import them into Flink without any code change
> required.
> >> 5. Data types -  Flink SQL should support all data types that are
> available in Hive.
> >> 6. SQL Language - Flink SQL should support SQL standard (such as
> SQL2003) with extension to support Hive's syntax and language features,
> around DDL, DML, and SELECT queries.
> >> 7.  SQL CLI - this is currently developing in Flink but more effort is
> needed.
> >> 8. Server - provide a server that's compatible with Hive's HiverServer2
> in thrift APIs, such that HiveServer2 users can reuse their existing client
> (such as beeline) but connect to Flink's thrift server instead.
> >> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for
> other application to use to connect to its thrift server
> >> 10. Support other user's customizations in Hive, such as Hive Serdes,
> storage handlers, etc.
> >> 11. Better task failure tolerance and task scheduling at Flink runtime.
> >>
> >> As you can see, achieving all those requires significant effort and
> across all layers in Flink. However, a short-term goal could  include only
> core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as
> #3, #6).
> >>
> >> Please share your further thoughts. If we generally agree that this is
> the right direction, I could come up with a formal proposal quickly and
> then we can follow up with broader discussions.
> >>
> >> Thanks,
> >> Xuefu
> >>
> >>
> >>
> >> ------------------------------------------------------------------
> >> Sender:vino yang <ya...@gmail.com>
> >> Sent at:2018 Oct 11 (Thu) 09:45
> >> Recipient:Fabian Hueske <fh...@gmail.com>
> >> Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <
> user@flink.apache.org>
> >> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> >>
> >> Hi Xuefu,
> >>
> >> Appreciate this proposal, and like Fabian, it would look better if you
> can give more details of the plan.
> >>
> >> Thanks, vino.
> >>
> >> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
> >> Hi Xuefu,
> >>
> >> Welcome to the Flink community and thanks for starting this discussion!
> Better Hive integration would be really great!
> >> Can you go into details of what you are proposing? I can think of a
> couple ways to improve Flink in that regard:
> >>
> >> * Support for Hive UDFs
> >> * Support for Hive metadata catalog
> >> * Support for HiveQL syntax
> >> * ???
> >>
> >> Best, Fabian
> >>
> >> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
> xuefu.z@alibaba-inc.com>:
> >> Hi all,
> >>
> >> Along with the community's effort, inside Alibaba we have explored
> Flink's potential as an execution engine not just for stream processing but
> also for batch processing. We are encouraged by our findings and have
> initiated our effort to make Flink's SQL capabilities full-fledged. When
> comparing what's available in Flink to the offerings from competitive data
> processing engines, we identified a major gap in Flink: a well integration
> with Hive ecosystem. This is crucial to the success of Flink SQL and batch
> due to the well-established data ecosystem around Hive. Therefore, we have
> done some initial work along this direction but there are still a lot of
> effort needed.
> >>
> >> We have two strategies in mind. The first one is to make Flink SQL
> full-fledged and well-integrated with Hive ecosystem. This is a similar
> approach to what Spark SQL adopted. The second strategy is to make Hive
> itself work with Flink, similar to the proposal in [1]. Each approach bears
> its pros and cons, but they don’t need to be mutually exclusive with each
> targeting at different users and use cases. We believe that both will
> promote a much greater adoption of Flink beyond stream processing.
> >>
> >> We have been focused on the first approach and would like to showcase
> Flink's batch and SQL capabilities with Flink SQL. However, we have also
> planned to start strategy #2 as the follow-up effort.
> >>
> >> I'm completely new to Flink(, with a short bio [2] below), though many
> of my colleagues here at Alibaba are long-time contributors. Nevertheless,
> I'd like to share our thoughts and invite your early feedback. At the same
> time, I am working on a detailed proposal on Flink SQL's integration with
> Hive ecosystem, which will be also shared when ready.
> >>
> >> While the ideas are simple, each approach will demand significant
> effort, more than what we can afford. Thus, the input and contributions
> from the communities are greatly welcome and appreciated.
> >>
> >> Regards,
> >>
> >>
> >> Xuefu
> >>
> >> References:
> >>
> >> [1] https://issues.apache.org/jira/browse/HIVE-10712
> >> [2] Xuefu Zhang is a long-time open source veteran, worked or working
> on many projects under Apache Foundation, of which he is also an honored
> member. About 10 years ago he worked in the Hadoop team at Yahoo where the
> projects just got started. Later he worked at Cloudera, initiating and
> leading the development of Hive on Spark project in the communities and
> across many organizations. Prior to joining Alibaba, he worked at Uber
> where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and
> significantly improved Uber's cluster efficiency.
> >>
> >>
>
>

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by Rong Rong <wa...@gmail.com>.
Hi Xuefu,

Thanks for putting together the overview. I would like to add some more on
top of Timo's comments.
1,2. I agree with Timo that a proper catalog support should also address
the metadata compatibility issues. I was actually wondering if you are
referring to something like utilizing table stats for plan optimization?
4. If the key is to have users integrate Hive UDF without code changes to
Flink UDF, it shouldn't be a problem as Timo mentioned. Is your concern
mostly on the support of Hive UDFs that should be implemented in
Flink-table natively?
7,8. Correct me if I am wrong, but I feel like some of the related
components might have already been discussed in the longer term road map of
FLIP-24 [1]?
9. per Jorn's comment to stay clear from a tight dependency on Hive and
treat it as one "connector" system. Should we also consider treating
JDBC/ODBC driver as part of the component from the connector system instead
of having Flink to provide them?

Thanks,
Rong

[1]. https://cwiki.apache.org/confluence/display/FLINK/FLIP-24+-+SQL+Client

On Thu, Oct 11, 2018 at 12:46 AM Timo Walther <tw...@apache.org> wrote:

> Hi Xuefu,
>
> thanks for your proposal, it is a nice summary. Here are my thoughts to
> your list:
>
> 1. I think this is also on our current mid-term roadmap. Flink lacks a
> poper catalog support for a very long time. Before we can connect
> catalogs we need to define how to map all the information from a catalog
> to Flink's representation. This is why the work on the unified connector
> API [1] is going on for quite some time as it is the first approach to
> discuss and represent the pure characteristics of connectors.
> 2. It would be helpful to figure out what is missing in [1] to to ensure
> this point. I guess we will need a new design document just for a proper
> Hive catalog integration.
> 3. This is already work in progress. ORC has been merged, Parquet is on
> its way [1].
> 4. This should be easy. There was a PR in past that I reviewed but was
> not maintained anymore.
> 5. The type system of Flink SQL is very flexible. Only UNION type is
> missing.
> 6. A Flink SQL DDL is on the roadmap soon once we are done with [1].
> Support for Hive syntax also needs cooperation with Apache Calcite.
> 7-11. Long-term goals.
>
> I would also propose to start with a smaller scope where also current
> Flink SQL users can profit: 1, 2, 5, 3. This would allow to grow the
> Flink SQL ecosystem. After that we can aim to be fully compatible
> including syntax and UDFs (4, 6 etc.). Once the core is ready, we can
> work on the tooling (7, 8, 9) and performance (10, 11).
>
> @Jörn: Yes, we should not have a tight dependency on Hive. It should be
> treated as one "connector" system out of many.
>
> Thanks,
> Timo
>
> [1]
>
> https://docs.google.com/document/d/1Yaxp1UJUFW-peGLt8EIidwKIZEWrrA-pznWLuvaH39Y/edit?ts=5bb62df4#
> [2] https://github.com/apache/flink/pull/6483
>
> Am 11.10.18 um 07:54 schrieb Jörn Franke:
> > Would it maybe make sense to provide Flink as an engine on Hive
> („flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely
> coupled than integrating hive in all possible flink core modules and thus
> introducing a very tight dependency to Hive in the core.
> > 1,2,3 could be achieved via a connector based on the Flink Table API.
> > Just as a proposal to start this Endeavour as independent projects (hive
> engine, connector) to avoid too tight coupling with Flink. Maybe in a more
> distant future if the Hive integration is heavily demanded one could then
> integrate it more tightly if needed.
> >
> > What is meant by 11?
> >> Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
> >>
> >> Hi Fabian/Vno,
> >>
> >> Thank you very much for your encouragement inquiry. Sorry that I didn't
> see Fabian's email until I read Vino's response just now. (Somehow Fabian's
> went to the spam folder.)
> >>
> >> My proposal contains long-term and short-terms goals. Nevertheless, the
> effort will focus on the following areas, including Fabian's list:
> >>
> >> 1. Hive metastore connectivity - This covers both read/write access,
> which means Flink can make full use of Hive's metastore as its catalog (at
> least for the batch but can extend for streaming as well).
> >> 2. Metadata compatibility - Objects (databases, tables, partitions,
> etc) created by Hive can be understood by Flink and the reverse direction
> is true also.
> >> 3. Data compatibility - Similar to #2, data produced by Hive can be
> consumed by Flink and vise versa.
> >> 4. Support Hive UDFs - For all Hive's native udfs, Flink either
> provides its own implementation or make Hive's implementation work in
> Flink. Further, for user created UDFs in Hive, Flink SQL should provide a
> mechanism allowing user to import them into Flink without any code change
> required.
> >> 5. Data types -  Flink SQL should support all data types that are
> available in Hive.
> >> 6. SQL Language - Flink SQL should support SQL standard (such as
> SQL2003) with extension to support Hive's syntax and language features,
> around DDL, DML, and SELECT queries.
> >> 7.  SQL CLI - this is currently developing in Flink but more effort is
> needed.
> >> 8. Server - provide a server that's compatible with Hive's HiverServer2
> in thrift APIs, such that HiveServer2 users can reuse their existing client
> (such as beeline) but connect to Flink's thrift server instead.
> >> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for
> other application to use to connect to its thrift server
> >> 10. Support other user's customizations in Hive, such as Hive Serdes,
> storage handlers, etc.
> >> 11. Better task failure tolerance and task scheduling at Flink runtime.
> >>
> >> As you can see, achieving all those requires significant effort and
> across all layers in Flink. However, a short-term goal could  include only
> core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as
> #3, #6).
> >>
> >> Please share your further thoughts. If we generally agree that this is
> the right direction, I could come up with a formal proposal quickly and
> then we can follow up with broader discussions.
> >>
> >> Thanks,
> >> Xuefu
> >>
> >>
> >>
> >> ------------------------------------------------------------------
> >> Sender:vino yang <ya...@gmail.com>
> >> Sent at:2018 Oct 11 (Thu) 09:45
> >> Recipient:Fabian Hueske <fh...@gmail.com>
> >> Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <
> user@flink.apache.org>
> >> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> >>
> >> Hi Xuefu,
> >>
> >> Appreciate this proposal, and like Fabian, it would look better if you
> can give more details of the plan.
> >>
> >> Thanks, vino.
> >>
> >> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
> >> Hi Xuefu,
> >>
> >> Welcome to the Flink community and thanks for starting this discussion!
> Better Hive integration would be really great!
> >> Can you go into details of what you are proposing? I can think of a
> couple ways to improve Flink in that regard:
> >>
> >> * Support for Hive UDFs
> >> * Support for Hive metadata catalog
> >> * Support for HiveQL syntax
> >> * ???
> >>
> >> Best, Fabian
> >>
> >> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
> xuefu.z@alibaba-inc.com>:
> >> Hi all,
> >>
> >> Along with the community's effort, inside Alibaba we have explored
> Flink's potential as an execution engine not just for stream processing but
> also for batch processing. We are encouraged by our findings and have
> initiated our effort to make Flink's SQL capabilities full-fledged. When
> comparing what's available in Flink to the offerings from competitive data
> processing engines, we identified a major gap in Flink: a well integration
> with Hive ecosystem. This is crucial to the success of Flink SQL and batch
> due to the well-established data ecosystem around Hive. Therefore, we have
> done some initial work along this direction but there are still a lot of
> effort needed.
> >>
> >> We have two strategies in mind. The first one is to make Flink SQL
> full-fledged and well-integrated with Hive ecosystem. This is a similar
> approach to what Spark SQL adopted. The second strategy is to make Hive
> itself work with Flink, similar to the proposal in [1]. Each approach bears
> its pros and cons, but they don’t need to be mutually exclusive with each
> targeting at different users and use cases. We believe that both will
> promote a much greater adoption of Flink beyond stream processing.
> >>
> >> We have been focused on the first approach and would like to showcase
> Flink's batch and SQL capabilities with Flink SQL. However, we have also
> planned to start strategy #2 as the follow-up effort.
> >>
> >> I'm completely new to Flink(, with a short bio [2] below), though many
> of my colleagues here at Alibaba are long-time contributors. Nevertheless,
> I'd like to share our thoughts and invite your early feedback. At the same
> time, I am working on a detailed proposal on Flink SQL's integration with
> Hive ecosystem, which will be also shared when ready.
> >>
> >> While the ideas are simple, each approach will demand significant
> effort, more than what we can afford. Thus, the input and contributions
> from the communities are greatly welcome and appreciated.
> >>
> >> Regards,
> >>
> >>
> >> Xuefu
> >>
> >> References:
> >>
> >> [1] https://issues.apache.org/jira/browse/HIVE-10712
> >> [2] Xuefu Zhang is a long-time open source veteran, worked or working
> on many projects under Apache Foundation, of which he is also an honored
> member. About 10 years ago he worked in the Hadoop team at Yahoo where the
> projects just got started. Later he worked at Cloudera, initiating and
> leading the development of Hive on Spark project in the communities and
> across many organizations. Prior to joining Alibaba, he worked at Uber
> where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and
> significantly improved Uber's cluster efficiency.
> >>
> >>
>
>

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by "Zhang, Xuefu" <xu...@alibaba-inc.com>.
Hi Timo,

Thank you for your input. It's exciting to see that the community has already initiated some of the topics. We'd certainly like to leverage the current and previous work and make progress in phases. Here I'd like to comment on a few things on top of your feedback.

1. I think there are two aspects #1 and #2 with regard to Hive metastore: a) as an backend storage for Flink's metadata (currently in memory), and b) an external catalog (just like a JDBC catalog) that Flink can interact with. While it may be possible and would be nice if we can achieve both in a single design, our focus has been on the latter. We will consider both cases in our design.

2. Re #5, I agree that Flink seems having the majority of data types. However, supporting some of them (such as struct) at SQL layer needs work on the parser (Calcite).

3. Similarly for #6, work needs to be done on parsing side. We can certainly ask Calcite community to provide Hive dialect parsing. This can be challenging and time-consuming. At the same time, we can also explore the possibilities of solving the problem in Flink, such as using Calcite's official extension mechanism. We will open the discussion when we get there.

Yes, I agree with you that we should start with a small scope while keeping a forward thinking. Specifically, we will first look at the metadata and data compatibilities, data types, DDL/DML, Query, UDFs, and so on. I think we align well on this.

Please let me know if you have further thoughts or commends.

Thanks,
Xuefu


------------------------------------------------------------------
Sender:Timo Walther <tw...@apache.org>
Sent at:2018 Oct 11 (Thu) 15:46
Recipient:dev <de...@flink.apache.org>; "Jörn Franke" <jo...@gmail.com>; Xuefu <xu...@alibaba-inc.com>
Cc:vino yang <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Xuefu,

thanks for your proposal, it is a nice summary. Here are my thoughts to 
your list:

1. I think this is also on our current mid-term roadmap. Flink lacks a 
poper catalog support for a very long time. Before we can connect 
catalogs we need to define how to map all the information from a catalog 
to Flink's representation. This is why the work on the unified connector 
API [1] is going on for quite some time as it is the first approach to 
discuss and represent the pure characteristics of connectors.
2. It would be helpful to figure out what is missing in [1] to to ensure 
this point. I guess we will need a new design document just for a proper 
Hive catalog integration.
3. This is already work in progress. ORC has been merged, Parquet is on 
its way [1].
4. This should be easy. There was a PR in past that I reviewed but was 
not maintained anymore.
5. The type system of Flink SQL is very flexible. Only UNION type is 
missing.
6. A Flink SQL DDL is on the roadmap soon once we are done with [1]. 
Support for Hive syntax also needs cooperation with Apache Calcite.
7-11. Long-term goals.

I would also propose to start with a smaller scope where also current 
Flink SQL users can profit: 1, 2, 5, 3. This would allow to grow the 
Flink SQL ecosystem. After that we can aim to be fully compatible 
including syntax and UDFs (4, 6 etc.). Once the core is ready, we can 
work on the tooling (7, 8, 9) and performance (10, 11).

@Jörn: Yes, we should not have a tight dependency on Hive. It should be 
treated as one "connector" system out of many.

Thanks,
Timo

[1] 
https://docs.google.com/document/d/1Yaxp1UJUFW-peGLt8EIidwKIZEWrrA-pznWLuvaH39Y/edit?ts=5bb62df4#
[2] https://github.com/apache/flink/pull/6483

Am 11.10.18 um 07:54 schrieb Jörn Franke:
> Would it maybe make sense to provide Flink as an engine on Hive („flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely coupled than integrating hive in all possible flink core modules and thus introducing a very tight dependency to Hive in the core.
> 1,2,3 could be achieved via a connector based on the Flink Table API.
> Just as a proposal to start this Endeavour as independent projects (hive engine, connector) to avoid too tight coupling with Flink. Maybe in a more distant future if the Hive integration is heavily demanded one could then integrate it more tightly if needed.
>
> What is meant by 11?
>> Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
>>
>> Hi Fabian/Vno,
>>
>> Thank you very much for your encouragement inquiry. Sorry that I didn't see Fabian's email until I read Vino's response just now. (Somehow Fabian's went to the spam folder.)
>>
>> My proposal contains long-term and short-terms goals. Nevertheless, the effort will focus on the following areas, including Fabian's list:
>>
>> 1. Hive metastore connectivity - This covers both read/write access, which means Flink can make full use of Hive's metastore as its catalog (at least for the batch but can extend for streaming as well).
>> 2. Metadata compatibility - Objects (databases, tables, partitions, etc) created by Hive can be understood by Flink and the reverse direction is true also.
>> 3. Data compatibility - Similar to #2, data produced by Hive can be consumed by Flink and vise versa.
>> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its own implementation or make Hive's implementation work in Flink. Further, for user created UDFs in Hive, Flink SQL should provide a mechanism allowing user to import them into Flink without any code change required.
>> 5. Data types -  Flink SQL should support all data types that are available in Hive.
>> 6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) with extension to support Hive's syntax and language features, around DDL, DML, and SELECT queries.
>> 7.  SQL CLI - this is currently developing in Flink but more effort is needed.
>> 8. Server - provide a server that's compatible with Hive's HiverServer2 in thrift APIs, such that HiveServer2 users can reuse their existing client (such as beeline) but connect to Flink's thrift server instead.
>> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other application to use to connect to its thrift server
>> 10. Support other user's customizations in Hive, such as Hive Serdes, storage handlers, etc.
>> 11. Better task failure tolerance and task scheduling at Flink runtime.
>>
>> As you can see, achieving all those requires significant effort and across all layers in Flink. However, a short-term goal could  include only core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3, #6).
>>
>> Please share your further thoughts. If we generally agree that this is the right direction, I could come up with a formal proposal quickly and then we can follow up with broader discussions.
>>
>> Thanks,
>> Xuefu
>>
>>
>>
>> ------------------------------------------------------------------
>> Sender:vino yang <ya...@gmail.com>
>> Sent at:2018 Oct 11 (Thu) 09:45
>> Recipient:Fabian Hueske <fh...@gmail.com>
>> Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <us...@flink.apache.org>
>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>
>> Hi Xuefu,
>>
>> Appreciate this proposal, and like Fabian, it would look better if you can give more details of the plan.
>>
>> Thanks, vino.
>>
>> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
>> Hi Xuefu,
>>
>> Welcome to the Flink community and thanks for starting this discussion! Better Hive integration would be really great!
>> Can you go into details of what you are proposing? I can think of a couple ways to improve Flink in that regard:
>>
>> * Support for Hive UDFs
>> * Support for Hive metadata catalog
>> * Support for HiveQL syntax
>> * ???
>>
>> Best, Fabian
>>
>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
>> Hi all,
>>
>> Along with the community's effort, inside Alibaba we have explored Flink's potential as an execution engine not just for stream processing but also for batch processing. We are encouraged by our findings and have initiated our effort to make Flink's SQL capabilities full-fledged. When comparing what's available in Flink to the offerings from competitive data processing engines, we identified a major gap in Flink: a well integration with Hive ecosystem. This is crucial to the success of Flink SQL and batch due to the well-established data ecosystem around Hive. Therefore, we have done some initial work along this direction but there are still a lot of effort needed.
>>
>> We have two strategies in mind. The first one is to make Flink SQL full-fledged and well-integrated with Hive ecosystem. This is a similar approach to what Spark SQL adopted. The second strategy is to make Hive itself work with Flink, similar to the proposal in [1]. Each approach bears its pros and cons, but they don’t need to be mutually exclusive with each targeting at different users and use cases. We believe that both will promote a much greater adoption of Flink beyond stream processing.
>>
>> We have been focused on the first approach and would like to showcase Flink's batch and SQL capabilities with Flink SQL. However, we have also planned to start strategy #2 as the follow-up effort.
>>
>> I'm completely new to Flink(, with a short bio [2] below), though many of my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like to share our thoughts and invite your early feedback. At the same time, I am working on a detailed proposal on Flink SQL's integration with Hive ecosystem, which will be also shared when ready.
>>
>> While the ideas are simple, each approach will demand significant effort, more than what we can afford. Thus, the input and contributions from the communities are greatly welcome and appreciated.
>>
>> Regards,
>>
>>
>> Xuefu
>>
>> References:
>>
>> [1] https://issues.apache.org/jira/browse/HIVE-10712
>> [2] Xuefu Zhang is a long-time open source veteran, worked or working on many projects under Apache Foundation, of which he is also an honored member. About 10 years ago he worked in the Hadoop team at Yahoo where the projects just got started. Later he worked at Cloudera, initiating and leading the development of Hive on Spark project in the communities and across many organizations. Prior to joining Alibaba, he worked at Uber where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and significantly improved Uber's cluster efficiency.
>>
>>

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by Timo Walther <tw...@apache.org>.
Hi Xuefu,

thanks for your proposal, it is a nice summary. Here are my thoughts to 
your list:

1. I think this is also on our current mid-term roadmap. Flink lacks a 
poper catalog support for a very long time. Before we can connect 
catalogs we need to define how to map all the information from a catalog 
to Flink's representation. This is why the work on the unified connector 
API [1] is going on for quite some time as it is the first approach to 
discuss and represent the pure characteristics of connectors.
2. It would be helpful to figure out what is missing in [1] to to ensure 
this point. I guess we will need a new design document just for a proper 
Hive catalog integration.
3. This is already work in progress. ORC has been merged, Parquet is on 
its way [1].
4. This should be easy. There was a PR in past that I reviewed but was 
not maintained anymore.
5. The type system of Flink SQL is very flexible. Only UNION type is 
missing.
6. A Flink SQL DDL is on the roadmap soon once we are done with [1]. 
Support for Hive syntax also needs cooperation with Apache Calcite.
7-11. Long-term goals.

I would also propose to start with a smaller scope where also current 
Flink SQL users can profit: 1, 2, 5, 3. This would allow to grow the 
Flink SQL ecosystem. After that we can aim to be fully compatible 
including syntax and UDFs (4, 6 etc.). Once the core is ready, we can 
work on the tooling (7, 8, 9) and performance (10, 11).

@Jörn: Yes, we should not have a tight dependency on Hive. It should be 
treated as one "connector" system out of many.

Thanks,
Timo

[1] 
https://docs.google.com/document/d/1Yaxp1UJUFW-peGLt8EIidwKIZEWrrA-pznWLuvaH39Y/edit?ts=5bb62df4#
[2] https://github.com/apache/flink/pull/6483

Am 11.10.18 um 07:54 schrieb Jörn Franke:
> Would it maybe make sense to provide Flink as an engine on Hive („flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely coupled than integrating hive in all possible flink core modules and thus introducing a very tight dependency to Hive in the core.
> 1,2,3 could be achieved via a connector based on the Flink Table API.
> Just as a proposal to start this Endeavour as independent projects (hive engine, connector) to avoid too tight coupling with Flink. Maybe in a more distant future if the Hive integration is heavily demanded one could then integrate it more tightly if needed.
>
> What is meant by 11?
>> Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
>>
>> Hi Fabian/Vno,
>>
>> Thank you very much for your encouragement inquiry. Sorry that I didn't see Fabian's email until I read Vino's response just now. (Somehow Fabian's went to the spam folder.)
>>
>> My proposal contains long-term and short-terms goals. Nevertheless, the effort will focus on the following areas, including Fabian's list:
>>
>> 1. Hive metastore connectivity - This covers both read/write access, which means Flink can make full use of Hive's metastore as its catalog (at least for the batch but can extend for streaming as well).
>> 2. Metadata compatibility - Objects (databases, tables, partitions, etc) created by Hive can be understood by Flink and the reverse direction is true also.
>> 3. Data compatibility - Similar to #2, data produced by Hive can be consumed by Flink and vise versa.
>> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its own implementation or make Hive's implementation work in Flink. Further, for user created UDFs in Hive, Flink SQL should provide a mechanism allowing user to import them into Flink without any code change required.
>> 5. Data types -  Flink SQL should support all data types that are available in Hive.
>> 6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) with extension to support Hive's syntax and language features, around DDL, DML, and SELECT queries.
>> 7.  SQL CLI - this is currently developing in Flink but more effort is needed.
>> 8. Server - provide a server that's compatible with Hive's HiverServer2 in thrift APIs, such that HiveServer2 users can reuse their existing client (such as beeline) but connect to Flink's thrift server instead.
>> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other application to use to connect to its thrift server
>> 10. Support other user's customizations in Hive, such as Hive Serdes, storage handlers, etc.
>> 11. Better task failure tolerance and task scheduling at Flink runtime.
>>
>> As you can see, achieving all those requires significant effort and across all layers in Flink. However, a short-term goal could  include only core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3, #6).
>>
>> Please share your further thoughts. If we generally agree that this is the right direction, I could come up with a formal proposal quickly and then we can follow up with broader discussions.
>>
>> Thanks,
>> Xuefu
>>
>>
>>
>> ------------------------------------------------------------------
>> Sender:vino yang <ya...@gmail.com>
>> Sent at:2018 Oct 11 (Thu) 09:45
>> Recipient:Fabian Hueske <fh...@gmail.com>
>> Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <us...@flink.apache.org>
>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>
>> Hi Xuefu,
>>
>> Appreciate this proposal, and like Fabian, it would look better if you can give more details of the plan.
>>
>> Thanks, vino.
>>
>> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
>> Hi Xuefu,
>>
>> Welcome to the Flink community and thanks for starting this discussion! Better Hive integration would be really great!
>> Can you go into details of what you are proposing? I can think of a couple ways to improve Flink in that regard:
>>
>> * Support for Hive UDFs
>> * Support for Hive metadata catalog
>> * Support for HiveQL syntax
>> * ???
>>
>> Best, Fabian
>>
>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
>> Hi all,
>>
>> Along with the community's effort, inside Alibaba we have explored Flink's potential as an execution engine not just for stream processing but also for batch processing. We are encouraged by our findings and have initiated our effort to make Flink's SQL capabilities full-fledged. When comparing what's available in Flink to the offerings from competitive data processing engines, we identified a major gap in Flink: a well integration with Hive ecosystem. This is crucial to the success of Flink SQL and batch due to the well-established data ecosystem around Hive. Therefore, we have done some initial work along this direction but there are still a lot of effort needed.
>>
>> We have two strategies in mind. The first one is to make Flink SQL full-fledged and well-integrated with Hive ecosystem. This is a similar approach to what Spark SQL adopted. The second strategy is to make Hive itself work with Flink, similar to the proposal in [1]. Each approach bears its pros and cons, but they don’t need to be mutually exclusive with each targeting at different users and use cases. We believe that both will promote a much greater adoption of Flink beyond stream processing.
>>
>> We have been focused on the first approach and would like to showcase Flink's batch and SQL capabilities with Flink SQL. However, we have also planned to start strategy #2 as the follow-up effort.
>>
>> I'm completely new to Flink(, with a short bio [2] below), though many of my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like to share our thoughts and invite your early feedback. At the same time, I am working on a detailed proposal on Flink SQL's integration with Hive ecosystem, which will be also shared when ready.
>>
>> While the ideas are simple, each approach will demand significant effort, more than what we can afford. Thus, the input and contributions from the communities are greatly welcome and appreciated.
>>
>> Regards,
>>
>>
>> Xuefu
>>
>> References:
>>
>> [1] https://issues.apache.org/jira/browse/HIVE-10712
>> [2] Xuefu Zhang is a long-time open source veteran, worked or working on many projects under Apache Foundation, of which he is also an honored member. About 10 years ago he worked in the Hadoop team at Yahoo where the projects just got started. Later he worked at Cloudera, initiating and leading the development of Hive on Spark project in the communities and across many organizations. Prior to joining Alibaba, he worked at Uber where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and significantly improved Uber's cluster efficiency.
>>
>>


Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by Timo Walther <tw...@apache.org>.
Hi Xuefu,

thanks for your proposal, it is a nice summary. Here are my thoughts to 
your list:

1. I think this is also on our current mid-term roadmap. Flink lacks a 
poper catalog support for a very long time. Before we can connect 
catalogs we need to define how to map all the information from a catalog 
to Flink's representation. This is why the work on the unified connector 
API [1] is going on for quite some time as it is the first approach to 
discuss and represent the pure characteristics of connectors.
2. It would be helpful to figure out what is missing in [1] to to ensure 
this point. I guess we will need a new design document just for a proper 
Hive catalog integration.
3. This is already work in progress. ORC has been merged, Parquet is on 
its way [1].
4. This should be easy. There was a PR in past that I reviewed but was 
not maintained anymore.
5. The type system of Flink SQL is very flexible. Only UNION type is 
missing.
6. A Flink SQL DDL is on the roadmap soon once we are done with [1]. 
Support for Hive syntax also needs cooperation with Apache Calcite.
7-11. Long-term goals.

I would also propose to start with a smaller scope where also current 
Flink SQL users can profit: 1, 2, 5, 3. This would allow to grow the 
Flink SQL ecosystem. After that we can aim to be fully compatible 
including syntax and UDFs (4, 6 etc.). Once the core is ready, we can 
work on the tooling (7, 8, 9) and performance (10, 11).

@Jörn: Yes, we should not have a tight dependency on Hive. It should be 
treated as one "connector" system out of many.

Thanks,
Timo

[1] 
https://docs.google.com/document/d/1Yaxp1UJUFW-peGLt8EIidwKIZEWrrA-pznWLuvaH39Y/edit?ts=5bb62df4#
[2] https://github.com/apache/flink/pull/6483

Am 11.10.18 um 07:54 schrieb Jörn Franke:
> Would it maybe make sense to provide Flink as an engine on Hive („flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely coupled than integrating hive in all possible flink core modules and thus introducing a very tight dependency to Hive in the core.
> 1,2,3 could be achieved via a connector based on the Flink Table API.
> Just as a proposal to start this Endeavour as independent projects (hive engine, connector) to avoid too tight coupling with Flink. Maybe in a more distant future if the Hive integration is heavily demanded one could then integrate it more tightly if needed.
>
> What is meant by 11?
>> Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
>>
>> Hi Fabian/Vno,
>>
>> Thank you very much for your encouragement inquiry. Sorry that I didn't see Fabian's email until I read Vino's response just now. (Somehow Fabian's went to the spam folder.)
>>
>> My proposal contains long-term and short-terms goals. Nevertheless, the effort will focus on the following areas, including Fabian's list:
>>
>> 1. Hive metastore connectivity - This covers both read/write access, which means Flink can make full use of Hive's metastore as its catalog (at least for the batch but can extend for streaming as well).
>> 2. Metadata compatibility - Objects (databases, tables, partitions, etc) created by Hive can be understood by Flink and the reverse direction is true also.
>> 3. Data compatibility - Similar to #2, data produced by Hive can be consumed by Flink and vise versa.
>> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its own implementation or make Hive's implementation work in Flink. Further, for user created UDFs in Hive, Flink SQL should provide a mechanism allowing user to import them into Flink without any code change required.
>> 5. Data types -  Flink SQL should support all data types that are available in Hive.
>> 6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) with extension to support Hive's syntax and language features, around DDL, DML, and SELECT queries.
>> 7.  SQL CLI - this is currently developing in Flink but more effort is needed.
>> 8. Server - provide a server that's compatible with Hive's HiverServer2 in thrift APIs, such that HiveServer2 users can reuse their existing client (such as beeline) but connect to Flink's thrift server instead.
>> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other application to use to connect to its thrift server
>> 10. Support other user's customizations in Hive, such as Hive Serdes, storage handlers, etc.
>> 11. Better task failure tolerance and task scheduling at Flink runtime.
>>
>> As you can see, achieving all those requires significant effort and across all layers in Flink. However, a short-term goal could  include only core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3, #6).
>>
>> Please share your further thoughts. If we generally agree that this is the right direction, I could come up with a formal proposal quickly and then we can follow up with broader discussions.
>>
>> Thanks,
>> Xuefu
>>
>>
>>
>> ------------------------------------------------------------------
>> Sender:vino yang <ya...@gmail.com>
>> Sent at:2018 Oct 11 (Thu) 09:45
>> Recipient:Fabian Hueske <fh...@gmail.com>
>> Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <us...@flink.apache.org>
>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>
>> Hi Xuefu,
>>
>> Appreciate this proposal, and like Fabian, it would look better if you can give more details of the plan.
>>
>> Thanks, vino.
>>
>> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
>> Hi Xuefu,
>>
>> Welcome to the Flink community and thanks for starting this discussion! Better Hive integration would be really great!
>> Can you go into details of what you are proposing? I can think of a couple ways to improve Flink in that regard:
>>
>> * Support for Hive UDFs
>> * Support for Hive metadata catalog
>> * Support for HiveQL syntax
>> * ???
>>
>> Best, Fabian
>>
>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
>> Hi all,
>>
>> Along with the community's effort, inside Alibaba we have explored Flink's potential as an execution engine not just for stream processing but also for batch processing. We are encouraged by our findings and have initiated our effort to make Flink's SQL capabilities full-fledged. When comparing what's available in Flink to the offerings from competitive data processing engines, we identified a major gap in Flink: a well integration with Hive ecosystem. This is crucial to the success of Flink SQL and batch due to the well-established data ecosystem around Hive. Therefore, we have done some initial work along this direction but there are still a lot of effort needed.
>>
>> We have two strategies in mind. The first one is to make Flink SQL full-fledged and well-integrated with Hive ecosystem. This is a similar approach to what Spark SQL adopted. The second strategy is to make Hive itself work with Flink, similar to the proposal in [1]. Each approach bears its pros and cons, but they don’t need to be mutually exclusive with each targeting at different users and use cases. We believe that both will promote a much greater adoption of Flink beyond stream processing.
>>
>> We have been focused on the first approach and would like to showcase Flink's batch and SQL capabilities with Flink SQL. However, we have also planned to start strategy #2 as the follow-up effort.
>>
>> I'm completely new to Flink(, with a short bio [2] below), though many of my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like to share our thoughts and invite your early feedback. At the same time, I am working on a detailed proposal on Flink SQL's integration with Hive ecosystem, which will be also shared when ready.
>>
>> While the ideas are simple, each approach will demand significant effort, more than what we can afford. Thus, the input and contributions from the communities are greatly welcome and appreciated.
>>
>> Regards,
>>
>>
>> Xuefu
>>
>> References:
>>
>> [1] https://issues.apache.org/jira/browse/HIVE-10712
>> [2] Xuefu Zhang is a long-time open source veteran, worked or working on many projects under Apache Foundation, of which he is also an honored member. About 10 years ago he worked in the Hadoop team at Yahoo where the projects just got started. Later he worked at Cloudera, initiating and leading the development of Hive on Spark project in the communities and across many organizations. Prior to joining Alibaba, he worked at Uber where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and significantly improved Uber's cluster efficiency.
>>
>>


Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by "Zhang, Xuefu" <xu...@alibaba-inc.com>.
Hi Bowen,

Thank you for your feedback and interest in the project. Your contribution is certainly welcome. Per your suggestion, I have created an Uber JIRA (https://issues.apache.org/jira/browse/FLINK-10556) to track our overall effort on this. For each subtask, we'd like to see a short description on the status quo and what is planned to add or change. Design doc should be provided when it's deemed necessary.

I'm looking forward to seeing your contributions!

Thanks,
Xuefu



Thanks,
Xuefu 


------------------------------------------------------------------
Sender:Bowen <bo...@gmail.com>
Sent at:2018 Oct 13 (Sat) 21:55
Recipient:Xuefu <xu...@alibaba-inc.com>; Fabian Hueske <fh...@gmail.com>
Cc:dev <de...@flink.apache.org>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem


Thank you Xuefu, for bringing up this awesome, detailed proposal! It will resolve lots of existing pain for users like me.

In general, I totally agree that improving FlinkSQL's completeness would be a much better start point than building 'Hive on Flink', as the Hive community is concerned about Flink's SQL incompleteness and lack of proven batch performance shown in https://issues.apache.org/jira/browse/HIVE-10712. Improving FlinkSQL seems a more natural direction to start with in order to achieve the integration.

Xuefu and Timo has laid a quite clear path of what to tackle next. Given that there're already some efforts going on, for item 1,2,5,3,4,6 in Xuefu's list, shall we:

identify gaps between a) Xuefu's proposal/discussion result in this thread and b) all the ongoing work/discussions?
then, create some new top-level JIRA tickets to keep track of and start more detailed discussions with?
It's gonna be a great and influential project , and I'd love to participate into it to move FlinkSQL's adoption and ecosystem even further.

Thanks,
Bowen


在 2018年10月12日,下午3:37,Jörn Franke <jo...@gmail.com> 写道:


Thank you very nice , I fully agree with that. 

Am 11.10.2018 um 19:31 schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:

Hi Jörn,

Thanks for your feedback. Yes, I think Hive on Flink makes sense and in fact it is one of the two approaches that I named in the beginning of the thread. As also pointed out there, this isn't mutually exclusive from work we proposed inside Flink and they target at different user groups and user cases. Further, what we proposed to do in Flink should be a good showcase that demonstrate Flink's capabilities in batch processing and convince Hive community of the worth of a new engine. As you might know, the idea encountered some doubt and resistance. Nevertheless, we do have a solid plan for Hive on Flink, which we will execute once Flink SQL is in a good shape.

I also agree with you that Flink SQL shouldn't be closely coupled with Hive. While we mentioned Hive in many of the proposed items, most of them are coupled only in concepts and functionality rather than code or libraries. We are taking the advantage of the connector framework in Flink. The only thing that might be exceptional is to support Hive built-in UDFs, which we may not make it work out of the box to avoid the coupling. We could, for example, require users bring in Hive library and register themselves. This is subject to further discussion.

#11 is about Flink runtime enhancement that is meant to make task failures more tolerable (so that the job don't have to start from the beginning in case of task failures) and to make task scheduling more resource-efficient. Flink's current design in those two aspects leans more to stream processing, which may not be good enough for batch processing. We will provide more detailed design when we get to them.

Please let me know if you have further thoughts or feedback.

Thanks,
Xuefu


------------------------------------------------------------------
Sender:Jörn Franke <jo...@gmail.com>
Sent at:2018 Oct 11 (Thu) 13:54
Recipient:Xuefu <xu...@alibaba-inc.com>
Cc:vino yang <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>; dev <de...@flink.apache.org>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Would it maybe make sense to provide Flink as an engine on Hive („flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely coupled than integrating hive in all possible flink core modules and thus introducing a very tight dependency to Hive in the core.
1,2,3 could be achieved via a connector based on the Flink Table API.
Just as a proposal to start this Endeavour as independent projects (hive engine, connector) to avoid too tight coupling with Flink. Maybe in a more distant future if the Hive integration is heavily demanded one could then integrate it more tightly if needed. 

What is meant by 11?
Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:

Hi Fabian/Vno,

Thank you very much for your encouragement inquiry. Sorry that I didn't see Fabian's email until I read Vino's response just now. (Somehow Fabian's went to the spam folder.)

My proposal contains long-term and short-terms goals. Nevertheless, the effort will focus on the following areas, including Fabian's list:

1. Hive metastore connectivity - This covers both read/write access, which means Flink can make full use of Hive's metastore as its catalog (at least for the batch but can extend for streaming as well).
2. Metadata compatibility - Objects (databases, tables, partitions, etc) created by Hive can be understood by Flink and the reverse direction is true also.
3. Data compatibility - Similar to #2, data produced by Hive can be consumed by Flink and vise versa.
4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its own implementation or make Hive's implementation work in Flink. Further, for user created UDFs in Hive, Flink SQL should provide a mechanism allowing user to import them into Flink without any code change required.
5. Data types -  Flink SQL should support all data types that are available in Hive.
6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) with extension to support Hive's syntax and language features, around DDL, DML, and SELECT queries.
7.  SQL CLI - this is currently developing in Flink but more effort is needed.
8. Server - provide a server that's compatible with Hive's HiverServer2 in thrift APIs, such that HiveServer2 users can reuse their existing client (such as beeline) but connect to Flink's thrift server instead.
9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other application to use to connect to its thrift server
10. Support other user's customizations in Hive, such as Hive Serdes, storage handlers, etc.
11. Better task failure tolerance and task scheduling at Flink runtime.

As you can see, achieving all those requires significant effort and across all layers in Flink. However, a short-term goal could  include only core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3, #6).

Please share your further thoughts. If we generally agree that this is the right direction, I could come up with a formal proposal quickly and then we can follow up with broader discussions.

Thanks,
Xuefu



------------------------------------------------------------------
Sender:vino yang <ya...@gmail.com>
Sent at:2018 Oct 11 (Thu) 09:45
Recipient:Fabian Hueske <fh...@gmail.com>
Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Xuefu,

Appreciate this proposal, and like Fabian, it would look better if you can give more details of the plan.

Thanks, vino.
Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
Hi Xuefu,

Welcome to the Flink community and thanks for starting this discussion! Better Hive integration would be really great!
Can you go into details of what you are proposing? I can think of a couple ways to improve Flink in that regard:

* Support for Hive UDFs
* Support for Hive metadata catalog
* Support for HiveQL syntax
* ???

Best, Fabian

Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
Hi all,

 Along with the community's effort, inside Alibaba we have explored Flink's potential as an execution engine not just for stream processing but also for batch processing. We are encouraged by our findings and have initiated our effort to make Flink's SQL capabilities full-fledged. When comparing what's available in Flink to the offerings from competitive data processing engines, we identified a major gap in Flink: a well integration with Hive ecosystem. This is crucial to the success of Flink SQL and batch due to the well-established data ecosystem around Hive. Therefore, we have done some initial work along this direction but there are still a lot of effort needed.

 We have two strategies in mind. The first one is to make Flink SQL full-fledged and well-integrated with Hive ecosystem. This is a similar approach to what Spark SQL adopted. The second strategy is to make Hive itself work with Flink, similar to the proposal in [1]. Each approach bears its pros and cons, but they don’t need to be mutually exclusive with each targeting at different users and use cases. We believe that both will promote a much greater adoption of Flink beyond stream processing.

 We have been focused on the first approach and would like to showcase Flink's batch and SQL capabilities with Flink SQL. However, we have also planned to start strategy #2 as the follow-up effort.

 I'm completely new to Flink(, with a short bio [2] below), though many of my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like to share our thoughts and invite your early feedback. At the same time, I am working on a detailed proposal on Flink SQL's integration with Hive ecosystem, which will be also shared when ready.

 While the ideas are simple, each approach will demand significant effort, more than what we can afford. Thus, the input and contributions from the communities are greatly welcome and appreciated.

 Regards,


 Xuefu

 References:

 [1] https://issues.apache.org/jira/browse/HIVE-10712
 [2] Xuefu Zhang is a long-time open source veteran, worked or working on many projects under Apache Foundation, of which he is also an honored member. About 10 years ago he worked in the Hadoop team at Yahoo where the projects just got started. Later he worked at Cloudera, initiating and leading the development of Hive on Spark project in the communities and across many organizations. Prior to joining Alibaba, he worked at Uber where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and significantly improved Uber's cluster efficiency.



Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by "Zhang, Xuefu" <xu...@alibaba-inc.com>.
Hi Bowen,

Thank you for your feedback and interest in the project. Your contribution is certainly welcome. Per your suggestion, I have created an Uber JIRA (https://issues.apache.org/jira/browse/FLINK-10556) to track our overall effort on this. For each subtask, we'd like to see a short description on the status quo and what is planned to add or change. Design doc should be provided when it's deemed necessary.

I'm looking forward to seeing your contributions!

Thanks,
Xuefu



Thanks,
Xuefu 


------------------------------------------------------------------
Sender:Bowen <bo...@gmail.com>
Sent at:2018 Oct 13 (Sat) 21:55
Recipient:Xuefu <xu...@alibaba-inc.com>; Fabian Hueske <fh...@gmail.com>
Cc:dev <de...@flink.apache.org>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem


Thank you Xuefu, for bringing up this awesome, detailed proposal! It will resolve lots of existing pain for users like me.

In general, I totally agree that improving FlinkSQL's completeness would be a much better start point than building 'Hive on Flink', as the Hive community is concerned about Flink's SQL incompleteness and lack of proven batch performance shown in https://issues.apache.org/jira/browse/HIVE-10712. Improving FlinkSQL seems a more natural direction to start with in order to achieve the integration.

Xuefu and Timo has laid a quite clear path of what to tackle next. Given that there're already some efforts going on, for item 1,2,5,3,4,6 in Xuefu's list, shall we:

identify gaps between a) Xuefu's proposal/discussion result in this thread and b) all the ongoing work/discussions?
then, create some new top-level JIRA tickets to keep track of and start more detailed discussions with?
It's gonna be a great and influential project , and I'd love to participate into it to move FlinkSQL's adoption and ecosystem even further.

Thanks,
Bowen


在 2018年10月12日,下午3:37,Jörn Franke <jo...@gmail.com> 写道:


Thank you very nice , I fully agree with that. 

Am 11.10.2018 um 19:31 schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:

Hi Jörn,

Thanks for your feedback. Yes, I think Hive on Flink makes sense and in fact it is one of the two approaches that I named in the beginning of the thread. As also pointed out there, this isn't mutually exclusive from work we proposed inside Flink and they target at different user groups and user cases. Further, what we proposed to do in Flink should be a good showcase that demonstrate Flink's capabilities in batch processing and convince Hive community of the worth of a new engine. As you might know, the idea encountered some doubt and resistance. Nevertheless, we do have a solid plan for Hive on Flink, which we will execute once Flink SQL is in a good shape.

I also agree with you that Flink SQL shouldn't be closely coupled with Hive. While we mentioned Hive in many of the proposed items, most of them are coupled only in concepts and functionality rather than code or libraries. We are taking the advantage of the connector framework in Flink. The only thing that might be exceptional is to support Hive built-in UDFs, which we may not make it work out of the box to avoid the coupling. We could, for example, require users bring in Hive library and register themselves. This is subject to further discussion.

#11 is about Flink runtime enhancement that is meant to make task failures more tolerable (so that the job don't have to start from the beginning in case of task failures) and to make task scheduling more resource-efficient. Flink's current design in those two aspects leans more to stream processing, which may not be good enough for batch processing. We will provide more detailed design when we get to them.

Please let me know if you have further thoughts or feedback.

Thanks,
Xuefu


------------------------------------------------------------------
Sender:Jörn Franke <jo...@gmail.com>
Sent at:2018 Oct 11 (Thu) 13:54
Recipient:Xuefu <xu...@alibaba-inc.com>
Cc:vino yang <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>; dev <de...@flink.apache.org>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Would it maybe make sense to provide Flink as an engine on Hive („flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely coupled than integrating hive in all possible flink core modules and thus introducing a very tight dependency to Hive in the core.
1,2,3 could be achieved via a connector based on the Flink Table API.
Just as a proposal to start this Endeavour as independent projects (hive engine, connector) to avoid too tight coupling with Flink. Maybe in a more distant future if the Hive integration is heavily demanded one could then integrate it more tightly if needed. 

What is meant by 11?
Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:

Hi Fabian/Vno,

Thank you very much for your encouragement inquiry. Sorry that I didn't see Fabian's email until I read Vino's response just now. (Somehow Fabian's went to the spam folder.)

My proposal contains long-term and short-terms goals. Nevertheless, the effort will focus on the following areas, including Fabian's list:

1. Hive metastore connectivity - This covers both read/write access, which means Flink can make full use of Hive's metastore as its catalog (at least for the batch but can extend for streaming as well).
2. Metadata compatibility - Objects (databases, tables, partitions, etc) created by Hive can be understood by Flink and the reverse direction is true also.
3. Data compatibility - Similar to #2, data produced by Hive can be consumed by Flink and vise versa.
4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its own implementation or make Hive's implementation work in Flink. Further, for user created UDFs in Hive, Flink SQL should provide a mechanism allowing user to import them into Flink without any code change required.
5. Data types -  Flink SQL should support all data types that are available in Hive.
6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) with extension to support Hive's syntax and language features, around DDL, DML, and SELECT queries.
7.  SQL CLI - this is currently developing in Flink but more effort is needed.
8. Server - provide a server that's compatible with Hive's HiverServer2 in thrift APIs, such that HiveServer2 users can reuse their existing client (such as beeline) but connect to Flink's thrift server instead.
9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other application to use to connect to its thrift server
10. Support other user's customizations in Hive, such as Hive Serdes, storage handlers, etc.
11. Better task failure tolerance and task scheduling at Flink runtime.

As you can see, achieving all those requires significant effort and across all layers in Flink. However, a short-term goal could  include only core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3, #6).

Please share your further thoughts. If we generally agree that this is the right direction, I could come up with a formal proposal quickly and then we can follow up with broader discussions.

Thanks,
Xuefu



------------------------------------------------------------------
Sender:vino yang <ya...@gmail.com>
Sent at:2018 Oct 11 (Thu) 09:45
Recipient:Fabian Hueske <fh...@gmail.com>
Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Xuefu,

Appreciate this proposal, and like Fabian, it would look better if you can give more details of the plan.

Thanks, vino.
Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
Hi Xuefu,

Welcome to the Flink community and thanks for starting this discussion! Better Hive integration would be really great!
Can you go into details of what you are proposing? I can think of a couple ways to improve Flink in that regard:

* Support for Hive UDFs
* Support for Hive metadata catalog
* Support for HiveQL syntax
* ???

Best, Fabian

Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
Hi all,

 Along with the community's effort, inside Alibaba we have explored Flink's potential as an execution engine not just for stream processing but also for batch processing. We are encouraged by our findings and have initiated our effort to make Flink's SQL capabilities full-fledged. When comparing what's available in Flink to the offerings from competitive data processing engines, we identified a major gap in Flink: a well integration with Hive ecosystem. This is crucial to the success of Flink SQL and batch due to the well-established data ecosystem around Hive. Therefore, we have done some initial work along this direction but there are still a lot of effort needed.

 We have two strategies in mind. The first one is to make Flink SQL full-fledged and well-integrated with Hive ecosystem. This is a similar approach to what Spark SQL adopted. The second strategy is to make Hive itself work with Flink, similar to the proposal in [1]. Each approach bears its pros and cons, but they don’t need to be mutually exclusive with each targeting at different users and use cases. We believe that both will promote a much greater adoption of Flink beyond stream processing.

 We have been focused on the first approach and would like to showcase Flink's batch and SQL capabilities with Flink SQL. However, we have also planned to start strategy #2 as the follow-up effort.

 I'm completely new to Flink(, with a short bio [2] below), though many of my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like to share our thoughts and invite your early feedback. At the same time, I am working on a detailed proposal on Flink SQL's integration with Hive ecosystem, which will be also shared when ready.

 While the ideas are simple, each approach will demand significant effort, more than what we can afford. Thus, the input and contributions from the communities are greatly welcome and appreciated.

 Regards,


 Xuefu

 References:

 [1] https://issues.apache.org/jira/browse/HIVE-10712
 [2] Xuefu Zhang is a long-time open source veteran, worked or working on many projects under Apache Foundation, of which he is also an honored member. About 10 years ago he worked in the Hadoop team at Yahoo where the projects just got started. Later he worked at Cloudera, initiating and leading the development of Hive on Spark project in the communities and across many organizations. Prior to joining Alibaba, he worked at Uber where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and significantly improved Uber's cluster efficiency.



Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by Bowen <bo...@gmail.com>.
Thank you Xuefu, for bringing up this awesome, detailed proposal! It will resolve lots of existing pain for users like me.

In general, I totally agree that improving FlinkSQL's completeness would be a much better start point than building 'Hive on Flink', as the Hive community is concerned about Flink's SQL incompleteness and lack of proven batch performance shown in https://issues.apache.org/jira/browse/HIVE-10712. Improving FlinkSQL seems a more natural direction to start with in order to achieve the integration.

Xuefu and Timo has laid a quite clear path of what to tackle next. Given that there're already some efforts going on, for item 1,2,5,3,4,6 in Xuefu's list, shall we:
identify gaps between a) Xuefu's proposal/discussion result in this thread and b) all the ongoing work/discussions?
then, create some new top-level JIRA tickets to keep track of and start more detailed discussions with?
It's gonna be a great and influential project , and I'd love to participate into it to move FlinkSQL's adoption and ecosystem even further.

Thanks,
Bowen


> 在 2018年10月12日,下午3:37,Jörn Franke <jo...@gmail.com> 写道:
> 
> Thank you very nice , I fully agree with that. 
> 
>> Am 11.10.2018 um 19:31 schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
>> 
>> Hi Jörn,
>> 
>> Thanks for your feedback. Yes, I think Hive on Flink makes sense and in fact it is one of the two approaches that I named in the beginning of the thread. As also pointed out there, this isn't mutually exclusive from work we proposed inside Flink and they target at different user groups and user cases. Further, what we proposed to do in Flink should be a good showcase that demonstrate Flink's capabilities in batch processing and convince Hive community of the worth of a new engine. As you might know, the idea encountered some doubt and resistance. Nevertheless, we do have a solid plan for Hive on Flink, which we will execute once Flink SQL is in a good shape.
>> 
>> I also agree with you that Flink SQL shouldn't be closely coupled with Hive. While we mentioned Hive in many of the proposed items, most of them are coupled only in concepts and functionality rather than code or libraries. We are taking the advantage of the connector framework in Flink. The only thing that might be exceptional is to support Hive built-in UDFs, which we may not make it work out of the box to avoid the coupling. We could, for example, require users bring in Hive library and register themselves. This is subject to further discussion.
>> 
>> #11 is about Flink runtime enhancement that is meant to make task failures more tolerable (so that the job don't have to start from the beginning in case of task failures) and to make task scheduling more resource-efficient. Flink's current design in those two aspects leans more to stream processing, which may not be good enough for batch processing. We will provide more detailed design when we get to them.
>> 
>> Please let me know if you have further thoughts or feedback.
>> 
>> Thanks,
>> Xuefu
>> 
>> 
>> ------------------------------------------------------------------
>> Sender:Jörn Franke <jo...@gmail.com>
>> Sent at:2018 Oct 11 (Thu) 13:54
>> Recipient:Xuefu <xu...@alibaba-inc.com>
>> Cc:vino yang <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>; dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>> 
>> Would it maybe make sense to provide Flink as an engine on Hive („flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely coupled than integrating hive in all possible flink core modules and thus introducing a very tight dependency to Hive in the core.
>> 1,2,3 could be achieved via a connector based on the Flink Table API.
>> Just as a proposal to start this Endeavour as independent projects (hive engine, connector) to avoid too tight coupling with Flink. Maybe in a more distant future if the Hive integration is heavily demanded one could then integrate it more tightly if needed. 
>> 
>> What is meant by 11?
>> Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
>> 
>> Hi Fabian/Vno,
>> 
>> Thank you very much for your encouragement inquiry. Sorry that I didn't see Fabian's email until I read Vino's response just now. (Somehow Fabian's went to the spam folder.)
>> 
>> My proposal contains long-term and short-terms goals. Nevertheless, the effort will focus on the following areas, including Fabian's list:
>> 
>> 1. Hive metastore connectivity - This covers both read/write access, which means Flink can make full use of Hive's metastore as its catalog (at least for the batch but can extend for streaming as well).
>> 2. Metadata compatibility - Objects (databases, tables, partitions, etc) created by Hive can be understood by Flink and the reverse direction is true also.
>> 3. Data compatibility - Similar to #2, data produced by Hive can be consumed by Flink and vise versa.
>> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its own implementation or make Hive's implementation work in Flink. Further, for user created UDFs in Hive, Flink SQL should provide a mechanism allowing user to import them into Flink without any code change required.
>> 5. Data types -  Flink SQL should support all data types that are available in Hive.
>> 6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) with extension to support Hive's syntax and language features, around DDL, DML, and SELECT queries.
>> 7.  SQL CLI - this is currently developing in Flink but more effort is needed.
>> 8. Server - provide a server that's compatible with Hive's HiverServer2 in thrift APIs, such that HiveServer2 users can reuse their existing client (such as beeline) but connect to Flink's thrift server instead.
>> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other application to use to connect to its thrift server
>> 10. Support other user's customizations in Hive, such as Hive Serdes, storage handlers, etc.
>> 11. Better task failure tolerance and task scheduling at Flink runtime.
>> 
>> As you can see, achieving all those requires significant effort and across all layers in Flink. However, a short-term goal could  include only core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3, #6).
>> 
>> Please share your further thoughts. If we generally agree that this is the right direction, I could come up with a formal proposal quickly and then we can follow up with broader discussions.
>> 
>> Thanks,
>> Xuefu
>> 
>> 
>> 
>> ------------------------------------------------------------------
>> Sender:vino yang <ya...@gmail.com>
>> Sent at:2018 Oct 11 (Thu) 09:45
>> Recipient:Fabian Hueske <fh...@gmail.com>
>> Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <us...@flink.apache.org>
>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>> 
>> Hi Xuefu,
>> 
>> Appreciate this proposal, and like Fabian, it would look better if you can give more details of the plan.
>> 
>> Thanks, vino.
>> 
>> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
>> Hi Xuefu,
>> 
>> Welcome to the Flink community and thanks for starting this discussion! Better Hive integration would be really great!
>> Can you go into details of what you are proposing? I can think of a couple ways to improve Flink in that regard:
>> 
>> * Support for Hive UDFs
>> * Support for Hive metadata catalog
>> * Support for HiveQL syntax
>> * ???
>> 
>> Best, Fabian
>> 
>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
>> Hi all,
>> 
>> Along with the community's effort, inside Alibaba we have explored Flink's potential as an execution engine not just for stream processing but also for batch processing. We are encouraged by our findings and have initiated our effort to make Flink's SQL capabilities full-fledged. When comparing what's available in Flink to the offerings from competitive data processing engines, we identified a major gap in Flink: a well integration with Hive ecosystem. This is crucial to the success of Flink SQL and batch due to the well-established data ecosystem around Hive. Therefore, we have done some initial work along this direction but there are still a lot of effort needed.
>> 
>> We have two strategies in mind. The first one is to make Flink SQL full-fledged and well-integrated with Hive ecosystem. This is a similar approach to what Spark SQL adopted. The second strategy is to make Hive itself work with Flink, similar to the proposal in [1]. Each approach bears its pros and cons, but they don’t need to be mutually exclusive with each targeting at different users and use cases. We believe that both will promote a much greater adoption of Flink beyond stream processing.
>> 
>> We have been focused on the first approach and would like to showcase Flink's batch and SQL capabilities with Flink SQL. However, we have also planned to start strategy #2 as the follow-up effort.
>> 
>> I'm completely new to Flink(, with a short bio [2] below), though many of my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like to share our thoughts and invite your early feedback. At the same time, I am working on a detailed proposal on Flink SQL's integration with Hive ecosystem, which will be also shared when ready.
>> 
>> While the ideas are simple, each approach will demand significant effort, more than what we can afford. Thus, the input and contributions from the communities are greatly welcome and appreciated.
>> 
>> Regards,
>> 
>> 
>> Xuefu
>> 
>> References:
>> 
>> [1] https://issues.apache.org/jira/browse/HIVE-10712
>> [2] Xuefu Zhang is a long-time open source veteran, worked or working on many projects under Apache Foundation, of which he is also an honored member. About 10 years ago he worked in the Hadoop team at Yahoo where the projects just got started. Later he worked at Cloudera, initiating and leading the development of Hive on Spark project in the communities and across many organizations. Prior to joining Alibaba, he worked at Uber where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and significantly improved Uber's cluster efficiency.
>> 
>> 

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by Bowen <bo...@gmail.com>.
Thank you Xuefu, for bringing up this awesome, detailed proposal! It will resolve lots of existing pain for users like me.

In general, I totally agree that improving FlinkSQL's completeness would be a much better start point than building 'Hive on Flink', as the Hive community is concerned about Flink's SQL incompleteness and lack of proven batch performance shown in https://issues.apache.org/jira/browse/HIVE-10712. Improving FlinkSQL seems a more natural direction to start with in order to achieve the integration.

Xuefu and Timo has laid a quite clear path of what to tackle next. Given that there're already some efforts going on, for item 1,2,5,3,4,6 in Xuefu's list, shall we:
identify gaps between a) Xuefu's proposal/discussion result in this thread and b) all the ongoing work/discussions?
then, create some new top-level JIRA tickets to keep track of and start more detailed discussions with?
It's gonna be a great and influential project , and I'd love to participate into it to move FlinkSQL's adoption and ecosystem even further.

Thanks,
Bowen


> 在 2018年10月12日,下午3:37,Jörn Franke <jo...@gmail.com> 写道:
> 
> Thank you very nice , I fully agree with that. 
> 
>> Am 11.10.2018 um 19:31 schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
>> 
>> Hi Jörn,
>> 
>> Thanks for your feedback. Yes, I think Hive on Flink makes sense and in fact it is one of the two approaches that I named in the beginning of the thread. As also pointed out there, this isn't mutually exclusive from work we proposed inside Flink and they target at different user groups and user cases. Further, what we proposed to do in Flink should be a good showcase that demonstrate Flink's capabilities in batch processing and convince Hive community of the worth of a new engine. As you might know, the idea encountered some doubt and resistance. Nevertheless, we do have a solid plan for Hive on Flink, which we will execute once Flink SQL is in a good shape.
>> 
>> I also agree with you that Flink SQL shouldn't be closely coupled with Hive. While we mentioned Hive in many of the proposed items, most of them are coupled only in concepts and functionality rather than code or libraries. We are taking the advantage of the connector framework in Flink. The only thing that might be exceptional is to support Hive built-in UDFs, which we may not make it work out of the box to avoid the coupling. We could, for example, require users bring in Hive library and register themselves. This is subject to further discussion.
>> 
>> #11 is about Flink runtime enhancement that is meant to make task failures more tolerable (so that the job don't have to start from the beginning in case of task failures) and to make task scheduling more resource-efficient. Flink's current design in those two aspects leans more to stream processing, which may not be good enough for batch processing. We will provide more detailed design when we get to them.
>> 
>> Please let me know if you have further thoughts or feedback.
>> 
>> Thanks,
>> Xuefu
>> 
>> 
>> ------------------------------------------------------------------
>> Sender:Jörn Franke <jo...@gmail.com>
>> Sent at:2018 Oct 11 (Thu) 13:54
>> Recipient:Xuefu <xu...@alibaba-inc.com>
>> Cc:vino yang <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>; dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>> 
>> Would it maybe make sense to provide Flink as an engine on Hive („flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely coupled than integrating hive in all possible flink core modules and thus introducing a very tight dependency to Hive in the core.
>> 1,2,3 could be achieved via a connector based on the Flink Table API.
>> Just as a proposal to start this Endeavour as independent projects (hive engine, connector) to avoid too tight coupling with Flink. Maybe in a more distant future if the Hive integration is heavily demanded one could then integrate it more tightly if needed. 
>> 
>> What is meant by 11?
>> Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
>> 
>> Hi Fabian/Vno,
>> 
>> Thank you very much for your encouragement inquiry. Sorry that I didn't see Fabian's email until I read Vino's response just now. (Somehow Fabian's went to the spam folder.)
>> 
>> My proposal contains long-term and short-terms goals. Nevertheless, the effort will focus on the following areas, including Fabian's list:
>> 
>> 1. Hive metastore connectivity - This covers both read/write access, which means Flink can make full use of Hive's metastore as its catalog (at least for the batch but can extend for streaming as well).
>> 2. Metadata compatibility - Objects (databases, tables, partitions, etc) created by Hive can be understood by Flink and the reverse direction is true also.
>> 3. Data compatibility - Similar to #2, data produced by Hive can be consumed by Flink and vise versa.
>> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its own implementation or make Hive's implementation work in Flink. Further, for user created UDFs in Hive, Flink SQL should provide a mechanism allowing user to import them into Flink without any code change required.
>> 5. Data types -  Flink SQL should support all data types that are available in Hive.
>> 6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) with extension to support Hive's syntax and language features, around DDL, DML, and SELECT queries.
>> 7.  SQL CLI - this is currently developing in Flink but more effort is needed.
>> 8. Server - provide a server that's compatible with Hive's HiverServer2 in thrift APIs, such that HiveServer2 users can reuse their existing client (such as beeline) but connect to Flink's thrift server instead.
>> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other application to use to connect to its thrift server
>> 10. Support other user's customizations in Hive, such as Hive Serdes, storage handlers, etc.
>> 11. Better task failure tolerance and task scheduling at Flink runtime.
>> 
>> As you can see, achieving all those requires significant effort and across all layers in Flink. However, a short-term goal could  include only core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3, #6).
>> 
>> Please share your further thoughts. If we generally agree that this is the right direction, I could come up with a formal proposal quickly and then we can follow up with broader discussions.
>> 
>> Thanks,
>> Xuefu
>> 
>> 
>> 
>> ------------------------------------------------------------------
>> Sender:vino yang <ya...@gmail.com>
>> Sent at:2018 Oct 11 (Thu) 09:45
>> Recipient:Fabian Hueske <fh...@gmail.com>
>> Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <us...@flink.apache.org>
>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>> 
>> Hi Xuefu,
>> 
>> Appreciate this proposal, and like Fabian, it would look better if you can give more details of the plan.
>> 
>> Thanks, vino.
>> 
>> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
>> Hi Xuefu,
>> 
>> Welcome to the Flink community and thanks for starting this discussion! Better Hive integration would be really great!
>> Can you go into details of what you are proposing? I can think of a couple ways to improve Flink in that regard:
>> 
>> * Support for Hive UDFs
>> * Support for Hive metadata catalog
>> * Support for HiveQL syntax
>> * ???
>> 
>> Best, Fabian
>> 
>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
>> Hi all,
>> 
>> Along with the community's effort, inside Alibaba we have explored Flink's potential as an execution engine not just for stream processing but also for batch processing. We are encouraged by our findings and have initiated our effort to make Flink's SQL capabilities full-fledged. When comparing what's available in Flink to the offerings from competitive data processing engines, we identified a major gap in Flink: a well integration with Hive ecosystem. This is crucial to the success of Flink SQL and batch due to the well-established data ecosystem around Hive. Therefore, we have done some initial work along this direction but there are still a lot of effort needed.
>> 
>> We have two strategies in mind. The first one is to make Flink SQL full-fledged and well-integrated with Hive ecosystem. This is a similar approach to what Spark SQL adopted. The second strategy is to make Hive itself work with Flink, similar to the proposal in [1]. Each approach bears its pros and cons, but they don’t need to be mutually exclusive with each targeting at different users and use cases. We believe that both will promote a much greater adoption of Flink beyond stream processing.
>> 
>> We have been focused on the first approach and would like to showcase Flink's batch and SQL capabilities with Flink SQL. However, we have also planned to start strategy #2 as the follow-up effort.
>> 
>> I'm completely new to Flink(, with a short bio [2] below), though many of my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like to share our thoughts and invite your early feedback. At the same time, I am working on a detailed proposal on Flink SQL's integration with Hive ecosystem, which will be also shared when ready.
>> 
>> While the ideas are simple, each approach will demand significant effort, more than what we can afford. Thus, the input and contributions from the communities are greatly welcome and appreciated.
>> 
>> Regards,
>> 
>> 
>> Xuefu
>> 
>> References:
>> 
>> [1] https://issues.apache.org/jira/browse/HIVE-10712
>> [2] Xuefu Zhang is a long-time open source veteran, worked or working on many projects under Apache Foundation, of which he is also an honored member. About 10 years ago he worked in the Hadoop team at Yahoo where the projects just got started. Later he worked at Cloudera, initiating and leading the development of Hive on Spark project in the communities and across many organizations. Prior to joining Alibaba, he worked at Uber where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and significantly improved Uber's cluster efficiency.
>> 
>> 

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by Jörn Franke <jo...@gmail.com>.
Thank you very nice , I fully agree with that. 

> Am 11.10.2018 um 19:31 schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
> 
> Hi Jörn,
> 
> Thanks for your feedback. Yes, I think Hive on Flink makes sense and in fact it is one of the two approaches that I named in the beginning of the thread. As also pointed out there, this isn't mutually exclusive from work we proposed inside Flink and they target at different user groups and user cases. Further, what we proposed to do in Flink should be a good showcase that demonstrate Flink's capabilities in batch processing and convince Hive community of the worth of a new engine. As you might know, the idea encountered some doubt and resistance. Nevertheless, we do have a solid plan for Hive on Flink, which we will execute once Flink SQL is in a good shape.
> 
> I also agree with you that Flink SQL shouldn't be closely coupled with Hive. While we mentioned Hive in many of the proposed items, most of them are coupled only in concepts and functionality rather than code or libraries. We are taking the advantage of the connector framework in Flink. The only thing that might be exceptional is to support Hive built-in UDFs, which we may not make it work out of the box to avoid the coupling. We could, for example, require users bring in Hive library and register themselves. This is subject to further discussion.
> 
> #11 is about Flink runtime enhancement that is meant to make task failures more tolerable (so that the job don't have to start from the beginning in case of task failures) and to make task scheduling more resource-efficient. Flink's current design in those two aspects leans more to stream processing, which may not be good enough for batch processing. We will provide more detailed design when we get to them.
> 
> Please let me know if you have further thoughts or feedback.
> 
> Thanks,
> Xuefu
> 
> 
> ------------------------------------------------------------------
> Sender:Jörn Franke <jo...@gmail.com>
> Sent at:2018 Oct 11 (Thu) 13:54
> Recipient:Xuefu <xu...@alibaba-inc.com>
> Cc:vino yang <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>; dev <de...@flink.apache.org>; user <us...@flink.apache.org>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> 
> Would it maybe make sense to provide Flink as an engine on Hive („flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely coupled than integrating hive in all possible flink core modules and thus introducing a very tight dependency to Hive in the core.
> 1,2,3 could be achieved via a connector based on the Flink Table API.
> Just as a proposal to start this Endeavour as independent projects (hive engine, connector) to avoid too tight coupling with Flink. Maybe in a more distant future if the Hive integration is heavily demanded one could then integrate it more tightly if needed. 
> 
> What is meant by 11?
> Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
> 
> Hi Fabian/Vno,
> 
> Thank you very much for your encouragement inquiry. Sorry that I didn't see Fabian's email until I read Vino's response just now. (Somehow Fabian's went to the spam folder.)
> 
> My proposal contains long-term and short-terms goals. Nevertheless, the effort will focus on the following areas, including Fabian's list:
> 
> 1. Hive metastore connectivity - This covers both read/write access, which means Flink can make full use of Hive's metastore as its catalog (at least for the batch but can extend for streaming as well).
> 2. Metadata compatibility - Objects (databases, tables, partitions, etc) created by Hive can be understood by Flink and the reverse direction is true also.
> 3. Data compatibility - Similar to #2, data produced by Hive can be consumed by Flink and vise versa.
> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its own implementation or make Hive's implementation work in Flink. Further, for user created UDFs in Hive, Flink SQL should provide a mechanism allowing user to import them into Flink without any code change required.
> 5. Data types -  Flink SQL should support all data types that are available in Hive.
> 6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) with extension to support Hive's syntax and language features, around DDL, DML, and SELECT queries.
> 7.  SQL CLI - this is currently developing in Flink but more effort is needed.
> 8. Server - provide a server that's compatible with Hive's HiverServer2 in thrift APIs, such that HiveServer2 users can reuse their existing client (such as beeline) but connect to Flink's thrift server instead.
> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other application to use to connect to its thrift server
> 10. Support other user's customizations in Hive, such as Hive Serdes, storage handlers, etc.
> 11. Better task failure tolerance and task scheduling at Flink runtime.
> 
> As you can see, achieving all those requires significant effort and across all layers in Flink. However, a short-term goal could  include only core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3, #6).
> 
> Please share your further thoughts. If we generally agree that this is the right direction, I could come up with a formal proposal quickly and then we can follow up with broader discussions.
> 
> Thanks,
> Xuefu
> 
> 
> 
> ------------------------------------------------------------------
> Sender:vino yang <ya...@gmail.com>
> Sent at:2018 Oct 11 (Thu) 09:45
> Recipient:Fabian Hueske <fh...@gmail.com>
> Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <us...@flink.apache.org>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> 
> Hi Xuefu,
> 
> Appreciate this proposal, and like Fabian, it would look better if you can give more details of the plan.
> 
> Thanks, vino.
> 
> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
> Hi Xuefu,
> 
> Welcome to the Flink community and thanks for starting this discussion! Better Hive integration would be really great!
> Can you go into details of what you are proposing? I can think of a couple ways to improve Flink in that regard:
> 
> * Support for Hive UDFs
> * Support for Hive metadata catalog
> * Support for HiveQL syntax
> * ???
> 
> Best, Fabian
> 
> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
> Hi all,
> 
> Along with the community's effort, inside Alibaba we have explored Flink's potential as an execution engine not just for stream processing but also for batch processing. We are encouraged by our findings and have initiated our effort to make Flink's SQL capabilities full-fledged. When comparing what's available in Flink to the offerings from competitive data processing engines, we identified a major gap in Flink: a well integration with Hive ecosystem. This is crucial to the success of Flink SQL and batch due to the well-established data ecosystem around Hive. Therefore, we have done some initial work along this direction but there are still a lot of effort needed.
> 
> We have two strategies in mind. The first one is to make Flink SQL full-fledged and well-integrated with Hive ecosystem. This is a similar approach to what Spark SQL adopted. The second strategy is to make Hive itself work with Flink, similar to the proposal in [1]. Each approach bears its pros and cons, but they don’t need to be mutually exclusive with each targeting at different users and use cases. We believe that both will promote a much greater adoption of Flink beyond stream processing.
> 
> We have been focused on the first approach and would like to showcase Flink's batch and SQL capabilities with Flink SQL. However, we have also planned to start strategy #2 as the follow-up effort.
> 
> I'm completely new to Flink(, with a short bio [2] below), though many of my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like to share our thoughts and invite your early feedback. At the same time, I am working on a detailed proposal on Flink SQL's integration with Hive ecosystem, which will be also shared when ready.
> 
> While the ideas are simple, each approach will demand significant effort, more than what we can afford. Thus, the input and contributions from the communities are greatly welcome and appreciated.
> 
> Regards,
> 
> 
> Xuefu
> 
> References:
> 
> [1] https://issues.apache.org/jira/browse/HIVE-10712
> [2] Xuefu Zhang is a long-time open source veteran, worked or working on many projects under Apache Foundation, of which he is also an honored member. About 10 years ago he worked in the Hadoop team at Yahoo where the projects just got started. Later he worked at Cloudera, initiating and leading the development of Hive on Spark project in the communities and across many organizations. Prior to joining Alibaba, he worked at Uber where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and significantly improved Uber's cluster efficiency.
> 
> 

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by Jörn Franke <jo...@gmail.com>.
Thank you very nice , I fully agree with that. 

> Am 11.10.2018 um 19:31 schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
> 
> Hi Jörn,
> 
> Thanks for your feedback. Yes, I think Hive on Flink makes sense and in fact it is one of the two approaches that I named in the beginning of the thread. As also pointed out there, this isn't mutually exclusive from work we proposed inside Flink and they target at different user groups and user cases. Further, what we proposed to do in Flink should be a good showcase that demonstrate Flink's capabilities in batch processing and convince Hive community of the worth of a new engine. As you might know, the idea encountered some doubt and resistance. Nevertheless, we do have a solid plan for Hive on Flink, which we will execute once Flink SQL is in a good shape.
> 
> I also agree with you that Flink SQL shouldn't be closely coupled with Hive. While we mentioned Hive in many of the proposed items, most of them are coupled only in concepts and functionality rather than code or libraries. We are taking the advantage of the connector framework in Flink. The only thing that might be exceptional is to support Hive built-in UDFs, which we may not make it work out of the box to avoid the coupling. We could, for example, require users bring in Hive library and register themselves. This is subject to further discussion.
> 
> #11 is about Flink runtime enhancement that is meant to make task failures more tolerable (so that the job don't have to start from the beginning in case of task failures) and to make task scheduling more resource-efficient. Flink's current design in those two aspects leans more to stream processing, which may not be good enough for batch processing. We will provide more detailed design when we get to them.
> 
> Please let me know if you have further thoughts or feedback.
> 
> Thanks,
> Xuefu
> 
> 
> ------------------------------------------------------------------
> Sender:Jörn Franke <jo...@gmail.com>
> Sent at:2018 Oct 11 (Thu) 13:54
> Recipient:Xuefu <xu...@alibaba-inc.com>
> Cc:vino yang <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>; dev <de...@flink.apache.org>; user <us...@flink.apache.org>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> 
> Would it maybe make sense to provide Flink as an engine on Hive („flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely coupled than integrating hive in all possible flink core modules and thus introducing a very tight dependency to Hive in the core.
> 1,2,3 could be achieved via a connector based on the Flink Table API.
> Just as a proposal to start this Endeavour as independent projects (hive engine, connector) to avoid too tight coupling with Flink. Maybe in a more distant future if the Hive integration is heavily demanded one could then integrate it more tightly if needed. 
> 
> What is meant by 11?
> Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
> 
> Hi Fabian/Vno,
> 
> Thank you very much for your encouragement inquiry. Sorry that I didn't see Fabian's email until I read Vino's response just now. (Somehow Fabian's went to the spam folder.)
> 
> My proposal contains long-term and short-terms goals. Nevertheless, the effort will focus on the following areas, including Fabian's list:
> 
> 1. Hive metastore connectivity - This covers both read/write access, which means Flink can make full use of Hive's metastore as its catalog (at least for the batch but can extend for streaming as well).
> 2. Metadata compatibility - Objects (databases, tables, partitions, etc) created by Hive can be understood by Flink and the reverse direction is true also.
> 3. Data compatibility - Similar to #2, data produced by Hive can be consumed by Flink and vise versa.
> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its own implementation or make Hive's implementation work in Flink. Further, for user created UDFs in Hive, Flink SQL should provide a mechanism allowing user to import them into Flink without any code change required.
> 5. Data types -  Flink SQL should support all data types that are available in Hive.
> 6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) with extension to support Hive's syntax and language features, around DDL, DML, and SELECT queries.
> 7.  SQL CLI - this is currently developing in Flink but more effort is needed.
> 8. Server - provide a server that's compatible with Hive's HiverServer2 in thrift APIs, such that HiveServer2 users can reuse their existing client (such as beeline) but connect to Flink's thrift server instead.
> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other application to use to connect to its thrift server
> 10. Support other user's customizations in Hive, such as Hive Serdes, storage handlers, etc.
> 11. Better task failure tolerance and task scheduling at Flink runtime.
> 
> As you can see, achieving all those requires significant effort and across all layers in Flink. However, a short-term goal could  include only core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3, #6).
> 
> Please share your further thoughts. If we generally agree that this is the right direction, I could come up with a formal proposal quickly and then we can follow up with broader discussions.
> 
> Thanks,
> Xuefu
> 
> 
> 
> ------------------------------------------------------------------
> Sender:vino yang <ya...@gmail.com>
> Sent at:2018 Oct 11 (Thu) 09:45
> Recipient:Fabian Hueske <fh...@gmail.com>
> Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <us...@flink.apache.org>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> 
> Hi Xuefu,
> 
> Appreciate this proposal, and like Fabian, it would look better if you can give more details of the plan.
> 
> Thanks, vino.
> 
> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
> Hi Xuefu,
> 
> Welcome to the Flink community and thanks for starting this discussion! Better Hive integration would be really great!
> Can you go into details of what you are proposing? I can think of a couple ways to improve Flink in that regard:
> 
> * Support for Hive UDFs
> * Support for Hive metadata catalog
> * Support for HiveQL syntax
> * ???
> 
> Best, Fabian
> 
> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
> Hi all,
> 
> Along with the community's effort, inside Alibaba we have explored Flink's potential as an execution engine not just for stream processing but also for batch processing. We are encouraged by our findings and have initiated our effort to make Flink's SQL capabilities full-fledged. When comparing what's available in Flink to the offerings from competitive data processing engines, we identified a major gap in Flink: a well integration with Hive ecosystem. This is crucial to the success of Flink SQL and batch due to the well-established data ecosystem around Hive. Therefore, we have done some initial work along this direction but there are still a lot of effort needed.
> 
> We have two strategies in mind. The first one is to make Flink SQL full-fledged and well-integrated with Hive ecosystem. This is a similar approach to what Spark SQL adopted. The second strategy is to make Hive itself work with Flink, similar to the proposal in [1]. Each approach bears its pros and cons, but they don’t need to be mutually exclusive with each targeting at different users and use cases. We believe that both will promote a much greater adoption of Flink beyond stream processing.
> 
> We have been focused on the first approach and would like to showcase Flink's batch and SQL capabilities with Flink SQL. However, we have also planned to start strategy #2 as the follow-up effort.
> 
> I'm completely new to Flink(, with a short bio [2] below), though many of my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like to share our thoughts and invite your early feedback. At the same time, I am working on a detailed proposal on Flink SQL's integration with Hive ecosystem, which will be also shared when ready.
> 
> While the ideas are simple, each approach will demand significant effort, more than what we can afford. Thus, the input and contributions from the communities are greatly welcome and appreciated.
> 
> Regards,
> 
> 
> Xuefu
> 
> References:
> 
> [1] https://issues.apache.org/jira/browse/HIVE-10712
> [2] Xuefu Zhang is a long-time open source veteran, worked or working on many projects under Apache Foundation, of which he is also an honored member. About 10 years ago he worked in the Hadoop team at Yahoo where the projects just got started. Later he worked at Cloudera, initiating and leading the development of Hive on Spark project in the communities and across many organizations. Prior to joining Alibaba, he worked at Uber where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and significantly improved Uber's cluster efficiency.
> 
> 

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by "Zhang, Xuefu" <xu...@alibaba-inc.com>.
Hi Jörn,

Thanks for your feedback. Yes, I think Hive on Flink makes sense and in fact it is one of the two approaches that I named in the beginning of the thread. As also pointed out there, this isn't mutually exclusive from work we proposed inside Flink and they target at different user groups and user cases. Further, what we proposed to do in Flink should be a good showcase that demonstrate Flink's capabilities in batch processing and convince Hive community of the worth of a new engine. As you might know, the idea encountered some doubt and resistance. Nevertheless, we do have a solid plan for Hive on Flink, which we will execute once Flink SQL is in a good shape.

I also agree with you that Flink SQL shouldn't be closely coupled with Hive. While we mentioned Hive in many of the proposed items, most of them are coupled only in concepts and functionality rather than code or libraries. We are taking the advantage of the connector framework in Flink. The only thing that might be exceptional is to support Hive built-in UDFs, which we may not make it work out of the box to avoid the coupling. We could, for example, require users bring in Hive library and register themselves. This is subject to further discussion.

#11 is about Flink runtime enhancement that is meant to make task failures more tolerable (so that the job don't have to start from the beginning in case of task failures) and to make task scheduling more resource-efficient. Flink's current design in those two aspects leans more to stream processing, which may not be good enough for batch processing. We will provide more detailed design when we get to them.

Please let me know if you have further thoughts or feedback.

Thanks,
Xuefu



------------------------------------------------------------------
Sender:Jörn Franke <jo...@gmail.com>
Sent at:2018 Oct 11 (Thu) 13:54
Recipient:Xuefu <xu...@alibaba-inc.com>
Cc:vino yang <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>; dev <de...@flink.apache.org>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem


Would it maybe make sense to provide Flink as an engine on Hive („flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely coupled than integrating hive in all possible flink core modules and thus introducing a very tight dependency to Hive in the core.
1,2,3 could be achieved via a connector based on the Flink Table API.
Just as a proposal to start this Endeavour as independent projects (hive engine, connector) to avoid too tight coupling with Flink. Maybe in a more distant future if the Hive integration is heavily demanded one could then integrate it more tightly if needed. 

What is meant by 11?
Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:


Hi Fabian/Vno,

Thank you very much for your encouragement inquiry. Sorry that I didn't see Fabian's email until I read Vino's response just now. (Somehow Fabian's went to the spam folder.)

My proposal contains long-term and short-terms goals. Nevertheless, the effort will focus on the following areas, including Fabian's list:

1. Hive metastore connectivity - This covers both read/write access, which means Flink can make full use of Hive's metastore as its catalog (at least for the batch but can extend for streaming as well).
2. Metadata compatibility - Objects (databases, tables, partitions, etc) created by Hive can be understood by Flink and the reverse direction is true also.
3. Data compatibility - Similar to #2, data produced by Hive can be consumed by Flink and vise versa.
4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its own implementation or make Hive's implementation work in Flink. Further, for user created UDFs in Hive, Flink SQL should provide a mechanism allowing user to import them into Flink without any code change required.
5. Data types -  Flink SQL should support all data types that are available in Hive.
6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) with extension to support Hive's syntax and language features, around DDL, DML, and SELECT queries.
7.  SQL CLI - this is currently developing in Flink but more effort is needed.
8. Server - provide a server that's compatible with Hive's HiverServer2 in thrift APIs, such that HiveServer2 users can reuse their existing client (such as beeline) but connect to Flink's thrift server instead.
9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other application to use to connect to its thrift server
10. Support other user's customizations in Hive, such as Hive Serdes, storage handlers, etc.
11. Better task failure tolerance and task scheduling at Flink runtime.

As you can see, achieving all those requires significant effort and across all layers in Flink. However, a short-term goal could  include only core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3, #6).

Please share your further thoughts. If we generally agree that this is the right direction, I could come up with a formal proposal quickly and then we can follow up with broader discussions.

Thanks,
Xuefu



------------------------------------------------------------------
Sender:vino yang <ya...@gmail.com>
Sent at:2018 Oct 11 (Thu) 09:45
Recipient:Fabian Hueske <fh...@gmail.com>
Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Xuefu,

Appreciate this proposal, and like Fabian, it would look better if you can give more details of the plan.

Thanks, vino.
Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
Hi Xuefu,

Welcome to the Flink community and thanks for starting this discussion! Better Hive integration would be really great!
Can you go into details of what you are proposing? I can think of a couple ways to improve Flink in that regard:

* Support for Hive UDFs
* Support for Hive metadata catalog
* Support for HiveQL syntax
* ???

Best, Fabian

Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
Hi all,

 Along with the community's effort, inside Alibaba we have explored Flink's potential as an execution engine not just for stream processing but also for batch processing. We are encouraged by our findings and have initiated our effort to make Flink's SQL capabilities full-fledged. When comparing what's available in Flink to the offerings from competitive data processing engines, we identified a major gap in Flink: a well integration with Hive ecosystem. This is crucial to the success of Flink SQL and batch due to the well-established data ecosystem around Hive. Therefore, we have done some initial work along this direction but there are still a lot of effort needed.

 We have two strategies in mind. The first one is to make Flink SQL full-fledged and well-integrated with Hive ecosystem. This is a similar approach to what Spark SQL adopted. The second strategy is to make Hive itself work with Flink, similar to the proposal in [1]. Each approach bears its pros and cons, but they don’t need to be mutually exclusive with each targeting at different users and use cases. We believe that both will promote a much greater adoption of Flink beyond stream processing.

 We have been focused on the first approach and would like to showcase Flink's batch and SQL capabilities with Flink SQL. However, we have also planned to start strategy #2 as the follow-up effort.

 I'm completely new to Flink(, with a short bio [2] below), though many of my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like to share our thoughts and invite your early feedback. At the same time, I am working on a detailed proposal on Flink SQL's integration with Hive ecosystem, which will be also shared when ready.

 While the ideas are simple, each approach will demand significant effort, more than what we can afford. Thus, the input and contributions from the communities are greatly welcome and appreciated.

 Regards,


 Xuefu

 References:

 [1] https://issues.apache.org/jira/browse/HIVE-10712
 [2] Xuefu Zhang is a long-time open source veteran, worked or working on many projects under Apache Foundation, of which he is also an honored member. About 10 years ago he worked in the Hadoop team at Yahoo where the projects just got started. Later he worked at Cloudera, initiating and leading the development of Hive on Spark project in the communities and across many organizations. Prior to joining Alibaba, he worked at Uber where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and significantly improved Uber's cluster efficiency.



Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by "Zhang, Xuefu" <xu...@alibaba-inc.com>.
Hi Jörn,

Thanks for your feedback. Yes, I think Hive on Flink makes sense and in fact it is one of the two approaches that I named in the beginning of the thread. As also pointed out there, this isn't mutually exclusive from work we proposed inside Flink and they target at different user groups and user cases. Further, what we proposed to do in Flink should be a good showcase that demonstrate Flink's capabilities in batch processing and convince Hive community of the worth of a new engine. As you might know, the idea encountered some doubt and resistance. Nevertheless, we do have a solid plan for Hive on Flink, which we will execute once Flink SQL is in a good shape.

I also agree with you that Flink SQL shouldn't be closely coupled with Hive. While we mentioned Hive in many of the proposed items, most of them are coupled only in concepts and functionality rather than code or libraries. We are taking the advantage of the connector framework in Flink. The only thing that might be exceptional is to support Hive built-in UDFs, which we may not make it work out of the box to avoid the coupling. We could, for example, require users bring in Hive library and register themselves. This is subject to further discussion.

#11 is about Flink runtime enhancement that is meant to make task failures more tolerable (so that the job don't have to start from the beginning in case of task failures) and to make task scheduling more resource-efficient. Flink's current design in those two aspects leans more to stream processing, which may not be good enough for batch processing. We will provide more detailed design when we get to them.

Please let me know if you have further thoughts or feedback.

Thanks,
Xuefu



------------------------------------------------------------------
Sender:Jörn Franke <jo...@gmail.com>
Sent at:2018 Oct 11 (Thu) 13:54
Recipient:Xuefu <xu...@alibaba-inc.com>
Cc:vino yang <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>; dev <de...@flink.apache.org>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem


Would it maybe make sense to provide Flink as an engine on Hive („flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely coupled than integrating hive in all possible flink core modules and thus introducing a very tight dependency to Hive in the core.
1,2,3 could be achieved via a connector based on the Flink Table API.
Just as a proposal to start this Endeavour as independent projects (hive engine, connector) to avoid too tight coupling with Flink. Maybe in a more distant future if the Hive integration is heavily demanded one could then integrate it more tightly if needed. 

What is meant by 11?
Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:


Hi Fabian/Vno,

Thank you very much for your encouragement inquiry. Sorry that I didn't see Fabian's email until I read Vino's response just now. (Somehow Fabian's went to the spam folder.)

My proposal contains long-term and short-terms goals. Nevertheless, the effort will focus on the following areas, including Fabian's list:

1. Hive metastore connectivity - This covers both read/write access, which means Flink can make full use of Hive's metastore as its catalog (at least for the batch but can extend for streaming as well).
2. Metadata compatibility - Objects (databases, tables, partitions, etc) created by Hive can be understood by Flink and the reverse direction is true also.
3. Data compatibility - Similar to #2, data produced by Hive can be consumed by Flink and vise versa.
4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its own implementation or make Hive's implementation work in Flink. Further, for user created UDFs in Hive, Flink SQL should provide a mechanism allowing user to import them into Flink without any code change required.
5. Data types -  Flink SQL should support all data types that are available in Hive.
6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) with extension to support Hive's syntax and language features, around DDL, DML, and SELECT queries.
7.  SQL CLI - this is currently developing in Flink but more effort is needed.
8. Server - provide a server that's compatible with Hive's HiverServer2 in thrift APIs, such that HiveServer2 users can reuse their existing client (such as beeline) but connect to Flink's thrift server instead.
9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other application to use to connect to its thrift server
10. Support other user's customizations in Hive, such as Hive Serdes, storage handlers, etc.
11. Better task failure tolerance and task scheduling at Flink runtime.

As you can see, achieving all those requires significant effort and across all layers in Flink. However, a short-term goal could  include only core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3, #6).

Please share your further thoughts. If we generally agree that this is the right direction, I could come up with a formal proposal quickly and then we can follow up with broader discussions.

Thanks,
Xuefu



------------------------------------------------------------------
Sender:vino yang <ya...@gmail.com>
Sent at:2018 Oct 11 (Thu) 09:45
Recipient:Fabian Hueske <fh...@gmail.com>
Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Xuefu,

Appreciate this proposal, and like Fabian, it would look better if you can give more details of the plan.

Thanks, vino.
Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
Hi Xuefu,

Welcome to the Flink community and thanks for starting this discussion! Better Hive integration would be really great!
Can you go into details of what you are proposing? I can think of a couple ways to improve Flink in that regard:

* Support for Hive UDFs
* Support for Hive metadata catalog
* Support for HiveQL syntax
* ???

Best, Fabian

Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
Hi all,

 Along with the community's effort, inside Alibaba we have explored Flink's potential as an execution engine not just for stream processing but also for batch processing. We are encouraged by our findings and have initiated our effort to make Flink's SQL capabilities full-fledged. When comparing what's available in Flink to the offerings from competitive data processing engines, we identified a major gap in Flink: a well integration with Hive ecosystem. This is crucial to the success of Flink SQL and batch due to the well-established data ecosystem around Hive. Therefore, we have done some initial work along this direction but there are still a lot of effort needed.

 We have two strategies in mind. The first one is to make Flink SQL full-fledged and well-integrated with Hive ecosystem. This is a similar approach to what Spark SQL adopted. The second strategy is to make Hive itself work with Flink, similar to the proposal in [1]. Each approach bears its pros and cons, but they don’t need to be mutually exclusive with each targeting at different users and use cases. We believe that both will promote a much greater adoption of Flink beyond stream processing.

 We have been focused on the first approach and would like to showcase Flink's batch and SQL capabilities with Flink SQL. However, we have also planned to start strategy #2 as the follow-up effort.

 I'm completely new to Flink(, with a short bio [2] below), though many of my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like to share our thoughts and invite your early feedback. At the same time, I am working on a detailed proposal on Flink SQL's integration with Hive ecosystem, which will be also shared when ready.

 While the ideas are simple, each approach will demand significant effort, more than what we can afford. Thus, the input and contributions from the communities are greatly welcome and appreciated.

 Regards,


 Xuefu

 References:

 [1] https://issues.apache.org/jira/browse/HIVE-10712
 [2] Xuefu Zhang is a long-time open source veteran, worked or working on many projects under Apache Foundation, of which he is also an honored member. About 10 years ago he worked in the Hadoop team at Yahoo where the projects just got started. Later he worked at Cloudera, initiating and leading the development of Hive on Spark project in the communities and across many organizations. Prior to joining Alibaba, he worked at Uber where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and significantly improved Uber's cluster efficiency.



Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by Jörn Franke <jo...@gmail.com>.
Would it maybe make sense to provide Flink as an engine on Hive („flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely coupled than integrating hive in all possible flink core modules and thus introducing a very tight dependency to Hive in the core.
1,2,3 could be achieved via a connector based on the Flink Table API.
Just as a proposal to start this Endeavour as independent projects (hive engine, connector) to avoid too tight coupling with Flink. Maybe in a more distant future if the Hive integration is heavily demanded one could then integrate it more tightly if needed. 

What is meant by 11?
> Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
> 
> Hi Fabian/Vno,
> 
> Thank you very much for your encouragement inquiry. Sorry that I didn't see Fabian's email until I read Vino's response just now. (Somehow Fabian's went to the spam folder.)
> 
> My proposal contains long-term and short-terms goals. Nevertheless, the effort will focus on the following areas, including Fabian's list:
> 
> 1. Hive metastore connectivity - This covers both read/write access, which means Flink can make full use of Hive's metastore as its catalog (at least for the batch but can extend for streaming as well).
> 2. Metadata compatibility - Objects (databases, tables, partitions, etc) created by Hive can be understood by Flink and the reverse direction is true also.
> 3. Data compatibility - Similar to #2, data produced by Hive can be consumed by Flink and vise versa.
> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its own implementation or make Hive's implementation work in Flink. Further, for user created UDFs in Hive, Flink SQL should provide a mechanism allowing user to import them into Flink without any code change required.
> 5. Data types -  Flink SQL should support all data types that are available in Hive.
> 6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) with extension to support Hive's syntax and language features, around DDL, DML, and SELECT queries.
> 7.  SQL CLI - this is currently developing in Flink but more effort is needed.
> 8. Server - provide a server that's compatible with Hive's HiverServer2 in thrift APIs, such that HiveServer2 users can reuse their existing client (such as beeline) but connect to Flink's thrift server instead.
> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other application to use to connect to its thrift server
> 10. Support other user's customizations in Hive, such as Hive Serdes, storage handlers, etc.
> 11. Better task failure tolerance and task scheduling at Flink runtime.
> 
> As you can see, achieving all those requires significant effort and across all layers in Flink. However, a short-term goal could  include only core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3, #6).
> 
> Please share your further thoughts. If we generally agree that this is the right direction, I could come up with a formal proposal quickly and then we can follow up with broader discussions.
> 
> Thanks,
> Xuefu
> 
> 
> 
> ------------------------------------------------------------------
> Sender:vino yang <ya...@gmail.com>
> Sent at:2018 Oct 11 (Thu) 09:45
> Recipient:Fabian Hueske <fh...@gmail.com>
> Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <us...@flink.apache.org>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> 
> Hi Xuefu,
> 
> Appreciate this proposal, and like Fabian, it would look better if you can give more details of the plan.
> 
> Thanks, vino.
> 
> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
> Hi Xuefu,
> 
> Welcome to the Flink community and thanks for starting this discussion! Better Hive integration would be really great!
> Can you go into details of what you are proposing? I can think of a couple ways to improve Flink in that regard:
> 
> * Support for Hive UDFs
> * Support for Hive metadata catalog
> * Support for HiveQL syntax
> * ???
> 
> Best, Fabian
> 
> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
> Hi all,
> 
> Along with the community's effort, inside Alibaba we have explored Flink's potential as an execution engine not just for stream processing but also for batch processing. We are encouraged by our findings and have initiated our effort to make Flink's SQL capabilities full-fledged. When comparing what's available in Flink to the offerings from competitive data processing engines, we identified a major gap in Flink: a well integration with Hive ecosystem. This is crucial to the success of Flink SQL and batch due to the well-established data ecosystem around Hive. Therefore, we have done some initial work along this direction but there are still a lot of effort needed.
> 
> We have two strategies in mind. The first one is to make Flink SQL full-fledged and well-integrated with Hive ecosystem. This is a similar approach to what Spark SQL adopted. The second strategy is to make Hive itself work with Flink, similar to the proposal in [1]. Each approach bears its pros and cons, but they don’t need to be mutually exclusive with each targeting at different users and use cases. We believe that both will promote a much greater adoption of Flink beyond stream processing.
> 
> We have been focused on the first approach and would like to showcase Flink's batch and SQL capabilities with Flink SQL. However, we have also planned to start strategy #2 as the follow-up effort.
> 
> I'm completely new to Flink(, with a short bio [2] below), though many of my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like to share our thoughts and invite your early feedback. At the same time, I am working on a detailed proposal on Flink SQL's integration with Hive ecosystem, which will be also shared when ready.
> 
> While the ideas are simple, each approach will demand significant effort, more than what we can afford. Thus, the input and contributions from the communities are greatly welcome and appreciated.
> 
> Regards,
> 
> 
> Xuefu
> 
> References:
> 
> [1] https://issues.apache.org/jira/browse/HIVE-10712
> [2] Xuefu Zhang is a long-time open source veteran, worked or working on many projects under Apache Foundation, of which he is also an honored member. About 10 years ago he worked in the Hadoop team at Yahoo where the projects just got started. Later he worked at Cloudera, initiating and leading the development of Hive on Spark project in the communities and across many organizations. Prior to joining Alibaba, he worked at Uber where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and significantly improved Uber's cluster efficiency.
> 
> 

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by Timo Walther <tw...@apache.org>.
Hi Bowen,

thanks for your feedback. We should not change the Google doc anymore 
but apply additional comments in the wiki page. I will also add a bit 
more explanation to some parts so that people know about certain design 
decisions.

Regards,
Timo


Am 08.01.19 um 22:54 schrieb Bowen Li:
> Thank you, Xuefu and Timo, for putting together the FLIP! I like that both
> its scope and implementation plan are clear. Look forward to feedbacks from
> the group.
>
> I also added a few more complementary details in the doc.
>
> Thanks,
> Bowen
>
>
> On Mon, Jan 7, 2019 at 8:37 PM Zhang, Xuefu <xu...@alibaba-inc.com> wrote:
>
>> Thanks, Timo!
>>
>> I have started put the content from the google doc to FLIP-30 [1].
>> However, please still keep the discussion along this thread.
>>
>> Thanks,
>> Xuefu
>>
>> [1]
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-30%3A+Unified+Catalog+APIs
>>
>>
>> ------------------------------------------------------------------
>> From:Timo Walther <tw...@apache.org>
>> Sent At:2019 Jan. 7 (Mon.) 05:59
>> To:dev <de...@flink.apache.org>
>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>
>> Hi everyone,
>>
>> Xuefu and I had multiple iterations over the catalog design document
>> [1]. I believe that it is in a good shape now to be converted into FLIP.
>> Maybe we need a bit more explanation at some places but the general
>> design would be ready now.
>>
>> The design document covers the following changes:
>> - Unify external catalog interface and Flink's internal catalog in
>> TableEnvironment
>> - Clearly define a hierarchy of reference objects namely:
>> "catalog.database.table"
>> - Enable a tight integration with Hive + Hive data connectors as well as
>> a broad integration with existing TableFactories and discovery mechanism
>> - Make the catalog interfaces more feature complete by adding views and
>> functions
>>
>> If you have any further feedback, it would be great to give it now
>> before we convert it into a FLIP.
>>
>> Thanks,
>> Timo
>>
>> [1]
>>
>> https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit#
>>
>>
>>
>> Am 07.01.19 um 13:51 schrieb Timo Walther:
>>> Hi Eron,
>>>
>>> thank you very much for the contributions. I merged the first little
>>> bug fixes. For the remaining PRs I think we can review and merge them
>>> soon. As you said, the code is agnostic to the details of the
>>> ExternalCatalog interface and I don't expect bigger merge conflicts in
>>> the near future.
>>>
>>> However, exposing the current external catalog interfaces to SQL
>>> Client users would make it even more difficult to change the
>>> interfaces in the future. So maybe I would first wait until the
>>> general catalog discussion is over and the FLIP has been created. This
>>> should happen shortly.
>>>
>>> We should definitely coordinate the efforts better in the future to
>>> avoid duplicate work.
>>>
>>> Thanks,
>>> Timo
>>>
>>>
>>> Am 07.01.19 um 00:24 schrieb Eron Wright:
>>>> Thanks Timo for merging a couple of the PRs.   Are you also able to
>>>> review the others that I mentioned? Xuefu I would like to incorporate
>>>> your feedback too.
>>>>
>>>> Check out this short demonstration of using a catalog in SQL Client:
>>>> https://asciinema.org/a/C8xuAjmZSxCuApgFgZQyeIHuo
>>>>
>>>> Thanks again!
>>>>
>>>> On Thu, Jan 3, 2019 at 9:37 AM Eron Wright <eronwright@gmail.com
>>>> <ma...@gmail.com>> wrote:
>>>>
>>>>      Would a couple folks raise their hand to make a review pass thru
>>>>      the 6 PRs listed above?  It is a lovely stack of PRs that is 'all
>>>>      green' at the moment.   I would be happy to open follow-on PRs to
>>>>      rapidly align with other efforts.
>>>>
>>>>      Note that the code is agnostic to the details of the
>>>>      ExternalCatalog interface; the code would not be obsolete if/when
>>>>      the catalog interface is enhanced as per the design doc.
>>>>
>>>>
>>>>
>>>>      On Wed, Jan 2, 2019 at 1:35 PM Eron Wright <eronwright@gmail.com
>>>>      <ma...@gmail.com>> wrote:
>>>>
>>>>          I propose that the community review and merge the PRs that I
>>>>          posted, and then evolve the design thru 1.8 and beyond.  I
>>>>          think having a basic infrastructure in place now will
>>>>          accelerate the effort, do you agree?
>>>>
>>>>          Thanks again!
>>>>
>>>>          On Wed, Jan 2, 2019 at 11:20 AM Zhang, Xuefu
>>>>          <xuefu.z@alibaba-inc.com <ma...@alibaba-inc.com>>
>>>> wrote:
>>>>
>>>>              Hi Eron,
>>>>
>>>>              Happy New Year!
>>>>
>>>>              Thank you very much for your contribution, especially
>>>>              during the holidays. Wile I'm encouraged by your work, I'd
>>>>              also like to share my thoughts on how to move forward.
>>>>
>>>>              First, please note that the design discussion is still
>>>>              finalizing, and we expect some moderate changes,
>>>>              especially around TableFactories. Another pending change
>>>>              is our decision to shy away from scala, which our work
>>>>              will be impacted by.
>>>>
>>>>              Secondly, while your work seemed about plugging in
>>>>              catalogs definitions to the execution environment, which
>>>>              is less impacted by TableFactory change, I did notice some
>>>>              duplication of your work and ours. This is no big deal,
>>>>              but going forward, we should probable have a better
>>>>              communication on the work assignment so as to avoid any
>>>>              possible duplication of work. On the other hand, I think
>>>>              some of your work is interesting and valuable for
>>>>              inclusion once we finalize the overall design.
>>>>
>>>>              Thus, please continue your research and experiment and let
>>>>              us know when you start working on anything so we can
>>>>              better coordinate.
>>>>
>>>>              Thanks again for your interest and contributions.
>>>>
>>>>              Thanks,
>>>>              Xuefu
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------
>>>>                  From:Eron Wright <eronwright@gmail.com
>>>>                  <ma...@gmail.com>>
>>>>                  Sent At:2019 Jan. 1 (Tue.) 18:39
>>>>                  To:dev <dev@flink.apache.org
>>>>                  <ma...@flink.apache.org>>; Xuefu
>>>>                  <xuefu.z@alibaba-inc.com
>>>> <ma...@alibaba-inc.com>>
>>>>                  Cc:Xiaowei Jiang <xiaoweij@gmail.com
>>>>                  <ma...@gmail.com>>; twalthr
>>>>                  <twalthr@apache.org <ma...@apache.org>>;
>>>>                  piotr <piotr@data-artisans.com
>>>>                  <ma...@data-artisans.com>>; Fabian Hueske
>>>>                  <fhueske@gmail.com <ma...@gmail.com>>;
>>>>                  suez1224 <suez1224@gmail.com
>>>>                  <ma...@gmail.com>>; Bowen Li
>>>>                  <bowenli86@gmail.com <ma...@gmail.com>>
>>>>                  Subject:Re: [DISCUSS] Integrate Flink SQL well with
>>>>                  Hive ecosystem
>>>>
>>>>                  Hi folks, there's clearly some incremental steps to be
>>>>                  taken to introduce catalog support to SQL Client,
>>>>                  complementary to what is proposed in the Flink-Hive
>>>>                  Metastore design doc.  I was quietly working on this
>>>>                  over the holidays.   I posted some new sub-tasks, PRs,
>>>>                  and sample code to FLINK-10744.
>>>>
>>>>                  What inspired me to get involved is that the catalog
>>>>                  interface seems like a great way to encapsulate a
>>>>                  'library' of Flink tables and functions. For example,
>>>>                  the NYC Taxi dataset (TaxiRides, TaxiFares, various
>>>>                  UDFs) may be nicely encapsulated as a catalog
>>>>                  (TaxiData).  Such a library should be fully consumable
>>>>                  in SQL Client.
>>>>
>>>>                  I implemented the above. Some highlights:
>>>>                  1. A fully-worked example of using the Taxi dataset in
>>>>                  SQL Client via an environment file.
>>>>                  - an ASCII video showing the SQL Client in action:
>>>> https://asciinema.org/a/C8xuAjmZSxCuApgFgZQyeIHuo
>>>>
>>>>                  - the corresponding environment file (will be even
>>>>                  more concise once 'FLINK-10696 Catalog UDFs' is merged):
>>>> _
>> https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/dist/conf/sql-client-defaults.yaml_
>>>>                  - the typed API for standalone table applications:
>>>> _
>> https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/java/com/dataartisans/flinktraining/examples/table_java/examples/ViaCatalog.java#L50_
>>>>                  2. Implementation of the core catalog descriptor and
>>>>                  factory.  I realize that some renames may later occur
>>>>                  as per the design doc, and would be happy to do that
>>>>                  as a follow-up.
>>>>                  https://github.com/apache/flink/pull/7390
>>>>
>>>>                  3. Implementation of a connect-style API on
>>>>                  TableEnvironment to use catalog descriptor.
>>>>                  https://github.com/apache/flink/pull/7392
>>>>
>>>>                  4. Integration into SQL-Client's environment file:
>>>>                  https://github.com/apache/flink/pull/7393
>>>>
>>>>                  I realize that the overall Hive integration is still
>>>>                  evolving, but I believe that these PRs are a good
>>>>                  stepping stone. Here's the list (in bottom-up order):
>>>>                  - https://github.com/apache/flink/pull/7386
>>>>                  - https://github.com/apache/flink/pull/7388
>>>>                  - https://github.com/apache/flink/pull/7389
>>>>                  - https://github.com/apache/flink/pull/7390
>>>>                  - https://github.com/apache/flink/pull/7392
>>>>                  - https://github.com/apache/flink/pull/7393
>>>>
>>>>                  Thanks and enjoy 2019!
>>>>                  Eron W
>>>>
>>>>
>>>>                  On Sun, Nov 18, 2018 at 3:04 PM Zhang, Xuefu
>>>>                  <xuefu.z@alibaba-inc.com
>>>>                  <ma...@alibaba-inc.com>> wrote:
>>>>                  Hi Xiaowei,
>>>>
>>>>                  Thanks for bringing up the question. In the current
>>>>                  design, the properties for meta objects are meant to
>>>>                  cover anything that's specific to a particular catalog
>>>>                  and agnostic to Flink. Anything that is common (such
>>>>                  as schema for tables, query text for views, and udf
>>>>                  classname) are abstracted as members of the respective
>>>>                  classes. However, this is still in discussion, and
>>>>                  Timo and I will go over this and provide an update.
>>>>
>>>>                  Please note that UDF is a little more involved than
>>>>                  what the current design doc shows. I'm still refining
>>>>                  this part.
>>>>
>>>>                  Thanks,
>>>>                  Xuefu
>>>>
>>>>
>>>> ------------------------------------------------------------------
>>>>                  Sender:Xiaowei Jiang <xiaoweij@gmail.com
>>>>                  <ma...@gmail.com>>
>>>>                  Sent at:2018 Nov 18 (Sun) 15:17
>>>>                  Recipient:dev <dev@flink.apache.org
>>>>                  <ma...@flink.apache.org>>
>>>>                  Cc:Xuefu <xuefu.z@alibaba-inc.com
>>>>                  <ma...@alibaba-inc.com>>; twalthr
>>>>                  <twalthr@apache.org <ma...@apache.org>>;
>>>>                  piotr <piotr@data-artisans.com
>>>>                  <ma...@data-artisans.com>>; Fabian Hueske
>>>>                  <fhueske@gmail.com <ma...@gmail.com>>;
>>>>                  suez1224 <suez1224@gmail.com
>>>> <ma...@gmail.com>>
>>>>                  Subject:Re: [DISCUSS] Integrate Flink SQL well with
>>>>                  Hive ecosystem
>>>>
>>>>                  Thanks Xuefu for the detailed design doc! One question
>>>>                  on the properties associated with the catalog objects.
>>>>                  Are we going to leave them completely free form or we
>>>>                  are going to set some standard for that? I think that
>>>>                  the answer may depend on if we want to explore catalog
>>>>                  specific optimization opportunities. In any case, I
>>>>                  think that it might be helpful for standardize as much
>>>>                  as possible into strongly typed classes and use leave
>>>>                  these properties for catalog specific things. But I
>>>>                  think that we can do it in steps.
>>>>
>>>>                  Xiaowei
>>>>                  On Fri, Nov 16, 2018 at 4:00 AM Bowen Li
>>>>                  <bowenli86@gmail.com <ma...@gmail.com>>
>>>> wrote:
>>>>                  Thanks for keeping on improving the overall design,
>>>>                  Xuefu! It looks quite
>>>>                   good to me now.
>>>>
>>>>                   Would be nice that cc-ed Flink committers can help to
>>>>                  review and confirm!
>>>>
>>>>
>>>>
>>>>                   One minor suggestion: Since the last section of
>>>>                  design doc already touches
>>>>                   some new sql statements, shall we add another section
>>>>                  in our doc and
>>>>                   formalize the new sql statements in SQL Client and
>>>>                  TableEnvironment that
>>>>                   are gonna come along naturally with our design? Here
>>>>                  are some that the
>>>>                   design doc mentioned and some that I came up with:
>>>>
>>>>                   To be added:
>>>>
>>>>                      - USE <catalog> - set default catalog
>>>>                      - USE <catalog.schema> - set default schema
>>>>                      - SHOW CATALOGS - show all registered catalogs
>>>>                      - SHOW SCHEMAS [FROM catalog] - list schemas in
>>>>                  the current default
>>>>                      catalog or the specified catalog
>>>>                      - DESCRIBE VIEW view - show the view's definition
>>>>                  in CatalogView
>>>>                      - SHOW VIEWS [FROM schema/catalog.schema] - show
>>>>                  views from current or a
>>>>                      specified schema.
>>>>
>>>>                      (DDLs that can be addressed by either our design
>>>>                  or Shuyi's DDL design)
>>>>
>>>>                      - CREATE/DROP/ALTER SCHEMA schema
>>>>                      - CREATE/DROP/ALTER CATALOG catalog
>>>>
>>>>                   To be modified:
>>>>
>>>>                      - SHOW TABLES [FROM schema/catalog.schema] - show
>>>>                  tables from current or
>>>>                      a specified schema. Add 'from schema' to existing
>>>>                  'SHOW TABLES' statement
>>>>                      - SHOW FUNCTIONS [FROM schema/catalog.schema] -
>>>>                  show functions from
>>>>                      current or a specified schema. Add 'from schema'
>>>>                  to existing 'SHOW TABLES'
>>>>                      statement'
>>>>
>>>>
>>>>                   Thanks, Bowen
>>>>
>>>>
>>>>
>>>>                   On Wed, Nov 14, 2018 at 10:39 PM Zhang, Xuefu
>>>>                  <xuefu.z@alibaba-inc.com
>>>> <ma...@alibaba-inc.com>>
>>>>                   wrote:
>>>>
>>>>                   > Thanks, Bowen, for catching the error. I have
>>>>                  granted comment permission
>>>>                   > with the link.
>>>>                   >
>>>>                   > I also updated the doc with the latest class
>>>>                  definitions. Everyone is
>>>>                   > encouraged to review and comment.
>>>>                   >
>>>>                   > Thanks,
>>>>                   > Xuefu
>>>>                   >
>>>>                   >
>>>> ------------------------------------------------------------------
>>>>                   > Sender:Bowen Li <bowenli86@gmail.com
>>>>                  <ma...@gmail.com>>
>>>>                   > Sent at:2018 Nov 14 (Wed) 06:44
>>>>                   > Recipient:Xuefu <xuefu.z@alibaba-inc.com
>>>>                  <ma...@alibaba-inc.com>>
>>>>                   > Cc:piotr <piotr@data-artisans.com
>>>>                  <ma...@data-artisans.com>>; dev
>>>>                  <dev@flink.apache.org <ma...@flink.apache.org>>;
>>>>                  Shuyi
>>>>                   > Chen <suez1224@gmail.com <mailto:suez1224@gmail.com
>>>>
>>>>                   > Subject:Re: [DISCUSS] Integrate Flink SQL well with
>>>>                  Hive ecosystem
>>>>                   >
>>>>                   > Hi Xuefu,
>>>>                   >
>>>>                   > Currently the new design doc
>>>>                   >
>>>> <
>> https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit
>>>>                   > is on “view only" mode, and people cannot leave
>>>>                  comments. Can you please
>>>>                   > change it to "can comment" or "can edit" mode?
>>>>                   >
>>>>                   > Thanks, Bowen
>>>>                   >
>>>>                   >
>>>>                   > On Mon, Nov 12, 2018 at 9:51 PM Zhang, Xuefu
>>>>                  <xuefu.z@alibaba-inc.com
>>>> <ma...@alibaba-inc.com>>
>>>>                   > wrote:
>>>>                   > Hi Piotr
>>>>                   >
>>>>                   > I have extracted the API portion of  the design and
>>>>                  the google doc is here
>>>>                   >
>>>> <
>> https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit?usp=sharing
>>> .
>>>>                   > Please review and provide your feedback.
>>>>                   >
>>>>                   > Thanks,
>>>>                   > Xuefu
>>>>                   >
>>>>                   >
>>>> ------------------------------------------------------------------
>>>>                   > Sender:Xuefu <xuefu.z@alibaba-inc.com
>>>>                  <ma...@alibaba-inc.com>>
>>>>                   > Sent at:2018 Nov 12 (Mon) 12:43
>>>>                   > Recipient:Piotr Nowojski <piotr@data-artisans.com
>>>>                  <ma...@data-artisans.com>>; dev <
>>>>                   > dev@flink.apache.org <ma...@flink.apache.org>>
>>>>                   > Cc:Bowen Li <bowenli86@gmail.com
>>>>                  <ma...@gmail.com>>; Shuyi Chen
>>>>                  <suez1224@gmail.com <ma...@gmail.com>>
>>>>                   > Subject:Re: [DISCUSS] Integrate Flink SQL well with
>>>>                  Hive ecosystem
>>>>                   >
>>>>                   > Hi Piotr,
>>>>                   >
>>>>                   > That sounds good to me. Let's close all the open
>>>>                  questions ((there are a
>>>>                   > couple of them)) in the Google doc and I should be
>>>>                  able to quickly split
>>>>                   > it into the three proposals as you suggested.
>>>>                   >
>>>>                   > Thanks,
>>>>                   > Xuefu
>>>>                   >
>>>>                   >
>>>> ------------------------------------------------------------------
>>>>                   > Sender:Piotr Nowojski <piotr@data-artisans.com
>>>>                  <ma...@data-artisans.com>>
>>>>                   > Sent at:2018 Nov 9 (Fri) 22:46
>>>>                   > Recipient:dev <dev@flink.apache.org
>>>>                  <ma...@flink.apache.org>>; Xuefu
>>>>                  <xuefu.z@alibaba-inc.com
>>>> <ma...@alibaba-inc.com>>
>>>>                   > Cc:Bowen Li <bowenli86@gmail.com
>>>>                  <ma...@gmail.com>>; Shuyi Chen
>>>>                  <suez1224@gmail.com <ma...@gmail.com>>
>>>>                   > Subject:Re: [DISCUSS] Integrate Flink SQL well with
>>>>                  Hive ecosystem
>>>>                   >
>>>>                   > Hi,
>>>>                   >
>>>>                   >
>>>>                   > Yes, it seems like the best solution. Maybe someone
>>>>                  else can also suggests if we can split it further?
>>>>                  Maybe changes in the interface in one doc, reading
>>>>                  from hive meta store another and final storing our
>>>>                  meta informations in hive meta store?
>>>>                   >
>>>>                   > Piotrek
>>>>                   >
>>>>                   > > On 9 Nov 2018, at 01:44, Zhang, Xuefu
>>>>                  <xuefu.z@alibaba-inc.com
>>>>                  <ma...@alibaba-inc.com>> wrote:
>>>>                   > >
>>>>                   > > Hi Piotr,
>>>>                   > >
>>>>                   > > That seems to be good idea!
>>>>                   > >
>>>>                   >
>>>>                   > > Since the google doc for the design is currently
>>>>                  under extensive review, I will leave it as it is for
>>>>                  now. However, I'll convert it to two different FLIPs
>>>>                  when the time comes.
>>>>                   > >
>>>>                   > > How does it sound to you?
>>>>                   > >
>>>>                   > > Thanks,
>>>>                   > > Xuefu
>>>>                   > >
>>>>                   > >
>>>>                   > >
>>>> ------------------------------------------------------------------
>>>>                   > > Sender:Piotr Nowojski <piotr@data-artisans.com
>>>>                  <ma...@data-artisans.com>>
>>>>                   > > Sent at:2018 Nov 9 (Fri) 02:31
>>>>                   > > Recipient:dev <dev@flink.apache.org
>>>>                  <ma...@flink.apache.org>>
>>>>                   > > Cc:Bowen Li <bowenli86@gmail.com
>>>>                  <ma...@gmail.com>>; Xuefu
>>>>                  <xuefu.z@alibaba-inc.com
>>>> <ma...@alibaba-inc.com>
>>>>                   > >; Shuyi Chen <suez1224@gmail.com
>>>>                  <ma...@gmail.com>>
>>>>                   > > Subject:Re: [DISCUSS] Integrate Flink SQL well
>>>>                  with Hive ecosystem
>>>>                   > >
>>>>                   > > Hi,
>>>>                   > >
>>>>                   >
>>>>                   > > Maybe we should split this topic (and the design
>>>>                  doc) into couple of smaller ones, hopefully
>>>>                  independent. The questions that you have asked Fabian
>>>>                  have for example very little to do with reading
>>>>                  metadata from Hive Meta Store?
>>>>                   > >
>>>>                   > > Piotrek
>>>>                   > >
>>>>                   > >> On 7 Nov 2018, at 14:27, Fabian Hueske
>>>>                  <fhueske@gmail.com <ma...@gmail.com>> wrote:
>>>>                   > >>
>>>>                   > >> Hi Xuefu and all,
>>>>                   > >>
>>>>                   > >> Thanks for sharing this design document!
>>>>                   >
>>>>                   > >> I'm very much in favor of restructuring /
>>>>                  reworking the catalog handling in
>>>>                   > >> Flink SQL as outlined in the document.
>>>>                   >
>>>>                   > >> Most changes described in the design document
>>>>                  seem to be rather general and
>>>>                   > >> not specifically related to the Hive integration.
>>>>                   > >>
>>>>                   >
>>>>                   > >> IMO, there are some aspects, especially those at
>>>>                  the boundary of Hive and
>>>>                   > >> Flink, that need a bit more discussion. For
>>>> example
>>>>                   > >>
>>>>                   > >> * What does it take to make Flink schema
>>>>                  compatible with Hive schema?
>>>>                   > >> * How will Flink tables (descriptors) be stored
>>>>                  in HMS?
>>>>                   > >> * How do both Hive catalogs differ? Could they
>>>>                  be integrated into to a
>>>>                   > >> single one? When to use which one?
>>>>                   >
>>>>                   > >> * What meta information is provided by HMS? What
>>>>                  of this can be leveraged
>>>>                   > >> by Flink?
>>>>                   > >>
>>>>                   > >> Thank you,
>>>>                   > >> Fabian
>>>>                   > >>
>>>>                   > >> Am Fr., 2. Nov. 2018 um 00:31 Uhr schrieb Bowen
>>>>                  Li <bowenli86@gmail.com <ma...@gmail.com>
>>>>                   > >:
>>>>                   > >>
>>>>                   > >>> After taking a look at how other discussion
>>>>                  threads work, I think it's
>>>>                   > >>> actually fine just keep our discussion here.
>>>>                  It's up to you, Xuefu.
>>>>                   > >>>
>>>>                   > >>> The google doc LGTM. I left some minor comments.
>>>>                   > >>>
>>>>                   > >>> On Thu, Nov 1, 2018 at 10:17 AM Bowen Li
>>>>                  <bowenli86@gmail.com <ma...@gmail.com>>
>>>> wrote:
>>>>                   > >>>
>>>>                   > >>>> Hi all,
>>>>                   > >>>>
>>>>                   > >>>> As Xuefu has published the design doc on
>>>>                  google, I agree with Shuyi's
>>>>                   >
>>>>                   > >>>> suggestion that we probably should start a new
>>>>                  email thread like "[DISCUSS]
>>>>                   >
>>>>                   > >>>> ... Hive integration design ..." on only dev
>>>>                  mailing list for community
>>>>                   > >>>> devs to review. The current thread sends to
>>>>                  both dev and user list.
>>>>                   > >>>>
>>>>                   >
>>>>                   > >>>> This email thread is more like validating the
>>>>                  general idea and direction
>>>>                   >
>>>>                   > >>>> with the community, and it's been pretty long
>>>>                  and crowded so far. Since
>>>>                   >
>>>>                   > >>>> everyone is pro for the idea, we can move
>>>>                  forward with another thread to
>>>>                   > >>>> discuss and finalize the design.
>>>>                   > >>>>
>>>>                   > >>>> Thanks,
>>>>                   > >>>> Bowen
>>>>                   > >>>>
>>>>                   > >>>> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu <
>>>>                   > xuefu.z@alibaba-inc.com
>>>>                  <ma...@alibaba-inc.com>>
>>>>                   > >>>> wrote:
>>>>                   > >>>>
>>>>                   > >>>>> Hi Shuiyi,
>>>>                   > >>>>>
>>>>                   >
>>>>                   > >>>>> Good idea. Actually the PDF was converted
>>>>                  from a google doc. Here is its
>>>>                   > >>>>> link:
>>>>                   > >>>>>
>>>>                   > >>>>>
>>>>                   >
>>>>
>> https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
>>>>                   > >>>>> Once we reach an agreement, I can convert it
>>>>                  to a FLIP.
>>>>                   > >>>>>
>>>>                   > >>>>> Thanks,
>>>>                   > >>>>> Xuefu
>>>>                   > >>>>>
>>>>                   > >>>>>
>>>>                   > >>>>>
>>>>                   > >>>>>
>>>> ------------------------------------------------------------------
>>>>                   > >>>>> Sender:Shuyi Chen <suez1224@gmail.com
>>>>                  <ma...@gmail.com>>
>>>>                   > >>>>> Sent at:2018 Nov 1 (Thu) 02:47
>>>>                   > >>>>> Recipient:Xuefu <xuefu.z@alibaba-inc.com
>>>>                  <ma...@alibaba-inc.com>>
>>>>                   > >>>>> Cc:vino yang <yanghua1127@gmail.com
>>>>                  <ma...@gmail.com>>; Fabian Hueske <
>>>>                   > fhueske@gmail.com <ma...@gmail.com>>;
>>>>                   > >>>>> dev <dev@flink.apache.org
>>>>                  <ma...@flink.apache.org>>; user
>>>>                  <user@flink.apache.org <ma...@flink.apache.org>>
>>>>                   > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
>>>>                  well with Hive ecosystem
>>>>                   > >>>>>
>>>>                   > >>>>> Hi Xuefu,
>>>>                   > >>>>>
>>>>                   >
>>>>                   > >>>>> Thanks a lot for driving this big effort. I
>>>>                  would suggest convert your
>>>>                   >
>>>>                   > >>>>> proposal and design doc into a google doc,
>>>>                  and share it on the dev mailing
>>>>                   >
>>>>                   > >>>>> list for the community to review and comment
>>>>                  with title like "[DISCUSS] ...
>>>>                   >
>>>>                   > >>>>> Hive integration design ..." . Once
>>>>                  approved,  we can document it as a FLIP
>>>>                   >
>>>>                   > >>>>> (Flink Improvement Proposals), and use JIRAs
>>>>                  to track the implementations.
>>>>                   > >>>>> What do you think?
>>>>                   > >>>>>
>>>>                   > >>>>> Shuyi
>>>>                   > >>>>>
>>>>                   > >>>>> On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <
>>>>                   > xuefu.z@alibaba-inc.com
>>>>                  <ma...@alibaba-inc.com>>
>>>>                   > >>>>> wrote:
>>>>                   > >>>>> Hi all,
>>>>                   > >>>>>
>>>>                   > >>>>> I have also shared a design doc on Hive
>>>>                  metastore integration that is
>>>>                   >
>>>>                   > >>>>> attached here and also to FLINK-10556[1].
>>>>                  Please kindly review and share
>>>>                   > >>>>> your feedback.
>>>>                   > >>>>>
>>>>                   > >>>>>
>>>>                   > >>>>> Thanks,
>>>>                   > >>>>> Xuefu
>>>>                   > >>>>>
>>>>                   > >>>>> [1]
>>>> https://issues.apache.org/jira/browse/FLINK-10556
>>>>                   > >>>>>
>>>> ------------------------------------------------------------------
>>>>                   > >>>>> Sender:Xuefu <xuefu.z@alibaba-inc.com
>>>>                  <ma...@alibaba-inc.com>>
>>>>                   > >>>>> Sent at:2018 Oct 25 (Thu) 01:08
>>>>                   > >>>>> Recipient:Xuefu <xuefu.z@alibaba-inc.com
>>>>                  <ma...@alibaba-inc.com>>; Shuyi Chen <
>>>>                   > >>>>> suez1224@gmail.com <mailto:suez1224@gmail.com
>>>>
>>>>                   > >>>>> Cc:yanghua1127 <yanghua1127@gmail.com
>>>>                  <ma...@gmail.com>>; Fabian Hueske <
>>>>                   > fhueske@gmail.com <ma...@gmail.com>>;
>>>>                   > >>>>> dev <dev@flink.apache.org
>>>>                  <ma...@flink.apache.org>>; user
>>>>                  <user@flink.apache.org <ma...@flink.apache.org>>
>>>>                   > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
>>>>                  well with Hive ecosystem
>>>>                   > >>>>>
>>>>                   > >>>>> Hi all,
>>>>                   > >>>>>
>>>>                   > >>>>> To wrap up the discussion, I have attached a
>>>>                  PDF describing the
>>>>                   >
>>>>                   > >>>>> proposal, which is also attached to
>>>>                  FLINK-10556 [1]. Please feel free to
>>>>                   > >>>>> watch that JIRA to track the progress.
>>>>                   > >>>>>
>>>>                   > >>>>> Please also let me know if you have
>>>>                  additional comments or questions.
>>>>                   > >>>>>
>>>>                   > >>>>> Thanks,
>>>>                   > >>>>> Xuefu
>>>>                   > >>>>>
>>>>                   > >>>>> [1]
>>>> https://issues.apache.org/jira/browse/FLINK-10556
>>>>                   > >>>>>
>>>>                   > >>>>>
>>>>                   > >>>>>
>>>> ------------------------------------------------------------------
>>>>                   > >>>>> Sender:Xuefu <xuefu.z@alibaba-inc.com
>>>>                  <ma...@alibaba-inc.com>>
>>>>                   > >>>>> Sent at:2018 Oct 16 (Tue) 03:40
>>>>                   > >>>>> Recipient:Shuyi Chen <suez1224@gmail.com
>>>>                  <ma...@gmail.com>>
>>>>                   > >>>>> Cc:yanghua1127 <yanghua1127@gmail.com
>>>>                  <ma...@gmail.com>>; Fabian Hueske <
>>>>                   > fhueske@gmail.com <ma...@gmail.com>>;
>>>>                   > >>>>> dev <dev@flink.apache.org
>>>>                  <ma...@flink.apache.org>>; user
>>>>                  <user@flink.apache.org <ma...@flink.apache.org>>
>>>>                   > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
>>>>                  well with Hive ecosystem
>>>>                   > >>>>>
>>>>                   > >>>>> Hi Shuyi,
>>>>                   > >>>>>
>>>>                   >
>>>>                   > >>>>> Thank you for your input. Yes, I agreed with
>>>>                  a phased approach and like
>>>>                   >
>>>>                   > >>>>> to move forward fast. :) We did some work
>>>>                  internally on DDL utilizing babel
>>>>                   > >>>>> parser in Calcite. While babel makes
>>>>                  Calcite's grammar extensible, at
>>>>                   > >>>>> first impression it still seems too
>>>>                  cumbersome for a project when too
>>>>                   >
>>>>                   > >>>>> much extensions are made. It's even
>>>>                  challenging to find where the extension
>>>>                   >
>>>>                   > >>>>> is needed! It would be certainly better if
>>>>                  Calcite can magically support
>>>>                   >
>>>>                   > >>>>> Hive QL by just turning on a flag, such as
>>>>                  that for MYSQL_5. I can also
>>>>                   >
>>>>                   > >>>>> see that this could mean a lot of work on
>>>>                  Calcite. Nevertheless, I will
>>>>                   >
>>>>                   > >>>>> bring up the discussion over there and to see
>>>>                  what their community thinks.
>>>>                   > >>>>>
>>>>                   > >>>>> Would mind to share more info about the
>>>>                  proposal on DDL that you
>>>>                   > >>>>> mentioned? We can certainly collaborate on
>>>> this.
>>>>                   > >>>>>
>>>>                   > >>>>> Thanks,
>>>>                   > >>>>> Xuefu
>>>>                   > >>>>>
>>>>                   > >>>>>
>>>> ------------------------------------------------------------------
>>>>                   > >>>>> Sender:Shuyi Chen <suez1224@gmail.com
>>>>                  <ma...@gmail.com>>
>>>>                   > >>>>> Sent at:2018 Oct 14 (Sun) 08:30
>>>>                   > >>>>> Recipient:Xuefu <xuefu.z@alibaba-inc.com
>>>>                  <ma...@alibaba-inc.com>>
>>>>                   > >>>>> Cc:yanghua1127 <yanghua1127@gmail.com
>>>>                  <ma...@gmail.com>>; Fabian Hueske <
>>>>                   > fhueske@gmail.com <ma...@gmail.com>>;
>>>>                   > >>>>> dev <dev@flink.apache.org
>>>>                  <ma...@flink.apache.org>>; user
>>>>                  <user@flink.apache.org <ma...@flink.apache.org>>
>>>>                   > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
>>>>                  well with Hive ecosystem
>>>>                   > >>>>>
>>>>                   > >>>>> Welcome to the community and thanks for the
>>>>                  great proposal, Xuefu! I
>>>>                   >
>>>>                   > >>>>> think the proposal can be divided into 2
>>>>                  stages: making Flink to support
>>>>                   >
>>>>                   > >>>>> Hive features, and make Hive to work with
>>>>                  Flink. I agreed with Timo that on
>>>>                   >
>>>>                   > >>>>> starting with a smaller scope, so we can make
>>>>                  progress faster. As for [6],
>>>>                   >
>>>>                   > >>>>> a proposal for DDL is already in progress,
>>>>                  and will come after the unified
>>>>                   >
>>>>                   > >>>>> SQL connector API is done. For supporting
>>>>                  Hive syntax, we might need to
>>>>                   > >>>>> work with the Calcite community, and a recent
>>>>                  effort called babel (
>>>>                   > >>>>>
>>>> https://issues.apache.org/jira/browse/CALCITE-2280) in
>>>>                  Calcite might
>>>>                   > >>>>> help here.
>>>>                   > >>>>>
>>>>                   > >>>>> Thanks
>>>>                   > >>>>> Shuyi
>>>>                   > >>>>>
>>>>                   > >>>>> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <
>>>>                   > xuefu.z@alibaba-inc.com
>>>>                  <ma...@alibaba-inc.com>>
>>>>                   > >>>>> wrote:
>>>>                   > >>>>> Hi Fabian/Vno,
>>>>                   > >>>>>
>>>>                   >
>>>>                   > >>>>> Thank you very much for your encouragement
>>>>                  inquiry. Sorry that I didn't
>>>>                   >
>>>>                   > >>>>> see Fabian's email until I read Vino's
>>>>                  response just now. (Somehow Fabian's
>>>>                   > >>>>> went to the spam folder.)
>>>>                   > >>>>>
>>>>                   >
>>>>                   > >>>>> My proposal contains long-term and
>>>>                  short-terms goals. Nevertheless, the
>>>>                   > >>>>> effort will focus on the following areas,
>>>>                  including Fabian's list:
>>>>                   > >>>>>
>>>>                   > >>>>> 1. Hive metastore connectivity - This covers
>>>>                  both read/write access,
>>>>                   >
>>>>                   > >>>>> which means Flink can make full use of Hive's
>>>>                  metastore as its catalog (at
>>>>                   > >>>>> least for the batch but can extend for
>>>>                  streaming as well).
>>>>                   >
>>>>                   > >>>>> 2. Metadata compatibility - Objects
>>>>                  (databases, tables, partitions, etc)
>>>>                   >
>>>>                   > >>>>> created by Hive can be understood by Flink
>>>>                  and the reverse direction is
>>>>                   > >>>>> true also.
>>>>                   > >>>>> 3. Data compatibility - Similar to #2, data
>>>>                  produced by Hive can be
>>>>                   > >>>>> consumed by Flink and vise versa.
>>>>                   >
>>>>                   > >>>>> 4. Support Hive UDFs - For all Hive's native
>>>>                  udfs, Flink either provides
>>>>                   > >>>>> its own implementation or make Hive's
>>>>                  implementation work in Flink.
>>>>                   > >>>>> Further, for user created UDFs in Hive, Flink
>>>>                  SQL should provide a
>>>>                   >
>>>>                   > >>>>> mechanism allowing user to import them into
>>>>                  Flink without any code change
>>>>                   > >>>>> required.
>>>>                   > >>>>> 5. Data types - Flink SQL should support all
>>>>                  data types that are
>>>>                   > >>>>> available in Hive.
>>>>                   > >>>>> 6. SQL Language - Flink SQL should support
>>>>                  SQL standard (such as
>>>>                   >
>>>>                   > >>>>> SQL2003) with extension to support Hive's
>>>>                  syntax and language features,
>>>>                   > >>>>> around DDL, DML, and SELECT queries.
>>>>                   >
>>>>                   > >>>>> 7.  SQL CLI - this is currently developing in
>>>>                  Flink but more effort is
>>>>                   > >>>>> needed.
>>>>                   >
>>>>                   > >>>>> 8. Server - provide a server that's
>>>>                  compatible with Hive's HiverServer2
>>>>                   >
>>>>                   > >>>>> in thrift APIs, such that HiveServer2 users
>>>>                  can reuse their existing client
>>>>                   > >>>>> (such as beeline) but connect to Flink's
>>>>                  thrift server instead.
>>>>                   >
>>>>                   > >>>>> 9. JDBC/ODBC drivers - Flink may provide its
>>>>                  own JDBC/ODBC drivers for
>>>>                   > >>>>> other application to use to connect to its
>>>>                  thrift server
>>>>                   > >>>>> 10. Support other user's customizations in
>>>>                  Hive, such as Hive Serdes,
>>>>                   > >>>>> storage handlers, etc.
>>>>                   >
>>>>                   > >>>>> 11. Better task failure tolerance and task
>>>>                  scheduling at Flink runtime.
>>>>                   > >>>>>
>>>>                   > >>>>> As you can see, achieving all those requires
>>>>                  significant effort and
>>>>                   >
>>>>                   > >>>>> across all layers in Flink. However, a
>>>>                  short-term goal could include only
>>>>                   >
>>>>                   > >>>>> core areas (such as 1, 2, 4, 5, 6, 7) or
>>>>                  start  at a smaller scope (such as
>>>>                   > >>>>> #3, #6).
>>>>                   > >>>>>
>>>>                   >
>>>>                   > >>>>> Please share your further thoughts. If we
>>>>                  generally agree that this is
>>>>                   >
>>>>                   > >>>>> the right direction, I could come up with a
>>>>                  formal proposal quickly and
>>>>                   > >>>>> then we can follow up with broader discussions.
>>>>                   > >>>>>
>>>>                   > >>>>> Thanks,
>>>>                   > >>>>> Xuefu
>>>>                   > >>>>>
>>>>                   > >>>>>
>>>>                   > >>>>>
>>>>                   > >>>>>
>>>> ------------------------------------------------------------------
>>>>                   > >>>>> Sender:vino yang <yanghua1127@gmail.com
>>>>                  <ma...@gmail.com>>
>>>>                   > >>>>> Sent at:2018 Oct 11 (Thu) 09:45
>>>>                   > >>>>> Recipient:Fabian Hueske <fhueske@gmail.com
>>>>                  <ma...@gmail.com>>
>>>>                   > >>>>> Cc:dev <dev@flink.apache.org
>>>>                  <ma...@flink.apache.org>>; Xuefu
>>>>                  <xuefu.z@alibaba-inc.com
>>>> <ma...@alibaba-inc.com>
>>>>                   > >; user <
>>>>                   > >>>>> user@flink.apache.org
>>>>                  <ma...@flink.apache.org>>
>>>>                   > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
>>>>                  well with Hive ecosystem
>>>>                   > >>>>>
>>>>                   > >>>>> Hi Xuefu,
>>>>                   > >>>>>
>>>>                   >
>>>>                   > >>>>> Appreciate this proposal, and like Fabian, it
>>>>                  would look better if you
>>>>                   > >>>>> can give more details of the plan.
>>>>                   > >>>>>
>>>>                   > >>>>> Thanks, vino.
>>>>                   > >>>>>
>>>>                   > >>>>> Fabian Hueske <fhueske@gmail.com
>>>>                  <ma...@gmail.com>> 于2018年10月10日周三
>>>>                  下午5:27写道:
>>>>                   > >>>>> Hi Xuefu,
>>>>                   > >>>>>
>>>>                   >
>>>>                   > >>>>> Welcome to the Flink community and thanks for
>>>>                  starting this discussion!
>>>>                   > >>>>> Better Hive integration would be really great!
>>>>                   > >>>>> Can you go into details of what you are
>>>>                  proposing? I can think of a
>>>>                   > >>>>> couple ways to improve Flink in that regard:
>>>>                   > >>>>>
>>>>                   > >>>>> * Support for Hive UDFs
>>>>                   > >>>>> * Support for Hive metadata catalog
>>>>                   > >>>>> * Support for HiveQL syntax
>>>>                   > >>>>> * ???
>>>>                   > >>>>>
>>>>                   > >>>>> Best, Fabian
>>>>                   > >>>>>
>>>>                   > >>>>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb
>>>>                  Zhang, Xuefu <
>>>>                   > >>>>> xuefu.z@alibaba-inc.com
>>>>                  <ma...@alibaba-inc.com>>:
>>>>                   > >>>>> Hi all,
>>>>                   > >>>>>
>>>>                   > >>>>> Along with the community's effort, inside
>>>>                  Alibaba we have explored
>>>>                   >
>>>>                   > >>>>> Flink's potential as an execution engine not
>>>>                  just for stream processing but
>>>>                   > >>>>> also for batch processing. We are encouraged
>>>>                  by our findings and have
>>>>                   >
>>>>                   > >>>>> initiated our effort to make Flink's SQL
>>>>                  capabilities full-fledged. When
>>>>                   >
>>>>                   > >>>>> comparing what's available in Flink to the
>>>>                  offerings from competitive data
>>>>                   >
>>>>                   > >>>>> processing engines, we identified a major gap
>>>>                  in Flink: a well integration
>>>>                   >
>>>>                   > >>>>> with Hive ecosystem. This is crucial to the
>>>>                  success of Flink SQL and batch
>>>>                   >
>>>>                   > >>>>> due to the well-established data ecosystem
>>>>                  around Hive. Therefore, we have
>>>>                   >
>>>>                   > >>>>> done some initial work along this direction
>>>>                  but there are still a lot of
>>>>                   > >>>>> effort needed.
>>>>                   > >>>>>
>>>>                   > >>>>> We have two strategies in mind. The first one
>>>>                  is to make Flink SQL
>>>>                   >
>>>>                   > >>>>> full-fledged and well-integrated with Hive
>>>>                  ecosystem. This is a similar
>>>>                   >
>>>>                   > >>>>> approach to what Spark SQL adopted. The
>>>>                  second strategy is to make Hive
>>>>                   >
>>>>                   > >>>>> itself work with Flink, similar to the
>>>>                  proposal in [1]. Each approach bears
>>>>                   >
>>>>                   > >>>>> its pros and cons, but they don’t need to be
>>>>                  mutually exclusive with each
>>>>                   > >>>>> targeting at different users and use cases.
>>>>                  We believe that both will
>>>>                   > >>>>> promote a much greater adoption of Flink
>>>>                  beyond stream processing.
>>>>                   > >>>>>
>>>>                   > >>>>> We have been focused on the first approach
>>>>                  and would like to showcase
>>>>                   >
>>>>                   > >>>>> Flink's batch and SQL capabilities with Flink
>>>>                  SQL. However, we have also
>>>>                   > >>>>> planned to start strategy #2 as the follow-up
>>>>                  effort.
>>>>                   > >>>>>
>>>>                   >
>>>>                   > >>>>> I'm completely new to Flink(, with a short
>>>>                  bio [2] below), though many
>>>>                   >
>>>>                   > >>>>> of my colleagues here at Alibaba are
>>>>                  long-time contributors. Nevertheless,
>>>>                   >
>>>>                   > >>>>> I'd like to share our thoughts and invite
>>>>                  your early feedback. At the same
>>>>                   >
>>>>                   > >>>>> time, I am working on a detailed proposal on
>>>>                  Flink SQL's integration with
>>>>                   > >>>>> Hive ecosystem, which will be also shared
>>>>                  when ready.
>>>>                   > >>>>>
>>>>                   > >>>>> While the ideas are simple, each approach
>>>>                  will demand significant
>>>>                   >
>>>>                   > >>>>> effort, more than what we can afford. Thus,
>>>>                  the input and contributions
>>>>                   > >>>>> from the communities are greatly welcome and
>>>>                  appreciated.
>>>>                   > >>>>>
>>>>                   > >>>>> Regards,
>>>>                   > >>>>>
>>>>                   > >>>>>
>>>>                   > >>>>> Xuefu
>>>>                   > >>>>>
>>>>                   > >>>>> References:
>>>>                   > >>>>>
>>>>                   > >>>>> [1]
>>>>                  https://issues.apache.org/jira/browse/HIVE-10712
>>>>                   >
>>>>                   > >>>>> [2] Xuefu Zhang is a long-time open source
>>>>                  veteran, worked or working on
>>>>                   > >>>>> many projects under Apache Foundation, of
>>>>                  which he is also an honored
>>>>                   >
>>>>                   > >>>>> member. About 10 years ago he worked in the
>>>>                  Hadoop team at Yahoo where the
>>>>                   >
>>>>                   > >>>>> projects just got started. Later he worked at
>>>>                  Cloudera, initiating and
>>>>                   >
>>>>                   > >>>>> leading the development of Hive on Spark
>>>>                  project in the communities and
>>>>                   >
>>>>                   > >>>>> across many organizations. Prior to joining
>>>>                  Alibaba, he worked at Uber
>>>>                   >
>>>>                   > >>>>> where he promoted Hive on Spark to all Uber's
>>>>                  SQL on Hadoop workload and
>>>>                   > >>>>> significantly improved Uber's cluster
>>>> efficiency.
>>>>                   > >>>>>
>>>>                   > >>>>>
>>>>                   > >>>>>
>>>>                   > >>>>>
>>>>                   > >>>>> --
>>>>                   >
>>>>                   > >>>>> "So you have to trust that the dots will
>>>>                  somehow connect in your future."
>>>>                   > >>>>>
>>>>                   > >>>>>
>>>>                   > >>>>> --
>>>>                   >
>>>>                   > >>>>> "So you have to trust that the dots will
>>>>                  somehow connect in your future."
>>>>                   > >>>>>
>>>>                   >
>>>>                   >
>>>>
>>>
>>


Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by Bowen Li <bo...@gmail.com>.
Thank you, Xuefu and Timo, for putting together the FLIP! I like that both
its scope and implementation plan are clear. Look forward to feedbacks from
the group.

I also added a few more complementary details in the doc.

Thanks,
Bowen


On Mon, Jan 7, 2019 at 8:37 PM Zhang, Xuefu <xu...@alibaba-inc.com> wrote:

> Thanks, Timo!
>
> I have started put the content from the google doc to FLIP-30 [1].
> However, please still keep the discussion along this thread.
>
> Thanks,
> Xuefu
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-30%3A+Unified+Catalog+APIs
>
>
> ------------------------------------------------------------------
> From:Timo Walther <tw...@apache.org>
> Sent At:2019 Jan. 7 (Mon.) 05:59
> To:dev <de...@flink.apache.org>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> Hi everyone,
>
> Xuefu and I had multiple iterations over the catalog design document
> [1]. I believe that it is in a good shape now to be converted into FLIP.
> Maybe we need a bit more explanation at some places but the general
> design would be ready now.
>
> The design document covers the following changes:
> - Unify external catalog interface and Flink's internal catalog in
> TableEnvironment
> - Clearly define a hierarchy of reference objects namely:
> "catalog.database.table"
> - Enable a tight integration with Hive + Hive data connectors as well as
> a broad integration with existing TableFactories and discovery mechanism
> - Make the catalog interfaces more feature complete by adding views and
> functions
>
> If you have any further feedback, it would be great to give it now
> before we convert it into a FLIP.
>
> Thanks,
> Timo
>
> [1]
>
> https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit#
>
>
>
> Am 07.01.19 um 13:51 schrieb Timo Walther:
> > Hi Eron,
> >
> > thank you very much for the contributions. I merged the first little
> > bug fixes. For the remaining PRs I think we can review and merge them
> > soon. As you said, the code is agnostic to the details of the
> > ExternalCatalog interface and I don't expect bigger merge conflicts in
> > the near future.
> >
> > However, exposing the current external catalog interfaces to SQL
> > Client users would make it even more difficult to change the
> > interfaces in the future. So maybe I would first wait until the
> > general catalog discussion is over and the FLIP has been created. This
> > should happen shortly.
> >
> > We should definitely coordinate the efforts better in the future to
> > avoid duplicate work.
> >
> > Thanks,
> > Timo
> >
> >
> > Am 07.01.19 um 00:24 schrieb Eron Wright:
> >> Thanks Timo for merging a couple of the PRs.   Are you also able to
> >> review the others that I mentioned? Xuefu I would like to incorporate
> >> your feedback too.
> >>
> >> Check out this short demonstration of using a catalog in SQL Client:
> >> https://asciinema.org/a/C8xuAjmZSxCuApgFgZQyeIHuo
> >>
> >> Thanks again!
> >>
> >> On Thu, Jan 3, 2019 at 9:37 AM Eron Wright <eronwright@gmail.com
> >> <ma...@gmail.com>> wrote:
> >>
> >>     Would a couple folks raise their hand to make a review pass thru
> >>     the 6 PRs listed above?  It is a lovely stack of PRs that is 'all
> >>     green' at the moment.   I would be happy to open follow-on PRs to
> >>     rapidly align with other efforts.
> >>
> >>     Note that the code is agnostic to the details of the
> >>     ExternalCatalog interface; the code would not be obsolete if/when
> >>     the catalog interface is enhanced as per the design doc.
> >>
> >>
> >>
> >>     On Wed, Jan 2, 2019 at 1:35 PM Eron Wright <eronwright@gmail.com
> >>     <ma...@gmail.com>> wrote:
> >>
> >>         I propose that the community review and merge the PRs that I
> >>         posted, and then evolve the design thru 1.8 and beyond.  I
> >>         think having a basic infrastructure in place now will
> >>         accelerate the effort, do you agree?
> >>
> >>         Thanks again!
> >>
> >>         On Wed, Jan 2, 2019 at 11:20 AM Zhang, Xuefu
> >>         <xuefu.z@alibaba-inc.com <ma...@alibaba-inc.com>>
> >> wrote:
> >>
> >>             Hi Eron,
> >>
> >>             Happy New Year!
> >>
> >>             Thank you very much for your contribution, especially
> >>             during the holidays. Wile I'm encouraged by your work, I'd
> >>             also like to share my thoughts on how to move forward.
> >>
> >>             First, please note that the design discussion is still
> >>             finalizing, and we expect some moderate changes,
> >>             especially around TableFactories. Another pending change
> >>             is our decision to shy away from scala, which our work
> >>             will be impacted by.
> >>
> >>             Secondly, while your work seemed about plugging in
> >>             catalogs definitions to the execution environment, which
> >>             is less impacted by TableFactory change, I did notice some
> >>             duplication of your work and ours. This is no big deal,
> >>             but going forward, we should probable have a better
> >>             communication on the work assignment so as to avoid any
> >>             possible duplication of work. On the other hand, I think
> >>             some of your work is interesting and valuable for
> >>             inclusion once we finalize the overall design.
> >>
> >>             Thus, please continue your research and experiment and let
> >>             us know when you start working on anything so we can
> >>             better coordinate.
> >>
> >>             Thanks again for your interest and contributions.
> >>
> >>             Thanks,
> >>             Xuefu
> >>
> >>
> >>
> >> ------------------------------------------------------------------
> >>                 From:Eron Wright <eronwright@gmail.com
> >>                 <ma...@gmail.com>>
> >>                 Sent At:2019 Jan. 1 (Tue.) 18:39
> >>                 To:dev <dev@flink.apache.org
> >>                 <ma...@flink.apache.org>>; Xuefu
> >>                 <xuefu.z@alibaba-inc.com
> >> <ma...@alibaba-inc.com>>
> >>                 Cc:Xiaowei Jiang <xiaoweij@gmail.com
> >>                 <ma...@gmail.com>>; twalthr
> >>                 <twalthr@apache.org <ma...@apache.org>>;
> >>                 piotr <piotr@data-artisans.com
> >>                 <ma...@data-artisans.com>>; Fabian Hueske
> >>                 <fhueske@gmail.com <ma...@gmail.com>>;
> >>                 suez1224 <suez1224@gmail.com
> >>                 <ma...@gmail.com>>; Bowen Li
> >>                 <bowenli86@gmail.com <ma...@gmail.com>>
> >>                 Subject:Re: [DISCUSS] Integrate Flink SQL well with
> >>                 Hive ecosystem
> >>
> >>                 Hi folks, there's clearly some incremental steps to be
> >>                 taken to introduce catalog support to SQL Client,
> >>                 complementary to what is proposed in the Flink-Hive
> >>                 Metastore design doc.  I was quietly working on this
> >>                 over the holidays.   I posted some new sub-tasks, PRs,
> >>                 and sample code to FLINK-10744.
> >>
> >>                 What inspired me to get involved is that the catalog
> >>                 interface seems like a great way to encapsulate a
> >>                 'library' of Flink tables and functions. For example,
> >>                 the NYC Taxi dataset (TaxiRides, TaxiFares, various
> >>                 UDFs) may be nicely encapsulated as a catalog
> >>                 (TaxiData).  Such a library should be fully consumable
> >>                 in SQL Client.
> >>
> >>                 I implemented the above. Some highlights:
> >>                 1. A fully-worked example of using the Taxi dataset in
> >>                 SQL Client via an environment file.
> >>                 - an ASCII video showing the SQL Client in action:
> >> https://asciinema.org/a/C8xuAjmZSxCuApgFgZQyeIHuo
> >>
> >>                 - the corresponding environment file (will be even
> >>                 more concise once 'FLINK-10696 Catalog UDFs' is merged):
> >> _
> https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/dist/conf/sql-client-defaults.yaml_
> >>
> >>                 - the typed API for standalone table applications:
> >> _
> https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/java/com/dataartisans/flinktraining/examples/table_java/examples/ViaCatalog.java#L50_
> >>
> >>                 2. Implementation of the core catalog descriptor and
> >>                 factory.  I realize that some renames may later occur
> >>                 as per the design doc, and would be happy to do that
> >>                 as a follow-up.
> >>                 https://github.com/apache/flink/pull/7390
> >>
> >>                 3. Implementation of a connect-style API on
> >>                 TableEnvironment to use catalog descriptor.
> >>                 https://github.com/apache/flink/pull/7392
> >>
> >>                 4. Integration into SQL-Client's environment file:
> >>                 https://github.com/apache/flink/pull/7393
> >>
> >>                 I realize that the overall Hive integration is still
> >>                 evolving, but I believe that these PRs are a good
> >>                 stepping stone. Here's the list (in bottom-up order):
> >>                 - https://github.com/apache/flink/pull/7386
> >>                 - https://github.com/apache/flink/pull/7388
> >>                 - https://github.com/apache/flink/pull/7389
> >>                 - https://github.com/apache/flink/pull/7390
> >>                 - https://github.com/apache/flink/pull/7392
> >>                 - https://github.com/apache/flink/pull/7393
> >>
> >>                 Thanks and enjoy 2019!
> >>                 Eron W
> >>
> >>
> >>                 On Sun, Nov 18, 2018 at 3:04 PM Zhang, Xuefu
> >>                 <xuefu.z@alibaba-inc.com
> >>                 <ma...@alibaba-inc.com>> wrote:
> >>                 Hi Xiaowei,
> >>
> >>                 Thanks for bringing up the question. In the current
> >>                 design, the properties for meta objects are meant to
> >>                 cover anything that's specific to a particular catalog
> >>                 and agnostic to Flink. Anything that is common (such
> >>                 as schema for tables, query text for views, and udf
> >>                 classname) are abstracted as members of the respective
> >>                 classes. However, this is still in discussion, and
> >>                 Timo and I will go over this and provide an update.
> >>
> >>                 Please note that UDF is a little more involved than
> >>                 what the current design doc shows. I'm still refining
> >>                 this part.
> >>
> >>                 Thanks,
> >>                 Xuefu
> >>
> >>
> >> ------------------------------------------------------------------
> >>                 Sender:Xiaowei Jiang <xiaoweij@gmail.com
> >>                 <ma...@gmail.com>>
> >>                 Sent at:2018 Nov 18 (Sun) 15:17
> >>                 Recipient:dev <dev@flink.apache.org
> >>                 <ma...@flink.apache.org>>
> >>                 Cc:Xuefu <xuefu.z@alibaba-inc.com
> >>                 <ma...@alibaba-inc.com>>; twalthr
> >>                 <twalthr@apache.org <ma...@apache.org>>;
> >>                 piotr <piotr@data-artisans.com
> >>                 <ma...@data-artisans.com>>; Fabian Hueske
> >>                 <fhueske@gmail.com <ma...@gmail.com>>;
> >>                 suez1224 <suez1224@gmail.com
> >> <ma...@gmail.com>>
> >>                 Subject:Re: [DISCUSS] Integrate Flink SQL well with
> >>                 Hive ecosystem
> >>
> >>                 Thanks Xuefu for the detailed design doc! One question
> >>                 on the properties associated with the catalog objects.
> >>                 Are we going to leave them completely free form or we
> >>                 are going to set some standard for that? I think that
> >>                 the answer may depend on if we want to explore catalog
> >>                 specific optimization opportunities. In any case, I
> >>                 think that it might be helpful for standardize as much
> >>                 as possible into strongly typed classes and use leave
> >>                 these properties for catalog specific things. But I
> >>                 think that we can do it in steps.
> >>
> >>                 Xiaowei
> >>                 On Fri, Nov 16, 2018 at 4:00 AM Bowen Li
> >>                 <bowenli86@gmail.com <ma...@gmail.com>>
> >> wrote:
> >>                 Thanks for keeping on improving the overall design,
> >>                 Xuefu! It looks quite
> >>                  good to me now.
> >>
> >>                  Would be nice that cc-ed Flink committers can help to
> >>                 review and confirm!
> >>
> >>
> >>
> >>                  One minor suggestion: Since the last section of
> >>                 design doc already touches
> >>                  some new sql statements, shall we add another section
> >>                 in our doc and
> >>                  formalize the new sql statements in SQL Client and
> >>                 TableEnvironment that
> >>                  are gonna come along naturally with our design? Here
> >>                 are some that the
> >>                  design doc mentioned and some that I came up with:
> >>
> >>                  To be added:
> >>
> >>                     - USE <catalog> - set default catalog
> >>                     - USE <catalog.schema> - set default schema
> >>                     - SHOW CATALOGS - show all registered catalogs
> >>                     - SHOW SCHEMAS [FROM catalog] - list schemas in
> >>                 the current default
> >>                     catalog or the specified catalog
> >>                     - DESCRIBE VIEW view - show the view's definition
> >>                 in CatalogView
> >>                     - SHOW VIEWS [FROM schema/catalog.schema] - show
> >>                 views from current or a
> >>                     specified schema.
> >>
> >>                     (DDLs that can be addressed by either our design
> >>                 or Shuyi's DDL design)
> >>
> >>                     - CREATE/DROP/ALTER SCHEMA schema
> >>                     - CREATE/DROP/ALTER CATALOG catalog
> >>
> >>                  To be modified:
> >>
> >>                     - SHOW TABLES [FROM schema/catalog.schema] - show
> >>                 tables from current or
> >>                     a specified schema. Add 'from schema' to existing
> >>                 'SHOW TABLES' statement
> >>                     - SHOW FUNCTIONS [FROM schema/catalog.schema] -
> >>                 show functions from
> >>                     current or a specified schema. Add 'from schema'
> >>                 to existing 'SHOW TABLES'
> >>                     statement'
> >>
> >>
> >>                  Thanks, Bowen
> >>
> >>
> >>
> >>                  On Wed, Nov 14, 2018 at 10:39 PM Zhang, Xuefu
> >>                 <xuefu.z@alibaba-inc.com
> >> <ma...@alibaba-inc.com>>
> >>                  wrote:
> >>
> >>                  > Thanks, Bowen, for catching the error. I have
> >>                 granted comment permission
> >>                  > with the link.
> >>                  >
> >>                  > I also updated the doc with the latest class
> >>                 definitions. Everyone is
> >>                  > encouraged to review and comment.
> >>                  >
> >>                  > Thanks,
> >>                  > Xuefu
> >>                  >
> >>                  >
> >> ------------------------------------------------------------------
> >>                  > Sender:Bowen Li <bowenli86@gmail.com
> >>                 <ma...@gmail.com>>
> >>                  > Sent at:2018 Nov 14 (Wed) 06:44
> >>                  > Recipient:Xuefu <xuefu.z@alibaba-inc.com
> >>                 <ma...@alibaba-inc.com>>
> >>                  > Cc:piotr <piotr@data-artisans.com
> >>                 <ma...@data-artisans.com>>; dev
> >>                 <dev@flink.apache.org <ma...@flink.apache.org>>;
> >>                 Shuyi
> >>                  > Chen <suez1224@gmail.com <mailto:suez1224@gmail.com
> >>
> >>                  > Subject:Re: [DISCUSS] Integrate Flink SQL well with
> >>                 Hive ecosystem
> >>                  >
> >>                  > Hi Xuefu,
> >>                  >
> >>                  > Currently the new design doc
> >>                  >
> >> <
> https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit
> >
> >>                  > is on “view only" mode, and people cannot leave
> >>                 comments. Can you please
> >>                  > change it to "can comment" or "can edit" mode?
> >>                  >
> >>                  > Thanks, Bowen
> >>                  >
> >>                  >
> >>                  > On Mon, Nov 12, 2018 at 9:51 PM Zhang, Xuefu
> >>                 <xuefu.z@alibaba-inc.com
> >> <ma...@alibaba-inc.com>>
> >>                  > wrote:
> >>                  > Hi Piotr
> >>                  >
> >>                  > I have extracted the API portion of  the design and
> >>                 the google doc is here
> >>                  >
> >> <
> https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit?usp=sharing
> >.
> >>                  > Please review and provide your feedback.
> >>                  >
> >>                  > Thanks,
> >>                  > Xuefu
> >>                  >
> >>                  >
> >> ------------------------------------------------------------------
> >>                  > Sender:Xuefu <xuefu.z@alibaba-inc.com
> >>                 <ma...@alibaba-inc.com>>
> >>                  > Sent at:2018 Nov 12 (Mon) 12:43
> >>                  > Recipient:Piotr Nowojski <piotr@data-artisans.com
> >>                 <ma...@data-artisans.com>>; dev <
> >>                  > dev@flink.apache.org <ma...@flink.apache.org>>
> >>                  > Cc:Bowen Li <bowenli86@gmail.com
> >>                 <ma...@gmail.com>>; Shuyi Chen
> >>                 <suez1224@gmail.com <ma...@gmail.com>>
> >>                  > Subject:Re: [DISCUSS] Integrate Flink SQL well with
> >>                 Hive ecosystem
> >>                  >
> >>                  > Hi Piotr,
> >>                  >
> >>                  > That sounds good to me. Let's close all the open
> >>                 questions ((there are a
> >>                  > couple of them)) in the Google doc and I should be
> >>                 able to quickly split
> >>                  > it into the three proposals as you suggested.
> >>                  >
> >>                  > Thanks,
> >>                  > Xuefu
> >>                  >
> >>                  >
> >> ------------------------------------------------------------------
> >>                  > Sender:Piotr Nowojski <piotr@data-artisans.com
> >>                 <ma...@data-artisans.com>>
> >>                  > Sent at:2018 Nov 9 (Fri) 22:46
> >>                  > Recipient:dev <dev@flink.apache.org
> >>                 <ma...@flink.apache.org>>; Xuefu
> >>                 <xuefu.z@alibaba-inc.com
> >> <ma...@alibaba-inc.com>>
> >>                  > Cc:Bowen Li <bowenli86@gmail.com
> >>                 <ma...@gmail.com>>; Shuyi Chen
> >>                 <suez1224@gmail.com <ma...@gmail.com>>
> >>                  > Subject:Re: [DISCUSS] Integrate Flink SQL well with
> >>                 Hive ecosystem
> >>                  >
> >>                  > Hi,
> >>                  >
> >>                  >
> >>                  > Yes, it seems like the best solution. Maybe someone
> >>                 else can also suggests if we can split it further?
> >>                 Maybe changes in the interface in one doc, reading
> >>                 from hive meta store another and final storing our
> >>                 meta informations in hive meta store?
> >>                  >
> >>                  > Piotrek
> >>                  >
> >>                  > > On 9 Nov 2018, at 01:44, Zhang, Xuefu
> >>                 <xuefu.z@alibaba-inc.com
> >>                 <ma...@alibaba-inc.com>> wrote:
> >>                  > >
> >>                  > > Hi Piotr,
> >>                  > >
> >>                  > > That seems to be good idea!
> >>                  > >
> >>                  >
> >>                  > > Since the google doc for the design is currently
> >>                 under extensive review, I will leave it as it is for
> >>                 now. However, I'll convert it to two different FLIPs
> >>                 when the time comes.
> >>                  > >
> >>                  > > How does it sound to you?
> >>                  > >
> >>                  > > Thanks,
> >>                  > > Xuefu
> >>                  > >
> >>                  > >
> >>                  > >
> >> ------------------------------------------------------------------
> >>                  > > Sender:Piotr Nowojski <piotr@data-artisans.com
> >>                 <ma...@data-artisans.com>>
> >>                  > > Sent at:2018 Nov 9 (Fri) 02:31
> >>                  > > Recipient:dev <dev@flink.apache.org
> >>                 <ma...@flink.apache.org>>
> >>                  > > Cc:Bowen Li <bowenli86@gmail.com
> >>                 <ma...@gmail.com>>; Xuefu
> >>                 <xuefu.z@alibaba-inc.com
> >> <ma...@alibaba-inc.com>
> >>                  > >; Shuyi Chen <suez1224@gmail.com
> >>                 <ma...@gmail.com>>
> >>                  > > Subject:Re: [DISCUSS] Integrate Flink SQL well
> >>                 with Hive ecosystem
> >>                  > >
> >>                  > > Hi,
> >>                  > >
> >>                  >
> >>                  > > Maybe we should split this topic (and the design
> >>                 doc) into couple of smaller ones, hopefully
> >>                 independent. The questions that you have asked Fabian
> >>                 have for example very little to do with reading
> >>                 metadata from Hive Meta Store?
> >>                  > >
> >>                  > > Piotrek
> >>                  > >
> >>                  > >> On 7 Nov 2018, at 14:27, Fabian Hueske
> >>                 <fhueske@gmail.com <ma...@gmail.com>> wrote:
> >>                  > >>
> >>                  > >> Hi Xuefu and all,
> >>                  > >>
> >>                  > >> Thanks for sharing this design document!
> >>                  >
> >>                  > >> I'm very much in favor of restructuring /
> >>                 reworking the catalog handling in
> >>                  > >> Flink SQL as outlined in the document.
> >>                  >
> >>                  > >> Most changes described in the design document
> >>                 seem to be rather general and
> >>                  > >> not specifically related to the Hive integration.
> >>                  > >>
> >>                  >
> >>                  > >> IMO, there are some aspects, especially those at
> >>                 the boundary of Hive and
> >>                  > >> Flink, that need a bit more discussion. For
> >> example
> >>                  > >>
> >>                  > >> * What does it take to make Flink schema
> >>                 compatible with Hive schema?
> >>                  > >> * How will Flink tables (descriptors) be stored
> >>                 in HMS?
> >>                  > >> * How do both Hive catalogs differ? Could they
> >>                 be integrated into to a
> >>                  > >> single one? When to use which one?
> >>                  >
> >>                  > >> * What meta information is provided by HMS? What
> >>                 of this can be leveraged
> >>                  > >> by Flink?
> >>                  > >>
> >>                  > >> Thank you,
> >>                  > >> Fabian
> >>                  > >>
> >>                  > >> Am Fr., 2. Nov. 2018 um 00:31 Uhr schrieb Bowen
> >>                 Li <bowenli86@gmail.com <ma...@gmail.com>
> >>                  > >:
> >>                  > >>
> >>                  > >>> After taking a look at how other discussion
> >>                 threads work, I think it's
> >>                  > >>> actually fine just keep our discussion here.
> >>                 It's up to you, Xuefu.
> >>                  > >>>
> >>                  > >>> The google doc LGTM. I left some minor comments.
> >>                  > >>>
> >>                  > >>> On Thu, Nov 1, 2018 at 10:17 AM Bowen Li
> >>                 <bowenli86@gmail.com <ma...@gmail.com>>
> >> wrote:
> >>                  > >>>
> >>                  > >>>> Hi all,
> >>                  > >>>>
> >>                  > >>>> As Xuefu has published the design doc on
> >>                 google, I agree with Shuyi's
> >>                  >
> >>                  > >>>> suggestion that we probably should start a new
> >>                 email thread like "[DISCUSS]
> >>                  >
> >>                  > >>>> ... Hive integration design ..." on only dev
> >>                 mailing list for community
> >>                  > >>>> devs to review. The current thread sends to
> >>                 both dev and user list.
> >>                  > >>>>
> >>                  >
> >>                  > >>>> This email thread is more like validating the
> >>                 general idea and direction
> >>                  >
> >>                  > >>>> with the community, and it's been pretty long
> >>                 and crowded so far. Since
> >>                  >
> >>                  > >>>> everyone is pro for the idea, we can move
> >>                 forward with another thread to
> >>                  > >>>> discuss and finalize the design.
> >>                  > >>>>
> >>                  > >>>> Thanks,
> >>                  > >>>> Bowen
> >>                  > >>>>
> >>                  > >>>> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu <
> >>                  > xuefu.z@alibaba-inc.com
> >>                 <ma...@alibaba-inc.com>>
> >>                  > >>>> wrote:
> >>                  > >>>>
> >>                  > >>>>> Hi Shuiyi,
> >>                  > >>>>>
> >>                  >
> >>                  > >>>>> Good idea. Actually the PDF was converted
> >>                 from a google doc. Here is its
> >>                  > >>>>> link:
> >>                  > >>>>>
> >>                  > >>>>>
> >>                  >
> >>
> https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
> >>                  > >>>>> Once we reach an agreement, I can convert it
> >>                 to a FLIP.
> >>                  > >>>>>
> >>                  > >>>>> Thanks,
> >>                  > >>>>> Xuefu
> >>                  > >>>>>
> >>                  > >>>>>
> >>                  > >>>>>
> >>                  > >>>>>
> >> ------------------------------------------------------------------
> >>                  > >>>>> Sender:Shuyi Chen <suez1224@gmail.com
> >>                 <ma...@gmail.com>>
> >>                  > >>>>> Sent at:2018 Nov 1 (Thu) 02:47
> >>                  > >>>>> Recipient:Xuefu <xuefu.z@alibaba-inc.com
> >>                 <ma...@alibaba-inc.com>>
> >>                  > >>>>> Cc:vino yang <yanghua1127@gmail.com
> >>                 <ma...@gmail.com>>; Fabian Hueske <
> >>                  > fhueske@gmail.com <ma...@gmail.com>>;
> >>                  > >>>>> dev <dev@flink.apache.org
> >>                 <ma...@flink.apache.org>>; user
> >>                 <user@flink.apache.org <ma...@flink.apache.org>>
> >>                  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
> >>                 well with Hive ecosystem
> >>                  > >>>>>
> >>                  > >>>>> Hi Xuefu,
> >>                  > >>>>>
> >>                  >
> >>                  > >>>>> Thanks a lot for driving this big effort. I
> >>                 would suggest convert your
> >>                  >
> >>                  > >>>>> proposal and design doc into a google doc,
> >>                 and share it on the dev mailing
> >>                  >
> >>                  > >>>>> list for the community to review and comment
> >>                 with title like "[DISCUSS] ...
> >>                  >
> >>                  > >>>>> Hive integration design ..." . Once
> >>                 approved,  we can document it as a FLIP
> >>                  >
> >>                  > >>>>> (Flink Improvement Proposals), and use JIRAs
> >>                 to track the implementations.
> >>                  > >>>>> What do you think?
> >>                  > >>>>>
> >>                  > >>>>> Shuyi
> >>                  > >>>>>
> >>                  > >>>>> On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <
> >>                  > xuefu.z@alibaba-inc.com
> >>                 <ma...@alibaba-inc.com>>
> >>                  > >>>>> wrote:
> >>                  > >>>>> Hi all,
> >>                  > >>>>>
> >>                  > >>>>> I have also shared a design doc on Hive
> >>                 metastore integration that is
> >>                  >
> >>                  > >>>>> attached here and also to FLINK-10556[1].
> >>                 Please kindly review and share
> >>                  > >>>>> your feedback.
> >>                  > >>>>>
> >>                  > >>>>>
> >>                  > >>>>> Thanks,
> >>                  > >>>>> Xuefu
> >>                  > >>>>>
> >>                  > >>>>> [1]
> >> https://issues.apache.org/jira/browse/FLINK-10556
> >>                  > >>>>>
> >> ------------------------------------------------------------------
> >>                  > >>>>> Sender:Xuefu <xuefu.z@alibaba-inc.com
> >>                 <ma...@alibaba-inc.com>>
> >>                  > >>>>> Sent at:2018 Oct 25 (Thu) 01:08
> >>                  > >>>>> Recipient:Xuefu <xuefu.z@alibaba-inc.com
> >>                 <ma...@alibaba-inc.com>>; Shuyi Chen <
> >>                  > >>>>> suez1224@gmail.com <mailto:suez1224@gmail.com
> >>
> >>                  > >>>>> Cc:yanghua1127 <yanghua1127@gmail.com
> >>                 <ma...@gmail.com>>; Fabian Hueske <
> >>                  > fhueske@gmail.com <ma...@gmail.com>>;
> >>                  > >>>>> dev <dev@flink.apache.org
> >>                 <ma...@flink.apache.org>>; user
> >>                 <user@flink.apache.org <ma...@flink.apache.org>>
> >>                  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
> >>                 well with Hive ecosystem
> >>                  > >>>>>
> >>                  > >>>>> Hi all,
> >>                  > >>>>>
> >>                  > >>>>> To wrap up the discussion, I have attached a
> >>                 PDF describing the
> >>                  >
> >>                  > >>>>> proposal, which is also attached to
> >>                 FLINK-10556 [1]. Please feel free to
> >>                  > >>>>> watch that JIRA to track the progress.
> >>                  > >>>>>
> >>                  > >>>>> Please also let me know if you have
> >>                 additional comments or questions.
> >>                  > >>>>>
> >>                  > >>>>> Thanks,
> >>                  > >>>>> Xuefu
> >>                  > >>>>>
> >>                  > >>>>> [1]
> >> https://issues.apache.org/jira/browse/FLINK-10556
> >>                  > >>>>>
> >>                  > >>>>>
> >>                  > >>>>>
> >> ------------------------------------------------------------------
> >>                  > >>>>> Sender:Xuefu <xuefu.z@alibaba-inc.com
> >>                 <ma...@alibaba-inc.com>>
> >>                  > >>>>> Sent at:2018 Oct 16 (Tue) 03:40
> >>                  > >>>>> Recipient:Shuyi Chen <suez1224@gmail.com
> >>                 <ma...@gmail.com>>
> >>                  > >>>>> Cc:yanghua1127 <yanghua1127@gmail.com
> >>                 <ma...@gmail.com>>; Fabian Hueske <
> >>                  > fhueske@gmail.com <ma...@gmail.com>>;
> >>                  > >>>>> dev <dev@flink.apache.org
> >>                 <ma...@flink.apache.org>>; user
> >>                 <user@flink.apache.org <ma...@flink.apache.org>>
> >>                  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
> >>                 well with Hive ecosystem
> >>                  > >>>>>
> >>                  > >>>>> Hi Shuyi,
> >>                  > >>>>>
> >>                  >
> >>                  > >>>>> Thank you for your input. Yes, I agreed with
> >>                 a phased approach and like
> >>                  >
> >>                  > >>>>> to move forward fast. :) We did some work
> >>                 internally on DDL utilizing babel
> >>                  > >>>>> parser in Calcite. While babel makes
> >>                 Calcite's grammar extensible, at
> >>                  > >>>>> first impression it still seems too
> >>                 cumbersome for a project when too
> >>                  >
> >>                  > >>>>> much extensions are made. It's even
> >>                 challenging to find where the extension
> >>                  >
> >>                  > >>>>> is needed! It would be certainly better if
> >>                 Calcite can magically support
> >>                  >
> >>                  > >>>>> Hive QL by just turning on a flag, such as
> >>                 that for MYSQL_5. I can also
> >>                  >
> >>                  > >>>>> see that this could mean a lot of work on
> >>                 Calcite. Nevertheless, I will
> >>                  >
> >>                  > >>>>> bring up the discussion over there and to see
> >>                 what their community thinks.
> >>                  > >>>>>
> >>                  > >>>>> Would mind to share more info about the
> >>                 proposal on DDL that you
> >>                  > >>>>> mentioned? We can certainly collaborate on
> >> this.
> >>                  > >>>>>
> >>                  > >>>>> Thanks,
> >>                  > >>>>> Xuefu
> >>                  > >>>>>
> >>                  > >>>>>
> >> ------------------------------------------------------------------
> >>                  > >>>>> Sender:Shuyi Chen <suez1224@gmail.com
> >>                 <ma...@gmail.com>>
> >>                  > >>>>> Sent at:2018 Oct 14 (Sun) 08:30
> >>                  > >>>>> Recipient:Xuefu <xuefu.z@alibaba-inc.com
> >>                 <ma...@alibaba-inc.com>>
> >>                  > >>>>> Cc:yanghua1127 <yanghua1127@gmail.com
> >>                 <ma...@gmail.com>>; Fabian Hueske <
> >>                  > fhueske@gmail.com <ma...@gmail.com>>;
> >>                  > >>>>> dev <dev@flink.apache.org
> >>                 <ma...@flink.apache.org>>; user
> >>                 <user@flink.apache.org <ma...@flink.apache.org>>
> >>                  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
> >>                 well with Hive ecosystem
> >>                  > >>>>>
> >>                  > >>>>> Welcome to the community and thanks for the
> >>                 great proposal, Xuefu! I
> >>                  >
> >>                  > >>>>> think the proposal can be divided into 2
> >>                 stages: making Flink to support
> >>                  >
> >>                  > >>>>> Hive features, and make Hive to work with
> >>                 Flink. I agreed with Timo that on
> >>                  >
> >>                  > >>>>> starting with a smaller scope, so we can make
> >>                 progress faster. As for [6],
> >>                  >
> >>                  > >>>>> a proposal for DDL is already in progress,
> >>                 and will come after the unified
> >>                  >
> >>                  > >>>>> SQL connector API is done. For supporting
> >>                 Hive syntax, we might need to
> >>                  > >>>>> work with the Calcite community, and a recent
> >>                 effort called babel (
> >>                  > >>>>>
> >> https://issues.apache.org/jira/browse/CALCITE-2280) in
> >>                 Calcite might
> >>                  > >>>>> help here.
> >>                  > >>>>>
> >>                  > >>>>> Thanks
> >>                  > >>>>> Shuyi
> >>                  > >>>>>
> >>                  > >>>>> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <
> >>                  > xuefu.z@alibaba-inc.com
> >>                 <ma...@alibaba-inc.com>>
> >>                  > >>>>> wrote:
> >>                  > >>>>> Hi Fabian/Vno,
> >>                  > >>>>>
> >>                  >
> >>                  > >>>>> Thank you very much for your encouragement
> >>                 inquiry. Sorry that I didn't
> >>                  >
> >>                  > >>>>> see Fabian's email until I read Vino's
> >>                 response just now. (Somehow Fabian's
> >>                  > >>>>> went to the spam folder.)
> >>                  > >>>>>
> >>                  >
> >>                  > >>>>> My proposal contains long-term and
> >>                 short-terms goals. Nevertheless, the
> >>                  > >>>>> effort will focus on the following areas,
> >>                 including Fabian's list:
> >>                  > >>>>>
> >>                  > >>>>> 1. Hive metastore connectivity - This covers
> >>                 both read/write access,
> >>                  >
> >>                  > >>>>> which means Flink can make full use of Hive's
> >>                 metastore as its catalog (at
> >>                  > >>>>> least for the batch but can extend for
> >>                 streaming as well).
> >>                  >
> >>                  > >>>>> 2. Metadata compatibility - Objects
> >>                 (databases, tables, partitions, etc)
> >>                  >
> >>                  > >>>>> created by Hive can be understood by Flink
> >>                 and the reverse direction is
> >>                  > >>>>> true also.
> >>                  > >>>>> 3. Data compatibility - Similar to #2, data
> >>                 produced by Hive can be
> >>                  > >>>>> consumed by Flink and vise versa.
> >>                  >
> >>                  > >>>>> 4. Support Hive UDFs - For all Hive's native
> >>                 udfs, Flink either provides
> >>                  > >>>>> its own implementation or make Hive's
> >>                 implementation work in Flink.
> >>                  > >>>>> Further, for user created UDFs in Hive, Flink
> >>                 SQL should provide a
> >>                  >
> >>                  > >>>>> mechanism allowing user to import them into
> >>                 Flink without any code change
> >>                  > >>>>> required.
> >>                  > >>>>> 5. Data types - Flink SQL should support all
> >>                 data types that are
> >>                  > >>>>> available in Hive.
> >>                  > >>>>> 6. SQL Language - Flink SQL should support
> >>                 SQL standard (such as
> >>                  >
> >>                  > >>>>> SQL2003) with extension to support Hive's
> >>                 syntax and language features,
> >>                  > >>>>> around DDL, DML, and SELECT queries.
> >>                  >
> >>                  > >>>>> 7.  SQL CLI - this is currently developing in
> >>                 Flink but more effort is
> >>                  > >>>>> needed.
> >>                  >
> >>                  > >>>>> 8. Server - provide a server that's
> >>                 compatible with Hive's HiverServer2
> >>                  >
> >>                  > >>>>> in thrift APIs, such that HiveServer2 users
> >>                 can reuse their existing client
> >>                  > >>>>> (such as beeline) but connect to Flink's
> >>                 thrift server instead.
> >>                  >
> >>                  > >>>>> 9. JDBC/ODBC drivers - Flink may provide its
> >>                 own JDBC/ODBC drivers for
> >>                  > >>>>> other application to use to connect to its
> >>                 thrift server
> >>                  > >>>>> 10. Support other user's customizations in
> >>                 Hive, such as Hive Serdes,
> >>                  > >>>>> storage handlers, etc.
> >>                  >
> >>                  > >>>>> 11. Better task failure tolerance and task
> >>                 scheduling at Flink runtime.
> >>                  > >>>>>
> >>                  > >>>>> As you can see, achieving all those requires
> >>                 significant effort and
> >>                  >
> >>                  > >>>>> across all layers in Flink. However, a
> >>                 short-term goal could include only
> >>                  >
> >>                  > >>>>> core areas (such as 1, 2, 4, 5, 6, 7) or
> >>                 start  at a smaller scope (such as
> >>                  > >>>>> #3, #6).
> >>                  > >>>>>
> >>                  >
> >>                  > >>>>> Please share your further thoughts. If we
> >>                 generally agree that this is
> >>                  >
> >>                  > >>>>> the right direction, I could come up with a
> >>                 formal proposal quickly and
> >>                  > >>>>> then we can follow up with broader discussions.
> >>                  > >>>>>
> >>                  > >>>>> Thanks,
> >>                  > >>>>> Xuefu
> >>                  > >>>>>
> >>                  > >>>>>
> >>                  > >>>>>
> >>                  > >>>>>
> >> ------------------------------------------------------------------
> >>                  > >>>>> Sender:vino yang <yanghua1127@gmail.com
> >>                 <ma...@gmail.com>>
> >>                  > >>>>> Sent at:2018 Oct 11 (Thu) 09:45
> >>                  > >>>>> Recipient:Fabian Hueske <fhueske@gmail.com
> >>                 <ma...@gmail.com>>
> >>                  > >>>>> Cc:dev <dev@flink.apache.org
> >>                 <ma...@flink.apache.org>>; Xuefu
> >>                 <xuefu.z@alibaba-inc.com
> >> <ma...@alibaba-inc.com>
> >>                  > >; user <
> >>                  > >>>>> user@flink.apache.org
> >>                 <ma...@flink.apache.org>>
> >>                  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
> >>                 well with Hive ecosystem
> >>                  > >>>>>
> >>                  > >>>>> Hi Xuefu,
> >>                  > >>>>>
> >>                  >
> >>                  > >>>>> Appreciate this proposal, and like Fabian, it
> >>                 would look better if you
> >>                  > >>>>> can give more details of the plan.
> >>                  > >>>>>
> >>                  > >>>>> Thanks, vino.
> >>                  > >>>>>
> >>                  > >>>>> Fabian Hueske <fhueske@gmail.com
> >>                 <ma...@gmail.com>> 于2018年10月10日周三
> >>                 下午5:27写道:
> >>                  > >>>>> Hi Xuefu,
> >>                  > >>>>>
> >>                  >
> >>                  > >>>>> Welcome to the Flink community and thanks for
> >>                 starting this discussion!
> >>                  > >>>>> Better Hive integration would be really great!
> >>                  > >>>>> Can you go into details of what you are
> >>                 proposing? I can think of a
> >>                  > >>>>> couple ways to improve Flink in that regard:
> >>                  > >>>>>
> >>                  > >>>>> * Support for Hive UDFs
> >>                  > >>>>> * Support for Hive metadata catalog
> >>                  > >>>>> * Support for HiveQL syntax
> >>                  > >>>>> * ???
> >>                  > >>>>>
> >>                  > >>>>> Best, Fabian
> >>                  > >>>>>
> >>                  > >>>>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb
> >>                 Zhang, Xuefu <
> >>                  > >>>>> xuefu.z@alibaba-inc.com
> >>                 <ma...@alibaba-inc.com>>:
> >>                  > >>>>> Hi all,
> >>                  > >>>>>
> >>                  > >>>>> Along with the community's effort, inside
> >>                 Alibaba we have explored
> >>                  >
> >>                  > >>>>> Flink's potential as an execution engine not
> >>                 just for stream processing but
> >>                  > >>>>> also for batch processing. We are encouraged
> >>                 by our findings and have
> >>                  >
> >>                  > >>>>> initiated our effort to make Flink's SQL
> >>                 capabilities full-fledged. When
> >>                  >
> >>                  > >>>>> comparing what's available in Flink to the
> >>                 offerings from competitive data
> >>                  >
> >>                  > >>>>> processing engines, we identified a major gap
> >>                 in Flink: a well integration
> >>                  >
> >>                  > >>>>> with Hive ecosystem. This is crucial to the
> >>                 success of Flink SQL and batch
> >>                  >
> >>                  > >>>>> due to the well-established data ecosystem
> >>                 around Hive. Therefore, we have
> >>                  >
> >>                  > >>>>> done some initial work along this direction
> >>                 but there are still a lot of
> >>                  > >>>>> effort needed.
> >>                  > >>>>>
> >>                  > >>>>> We have two strategies in mind. The first one
> >>                 is to make Flink SQL
> >>                  >
> >>                  > >>>>> full-fledged and well-integrated with Hive
> >>                 ecosystem. This is a similar
> >>                  >
> >>                  > >>>>> approach to what Spark SQL adopted. The
> >>                 second strategy is to make Hive
> >>                  >
> >>                  > >>>>> itself work with Flink, similar to the
> >>                 proposal in [1]. Each approach bears
> >>                  >
> >>                  > >>>>> its pros and cons, but they don’t need to be
> >>                 mutually exclusive with each
> >>                  > >>>>> targeting at different users and use cases.
> >>                 We believe that both will
> >>                  > >>>>> promote a much greater adoption of Flink
> >>                 beyond stream processing.
> >>                  > >>>>>
> >>                  > >>>>> We have been focused on the first approach
> >>                 and would like to showcase
> >>                  >
> >>                  > >>>>> Flink's batch and SQL capabilities with Flink
> >>                 SQL. However, we have also
> >>                  > >>>>> planned to start strategy #2 as the follow-up
> >>                 effort.
> >>                  > >>>>>
> >>                  >
> >>                  > >>>>> I'm completely new to Flink(, with a short
> >>                 bio [2] below), though many
> >>                  >
> >>                  > >>>>> of my colleagues here at Alibaba are
> >>                 long-time contributors. Nevertheless,
> >>                  >
> >>                  > >>>>> I'd like to share our thoughts and invite
> >>                 your early feedback. At the same
> >>                  >
> >>                  > >>>>> time, I am working on a detailed proposal on
> >>                 Flink SQL's integration with
> >>                  > >>>>> Hive ecosystem, which will be also shared
> >>                 when ready.
> >>                  > >>>>>
> >>                  > >>>>> While the ideas are simple, each approach
> >>                 will demand significant
> >>                  >
> >>                  > >>>>> effort, more than what we can afford. Thus,
> >>                 the input and contributions
> >>                  > >>>>> from the communities are greatly welcome and
> >>                 appreciated.
> >>                  > >>>>>
> >>                  > >>>>> Regards,
> >>                  > >>>>>
> >>                  > >>>>>
> >>                  > >>>>> Xuefu
> >>                  > >>>>>
> >>                  > >>>>> References:
> >>                  > >>>>>
> >>                  > >>>>> [1]
> >>                 https://issues.apache.org/jira/browse/HIVE-10712
> >>                  >
> >>                  > >>>>> [2] Xuefu Zhang is a long-time open source
> >>                 veteran, worked or working on
> >>                  > >>>>> many projects under Apache Foundation, of
> >>                 which he is also an honored
> >>                  >
> >>                  > >>>>> member. About 10 years ago he worked in the
> >>                 Hadoop team at Yahoo where the
> >>                  >
> >>                  > >>>>> projects just got started. Later he worked at
> >>                 Cloudera, initiating and
> >>                  >
> >>                  > >>>>> leading the development of Hive on Spark
> >>                 project in the communities and
> >>                  >
> >>                  > >>>>> across many organizations. Prior to joining
> >>                 Alibaba, he worked at Uber
> >>                  >
> >>                  > >>>>> where he promoted Hive on Spark to all Uber's
> >>                 SQL on Hadoop workload and
> >>                  > >>>>> significantly improved Uber's cluster
> >> efficiency.
> >>                  > >>>>>
> >>                  > >>>>>
> >>                  > >>>>>
> >>                  > >>>>>
> >>                  > >>>>> --
> >>                  >
> >>                  > >>>>> "So you have to trust that the dots will
> >>                 somehow connect in your future."
> >>                  > >>>>>
> >>                  > >>>>>
> >>                  > >>>>> --
> >>                  >
> >>                  > >>>>> "So you have to trust that the dots will
> >>                 somehow connect in your future."
> >>                  > >>>>>
> >>                  >
> >>                  >
> >>
> >
> >
>
>

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by "Zhang, Xuefu" <xu...@alibaba-inc.com>.
Thanks, Timo!

I have started put the content from the google doc to FLIP-30 [1]. However, please still keep the discussion along this thread.

Thanks,
Xuefu

[1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-30%3A+Unified+Catalog+APIs


------------------------------------------------------------------
From:Timo Walther <tw...@apache.org>
Sent At:2019 Jan. 7 (Mon.) 05:59
To:dev <de...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi everyone,

Xuefu and I had multiple iterations over the catalog design document 
[1]. I believe that it is in a good shape now to be converted into FLIP. 
Maybe we need a bit more explanation at some places but the general 
design would be ready now.

The design document covers the following changes:
- Unify external catalog interface and Flink's internal catalog in 
TableEnvironment
- Clearly define a hierarchy of reference objects namely: 
"catalog.database.table"
- Enable a tight integration with Hive + Hive data connectors as well as 
a broad integration with existing TableFactories and discovery mechanism
- Make the catalog interfaces more feature complete by adding views and 
functions

If you have any further feedback, it would be great to give it now 
before we convert it into a FLIP.

Thanks,
Timo

[1] 
https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit#



Am 07.01.19 um 13:51 schrieb Timo Walther:
> Hi Eron,
>
> thank you very much for the contributions. I merged the first little 
> bug fixes. For the remaining PRs I think we can review and merge them 
> soon. As you said, the code is agnostic to the details of the 
> ExternalCatalog interface and I don't expect bigger merge conflicts in 
> the near future.
>
> However, exposing the current external catalog interfaces to SQL 
> Client users would make it even more difficult to change the 
> interfaces in the future. So maybe I would first wait until the 
> general catalog discussion is over and the FLIP has been created. This 
> should happen shortly.
>
> We should definitely coordinate the efforts better in the future to 
> avoid duplicate work.
>
> Thanks,
> Timo
>
>
> Am 07.01.19 um 00:24 schrieb Eron Wright:
>> Thanks Timo for merging a couple of the PRs.   Are you also able to 
>> review the others that I mentioned? Xuefu I would like to incorporate 
>> your feedback too.
>>
>> Check out this short demonstration of using a catalog in SQL Client:
>> https://asciinema.org/a/C8xuAjmZSxCuApgFgZQyeIHuo
>>
>> Thanks again!
>>
>> On Thu, Jan 3, 2019 at 9:37 AM Eron Wright <eronwright@gmail.com 
>> <ma...@gmail.com>> wrote:
>>
>>     Would a couple folks raise their hand to make a review pass thru
>>     the 6 PRs listed above?  It is a lovely stack of PRs that is 'all
>>     green' at the moment.   I would be happy to open follow-on PRs to
>>     rapidly align with other efforts.
>>
>>     Note that the code is agnostic to the details of the
>>     ExternalCatalog interface; the code would not be obsolete if/when
>>     the catalog interface is enhanced as per the design doc.
>>
>>
>>
>>     On Wed, Jan 2, 2019 at 1:35 PM Eron Wright <eronwright@gmail.com
>>     <ma...@gmail.com>> wrote:
>>
>>         I propose that the community review and merge the PRs that I
>>         posted, and then evolve the design thru 1.8 and beyond.  I
>>         think having a basic infrastructure in place now will
>>         accelerate the effort, do you agree?
>>
>>         Thanks again!
>>
>>         On Wed, Jan 2, 2019 at 11:20 AM Zhang, Xuefu
>>         <xuefu.z@alibaba-inc.com <ma...@alibaba-inc.com>> 
>> wrote:
>>
>>             Hi Eron,
>>
>>             Happy New Year!
>>
>>             Thank you very much for your contribution, especially
>>             during the holidays. Wile I'm encouraged by your work, I'd
>>             also like to share my thoughts on how to move forward.
>>
>>             First, please note that the design discussion is still
>>             finalizing, and we expect some moderate changes,
>>             especially around TableFactories. Another pending change
>>             is our decision to shy away from scala, which our work
>>             will be impacted by.
>>
>>             Secondly, while your work seemed about plugging in
>>             catalogs definitions to the execution environment, which
>>             is less impacted by TableFactory change, I did notice some
>>             duplication of your work and ours. This is no big deal,
>>             but going forward, we should probable have a better
>>             communication on the work assignment so as to avoid any
>>             possible duplication of work. On the other hand, I think
>>             some of your work is interesting and valuable for
>>             inclusion once we finalize the overall design.
>>
>>             Thus, please continue your research and experiment and let
>>             us know when you start working on anything so we can
>>             better coordinate.
>>
>>             Thanks again for your interest and contributions.
>>
>>             Thanks,
>>             Xuefu
>>
>>
>>
>> ------------------------------------------------------------------
>>                 From:Eron Wright <eronwright@gmail.com
>>                 <ma...@gmail.com>>
>>                 Sent At:2019 Jan. 1 (Tue.) 18:39
>>                 To:dev <dev@flink.apache.org
>>                 <ma...@flink.apache.org>>; Xuefu
>>                 <xuefu.z@alibaba-inc.com 
>> <ma...@alibaba-inc.com>>
>>                 Cc:Xiaowei Jiang <xiaoweij@gmail.com
>>                 <ma...@gmail.com>>; twalthr
>>                 <twalthr@apache.org <ma...@apache.org>>;
>>                 piotr <piotr@data-artisans.com
>>                 <ma...@data-artisans.com>>; Fabian Hueske
>>                 <fhueske@gmail.com <ma...@gmail.com>>;
>>                 suez1224 <suez1224@gmail.com
>>                 <ma...@gmail.com>>; Bowen Li
>>                 <bowenli86@gmail.com <ma...@gmail.com>>
>>                 Subject:Re: [DISCUSS] Integrate Flink SQL well with
>>                 Hive ecosystem
>>
>>                 Hi folks, there's clearly some incremental steps to be
>>                 taken to introduce catalog support to SQL Client,
>>                 complementary to what is proposed in the Flink-Hive
>>                 Metastore design doc.  I was quietly working on this
>>                 over the holidays.   I posted some new sub-tasks, PRs,
>>                 and sample code to FLINK-10744.
>>
>>                 What inspired me to get involved is that the catalog
>>                 interface seems like a great way to encapsulate a
>>                 'library' of Flink tables and functions. For example,
>>                 the NYC Taxi dataset (TaxiRides, TaxiFares, various
>>                 UDFs) may be nicely encapsulated as a catalog
>>                 (TaxiData).  Such a library should be fully consumable
>>                 in SQL Client.
>>
>>                 I implemented the above. Some highlights:
>>                 1. A fully-worked example of using the Taxi dataset in
>>                 SQL Client via an environment file.
>>                 - an ASCII video showing the SQL Client in action:
>> https://asciinema.org/a/C8xuAjmZSxCuApgFgZQyeIHuo
>>
>>                 - the corresponding environment file (will be even
>>                 more concise once 'FLINK-10696 Catalog UDFs' is merged):
>> _https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/dist/conf/sql-client-defaults.yaml_
>>
>>                 - the typed API for standalone table applications:
>> _https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/java/com/dataartisans/flinktraining/examples/table_java/examples/ViaCatalog.java#L50_
>>
>>                 2. Implementation of the core catalog descriptor and
>>                 factory.  I realize that some renames may later occur
>>                 as per the design doc, and would be happy to do that
>>                 as a follow-up.
>>                 https://github.com/apache/flink/pull/7390
>>
>>                 3. Implementation of a connect-style API on
>>                 TableEnvironment to use catalog descriptor.
>>                 https://github.com/apache/flink/pull/7392
>>
>>                 4. Integration into SQL-Client's environment file:
>>                 https://github.com/apache/flink/pull/7393
>>
>>                 I realize that the overall Hive integration is still
>>                 evolving, but I believe that these PRs are a good
>>                 stepping stone. Here's the list (in bottom-up order):
>>                 - https://github.com/apache/flink/pull/7386
>>                 - https://github.com/apache/flink/pull/7388
>>                 - https://github.com/apache/flink/pull/7389
>>                 - https://github.com/apache/flink/pull/7390
>>                 - https://github.com/apache/flink/pull/7392
>>                 - https://github.com/apache/flink/pull/7393
>>
>>                 Thanks and enjoy 2019!
>>                 Eron W
>>
>>
>>                 On Sun, Nov 18, 2018 at 3:04 PM Zhang, Xuefu
>>                 <xuefu.z@alibaba-inc.com
>>                 <ma...@alibaba-inc.com>> wrote:
>>                 Hi Xiaowei,
>>
>>                 Thanks for bringing up the question. In the current
>>                 design, the properties for meta objects are meant to
>>                 cover anything that's specific to a particular catalog
>>                 and agnostic to Flink. Anything that is common (such
>>                 as schema for tables, query text for views, and udf
>>                 classname) are abstracted as members of the respective
>>                 classes. However, this is still in discussion, and
>>                 Timo and I will go over this and provide an update.
>>
>>                 Please note that UDF is a little more involved than
>>                 what the current design doc shows. I'm still refining
>>                 this part.
>>
>>                 Thanks,
>>                 Xuefu
>>
>>
>> ------------------------------------------------------------------
>>                 Sender:Xiaowei Jiang <xiaoweij@gmail.com
>>                 <ma...@gmail.com>>
>>                 Sent at:2018 Nov 18 (Sun) 15:17
>>                 Recipient:dev <dev@flink.apache.org
>>                 <ma...@flink.apache.org>>
>>                 Cc:Xuefu <xuefu.z@alibaba-inc.com
>>                 <ma...@alibaba-inc.com>>; twalthr
>>                 <twalthr@apache.org <ma...@apache.org>>;
>>                 piotr <piotr@data-artisans.com
>>                 <ma...@data-artisans.com>>; Fabian Hueske
>>                 <fhueske@gmail.com <ma...@gmail.com>>;
>>                 suez1224 <suez1224@gmail.com 
>> <ma...@gmail.com>>
>>                 Subject:Re: [DISCUSS] Integrate Flink SQL well with
>>                 Hive ecosystem
>>
>>                 Thanks Xuefu for the detailed design doc! One question
>>                 on the properties associated with the catalog objects.
>>                 Are we going to leave them completely free form or we
>>                 are going to set some standard for that? I think that
>>                 the answer may depend on if we want to explore catalog
>>                 specific optimization opportunities. In any case, I
>>                 think that it might be helpful for standardize as much
>>                 as possible into strongly typed classes and use leave
>>                 these properties for catalog specific things. But I
>>                 think that we can do it in steps.
>>
>>                 Xiaowei
>>                 On Fri, Nov 16, 2018 at 4:00 AM Bowen Li
>>                 <bowenli86@gmail.com <ma...@gmail.com>> 
>> wrote:
>>                 Thanks for keeping on improving the overall design,
>>                 Xuefu! It looks quite
>>                  good to me now.
>>
>>                  Would be nice that cc-ed Flink committers can help to
>>                 review and confirm!
>>
>>
>>
>>                  One minor suggestion: Since the last section of
>>                 design doc already touches
>>                  some new sql statements, shall we add another section
>>                 in our doc and
>>                  formalize the new sql statements in SQL Client and
>>                 TableEnvironment that
>>                  are gonna come along naturally with our design? Here
>>                 are some that the
>>                  design doc mentioned and some that I came up with:
>>
>>                  To be added:
>>
>>                     - USE <catalog> - set default catalog
>>                     - USE <catalog.schema> - set default schema
>>                     - SHOW CATALOGS - show all registered catalogs
>>                     - SHOW SCHEMAS [FROM catalog] - list schemas in
>>                 the current default
>>                     catalog or the specified catalog
>>                     - DESCRIBE VIEW view - show the view's definition
>>                 in CatalogView
>>                     - SHOW VIEWS [FROM schema/catalog.schema] - show
>>                 views from current or a
>>                     specified schema.
>>
>>                     (DDLs that can be addressed by either our design
>>                 or Shuyi's DDL design)
>>
>>                     - CREATE/DROP/ALTER SCHEMA schema
>>                     - CREATE/DROP/ALTER CATALOG catalog
>>
>>                  To be modified:
>>
>>                     - SHOW TABLES [FROM schema/catalog.schema] - show
>>                 tables from current or
>>                     a specified schema. Add 'from schema' to existing
>>                 'SHOW TABLES' statement
>>                     - SHOW FUNCTIONS [FROM schema/catalog.schema] -
>>                 show functions from
>>                     current or a specified schema. Add 'from schema'
>>                 to existing 'SHOW TABLES'
>>                     statement'
>>
>>
>>                  Thanks, Bowen
>>
>>
>>
>>                  On Wed, Nov 14, 2018 at 10:39 PM Zhang, Xuefu
>>                 <xuefu.z@alibaba-inc.com 
>> <ma...@alibaba-inc.com>>
>>                  wrote:
>>
>>                  > Thanks, Bowen, for catching the error. I have
>>                 granted comment permission
>>                  > with the link.
>>                  >
>>                  > I also updated the doc with the latest class
>>                 definitions. Everyone is
>>                  > encouraged to review and comment.
>>                  >
>>                  > Thanks,
>>                  > Xuefu
>>                  >
>>                  >
>> ------------------------------------------------------------------
>>                  > Sender:Bowen Li <bowenli86@gmail.com
>>                 <ma...@gmail.com>>
>>                  > Sent at:2018 Nov 14 (Wed) 06:44
>>                  > Recipient:Xuefu <xuefu.z@alibaba-inc.com
>>                 <ma...@alibaba-inc.com>>
>>                  > Cc:piotr <piotr@data-artisans.com
>>                 <ma...@data-artisans.com>>; dev
>>                 <dev@flink.apache.org <ma...@flink.apache.org>>;
>>                 Shuyi
>>                  > Chen <suez1224@gmail.com <ma...@gmail.com>>
>>                  > Subject:Re: [DISCUSS] Integrate Flink SQL well with
>>                 Hive ecosystem
>>                  >
>>                  > Hi Xuefu,
>>                  >
>>                  > Currently the new design doc
>>                  >
>> <https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit>
>>                  > is on “view only" mode, and people cannot leave
>>                 comments. Can you please
>>                  > change it to "can comment" or "can edit" mode?
>>                  >
>>                  > Thanks, Bowen
>>                  >
>>                  >
>>                  > On Mon, Nov 12, 2018 at 9:51 PM Zhang, Xuefu
>>                 <xuefu.z@alibaba-inc.com 
>> <ma...@alibaba-inc.com>>
>>                  > wrote:
>>                  > Hi Piotr
>>                  >
>>                  > I have extracted the API portion of  the design and
>>                 the google doc is here
>>                  >
>> <https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit?usp=sharing>.
>>                  > Please review and provide your feedback.
>>                  >
>>                  > Thanks,
>>                  > Xuefu
>>                  >
>>                  >
>> ------------------------------------------------------------------
>>                  > Sender:Xuefu <xuefu.z@alibaba-inc.com
>>                 <ma...@alibaba-inc.com>>
>>                  > Sent at:2018 Nov 12 (Mon) 12:43
>>                  > Recipient:Piotr Nowojski <piotr@data-artisans.com
>>                 <ma...@data-artisans.com>>; dev <
>>                  > dev@flink.apache.org <ma...@flink.apache.org>>
>>                  > Cc:Bowen Li <bowenli86@gmail.com
>>                 <ma...@gmail.com>>; Shuyi Chen
>>                 <suez1224@gmail.com <ma...@gmail.com>>
>>                  > Subject:Re: [DISCUSS] Integrate Flink SQL well with
>>                 Hive ecosystem
>>                  >
>>                  > Hi Piotr,
>>                  >
>>                  > That sounds good to me. Let's close all the open
>>                 questions ((there are a
>>                  > couple of them)) in the Google doc and I should be
>>                 able to quickly split
>>                  > it into the three proposals as you suggested.
>>                  >
>>                  > Thanks,
>>                  > Xuefu
>>                  >
>>                  >
>> ------------------------------------------------------------------
>>                  > Sender:Piotr Nowojski <piotr@data-artisans.com
>>                 <ma...@data-artisans.com>>
>>                  > Sent at:2018 Nov 9 (Fri) 22:46
>>                  > Recipient:dev <dev@flink.apache.org
>>                 <ma...@flink.apache.org>>; Xuefu
>>                 <xuefu.z@alibaba-inc.com 
>> <ma...@alibaba-inc.com>>
>>                  > Cc:Bowen Li <bowenli86@gmail.com
>>                 <ma...@gmail.com>>; Shuyi Chen
>>                 <suez1224@gmail.com <ma...@gmail.com>>
>>                  > Subject:Re: [DISCUSS] Integrate Flink SQL well with
>>                 Hive ecosystem
>>                  >
>>                  > Hi,
>>                  >
>>                  >
>>                  > Yes, it seems like the best solution. Maybe someone
>>                 else can also suggests if we can split it further?
>>                 Maybe changes in the interface in one doc, reading
>>                 from hive meta store another and final storing our
>>                 meta informations in hive meta store?
>>                  >
>>                  > Piotrek
>>                  >
>>                  > > On 9 Nov 2018, at 01:44, Zhang, Xuefu
>>                 <xuefu.z@alibaba-inc.com
>>                 <ma...@alibaba-inc.com>> wrote:
>>                  > >
>>                  > > Hi Piotr,
>>                  > >
>>                  > > That seems to be good idea!
>>                  > >
>>                  >
>>                  > > Since the google doc for the design is currently
>>                 under extensive review, I will leave it as it is for
>>                 now. However, I'll convert it to two different FLIPs
>>                 when the time comes.
>>                  > >
>>                  > > How does it sound to you?
>>                  > >
>>                  > > Thanks,
>>                  > > Xuefu
>>                  > >
>>                  > >
>>                  > >
>> ------------------------------------------------------------------
>>                  > > Sender:Piotr Nowojski <piotr@data-artisans.com
>>                 <ma...@data-artisans.com>>
>>                  > > Sent at:2018 Nov 9 (Fri) 02:31
>>                  > > Recipient:dev <dev@flink.apache.org
>>                 <ma...@flink.apache.org>>
>>                  > > Cc:Bowen Li <bowenli86@gmail.com
>>                 <ma...@gmail.com>>; Xuefu
>>                 <xuefu.z@alibaba-inc.com 
>> <ma...@alibaba-inc.com>
>>                  > >; Shuyi Chen <suez1224@gmail.com
>>                 <ma...@gmail.com>>
>>                  > > Subject:Re: [DISCUSS] Integrate Flink SQL well
>>                 with Hive ecosystem
>>                  > >
>>                  > > Hi,
>>                  > >
>>                  >
>>                  > > Maybe we should split this topic (and the design
>>                 doc) into couple of smaller ones, hopefully
>>                 independent. The questions that you have asked Fabian
>>                 have for example very little to do with reading
>>                 metadata from Hive Meta Store?
>>                  > >
>>                  > > Piotrek
>>                  > >
>>                  > >> On 7 Nov 2018, at 14:27, Fabian Hueske
>>                 <fhueske@gmail.com <ma...@gmail.com>> wrote:
>>                  > >>
>>                  > >> Hi Xuefu and all,
>>                  > >>
>>                  > >> Thanks for sharing this design document!
>>                  >
>>                  > >> I'm very much in favor of restructuring /
>>                 reworking the catalog handling in
>>                  > >> Flink SQL as outlined in the document.
>>                  >
>>                  > >> Most changes described in the design document
>>                 seem to be rather general and
>>                  > >> not specifically related to the Hive integration.
>>                  > >>
>>                  >
>>                  > >> IMO, there are some aspects, especially those at
>>                 the boundary of Hive and
>>                  > >> Flink, that need a bit more discussion. For 
>> example
>>                  > >>
>>                  > >> * What does it take to make Flink schema
>>                 compatible with Hive schema?
>>                  > >> * How will Flink tables (descriptors) be stored
>>                 in HMS?
>>                  > >> * How do both Hive catalogs differ? Could they
>>                 be integrated into to a
>>                  > >> single one? When to use which one?
>>                  >
>>                  > >> * What meta information is provided by HMS? What
>>                 of this can be leveraged
>>                  > >> by Flink?
>>                  > >>
>>                  > >> Thank you,
>>                  > >> Fabian
>>                  > >>
>>                  > >> Am Fr., 2. Nov. 2018 um 00:31 Uhr schrieb Bowen
>>                 Li <bowenli86@gmail.com <ma...@gmail.com>
>>                  > >:
>>                  > >>
>>                  > >>> After taking a look at how other discussion
>>                 threads work, I think it's
>>                  > >>> actually fine just keep our discussion here.
>>                 It's up to you, Xuefu.
>>                  > >>>
>>                  > >>> The google doc LGTM. I left some minor comments.
>>                  > >>>
>>                  > >>> On Thu, Nov 1, 2018 at 10:17 AM Bowen Li
>>                 <bowenli86@gmail.com <ma...@gmail.com>> 
>> wrote:
>>                  > >>>
>>                  > >>>> Hi all,
>>                  > >>>>
>>                  > >>>> As Xuefu has published the design doc on
>>                 google, I agree with Shuyi's
>>                  >
>>                  > >>>> suggestion that we probably should start a new
>>                 email thread like "[DISCUSS]
>>                  >
>>                  > >>>> ... Hive integration design ..." on only dev
>>                 mailing list for community
>>                  > >>>> devs to review. The current thread sends to
>>                 both dev and user list.
>>                  > >>>>
>>                  >
>>                  > >>>> This email thread is more like validating the
>>                 general idea and direction
>>                  >
>>                  > >>>> with the community, and it's been pretty long
>>                 and crowded so far. Since
>>                  >
>>                  > >>>> everyone is pro for the idea, we can move
>>                 forward with another thread to
>>                  > >>>> discuss and finalize the design.
>>                  > >>>>
>>                  > >>>> Thanks,
>>                  > >>>> Bowen
>>                  > >>>>
>>                  > >>>> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu <
>>                  > xuefu.z@alibaba-inc.com
>>                 <ma...@alibaba-inc.com>>
>>                  > >>>> wrote:
>>                  > >>>>
>>                  > >>>>> Hi Shuiyi,
>>                  > >>>>>
>>                  >
>>                  > >>>>> Good idea. Actually the PDF was converted
>>                 from a google doc. Here is its
>>                  > >>>>> link:
>>                  > >>>>>
>>                  > >>>>>
>>                  >
>> https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
>>                  > >>>>> Once we reach an agreement, I can convert it
>>                 to a FLIP.
>>                  > >>>>>
>>                  > >>>>> Thanks,
>>                  > >>>>> Xuefu
>>                  > >>>>>
>>                  > >>>>>
>>                  > >>>>>
>>                  > >>>>>
>> ------------------------------------------------------------------
>>                  > >>>>> Sender:Shuyi Chen <suez1224@gmail.com
>>                 <ma...@gmail.com>>
>>                  > >>>>> Sent at:2018 Nov 1 (Thu) 02:47
>>                  > >>>>> Recipient:Xuefu <xuefu.z@alibaba-inc.com
>>                 <ma...@alibaba-inc.com>>
>>                  > >>>>> Cc:vino yang <yanghua1127@gmail.com
>>                 <ma...@gmail.com>>; Fabian Hueske <
>>                  > fhueske@gmail.com <ma...@gmail.com>>;
>>                  > >>>>> dev <dev@flink.apache.org
>>                 <ma...@flink.apache.org>>; user
>>                 <user@flink.apache.org <ma...@flink.apache.org>>
>>                  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
>>                 well with Hive ecosystem
>>                  > >>>>>
>>                  > >>>>> Hi Xuefu,
>>                  > >>>>>
>>                  >
>>                  > >>>>> Thanks a lot for driving this big effort. I
>>                 would suggest convert your
>>                  >
>>                  > >>>>> proposal and design doc into a google doc,
>>                 and share it on the dev mailing
>>                  >
>>                  > >>>>> list for the community to review and comment
>>                 with title like "[DISCUSS] ...
>>                  >
>>                  > >>>>> Hive integration design ..." . Once
>>                 approved,  we can document it as a FLIP
>>                  >
>>                  > >>>>> (Flink Improvement Proposals), and use JIRAs
>>                 to track the implementations.
>>                  > >>>>> What do you think?
>>                  > >>>>>
>>                  > >>>>> Shuyi
>>                  > >>>>>
>>                  > >>>>> On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <
>>                  > xuefu.z@alibaba-inc.com
>>                 <ma...@alibaba-inc.com>>
>>                  > >>>>> wrote:
>>                  > >>>>> Hi all,
>>                  > >>>>>
>>                  > >>>>> I have also shared a design doc on Hive
>>                 metastore integration that is
>>                  >
>>                  > >>>>> attached here and also to FLINK-10556[1].
>>                 Please kindly review and share
>>                  > >>>>> your feedback.
>>                  > >>>>>
>>                  > >>>>>
>>                  > >>>>> Thanks,
>>                  > >>>>> Xuefu
>>                  > >>>>>
>>                  > >>>>> [1]
>> https://issues.apache.org/jira/browse/FLINK-10556
>>                  > >>>>>
>> ------------------------------------------------------------------
>>                  > >>>>> Sender:Xuefu <xuefu.z@alibaba-inc.com
>>                 <ma...@alibaba-inc.com>>
>>                  > >>>>> Sent at:2018 Oct 25 (Thu) 01:08
>>                  > >>>>> Recipient:Xuefu <xuefu.z@alibaba-inc.com
>>                 <ma...@alibaba-inc.com>>; Shuyi Chen <
>>                  > >>>>> suez1224@gmail.com <ma...@gmail.com>>
>>                  > >>>>> Cc:yanghua1127 <yanghua1127@gmail.com
>>                 <ma...@gmail.com>>; Fabian Hueske <
>>                  > fhueske@gmail.com <ma...@gmail.com>>;
>>                  > >>>>> dev <dev@flink.apache.org
>>                 <ma...@flink.apache.org>>; user
>>                 <user@flink.apache.org <ma...@flink.apache.org>>
>>                  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
>>                 well with Hive ecosystem
>>                  > >>>>>
>>                  > >>>>> Hi all,
>>                  > >>>>>
>>                  > >>>>> To wrap up the discussion, I have attached a
>>                 PDF describing the
>>                  >
>>                  > >>>>> proposal, which is also attached to
>>                 FLINK-10556 [1]. Please feel free to
>>                  > >>>>> watch that JIRA to track the progress.
>>                  > >>>>>
>>                  > >>>>> Please also let me know if you have
>>                 additional comments or questions.
>>                  > >>>>>
>>                  > >>>>> Thanks,
>>                  > >>>>> Xuefu
>>                  > >>>>>
>>                  > >>>>> [1]
>> https://issues.apache.org/jira/browse/FLINK-10556
>>                  > >>>>>
>>                  > >>>>>
>>                  > >>>>>
>> ------------------------------------------------------------------
>>                  > >>>>> Sender:Xuefu <xuefu.z@alibaba-inc.com
>>                 <ma...@alibaba-inc.com>>
>>                  > >>>>> Sent at:2018 Oct 16 (Tue) 03:40
>>                  > >>>>> Recipient:Shuyi Chen <suez1224@gmail.com
>>                 <ma...@gmail.com>>
>>                  > >>>>> Cc:yanghua1127 <yanghua1127@gmail.com
>>                 <ma...@gmail.com>>; Fabian Hueske <
>>                  > fhueske@gmail.com <ma...@gmail.com>>;
>>                  > >>>>> dev <dev@flink.apache.org
>>                 <ma...@flink.apache.org>>; user
>>                 <user@flink.apache.org <ma...@flink.apache.org>>
>>                  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
>>                 well with Hive ecosystem
>>                  > >>>>>
>>                  > >>>>> Hi Shuyi,
>>                  > >>>>>
>>                  >
>>                  > >>>>> Thank you for your input. Yes, I agreed with
>>                 a phased approach and like
>>                  >
>>                  > >>>>> to move forward fast. :) We did some work
>>                 internally on DDL utilizing babel
>>                  > >>>>> parser in Calcite. While babel makes
>>                 Calcite's grammar extensible, at
>>                  > >>>>> first impression it still seems too
>>                 cumbersome for a project when too
>>                  >
>>                  > >>>>> much extensions are made. It's even
>>                 challenging to find where the extension
>>                  >
>>                  > >>>>> is needed! It would be certainly better if
>>                 Calcite can magically support
>>                  >
>>                  > >>>>> Hive QL by just turning on a flag, such as
>>                 that for MYSQL_5. I can also
>>                  >
>>                  > >>>>> see that this could mean a lot of work on
>>                 Calcite. Nevertheless, I will
>>                  >
>>                  > >>>>> bring up the discussion over there and to see
>>                 what their community thinks.
>>                  > >>>>>
>>                  > >>>>> Would mind to share more info about the
>>                 proposal on DDL that you
>>                  > >>>>> mentioned? We can certainly collaborate on 
>> this.
>>                  > >>>>>
>>                  > >>>>> Thanks,
>>                  > >>>>> Xuefu
>>                  > >>>>>
>>                  > >>>>>
>> ------------------------------------------------------------------
>>                  > >>>>> Sender:Shuyi Chen <suez1224@gmail.com
>>                 <ma...@gmail.com>>
>>                  > >>>>> Sent at:2018 Oct 14 (Sun) 08:30
>>                  > >>>>> Recipient:Xuefu <xuefu.z@alibaba-inc.com
>>                 <ma...@alibaba-inc.com>>
>>                  > >>>>> Cc:yanghua1127 <yanghua1127@gmail.com
>>                 <ma...@gmail.com>>; Fabian Hueske <
>>                  > fhueske@gmail.com <ma...@gmail.com>>;
>>                  > >>>>> dev <dev@flink.apache.org
>>                 <ma...@flink.apache.org>>; user
>>                 <user@flink.apache.org <ma...@flink.apache.org>>
>>                  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
>>                 well with Hive ecosystem
>>                  > >>>>>
>>                  > >>>>> Welcome to the community and thanks for the
>>                 great proposal, Xuefu! I
>>                  >
>>                  > >>>>> think the proposal can be divided into 2
>>                 stages: making Flink to support
>>                  >
>>                  > >>>>> Hive features, and make Hive to work with
>>                 Flink. I agreed with Timo that on
>>                  >
>>                  > >>>>> starting with a smaller scope, so we can make
>>                 progress faster. As for [6],
>>                  >
>>                  > >>>>> a proposal for DDL is already in progress,
>>                 and will come after the unified
>>                  >
>>                  > >>>>> SQL connector API is done. For supporting
>>                 Hive syntax, we might need to
>>                  > >>>>> work with the Calcite community, and a recent
>>                 effort called babel (
>>                  > >>>>>
>> https://issues.apache.org/jira/browse/CALCITE-2280) in
>>                 Calcite might
>>                  > >>>>> help here.
>>                  > >>>>>
>>                  > >>>>> Thanks
>>                  > >>>>> Shuyi
>>                  > >>>>>
>>                  > >>>>> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <
>>                  > xuefu.z@alibaba-inc.com
>>                 <ma...@alibaba-inc.com>>
>>                  > >>>>> wrote:
>>                  > >>>>> Hi Fabian/Vno,
>>                  > >>>>>
>>                  >
>>                  > >>>>> Thank you very much for your encouragement
>>                 inquiry. Sorry that I didn't
>>                  >
>>                  > >>>>> see Fabian's email until I read Vino's
>>                 response just now. (Somehow Fabian's
>>                  > >>>>> went to the spam folder.)
>>                  > >>>>>
>>                  >
>>                  > >>>>> My proposal contains long-term and
>>                 short-terms goals. Nevertheless, the
>>                  > >>>>> effort will focus on the following areas,
>>                 including Fabian's list:
>>                  > >>>>>
>>                  > >>>>> 1. Hive metastore connectivity - This covers
>>                 both read/write access,
>>                  >
>>                  > >>>>> which means Flink can make full use of Hive's
>>                 metastore as its catalog (at
>>                  > >>>>> least for the batch but can extend for
>>                 streaming as well).
>>                  >
>>                  > >>>>> 2. Metadata compatibility - Objects
>>                 (databases, tables, partitions, etc)
>>                  >
>>                  > >>>>> created by Hive can be understood by Flink
>>                 and the reverse direction is
>>                  > >>>>> true also.
>>                  > >>>>> 3. Data compatibility - Similar to #2, data
>>                 produced by Hive can be
>>                  > >>>>> consumed by Flink and vise versa.
>>                  >
>>                  > >>>>> 4. Support Hive UDFs - For all Hive's native
>>                 udfs, Flink either provides
>>                  > >>>>> its own implementation or make Hive's
>>                 implementation work in Flink.
>>                  > >>>>> Further, for user created UDFs in Hive, Flink
>>                 SQL should provide a
>>                  >
>>                  > >>>>> mechanism allowing user to import them into
>>                 Flink without any code change
>>                  > >>>>> required.
>>                  > >>>>> 5. Data types - Flink SQL should support all
>>                 data types that are
>>                  > >>>>> available in Hive.
>>                  > >>>>> 6. SQL Language - Flink SQL should support
>>                 SQL standard (such as
>>                  >
>>                  > >>>>> SQL2003) with extension to support Hive's
>>                 syntax and language features,
>>                  > >>>>> around DDL, DML, and SELECT queries.
>>                  >
>>                  > >>>>> 7.  SQL CLI - this is currently developing in
>>                 Flink but more effort is
>>                  > >>>>> needed.
>>                  >
>>                  > >>>>> 8. Server - provide a server that's
>>                 compatible with Hive's HiverServer2
>>                  >
>>                  > >>>>> in thrift APIs, such that HiveServer2 users
>>                 can reuse their existing client
>>                  > >>>>> (such as beeline) but connect to Flink's
>>                 thrift server instead.
>>                  >
>>                  > >>>>> 9. JDBC/ODBC drivers - Flink may provide its
>>                 own JDBC/ODBC drivers for
>>                  > >>>>> other application to use to connect to its
>>                 thrift server
>>                  > >>>>> 10. Support other user's customizations in
>>                 Hive, such as Hive Serdes,
>>                  > >>>>> storage handlers, etc.
>>                  >
>>                  > >>>>> 11. Better task failure tolerance and task
>>                 scheduling at Flink runtime.
>>                  > >>>>>
>>                  > >>>>> As you can see, achieving all those requires
>>                 significant effort and
>>                  >
>>                  > >>>>> across all layers in Flink. However, a
>>                 short-term goal could include only
>>                  >
>>                  > >>>>> core areas (such as 1, 2, 4, 5, 6, 7) or
>>                 start  at a smaller scope (such as
>>                  > >>>>> #3, #6).
>>                  > >>>>>
>>                  >
>>                  > >>>>> Please share your further thoughts. If we
>>                 generally agree that this is
>>                  >
>>                  > >>>>> the right direction, I could come up with a
>>                 formal proposal quickly and
>>                  > >>>>> then we can follow up with broader discussions.
>>                  > >>>>>
>>                  > >>>>> Thanks,
>>                  > >>>>> Xuefu
>>                  > >>>>>
>>                  > >>>>>
>>                  > >>>>>
>>                  > >>>>>
>> ------------------------------------------------------------------
>>                  > >>>>> Sender:vino yang <yanghua1127@gmail.com
>>                 <ma...@gmail.com>>
>>                  > >>>>> Sent at:2018 Oct 11 (Thu) 09:45
>>                  > >>>>> Recipient:Fabian Hueske <fhueske@gmail.com
>>                 <ma...@gmail.com>>
>>                  > >>>>> Cc:dev <dev@flink.apache.org
>>                 <ma...@flink.apache.org>>; Xuefu
>>                 <xuefu.z@alibaba-inc.com 
>> <ma...@alibaba-inc.com>
>>                  > >; user <
>>                  > >>>>> user@flink.apache.org
>>                 <ma...@flink.apache.org>>
>>                  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
>>                 well with Hive ecosystem
>>                  > >>>>>
>>                  > >>>>> Hi Xuefu,
>>                  > >>>>>
>>                  >
>>                  > >>>>> Appreciate this proposal, and like Fabian, it
>>                 would look better if you
>>                  > >>>>> can give more details of the plan.
>>                  > >>>>>
>>                  > >>>>> Thanks, vino.
>>                  > >>>>>
>>                  > >>>>> Fabian Hueske <fhueske@gmail.com
>>                 <ma...@gmail.com>> 于2018年10月10日周三
>>                 下午5:27写道:
>>                  > >>>>> Hi Xuefu,
>>                  > >>>>>
>>                  >
>>                  > >>>>> Welcome to the Flink community and thanks for
>>                 starting this discussion!
>>                  > >>>>> Better Hive integration would be really great!
>>                  > >>>>> Can you go into details of what you are
>>                 proposing? I can think of a
>>                  > >>>>> couple ways to improve Flink in that regard:
>>                  > >>>>>
>>                  > >>>>> * Support for Hive UDFs
>>                  > >>>>> * Support for Hive metadata catalog
>>                  > >>>>> * Support for HiveQL syntax
>>                  > >>>>> * ???
>>                  > >>>>>
>>                  > >>>>> Best, Fabian
>>                  > >>>>>
>>                  > >>>>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb
>>                 Zhang, Xuefu <
>>                  > >>>>> xuefu.z@alibaba-inc.com
>>                 <ma...@alibaba-inc.com>>:
>>                  > >>>>> Hi all,
>>                  > >>>>>
>>                  > >>>>> Along with the community's effort, inside
>>                 Alibaba we have explored
>>                  >
>>                  > >>>>> Flink's potential as an execution engine not
>>                 just for stream processing but
>>                  > >>>>> also for batch processing. We are encouraged
>>                 by our findings and have
>>                  >
>>                  > >>>>> initiated our effort to make Flink's SQL
>>                 capabilities full-fledged. When
>>                  >
>>                  > >>>>> comparing what's available in Flink to the
>>                 offerings from competitive data
>>                  >
>>                  > >>>>> processing engines, we identified a major gap
>>                 in Flink: a well integration
>>                  >
>>                  > >>>>> with Hive ecosystem. This is crucial to the
>>                 success of Flink SQL and batch
>>                  >
>>                  > >>>>> due to the well-established data ecosystem
>>                 around Hive. Therefore, we have
>>                  >
>>                  > >>>>> done some initial work along this direction
>>                 but there are still a lot of
>>                  > >>>>> effort needed.
>>                  > >>>>>
>>                  > >>>>> We have two strategies in mind. The first one
>>                 is to make Flink SQL
>>                  >
>>                  > >>>>> full-fledged and well-integrated with Hive
>>                 ecosystem. This is a similar
>>                  >
>>                  > >>>>> approach to what Spark SQL adopted. The
>>                 second strategy is to make Hive
>>                  >
>>                  > >>>>> itself work with Flink, similar to the
>>                 proposal in [1]. Each approach bears
>>                  >
>>                  > >>>>> its pros and cons, but they don’t need to be
>>                 mutually exclusive with each
>>                  > >>>>> targeting at different users and use cases.
>>                 We believe that both will
>>                  > >>>>> promote a much greater adoption of Flink
>>                 beyond stream processing.
>>                  > >>>>>
>>                  > >>>>> We have been focused on the first approach
>>                 and would like to showcase
>>                  >
>>                  > >>>>> Flink's batch and SQL capabilities with Flink
>>                 SQL. However, we have also
>>                  > >>>>> planned to start strategy #2 as the follow-up
>>                 effort.
>>                  > >>>>>
>>                  >
>>                  > >>>>> I'm completely new to Flink(, with a short
>>                 bio [2] below), though many
>>                  >
>>                  > >>>>> of my colleagues here at Alibaba are
>>                 long-time contributors. Nevertheless,
>>                  >
>>                  > >>>>> I'd like to share our thoughts and invite
>>                 your early feedback. At the same
>>                  >
>>                  > >>>>> time, I am working on a detailed proposal on
>>                 Flink SQL's integration with
>>                  > >>>>> Hive ecosystem, which will be also shared
>>                 when ready.
>>                  > >>>>>
>>                  > >>>>> While the ideas are simple, each approach
>>                 will demand significant
>>                  >
>>                  > >>>>> effort, more than what we can afford. Thus,
>>                 the input and contributions
>>                  > >>>>> from the communities are greatly welcome and
>>                 appreciated.
>>                  > >>>>>
>>                  > >>>>> Regards,
>>                  > >>>>>
>>                  > >>>>>
>>                  > >>>>> Xuefu
>>                  > >>>>>
>>                  > >>>>> References:
>>                  > >>>>>
>>                  > >>>>> [1]
>>                 https://issues.apache.org/jira/browse/HIVE-10712
>>                  >
>>                  > >>>>> [2] Xuefu Zhang is a long-time open source
>>                 veteran, worked or working on
>>                  > >>>>> many projects under Apache Foundation, of
>>                 which he is also an honored
>>                  >
>>                  > >>>>> member. About 10 years ago he worked in the
>>                 Hadoop team at Yahoo where the
>>                  >
>>                  > >>>>> projects just got started. Later he worked at
>>                 Cloudera, initiating and
>>                  >
>>                  > >>>>> leading the development of Hive on Spark
>>                 project in the communities and
>>                  >
>>                  > >>>>> across many organizations. Prior to joining
>>                 Alibaba, he worked at Uber
>>                  >
>>                  > >>>>> where he promoted Hive on Spark to all Uber's
>>                 SQL on Hadoop workload and
>>                  > >>>>> significantly improved Uber's cluster 
>> efficiency.
>>                  > >>>>>
>>                  > >>>>>
>>                  > >>>>>
>>                  > >>>>>
>>                  > >>>>> --
>>                  >
>>                  > >>>>> "So you have to trust that the dots will
>>                 somehow connect in your future."
>>                  > >>>>>
>>                  > >>>>>
>>                  > >>>>> --
>>                  >
>>                  > >>>>> "So you have to trust that the dots will
>>                 somehow connect in your future."
>>                  > >>>>>
>>                  >
>>                  >
>>
>
>


Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by Timo Walther <tw...@apache.org>.
Hi everyone,

Xuefu and I had multiple iterations over the catalog design document 
[1]. I believe that it is in a good shape now to be converted into FLIP. 
Maybe we need a bit more explanation at some places but the general 
design would be ready now.

The design document covers the following changes:
- Unify external catalog interface and Flink's internal catalog in 
TableEnvironment
- Clearly define a hierarchy of reference objects namely: 
"catalog.database.table"
- Enable a tight integration with Hive + Hive data connectors as well as 
a broad integration with existing TableFactories and discovery mechanism
- Make the catalog interfaces more feature complete by adding views and 
functions

If you have any further feedback, it would be great to give it now 
before we convert it into a FLIP.

Thanks,
Timo

[1] 
https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit#



Am 07.01.19 um 13:51 schrieb Timo Walther:
> Hi Eron,
>
> thank you very much for the contributions. I merged the first little 
> bug fixes. For the remaining PRs I think we can review and merge them 
> soon. As you said, the code is agnostic to the details of the 
> ExternalCatalog interface and I don't expect bigger merge conflicts in 
> the near future.
>
> However, exposing the current external catalog interfaces to SQL 
> Client users would make it even more difficult to change the 
> interfaces in the future. So maybe I would first wait until the 
> general catalog discussion is over and the FLIP has been created. This 
> should happen shortly.
>
> We should definitely coordinate the efforts better in the future to 
> avoid duplicate work.
>
> Thanks,
> Timo
>
>
> Am 07.01.19 um 00:24 schrieb Eron Wright:
>> Thanks Timo for merging a couple of the PRs.   Are you also able to 
>> review the others that I mentioned? Xuefu I would like to incorporate 
>> your feedback too.
>>
>> Check out this short demonstration of using a catalog in SQL Client:
>> https://asciinema.org/a/C8xuAjmZSxCuApgFgZQyeIHuo
>>
>> Thanks again!
>>
>> On Thu, Jan 3, 2019 at 9:37 AM Eron Wright <eronwright@gmail.com 
>> <ma...@gmail.com>> wrote:
>>
>>     Would a couple folks raise their hand to make a review pass thru
>>     the 6 PRs listed above?  It is a lovely stack of PRs that is 'all
>>     green' at the moment.   I would be happy to open follow-on PRs to
>>     rapidly align with other efforts.
>>
>>     Note that the code is agnostic to the details of the
>>     ExternalCatalog interface; the code would not be obsolete if/when
>>     the catalog interface is enhanced as per the design doc.
>>
>>
>>
>>     On Wed, Jan 2, 2019 at 1:35 PM Eron Wright <eronwright@gmail.com
>>     <ma...@gmail.com>> wrote:
>>
>>         I propose that the community review and merge the PRs that I
>>         posted, and then evolve the design thru 1.8 and beyond.  I
>>         think having a basic infrastructure in place now will
>>         accelerate the effort, do you agree?
>>
>>         Thanks again!
>>
>>         On Wed, Jan 2, 2019 at 11:20 AM Zhang, Xuefu
>>         <xuefu.z@alibaba-inc.com <ma...@alibaba-inc.com>> 
>> wrote:
>>
>>             Hi Eron,
>>
>>             Happy New Year!
>>
>>             Thank you very much for your contribution, especially
>>             during the holidays. Wile I'm encouraged by your work, I'd
>>             also like to share my thoughts on how to move forward.
>>
>>             First, please note that the design discussion is still
>>             finalizing, and we expect some moderate changes,
>>             especially around TableFactories. Another pending change
>>             is our decision to shy away from scala, which our work
>>             will be impacted by.
>>
>>             Secondly, while your work seemed about plugging in
>>             catalogs definitions to the execution environment, which
>>             is less impacted by TableFactory change, I did notice some
>>             duplication of your work and ours. This is no big deal,
>>             but going forward, we should probable have a better
>>             communication on the work assignment so as to avoid any
>>             possible duplication of work. On the other hand, I think
>>             some of your work is interesting and valuable for
>>             inclusion once we finalize the overall design.
>>
>>             Thus, please continue your research and experiment and let
>>             us know when you start working on anything so we can
>>             better coordinate.
>>
>>             Thanks again for your interest and contributions.
>>
>>             Thanks,
>>             Xuefu
>>
>>
>>
>> ------------------------------------------------------------------
>>                 From:Eron Wright <eronwright@gmail.com
>>                 <ma...@gmail.com>>
>>                 Sent At:2019 Jan. 1 (Tue.) 18:39
>>                 To:dev <dev@flink.apache.org
>>                 <ma...@flink.apache.org>>; Xuefu
>>                 <xuefu.z@alibaba-inc.com 
>> <ma...@alibaba-inc.com>>
>>                 Cc:Xiaowei Jiang <xiaoweij@gmail.com
>>                 <ma...@gmail.com>>; twalthr
>>                 <twalthr@apache.org <ma...@apache.org>>;
>>                 piotr <piotr@data-artisans.com
>>                 <ma...@data-artisans.com>>; Fabian Hueske
>>                 <fhueske@gmail.com <ma...@gmail.com>>;
>>                 suez1224 <suez1224@gmail.com
>>                 <ma...@gmail.com>>; Bowen Li
>>                 <bowenli86@gmail.com <ma...@gmail.com>>
>>                 Subject:Re: [DISCUSS] Integrate Flink SQL well with
>>                 Hive ecosystem
>>
>>                 Hi folks, there's clearly some incremental steps to be
>>                 taken to introduce catalog support to SQL Client,
>>                 complementary to what is proposed in the Flink-Hive
>>                 Metastore design doc.  I was quietly working on this
>>                 over the holidays.   I posted some new sub-tasks, PRs,
>>                 and sample code to FLINK-10744.
>>
>>                 What inspired me to get involved is that the catalog
>>                 interface seems like a great way to encapsulate a
>>                 'library' of Flink tables and functions. For example,
>>                 the NYC Taxi dataset (TaxiRides, TaxiFares, various
>>                 UDFs) may be nicely encapsulated as a catalog
>>                 (TaxiData).  Such a library should be fully consumable
>>                 in SQL Client.
>>
>>                 I implemented the above. Some highlights:
>>                 1. A fully-worked example of using the Taxi dataset in
>>                 SQL Client via an environment file.
>>                 - an ASCII video showing the SQL Client in action:
>> https://asciinema.org/a/C8xuAjmZSxCuApgFgZQyeIHuo
>>
>>                 - the corresponding environment file (will be even
>>                 more concise once 'FLINK-10696 Catalog UDFs' is merged):
>> _https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/dist/conf/sql-client-defaults.yaml_
>>
>>                 - the typed API for standalone table applications:
>> _https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/java/com/dataartisans/flinktraining/examples/table_java/examples/ViaCatalog.java#L50_
>>
>>                 2. Implementation of the core catalog descriptor and
>>                 factory.  I realize that some renames may later occur
>>                 as per the design doc, and would be happy to do that
>>                 as a follow-up.
>>                 https://github.com/apache/flink/pull/7390
>>
>>                 3. Implementation of a connect-style API on
>>                 TableEnvironment to use catalog descriptor.
>>                 https://github.com/apache/flink/pull/7392
>>
>>                 4. Integration into SQL-Client's environment file:
>>                 https://github.com/apache/flink/pull/7393
>>
>>                 I realize that the overall Hive integration is still
>>                 evolving, but I believe that these PRs are a good
>>                 stepping stone. Here's the list (in bottom-up order):
>>                 - https://github.com/apache/flink/pull/7386
>>                 - https://github.com/apache/flink/pull/7388
>>                 - https://github.com/apache/flink/pull/7389
>>                 - https://github.com/apache/flink/pull/7390
>>                 - https://github.com/apache/flink/pull/7392
>>                 - https://github.com/apache/flink/pull/7393
>>
>>                 Thanks and enjoy 2019!
>>                 Eron W
>>
>>
>>                 On Sun, Nov 18, 2018 at 3:04 PM Zhang, Xuefu
>>                 <xuefu.z@alibaba-inc.com
>>                 <ma...@alibaba-inc.com>> wrote:
>>                 Hi Xiaowei,
>>
>>                 Thanks for bringing up the question. In the current
>>                 design, the properties for meta objects are meant to
>>                 cover anything that's specific to a particular catalog
>>                 and agnostic to Flink. Anything that is common (such
>>                 as schema for tables, query text for views, and udf
>>                 classname) are abstracted as members of the respective
>>                 classes. However, this is still in discussion, and
>>                 Timo and I will go over this and provide an update.
>>
>>                 Please note that UDF is a little more involved than
>>                 what the current design doc shows. I'm still refining
>>                 this part.
>>
>>                 Thanks,
>>                 Xuefu
>>
>>
>> ------------------------------------------------------------------
>>                 Sender:Xiaowei Jiang <xiaoweij@gmail.com
>>                 <ma...@gmail.com>>
>>                 Sent at:2018 Nov 18 (Sun) 15:17
>>                 Recipient:dev <dev@flink.apache.org
>>                 <ma...@flink.apache.org>>
>>                 Cc:Xuefu <xuefu.z@alibaba-inc.com
>>                 <ma...@alibaba-inc.com>>; twalthr
>>                 <twalthr@apache.org <ma...@apache.org>>;
>>                 piotr <piotr@data-artisans.com
>>                 <ma...@data-artisans.com>>; Fabian Hueske
>>                 <fhueske@gmail.com <ma...@gmail.com>>;
>>                 suez1224 <suez1224@gmail.com 
>> <ma...@gmail.com>>
>>                 Subject:Re: [DISCUSS] Integrate Flink SQL well with
>>                 Hive ecosystem
>>
>>                 Thanks Xuefu for the detailed design doc! One question
>>                 on the properties associated with the catalog objects.
>>                 Are we going to leave them completely free form or we
>>                 are going to set some standard for that? I think that
>>                 the answer may depend on if we want to explore catalog
>>                 specific optimization opportunities. In any case, I
>>                 think that it might be helpful for standardize as much
>>                 as possible into strongly typed classes and use leave
>>                 these properties for catalog specific things. But I
>>                 think that we can do it in steps.
>>
>>                 Xiaowei
>>                 On Fri, Nov 16, 2018 at 4:00 AM Bowen Li
>>                 <bowenli86@gmail.com <ma...@gmail.com>> 
>> wrote:
>>                 Thanks for keeping on improving the overall design,
>>                 Xuefu! It looks quite
>>                  good to me now.
>>
>>                  Would be nice that cc-ed Flink committers can help to
>>                 review and confirm!
>>
>>
>>
>>                  One minor suggestion: Since the last section of
>>                 design doc already touches
>>                  some new sql statements, shall we add another section
>>                 in our doc and
>>                  formalize the new sql statements in SQL Client and
>>                 TableEnvironment that
>>                  are gonna come along naturally with our design? Here
>>                 are some that the
>>                  design doc mentioned and some that I came up with:
>>
>>                  To be added:
>>
>>                     - USE <catalog> - set default catalog
>>                     - USE <catalog.schema> - set default schema
>>                     - SHOW CATALOGS - show all registered catalogs
>>                     - SHOW SCHEMAS [FROM catalog] - list schemas in
>>                 the current default
>>                     catalog or the specified catalog
>>                     - DESCRIBE VIEW view - show the view's definition
>>                 in CatalogView
>>                     - SHOW VIEWS [FROM schema/catalog.schema] - show
>>                 views from current or a
>>                     specified schema.
>>
>>                     (DDLs that can be addressed by either our design
>>                 or Shuyi's DDL design)
>>
>>                     - CREATE/DROP/ALTER SCHEMA schema
>>                     - CREATE/DROP/ALTER CATALOG catalog
>>
>>                  To be modified:
>>
>>                     - SHOW TABLES [FROM schema/catalog.schema] - show
>>                 tables from current or
>>                     a specified schema. Add 'from schema' to existing
>>                 'SHOW TABLES' statement
>>                     - SHOW FUNCTIONS [FROM schema/catalog.schema] -
>>                 show functions from
>>                     current or a specified schema. Add 'from schema'
>>                 to existing 'SHOW TABLES'
>>                     statement'
>>
>>
>>                  Thanks, Bowen
>>
>>
>>
>>                  On Wed, Nov 14, 2018 at 10:39 PM Zhang, Xuefu
>>                 <xuefu.z@alibaba-inc.com 
>> <ma...@alibaba-inc.com>>
>>                  wrote:
>>
>>                  > Thanks, Bowen, for catching the error. I have
>>                 granted comment permission
>>                  > with the link.
>>                  >
>>                  > I also updated the doc with the latest class
>>                 definitions. Everyone is
>>                  > encouraged to review and comment.
>>                  >
>>                  > Thanks,
>>                  > Xuefu
>>                  >
>>                  >
>> ------------------------------------------------------------------
>>                  > Sender:Bowen Li <bowenli86@gmail.com
>>                 <ma...@gmail.com>>
>>                  > Sent at:2018 Nov 14 (Wed) 06:44
>>                  > Recipient:Xuefu <xuefu.z@alibaba-inc.com
>>                 <ma...@alibaba-inc.com>>
>>                  > Cc:piotr <piotr@data-artisans.com
>>                 <ma...@data-artisans.com>>; dev
>>                 <dev@flink.apache.org <ma...@flink.apache.org>>;
>>                 Shuyi
>>                  > Chen <suez1224@gmail.com <ma...@gmail.com>>
>>                  > Subject:Re: [DISCUSS] Integrate Flink SQL well with
>>                 Hive ecosystem
>>                  >
>>                  > Hi Xuefu,
>>                  >
>>                  > Currently the new design doc
>>                  >
>> <https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit>
>>                  > is on “view only" mode, and people cannot leave
>>                 comments. Can you please
>>                  > change it to "can comment" or "can edit" mode?
>>                  >
>>                  > Thanks, Bowen
>>                  >
>>                  >
>>                  > On Mon, Nov 12, 2018 at 9:51 PM Zhang, Xuefu
>>                 <xuefu.z@alibaba-inc.com 
>> <ma...@alibaba-inc.com>>
>>                  > wrote:
>>                  > Hi Piotr
>>                  >
>>                  > I have extracted the API portion of  the design and
>>                 the google doc is here
>>                  >
>> <https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit?usp=sharing>.
>>                  > Please review and provide your feedback.
>>                  >
>>                  > Thanks,
>>                  > Xuefu
>>                  >
>>                  >
>> ------------------------------------------------------------------
>>                  > Sender:Xuefu <xuefu.z@alibaba-inc.com
>>                 <ma...@alibaba-inc.com>>
>>                  > Sent at:2018 Nov 12 (Mon) 12:43
>>                  > Recipient:Piotr Nowojski <piotr@data-artisans.com
>>                 <ma...@data-artisans.com>>; dev <
>>                  > dev@flink.apache.org <ma...@flink.apache.org>>
>>                  > Cc:Bowen Li <bowenli86@gmail.com
>>                 <ma...@gmail.com>>; Shuyi Chen
>>                 <suez1224@gmail.com <ma...@gmail.com>>
>>                  > Subject:Re: [DISCUSS] Integrate Flink SQL well with
>>                 Hive ecosystem
>>                  >
>>                  > Hi Piotr,
>>                  >
>>                  > That sounds good to me. Let's close all the open
>>                 questions ((there are a
>>                  > couple of them)) in the Google doc and I should be
>>                 able to quickly split
>>                  > it into the three proposals as you suggested.
>>                  >
>>                  > Thanks,
>>                  > Xuefu
>>                  >
>>                  >
>> ------------------------------------------------------------------
>>                  > Sender:Piotr Nowojski <piotr@data-artisans.com
>>                 <ma...@data-artisans.com>>
>>                  > Sent at:2018 Nov 9 (Fri) 22:46
>>                  > Recipient:dev <dev@flink.apache.org
>>                 <ma...@flink.apache.org>>; Xuefu
>>                 <xuefu.z@alibaba-inc.com 
>> <ma...@alibaba-inc.com>>
>>                  > Cc:Bowen Li <bowenli86@gmail.com
>>                 <ma...@gmail.com>>; Shuyi Chen
>>                 <suez1224@gmail.com <ma...@gmail.com>>
>>                  > Subject:Re: [DISCUSS] Integrate Flink SQL well with
>>                 Hive ecosystem
>>                  >
>>                  > Hi,
>>                  >
>>                  >
>>                  > Yes, it seems like the best solution. Maybe someone
>>                 else can also suggests if we can split it further?
>>                 Maybe changes in the interface in one doc, reading
>>                 from hive meta store another and final storing our
>>                 meta informations in hive meta store?
>>                  >
>>                  > Piotrek
>>                  >
>>                  > > On 9 Nov 2018, at 01:44, Zhang, Xuefu
>>                 <xuefu.z@alibaba-inc.com
>>                 <ma...@alibaba-inc.com>> wrote:
>>                  > >
>>                  > > Hi Piotr,
>>                  > >
>>                  > > That seems to be good idea!
>>                  > >
>>                  >
>>                  > > Since the google doc for the design is currently
>>                 under extensive review, I will leave it as it is for
>>                 now. However, I'll convert it to two different FLIPs
>>                 when the time comes.
>>                  > >
>>                  > > How does it sound to you?
>>                  > >
>>                  > > Thanks,
>>                  > > Xuefu
>>                  > >
>>                  > >
>>                  > >
>> ------------------------------------------------------------------
>>                  > > Sender:Piotr Nowojski <piotr@data-artisans.com
>>                 <ma...@data-artisans.com>>
>>                  > > Sent at:2018 Nov 9 (Fri) 02:31
>>                  > > Recipient:dev <dev@flink.apache.org
>>                 <ma...@flink.apache.org>>
>>                  > > Cc:Bowen Li <bowenli86@gmail.com
>>                 <ma...@gmail.com>>; Xuefu
>>                 <xuefu.z@alibaba-inc.com 
>> <ma...@alibaba-inc.com>
>>                  > >; Shuyi Chen <suez1224@gmail.com
>>                 <ma...@gmail.com>>
>>                  > > Subject:Re: [DISCUSS] Integrate Flink SQL well
>>                 with Hive ecosystem
>>                  > >
>>                  > > Hi,
>>                  > >
>>                  >
>>                  > > Maybe we should split this topic (and the design
>>                 doc) into couple of smaller ones, hopefully
>>                 independent. The questions that you have asked Fabian
>>                 have for example very little to do with reading
>>                 metadata from Hive Meta Store?
>>                  > >
>>                  > > Piotrek
>>                  > >
>>                  > >> On 7 Nov 2018, at 14:27, Fabian Hueske
>>                 <fhueske@gmail.com <ma...@gmail.com>> wrote:
>>                  > >>
>>                  > >> Hi Xuefu and all,
>>                  > >>
>>                  > >> Thanks for sharing this design document!
>>                  >
>>                  > >> I'm very much in favor of restructuring /
>>                 reworking the catalog handling in
>>                  > >> Flink SQL as outlined in the document.
>>                  >
>>                  > >> Most changes described in the design document
>>                 seem to be rather general and
>>                  > >> not specifically related to the Hive integration.
>>                  > >>
>>                  >
>>                  > >> IMO, there are some aspects, especially those at
>>                 the boundary of Hive and
>>                  > >> Flink, that need a bit more discussion. For 
>> example
>>                  > >>
>>                  > >> * What does it take to make Flink schema
>>                 compatible with Hive schema?
>>                  > >> * How will Flink tables (descriptors) be stored
>>                 in HMS?
>>                  > >> * How do both Hive catalogs differ? Could they
>>                 be integrated into to a
>>                  > >> single one? When to use which one?
>>                  >
>>                  > >> * What meta information is provided by HMS? What
>>                 of this can be leveraged
>>                  > >> by Flink?
>>                  > >>
>>                  > >> Thank you,
>>                  > >> Fabian
>>                  > >>
>>                  > >> Am Fr., 2. Nov. 2018 um 00:31 Uhr schrieb Bowen
>>                 Li <bowenli86@gmail.com <ma...@gmail.com>
>>                  > >:
>>                  > >>
>>                  > >>> After taking a look at how other discussion
>>                 threads work, I think it's
>>                  > >>> actually fine just keep our discussion here.
>>                 It's up to you, Xuefu.
>>                  > >>>
>>                  > >>> The google doc LGTM. I left some minor comments.
>>                  > >>>
>>                  > >>> On Thu, Nov 1, 2018 at 10:17 AM Bowen Li
>>                 <bowenli86@gmail.com <ma...@gmail.com>> 
>> wrote:
>>                  > >>>
>>                  > >>>> Hi all,
>>                  > >>>>
>>                  > >>>> As Xuefu has published the design doc on
>>                 google, I agree with Shuyi's
>>                  >
>>                  > >>>> suggestion that we probably should start a new
>>                 email thread like "[DISCUSS]
>>                  >
>>                  > >>>> ... Hive integration design ..." on only dev
>>                 mailing list for community
>>                  > >>>> devs to review. The current thread sends to
>>                 both dev and user list.
>>                  > >>>>
>>                  >
>>                  > >>>> This email thread is more like validating the
>>                 general idea and direction
>>                  >
>>                  > >>>> with the community, and it's been pretty long
>>                 and crowded so far. Since
>>                  >
>>                  > >>>> everyone is pro for the idea, we can move
>>                 forward with another thread to
>>                  > >>>> discuss and finalize the design.
>>                  > >>>>
>>                  > >>>> Thanks,
>>                  > >>>> Bowen
>>                  > >>>>
>>                  > >>>> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu <
>>                  > xuefu.z@alibaba-inc.com
>>                 <ma...@alibaba-inc.com>>
>>                  > >>>> wrote:
>>                  > >>>>
>>                  > >>>>> Hi Shuiyi,
>>                  > >>>>>
>>                  >
>>                  > >>>>> Good idea. Actually the PDF was converted
>>                 from a google doc. Here is its
>>                  > >>>>> link:
>>                  > >>>>>
>>                  > >>>>>
>>                  >
>> https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
>>                  > >>>>> Once we reach an agreement, I can convert it
>>                 to a FLIP.
>>                  > >>>>>
>>                  > >>>>> Thanks,
>>                  > >>>>> Xuefu
>>                  > >>>>>
>>                  > >>>>>
>>                  > >>>>>
>>                  > >>>>>
>> ------------------------------------------------------------------
>>                  > >>>>> Sender:Shuyi Chen <suez1224@gmail.com
>>                 <ma...@gmail.com>>
>>                  > >>>>> Sent at:2018 Nov 1 (Thu) 02:47
>>                  > >>>>> Recipient:Xuefu <xuefu.z@alibaba-inc.com
>>                 <ma...@alibaba-inc.com>>
>>                  > >>>>> Cc:vino yang <yanghua1127@gmail.com
>>                 <ma...@gmail.com>>; Fabian Hueske <
>>                  > fhueske@gmail.com <ma...@gmail.com>>;
>>                  > >>>>> dev <dev@flink.apache.org
>>                 <ma...@flink.apache.org>>; user
>>                 <user@flink.apache.org <ma...@flink.apache.org>>
>>                  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
>>                 well with Hive ecosystem
>>                  > >>>>>
>>                  > >>>>> Hi Xuefu,
>>                  > >>>>>
>>                  >
>>                  > >>>>> Thanks a lot for driving this big effort. I
>>                 would suggest convert your
>>                  >
>>                  > >>>>> proposal and design doc into a google doc,
>>                 and share it on the dev mailing
>>                  >
>>                  > >>>>> list for the community to review and comment
>>                 with title like "[DISCUSS] ...
>>                  >
>>                  > >>>>> Hive integration design ..." . Once
>>                 approved,  we can document it as a FLIP
>>                  >
>>                  > >>>>> (Flink Improvement Proposals), and use JIRAs
>>                 to track the implementations.
>>                  > >>>>> What do you think?
>>                  > >>>>>
>>                  > >>>>> Shuyi
>>                  > >>>>>
>>                  > >>>>> On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <
>>                  > xuefu.z@alibaba-inc.com
>>                 <ma...@alibaba-inc.com>>
>>                  > >>>>> wrote:
>>                  > >>>>> Hi all,
>>                  > >>>>>
>>                  > >>>>> I have also shared a design doc on Hive
>>                 metastore integration that is
>>                  >
>>                  > >>>>> attached here and also to FLINK-10556[1].
>>                 Please kindly review and share
>>                  > >>>>> your feedback.
>>                  > >>>>>
>>                  > >>>>>
>>                  > >>>>> Thanks,
>>                  > >>>>> Xuefu
>>                  > >>>>>
>>                  > >>>>> [1]
>> https://issues.apache.org/jira/browse/FLINK-10556
>>                  > >>>>>
>> ------------------------------------------------------------------
>>                  > >>>>> Sender:Xuefu <xuefu.z@alibaba-inc.com
>>                 <ma...@alibaba-inc.com>>
>>                  > >>>>> Sent at:2018 Oct 25 (Thu) 01:08
>>                  > >>>>> Recipient:Xuefu <xuefu.z@alibaba-inc.com
>>                 <ma...@alibaba-inc.com>>; Shuyi Chen <
>>                  > >>>>> suez1224@gmail.com <ma...@gmail.com>>
>>                  > >>>>> Cc:yanghua1127 <yanghua1127@gmail.com
>>                 <ma...@gmail.com>>; Fabian Hueske <
>>                  > fhueske@gmail.com <ma...@gmail.com>>;
>>                  > >>>>> dev <dev@flink.apache.org
>>                 <ma...@flink.apache.org>>; user
>>                 <user@flink.apache.org <ma...@flink.apache.org>>
>>                  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
>>                 well with Hive ecosystem
>>                  > >>>>>
>>                  > >>>>> Hi all,
>>                  > >>>>>
>>                  > >>>>> To wrap up the discussion, I have attached a
>>                 PDF describing the
>>                  >
>>                  > >>>>> proposal, which is also attached to
>>                 FLINK-10556 [1]. Please feel free to
>>                  > >>>>> watch that JIRA to track the progress.
>>                  > >>>>>
>>                  > >>>>> Please also let me know if you have
>>                 additional comments or questions.
>>                  > >>>>>
>>                  > >>>>> Thanks,
>>                  > >>>>> Xuefu
>>                  > >>>>>
>>                  > >>>>> [1]
>> https://issues.apache.org/jira/browse/FLINK-10556
>>                  > >>>>>
>>                  > >>>>>
>>                  > >>>>>
>> ------------------------------------------------------------------
>>                  > >>>>> Sender:Xuefu <xuefu.z@alibaba-inc.com
>>                 <ma...@alibaba-inc.com>>
>>                  > >>>>> Sent at:2018 Oct 16 (Tue) 03:40
>>                  > >>>>> Recipient:Shuyi Chen <suez1224@gmail.com
>>                 <ma...@gmail.com>>
>>                  > >>>>> Cc:yanghua1127 <yanghua1127@gmail.com
>>                 <ma...@gmail.com>>; Fabian Hueske <
>>                  > fhueske@gmail.com <ma...@gmail.com>>;
>>                  > >>>>> dev <dev@flink.apache.org
>>                 <ma...@flink.apache.org>>; user
>>                 <user@flink.apache.org <ma...@flink.apache.org>>
>>                  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
>>                 well with Hive ecosystem
>>                  > >>>>>
>>                  > >>>>> Hi Shuyi,
>>                  > >>>>>
>>                  >
>>                  > >>>>> Thank you for your input. Yes, I agreed with
>>                 a phased approach and like
>>                  >
>>                  > >>>>> to move forward fast. :) We did some work
>>                 internally on DDL utilizing babel
>>                  > >>>>> parser in Calcite. While babel makes
>>                 Calcite's grammar extensible, at
>>                  > >>>>> first impression it still seems too
>>                 cumbersome for a project when too
>>                  >
>>                  > >>>>> much extensions are made. It's even
>>                 challenging to find where the extension
>>                  >
>>                  > >>>>> is needed! It would be certainly better if
>>                 Calcite can magically support
>>                  >
>>                  > >>>>> Hive QL by just turning on a flag, such as
>>                 that for MYSQL_5. I can also
>>                  >
>>                  > >>>>> see that this could mean a lot of work on
>>                 Calcite. Nevertheless, I will
>>                  >
>>                  > >>>>> bring up the discussion over there and to see
>>                 what their community thinks.
>>                  > >>>>>
>>                  > >>>>> Would mind to share more info about the
>>                 proposal on DDL that you
>>                  > >>>>> mentioned? We can certainly collaborate on 
>> this.
>>                  > >>>>>
>>                  > >>>>> Thanks,
>>                  > >>>>> Xuefu
>>                  > >>>>>
>>                  > >>>>>
>> ------------------------------------------------------------------
>>                  > >>>>> Sender:Shuyi Chen <suez1224@gmail.com
>>                 <ma...@gmail.com>>
>>                  > >>>>> Sent at:2018 Oct 14 (Sun) 08:30
>>                  > >>>>> Recipient:Xuefu <xuefu.z@alibaba-inc.com
>>                 <ma...@alibaba-inc.com>>
>>                  > >>>>> Cc:yanghua1127 <yanghua1127@gmail.com
>>                 <ma...@gmail.com>>; Fabian Hueske <
>>                  > fhueske@gmail.com <ma...@gmail.com>>;
>>                  > >>>>> dev <dev@flink.apache.org
>>                 <ma...@flink.apache.org>>; user
>>                 <user@flink.apache.org <ma...@flink.apache.org>>
>>                  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
>>                 well with Hive ecosystem
>>                  > >>>>>
>>                  > >>>>> Welcome to the community and thanks for the
>>                 great proposal, Xuefu! I
>>                  >
>>                  > >>>>> think the proposal can be divided into 2
>>                 stages: making Flink to support
>>                  >
>>                  > >>>>> Hive features, and make Hive to work with
>>                 Flink. I agreed with Timo that on
>>                  >
>>                  > >>>>> starting with a smaller scope, so we can make
>>                 progress faster. As for [6],
>>                  >
>>                  > >>>>> a proposal for DDL is already in progress,
>>                 and will come after the unified
>>                  >
>>                  > >>>>> SQL connector API is done. For supporting
>>                 Hive syntax, we might need to
>>                  > >>>>> work with the Calcite community, and a recent
>>                 effort called babel (
>>                  > >>>>>
>> https://issues.apache.org/jira/browse/CALCITE-2280) in
>>                 Calcite might
>>                  > >>>>> help here.
>>                  > >>>>>
>>                  > >>>>> Thanks
>>                  > >>>>> Shuyi
>>                  > >>>>>
>>                  > >>>>> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <
>>                  > xuefu.z@alibaba-inc.com
>>                 <ma...@alibaba-inc.com>>
>>                  > >>>>> wrote:
>>                  > >>>>> Hi Fabian/Vno,
>>                  > >>>>>
>>                  >
>>                  > >>>>> Thank you very much for your encouragement
>>                 inquiry. Sorry that I didn't
>>                  >
>>                  > >>>>> see Fabian's email until I read Vino's
>>                 response just now. (Somehow Fabian's
>>                  > >>>>> went to the spam folder.)
>>                  > >>>>>
>>                  >
>>                  > >>>>> My proposal contains long-term and
>>                 short-terms goals. Nevertheless, the
>>                  > >>>>> effort will focus on the following areas,
>>                 including Fabian's list:
>>                  > >>>>>
>>                  > >>>>> 1. Hive metastore connectivity - This covers
>>                 both read/write access,
>>                  >
>>                  > >>>>> which means Flink can make full use of Hive's
>>                 metastore as its catalog (at
>>                  > >>>>> least for the batch but can extend for
>>                 streaming as well).
>>                  >
>>                  > >>>>> 2. Metadata compatibility - Objects
>>                 (databases, tables, partitions, etc)
>>                  >
>>                  > >>>>> created by Hive can be understood by Flink
>>                 and the reverse direction is
>>                  > >>>>> true also.
>>                  > >>>>> 3. Data compatibility - Similar to #2, data
>>                 produced by Hive can be
>>                  > >>>>> consumed by Flink and vise versa.
>>                  >
>>                  > >>>>> 4. Support Hive UDFs - For all Hive's native
>>                 udfs, Flink either provides
>>                  > >>>>> its own implementation or make Hive's
>>                 implementation work in Flink.
>>                  > >>>>> Further, for user created UDFs in Hive, Flink
>>                 SQL should provide a
>>                  >
>>                  > >>>>> mechanism allowing user to import them into
>>                 Flink without any code change
>>                  > >>>>> required.
>>                  > >>>>> 5. Data types - Flink SQL should support all
>>                 data types that are
>>                  > >>>>> available in Hive.
>>                  > >>>>> 6. SQL Language - Flink SQL should support
>>                 SQL standard (such as
>>                  >
>>                  > >>>>> SQL2003) with extension to support Hive's
>>                 syntax and language features,
>>                  > >>>>> around DDL, DML, and SELECT queries.
>>                  >
>>                  > >>>>> 7.  SQL CLI - this is currently developing in
>>                 Flink but more effort is
>>                  > >>>>> needed.
>>                  >
>>                  > >>>>> 8. Server - provide a server that's
>>                 compatible with Hive's HiverServer2
>>                  >
>>                  > >>>>> in thrift APIs, such that HiveServer2 users
>>                 can reuse their existing client
>>                  > >>>>> (such as beeline) but connect to Flink's
>>                 thrift server instead.
>>                  >
>>                  > >>>>> 9. JDBC/ODBC drivers - Flink may provide its
>>                 own JDBC/ODBC drivers for
>>                  > >>>>> other application to use to connect to its
>>                 thrift server
>>                  > >>>>> 10. Support other user's customizations in
>>                 Hive, such as Hive Serdes,
>>                  > >>>>> storage handlers, etc.
>>                  >
>>                  > >>>>> 11. Better task failure tolerance and task
>>                 scheduling at Flink runtime.
>>                  > >>>>>
>>                  > >>>>> As you can see, achieving all those requires
>>                 significant effort and
>>                  >
>>                  > >>>>> across all layers in Flink. However, a
>>                 short-term goal could include only
>>                  >
>>                  > >>>>> core areas (such as 1, 2, 4, 5, 6, 7) or
>>                 start  at a smaller scope (such as
>>                  > >>>>> #3, #6).
>>                  > >>>>>
>>                  >
>>                  > >>>>> Please share your further thoughts. If we
>>                 generally agree that this is
>>                  >
>>                  > >>>>> the right direction, I could come up with a
>>                 formal proposal quickly and
>>                  > >>>>> then we can follow up with broader discussions.
>>                  > >>>>>
>>                  > >>>>> Thanks,
>>                  > >>>>> Xuefu
>>                  > >>>>>
>>                  > >>>>>
>>                  > >>>>>
>>                  > >>>>>
>> ------------------------------------------------------------------
>>                  > >>>>> Sender:vino yang <yanghua1127@gmail.com
>>                 <ma...@gmail.com>>
>>                  > >>>>> Sent at:2018 Oct 11 (Thu) 09:45
>>                  > >>>>> Recipient:Fabian Hueske <fhueske@gmail.com
>>                 <ma...@gmail.com>>
>>                  > >>>>> Cc:dev <dev@flink.apache.org
>>                 <ma...@flink.apache.org>>; Xuefu
>>                 <xuefu.z@alibaba-inc.com 
>> <ma...@alibaba-inc.com>
>>                  > >; user <
>>                  > >>>>> user@flink.apache.org
>>                 <ma...@flink.apache.org>>
>>                  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
>>                 well with Hive ecosystem
>>                  > >>>>>
>>                  > >>>>> Hi Xuefu,
>>                  > >>>>>
>>                  >
>>                  > >>>>> Appreciate this proposal, and like Fabian, it
>>                 would look better if you
>>                  > >>>>> can give more details of the plan.
>>                  > >>>>>
>>                  > >>>>> Thanks, vino.
>>                  > >>>>>
>>                  > >>>>> Fabian Hueske <fhueske@gmail.com
>>                 <ma...@gmail.com>> 于2018年10月10日周三
>>                 下午5:27写道:
>>                  > >>>>> Hi Xuefu,
>>                  > >>>>>
>>                  >
>>                  > >>>>> Welcome to the Flink community and thanks for
>>                 starting this discussion!
>>                  > >>>>> Better Hive integration would be really great!
>>                  > >>>>> Can you go into details of what you are
>>                 proposing? I can think of a
>>                  > >>>>> couple ways to improve Flink in that regard:
>>                  > >>>>>
>>                  > >>>>> * Support for Hive UDFs
>>                  > >>>>> * Support for Hive metadata catalog
>>                  > >>>>> * Support for HiveQL syntax
>>                  > >>>>> * ???
>>                  > >>>>>
>>                  > >>>>> Best, Fabian
>>                  > >>>>>
>>                  > >>>>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb
>>                 Zhang, Xuefu <
>>                  > >>>>> xuefu.z@alibaba-inc.com
>>                 <ma...@alibaba-inc.com>>:
>>                  > >>>>> Hi all,
>>                  > >>>>>
>>                  > >>>>> Along with the community's effort, inside
>>                 Alibaba we have explored
>>                  >
>>                  > >>>>> Flink's potential as an execution engine not
>>                 just for stream processing but
>>                  > >>>>> also for batch processing. We are encouraged
>>                 by our findings and have
>>                  >
>>                  > >>>>> initiated our effort to make Flink's SQL
>>                 capabilities full-fledged. When
>>                  >
>>                  > >>>>> comparing what's available in Flink to the
>>                 offerings from competitive data
>>                  >
>>                  > >>>>> processing engines, we identified a major gap
>>                 in Flink: a well integration
>>                  >
>>                  > >>>>> with Hive ecosystem. This is crucial to the
>>                 success of Flink SQL and batch
>>                  >
>>                  > >>>>> due to the well-established data ecosystem
>>                 around Hive. Therefore, we have
>>                  >
>>                  > >>>>> done some initial work along this direction
>>                 but there are still a lot of
>>                  > >>>>> effort needed.
>>                  > >>>>>
>>                  > >>>>> We have two strategies in mind. The first one
>>                 is to make Flink SQL
>>                  >
>>                  > >>>>> full-fledged and well-integrated with Hive
>>                 ecosystem. This is a similar
>>                  >
>>                  > >>>>> approach to what Spark SQL adopted. The
>>                 second strategy is to make Hive
>>                  >
>>                  > >>>>> itself work with Flink, similar to the
>>                 proposal in [1]. Each approach bears
>>                  >
>>                  > >>>>> its pros and cons, but they don’t need to be
>>                 mutually exclusive with each
>>                  > >>>>> targeting at different users and use cases.
>>                 We believe that both will
>>                  > >>>>> promote a much greater adoption of Flink
>>                 beyond stream processing.
>>                  > >>>>>
>>                  > >>>>> We have been focused on the first approach
>>                 and would like to showcase
>>                  >
>>                  > >>>>> Flink's batch and SQL capabilities with Flink
>>                 SQL. However, we have also
>>                  > >>>>> planned to start strategy #2 as the follow-up
>>                 effort.
>>                  > >>>>>
>>                  >
>>                  > >>>>> I'm completely new to Flink(, with a short
>>                 bio [2] below), though many
>>                  >
>>                  > >>>>> of my colleagues here at Alibaba are
>>                 long-time contributors. Nevertheless,
>>                  >
>>                  > >>>>> I'd like to share our thoughts and invite
>>                 your early feedback. At the same
>>                  >
>>                  > >>>>> time, I am working on a detailed proposal on
>>                 Flink SQL's integration with
>>                  > >>>>> Hive ecosystem, which will be also shared
>>                 when ready.
>>                  > >>>>>
>>                  > >>>>> While the ideas are simple, each approach
>>                 will demand significant
>>                  >
>>                  > >>>>> effort, more than what we can afford. Thus,
>>                 the input and contributions
>>                  > >>>>> from the communities are greatly welcome and
>>                 appreciated.
>>                  > >>>>>
>>                  > >>>>> Regards,
>>                  > >>>>>
>>                  > >>>>>
>>                  > >>>>> Xuefu
>>                  > >>>>>
>>                  > >>>>> References:
>>                  > >>>>>
>>                  > >>>>> [1]
>>                 https://issues.apache.org/jira/browse/HIVE-10712
>>                  >
>>                  > >>>>> [2] Xuefu Zhang is a long-time open source
>>                 veteran, worked or working on
>>                  > >>>>> many projects under Apache Foundation, of
>>                 which he is also an honored
>>                  >
>>                  > >>>>> member. About 10 years ago he worked in the
>>                 Hadoop team at Yahoo where the
>>                  >
>>                  > >>>>> projects just got started. Later he worked at
>>                 Cloudera, initiating and
>>                  >
>>                  > >>>>> leading the development of Hive on Spark
>>                 project in the communities and
>>                  >
>>                  > >>>>> across many organizations. Prior to joining
>>                 Alibaba, he worked at Uber
>>                  >
>>                  > >>>>> where he promoted Hive on Spark to all Uber's
>>                 SQL on Hadoop workload and
>>                  > >>>>> significantly improved Uber's cluster 
>> efficiency.
>>                  > >>>>>
>>                  > >>>>>
>>                  > >>>>>
>>                  > >>>>>
>>                  > >>>>> --
>>                  >
>>                  > >>>>> "So you have to trust that the dots will
>>                 somehow connect in your future."
>>                  > >>>>>
>>                  > >>>>>
>>                  > >>>>> --
>>                  >
>>                  > >>>>> "So you have to trust that the dots will
>>                 somehow connect in your future."
>>                  > >>>>>
>>                  >
>>                  >
>>
>
>


Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by Timo Walther <tw...@apache.org>.
Hi Eron,

thank you very much for the contributions. I merged the first little bug 
fixes. For the remaining PRs I think we can review and merge them soon. 
As you said, the code is agnostic to the details of the ExternalCatalog 
interface and I don't expect bigger merge conflicts in the near future.

However, exposing the current external catalog interfaces to SQL Client 
users would make it even more difficult to change the interfaces in the 
future. So maybe I would first wait until the general catalog discussion 
is over and the FLIP has been created. This should happen shortly.

We should definitely coordinate the efforts better in the future to 
avoid duplicate work.

Thanks,
Timo


Am 07.01.19 um 00:24 schrieb Eron Wright:
> Thanks Timo for merging a couple of the PRs.   Are you also able to 
> review the others that I mentioned?  Xuefu I would like to incorporate 
> your feedback too.
>
> Check out this short demonstration of using a catalog in SQL Client:
> https://asciinema.org/a/C8xuAjmZSxCuApgFgZQyeIHuo
>
> Thanks again!
>
> On Thu, Jan 3, 2019 at 9:37 AM Eron Wright <eronwright@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Would a couple folks raise their hand to make a review pass thru
>     the 6 PRs listed above?  It is a lovely stack of PRs that is 'all
>     green' at the moment.   I would be happy to open follow-on PRs to
>     rapidly align with other efforts.
>
>     Note that the code is agnostic to the details of the
>     ExternalCatalog interface; the code would not be obsolete if/when
>     the catalog interface is enhanced as per the design doc.
>
>
>
>     On Wed, Jan 2, 2019 at 1:35 PM Eron Wright <eronwright@gmail.com
>     <ma...@gmail.com>> wrote:
>
>         I propose that the community review and merge the PRs that I
>         posted, and then evolve the design thru 1.8 and beyond.   I
>         think having a basic infrastructure in place now will
>         accelerate the effort, do you agree?
>
>         Thanks again!
>
>         On Wed, Jan 2, 2019 at 11:20 AM Zhang, Xuefu
>         <xuefu.z@alibaba-inc.com <ma...@alibaba-inc.com>> wrote:
>
>             Hi Eron,
>
>             Happy New Year!
>
>             Thank you very much for your contribution, especially
>             during the holidays. Wile I'm encouraged by your work, I'd
>             also like to share my thoughts on how to move forward.
>
>             First, please note that the design discussion is still
>             finalizing, and we expect some moderate changes,
>             especially around TableFactories. Another pending change
>             is our decision to shy away from scala, which our work
>             will be impacted by.
>
>             Secondly, while your work seemed about plugging in
>             catalogs definitions to the execution environment, which
>             is less impacted by TableFactory change, I did notice some
>             duplication of your work and ours. This is no big deal,
>             but going forward, we should probable have a better
>             communication on the work assignment so as to avoid any
>             possible duplication of work. On the other hand, I think
>             some of your work is interesting and valuable for
>             inclusion once we finalize the overall design.
>
>             Thus, please continue your research and experiment and let
>             us know when you start working on anything so we can
>             better coordinate.
>
>             Thanks again for your interest and contributions.
>
>             Thanks,
>             Xuefu
>
>
>
>                 ------------------------------------------------------------------
>                 From:Eron Wright <eronwright@gmail.com
>                 <ma...@gmail.com>>
>                 Sent At:2019 Jan. 1 (Tue.) 18:39
>                 To:dev <dev@flink.apache.org
>                 <ma...@flink.apache.org>>; Xuefu
>                 <xuefu.z@alibaba-inc.com <ma...@alibaba-inc.com>>
>                 Cc:Xiaowei Jiang <xiaoweij@gmail.com
>                 <ma...@gmail.com>>; twalthr
>                 <twalthr@apache.org <ma...@apache.org>>;
>                 piotr <piotr@data-artisans.com
>                 <ma...@data-artisans.com>>; Fabian Hueske
>                 <fhueske@gmail.com <ma...@gmail.com>>;
>                 suez1224 <suez1224@gmail.com
>                 <ma...@gmail.com>>; Bowen Li
>                 <bowenli86@gmail.com <ma...@gmail.com>>
>                 Subject:Re: [DISCUSS] Integrate Flink SQL well with
>                 Hive ecosystem
>
>                 Hi folks, there's clearly some incremental steps to be
>                 taken to introduce catalog support to SQL Client,
>                 complementary to what is proposed in the Flink-Hive
>                 Metastore design doc.  I was quietly working on this
>                 over the holidays.   I posted some new sub-tasks, PRs,
>                 and sample code to FLINK-10744.
>
>                 What inspired me to get involved is that the catalog
>                 interface seems like a great way to encapsulate a
>                 'library' of Flink tables and functions. For example,
>                 the NYC Taxi dataset (TaxiRides, TaxiFares, various
>                 UDFs) may be nicely encapsulated as a catalog
>                 (TaxiData).  Such a library should be fully consumable
>                 in SQL Client.
>
>                 I implemented the above. Some highlights:
>                 1. A fully-worked example of using the Taxi dataset in
>                 SQL Client via an environment file.
>                 - an ASCII video showing the SQL Client in action:
>                 https://asciinema.org/a/C8xuAjmZSxCuApgFgZQyeIHuo
>
>                 - the corresponding environment file (will be even
>                 more concise once 'FLINK-10696 Catalog UDFs' is merged):
>                 _https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/dist/conf/sql-client-defaults.yaml_
>
>                 - the typed API for standalone table applications:
>                 _https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/java/com/dataartisans/flinktraining/examples/table_java/examples/ViaCatalog.java#L50_
>
>                 2. Implementation of the core catalog descriptor and
>                 factory.  I realize that some renames may later occur
>                 as per the design doc, and would be happy to do that
>                 as a follow-up.
>                 https://github.com/apache/flink/pull/7390
>
>                 3. Implementation of a connect-style API on
>                 TableEnvironment to use catalog descriptor.
>                 https://github.com/apache/flink/pull/7392
>
>                 4. Integration into SQL-Client's environment file:
>                 https://github.com/apache/flink/pull/7393
>
>                 I realize that the overall Hive integration is still
>                 evolving, but I believe that these PRs are a good
>                 stepping stone. Here's the list (in bottom-up order):
>                 - https://github.com/apache/flink/pull/7386
>                 - https://github.com/apache/flink/pull/7388
>                 - https://github.com/apache/flink/pull/7389
>                 - https://github.com/apache/flink/pull/7390
>                 - https://github.com/apache/flink/pull/7392
>                 - https://github.com/apache/flink/pull/7393
>
>                 Thanks and enjoy 2019!
>                 Eron W
>
>
>                 On Sun, Nov 18, 2018 at 3:04 PM Zhang, Xuefu
>                 <xuefu.z@alibaba-inc.com
>                 <ma...@alibaba-inc.com>> wrote:
>                 Hi Xiaowei,
>
>                 Thanks for bringing up the question. In the current
>                 design, the properties for meta objects are meant to
>                 cover anything that's specific to a particular catalog
>                 and agnostic to Flink. Anything that is common (such
>                 as schema for tables, query text for views, and udf
>                 classname) are abstracted as members of the respective
>                 classes. However, this is still in discussion, and
>                 Timo and I will go over this and provide an update.
>
>                 Please note that UDF is a little more involved than
>                 what the current design doc shows. I'm still refining
>                 this part.
>
>                 Thanks,
>                 Xuefu
>
>
>                 ------------------------------------------------------------------
>                 Sender:Xiaowei Jiang <xiaoweij@gmail.com
>                 <ma...@gmail.com>>
>                 Sent at:2018 Nov 18 (Sun) 15:17
>                 Recipient:dev <dev@flink.apache.org
>                 <ma...@flink.apache.org>>
>                 Cc:Xuefu <xuefu.z@alibaba-inc.com
>                 <ma...@alibaba-inc.com>>; twalthr
>                 <twalthr@apache.org <ma...@apache.org>>;
>                 piotr <piotr@data-artisans.com
>                 <ma...@data-artisans.com>>; Fabian Hueske
>                 <fhueske@gmail.com <ma...@gmail.com>>;
>                 suez1224 <suez1224@gmail.com <ma...@gmail.com>>
>                 Subject:Re: [DISCUSS] Integrate Flink SQL well with
>                 Hive ecosystem
>
>                 Thanks Xuefu for the detailed design doc! One question
>                 on the properties associated with the catalog objects.
>                 Are we going to leave them completely free form or we
>                 are going to set some standard for that? I think that
>                 the answer may depend on if we want to explore catalog
>                 specific optimization opportunities. In any case, I
>                 think that it might be helpful for standardize as much
>                 as possible into strongly typed classes and use leave
>                 these properties for catalog specific things. But I
>                 think that we can do it in steps.
>
>                 Xiaowei
>                 On Fri, Nov 16, 2018 at 4:00 AM Bowen Li
>                 <bowenli86@gmail.com <ma...@gmail.com>> wrote:
>                 Thanks for keeping on improving the overall design,
>                 Xuefu! It looks quite
>                  good to me now.
>
>                  Would be nice that cc-ed Flink committers can help to
>                 review and confirm!
>
>
>
>                  One minor suggestion: Since the last section of
>                 design doc already touches
>                  some new sql statements, shall we add another section
>                 in our doc and
>                  formalize the new sql statements in SQL Client and
>                 TableEnvironment that
>                  are gonna come along naturally with our design? Here
>                 are some that the
>                  design doc mentioned and some that I came up with:
>
>                  To be added:
>
>                     - USE <catalog> - set default catalog
>                     - USE <catalog.schema> - set default schema
>                     - SHOW CATALOGS - show all registered catalogs
>                     - SHOW SCHEMAS [FROM catalog] - list schemas in
>                 the current default
>                     catalog or the specified catalog
>                     - DESCRIBE VIEW view - show the view's definition
>                 in CatalogView
>                     - SHOW VIEWS [FROM schema/catalog.schema] - show
>                 views from current or a
>                     specified schema.
>
>                     (DDLs that can be addressed by either our design
>                 or Shuyi's DDL design)
>
>                     - CREATE/DROP/ALTER SCHEMA schema
>                     - CREATE/DROP/ALTER CATALOG catalog
>
>                  To be modified:
>
>                     - SHOW TABLES [FROM schema/catalog.schema] - show
>                 tables from current or
>                     a specified schema. Add 'from schema' to existing
>                 'SHOW TABLES' statement
>                     - SHOW FUNCTIONS [FROM schema/catalog.schema] -
>                 show functions from
>                     current or a specified schema. Add 'from schema'
>                 to existing 'SHOW TABLES'
>                     statement'
>
>
>                  Thanks, Bowen
>
>
>
>                  On Wed, Nov 14, 2018 at 10:39 PM Zhang, Xuefu
>                 <xuefu.z@alibaba-inc.com <ma...@alibaba-inc.com>>
>                  wrote:
>
>                  > Thanks, Bowen, for catching the error. I have
>                 granted comment permission
>                  > with the link.
>                  >
>                  > I also updated the doc with the latest class
>                 definitions. Everyone is
>                  > encouraged to review and comment.
>                  >
>                  > Thanks,
>                  > Xuefu
>                  >
>                  >
>                 ------------------------------------------------------------------
>                  > Sender:Bowen Li <bowenli86@gmail.com
>                 <ma...@gmail.com>>
>                  > Sent at:2018 Nov 14 (Wed) 06:44
>                  > Recipient:Xuefu <xuefu.z@alibaba-inc.com
>                 <ma...@alibaba-inc.com>>
>                  > Cc:piotr <piotr@data-artisans.com
>                 <ma...@data-artisans.com>>; dev
>                 <dev@flink.apache.org <ma...@flink.apache.org>>;
>                 Shuyi
>                  > Chen <suez1224@gmail.com <ma...@gmail.com>>
>                  > Subject:Re: [DISCUSS] Integrate Flink SQL well with
>                 Hive ecosystem
>                  >
>                  > Hi Xuefu,
>                  >
>                  > Currently the new design doc
>                  >
>                 <https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit>
>                  > is on “view only" mode, and people cannot leave
>                 comments. Can you please
>                  > change it to "can comment" or "can edit" mode?
>                  >
>                  > Thanks, Bowen
>                  >
>                  >
>                  > On Mon, Nov 12, 2018 at 9:51 PM Zhang, Xuefu
>                 <xuefu.z@alibaba-inc.com <ma...@alibaba-inc.com>>
>                  > wrote:
>                  > Hi Piotr
>                  >
>                  > I have extracted the API portion of  the design and
>                 the google doc is here
>                  >
>                 <https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit?usp=sharing>.
>                  > Please review and provide your feedback.
>                  >
>                  > Thanks,
>                  > Xuefu
>                  >
>                  >
>                 ------------------------------------------------------------------
>                  > Sender:Xuefu <xuefu.z@alibaba-inc.com
>                 <ma...@alibaba-inc.com>>
>                  > Sent at:2018 Nov 12 (Mon) 12:43
>                  > Recipient:Piotr Nowojski <piotr@data-artisans.com
>                 <ma...@data-artisans.com>>; dev <
>                  > dev@flink.apache.org <ma...@flink.apache.org>>
>                  > Cc:Bowen Li <bowenli86@gmail.com
>                 <ma...@gmail.com>>; Shuyi Chen
>                 <suez1224@gmail.com <ma...@gmail.com>>
>                  > Subject:Re: [DISCUSS] Integrate Flink SQL well with
>                 Hive ecosystem
>                  >
>                  > Hi Piotr,
>                  >
>                  > That sounds good to me. Let's close all the open
>                 questions ((there are a
>                  > couple of them)) in the Google doc and I should be
>                 able to quickly split
>                  > it into the three proposals as you suggested.
>                  >
>                  > Thanks,
>                  > Xuefu
>                  >
>                  >
>                 ------------------------------------------------------------------
>                  > Sender:Piotr Nowojski <piotr@data-artisans.com
>                 <ma...@data-artisans.com>>
>                  > Sent at:2018 Nov 9 (Fri) 22:46
>                  > Recipient:dev <dev@flink.apache.org
>                 <ma...@flink.apache.org>>; Xuefu
>                 <xuefu.z@alibaba-inc.com <ma...@alibaba-inc.com>>
>                  > Cc:Bowen Li <bowenli86@gmail.com
>                 <ma...@gmail.com>>; Shuyi Chen
>                 <suez1224@gmail.com <ma...@gmail.com>>
>                  > Subject:Re: [DISCUSS] Integrate Flink SQL well with
>                 Hive ecosystem
>                  >
>                  > Hi,
>                  >
>                  >
>                  > Yes, it seems like the best solution. Maybe someone
>                 else can also suggests if we can split it further?
>                 Maybe changes in the interface in one doc, reading
>                 from hive meta store another and final storing our
>                 meta informations in hive meta store?
>                  >
>                  > Piotrek
>                  >
>                  > > On 9 Nov 2018, at 01:44, Zhang, Xuefu
>                 <xuefu.z@alibaba-inc.com
>                 <ma...@alibaba-inc.com>> wrote:
>                  > >
>                  > > Hi Piotr,
>                  > >
>                  > > That seems to be good idea!
>                  > >
>                  >
>                  > > Since the google doc for the design is currently
>                 under extensive review, I will leave it as it is for
>                 now. However, I'll convert it to two different FLIPs
>                 when the time comes.
>                  > >
>                  > > How does it sound to you?
>                  > >
>                  > > Thanks,
>                  > > Xuefu
>                  > >
>                  > >
>                  > >
>                 ------------------------------------------------------------------
>                  > > Sender:Piotr Nowojski <piotr@data-artisans.com
>                 <ma...@data-artisans.com>>
>                  > > Sent at:2018 Nov 9 (Fri) 02:31
>                  > > Recipient:dev <dev@flink.apache.org
>                 <ma...@flink.apache.org>>
>                  > > Cc:Bowen Li <bowenli86@gmail.com
>                 <ma...@gmail.com>>; Xuefu
>                 <xuefu.z@alibaba-inc.com <ma...@alibaba-inc.com>
>                  > >; Shuyi Chen <suez1224@gmail.com
>                 <ma...@gmail.com>>
>                  > > Subject:Re: [DISCUSS] Integrate Flink SQL well
>                 with Hive ecosystem
>                  > >
>                  > > Hi,
>                  > >
>                  >
>                  > > Maybe we should split this topic (and the design
>                 doc) into couple of smaller ones, hopefully
>                 independent. The questions that you have asked Fabian
>                 have for example very little to do with reading
>                 metadata from Hive Meta Store?
>                  > >
>                  > > Piotrek
>                  > >
>                  > >> On 7 Nov 2018, at 14:27, Fabian Hueske
>                 <fhueske@gmail.com <ma...@gmail.com>> wrote:
>                  > >>
>                  > >> Hi Xuefu and all,
>                  > >>
>                  > >> Thanks for sharing this design document!
>                  >
>                  > >> I'm very much in favor of restructuring /
>                 reworking the catalog handling in
>                  > >> Flink SQL as outlined in the document.
>                  >
>                  > >> Most changes described in the design document
>                 seem to be rather general and
>                  > >> not specifically related to the Hive integration.
>                  > >>
>                  >
>                  > >> IMO, there are some aspects, especially those at
>                 the boundary of Hive and
>                  > >> Flink, that need a bit more discussion. For example
>                  > >>
>                  > >> * What does it take to make Flink schema
>                 compatible with Hive schema?
>                  > >> * How will Flink tables (descriptors) be stored
>                 in HMS?
>                  > >> * How do both Hive catalogs differ? Could they
>                 be integrated into to a
>                  > >> single one? When to use which one?
>                  >
>                  > >> * What meta information is provided by HMS? What
>                 of this can be leveraged
>                  > >> by Flink?
>                  > >>
>                  > >> Thank you,
>                  > >> Fabian
>                  > >>
>                  > >> Am Fr., 2. Nov. 2018 um 00:31 Uhr schrieb Bowen
>                 Li <bowenli86@gmail.com <ma...@gmail.com>
>                  > >:
>                  > >>
>                  > >>> After taking a look at how other discussion
>                 threads work, I think it's
>                  > >>> actually fine just keep our discussion here.
>                 It's up to you, Xuefu.
>                  > >>>
>                  > >>> The google doc LGTM. I left some minor comments.
>                  > >>>
>                  > >>> On Thu, Nov 1, 2018 at 10:17 AM Bowen Li
>                 <bowenli86@gmail.com <ma...@gmail.com>> wrote:
>                  > >>>
>                  > >>>> Hi all,
>                  > >>>>
>                  > >>>> As Xuefu has published the design doc on
>                 google, I agree with Shuyi's
>                  >
>                  > >>>> suggestion that we probably should start a new
>                 email thread like "[DISCUSS]
>                  >
>                  > >>>> ... Hive integration design ..." on only dev
>                 mailing list for community
>                  > >>>> devs to review. The current thread sends to
>                 both dev and user list.
>                  > >>>>
>                  >
>                  > >>>> This email thread is more like validating the
>                 general idea and direction
>                  >
>                  > >>>> with the community, and it's been pretty long
>                 and crowded so far. Since
>                  >
>                  > >>>> everyone is pro for the idea, we can move
>                 forward with another thread to
>                  > >>>> discuss and finalize the design.
>                  > >>>>
>                  > >>>> Thanks,
>                  > >>>> Bowen
>                  > >>>>
>                  > >>>> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu <
>                  > xuefu.z@alibaba-inc.com
>                 <ma...@alibaba-inc.com>>
>                  > >>>> wrote:
>                  > >>>>
>                  > >>>>> Hi Shuiyi,
>                  > >>>>>
>                  >
>                  > >>>>> Good idea. Actually the PDF was converted
>                 from a google doc. Here is its
>                  > >>>>> link:
>                  > >>>>>
>                  > >>>>>
>                  >
>                 https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
>                  > >>>>> Once we reach an agreement, I can convert it
>                 to a FLIP.
>                  > >>>>>
>                  > >>>>> Thanks,
>                  > >>>>> Xuefu
>                  > >>>>>
>                  > >>>>>
>                  > >>>>>
>                  > >>>>>
>                 ------------------------------------------------------------------
>                  > >>>>> Sender:Shuyi Chen <suez1224@gmail.com
>                 <ma...@gmail.com>>
>                  > >>>>> Sent at:2018 Nov 1 (Thu) 02:47
>                  > >>>>> Recipient:Xuefu <xuefu.z@alibaba-inc.com
>                 <ma...@alibaba-inc.com>>
>                  > >>>>> Cc:vino yang <yanghua1127@gmail.com
>                 <ma...@gmail.com>>; Fabian Hueske <
>                  > fhueske@gmail.com <ma...@gmail.com>>;
>                  > >>>>> dev <dev@flink.apache.org
>                 <ma...@flink.apache.org>>; user
>                 <user@flink.apache.org <ma...@flink.apache.org>>
>                  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
>                 well with Hive ecosystem
>                  > >>>>>
>                  > >>>>> Hi Xuefu,
>                  > >>>>>
>                  >
>                  > >>>>> Thanks a lot for driving this big effort. I
>                 would suggest convert your
>                  >
>                  > >>>>> proposal and design doc into a google doc,
>                 and share it on the dev mailing
>                  >
>                  > >>>>> list for the community to review and comment
>                 with title like "[DISCUSS] ...
>                  >
>                  > >>>>> Hive integration design ..." . Once
>                 approved,  we can document it as a FLIP
>                  >
>                  > >>>>> (Flink Improvement Proposals), and use JIRAs
>                 to track the implementations.
>                  > >>>>> What do you think?
>                  > >>>>>
>                  > >>>>> Shuyi
>                  > >>>>>
>                  > >>>>> On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <
>                  > xuefu.z@alibaba-inc.com
>                 <ma...@alibaba-inc.com>>
>                  > >>>>> wrote:
>                  > >>>>> Hi all,
>                  > >>>>>
>                  > >>>>> I have also shared a design doc on Hive
>                 metastore integration that is
>                  >
>                  > >>>>> attached here and also to FLINK-10556[1].
>                 Please kindly review and share
>                  > >>>>> your feedback.
>                  > >>>>>
>                  > >>>>>
>                  > >>>>> Thanks,
>                  > >>>>> Xuefu
>                  > >>>>>
>                  > >>>>> [1]
>                 https://issues.apache.org/jira/browse/FLINK-10556
>                  > >>>>>
>                 ------------------------------------------------------------------
>                  > >>>>> Sender:Xuefu <xuefu.z@alibaba-inc.com
>                 <ma...@alibaba-inc.com>>
>                  > >>>>> Sent at:2018 Oct 25 (Thu) 01:08
>                  > >>>>> Recipient:Xuefu <xuefu.z@alibaba-inc.com
>                 <ma...@alibaba-inc.com>>; Shuyi Chen <
>                  > >>>>> suez1224@gmail.com <ma...@gmail.com>>
>                  > >>>>> Cc:yanghua1127 <yanghua1127@gmail.com
>                 <ma...@gmail.com>>; Fabian Hueske <
>                  > fhueske@gmail.com <ma...@gmail.com>>;
>                  > >>>>> dev <dev@flink.apache.org
>                 <ma...@flink.apache.org>>; user
>                 <user@flink.apache.org <ma...@flink.apache.org>>
>                  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
>                 well with Hive ecosystem
>                  > >>>>>
>                  > >>>>> Hi all,
>                  > >>>>>
>                  > >>>>> To wrap up the discussion, I have attached a
>                 PDF describing the
>                  >
>                  > >>>>> proposal, which is also attached to
>                 FLINK-10556 [1]. Please feel free to
>                  > >>>>> watch that JIRA to track the progress.
>                  > >>>>>
>                  > >>>>> Please also let me know if you have
>                 additional comments or questions.
>                  > >>>>>
>                  > >>>>> Thanks,
>                  > >>>>> Xuefu
>                  > >>>>>
>                  > >>>>> [1]
>                 https://issues.apache.org/jira/browse/FLINK-10556
>                  > >>>>>
>                  > >>>>>
>                  > >>>>>
>                 ------------------------------------------------------------------
>                  > >>>>> Sender:Xuefu <xuefu.z@alibaba-inc.com
>                 <ma...@alibaba-inc.com>>
>                  > >>>>> Sent at:2018 Oct 16 (Tue) 03:40
>                  > >>>>> Recipient:Shuyi Chen <suez1224@gmail.com
>                 <ma...@gmail.com>>
>                  > >>>>> Cc:yanghua1127 <yanghua1127@gmail.com
>                 <ma...@gmail.com>>; Fabian Hueske <
>                  > fhueske@gmail.com <ma...@gmail.com>>;
>                  > >>>>> dev <dev@flink.apache.org
>                 <ma...@flink.apache.org>>; user
>                 <user@flink.apache.org <ma...@flink.apache.org>>
>                  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
>                 well with Hive ecosystem
>                  > >>>>>
>                  > >>>>> Hi Shuyi,
>                  > >>>>>
>                  >
>                  > >>>>> Thank you for your input. Yes, I agreed with
>                 a phased approach and like
>                  >
>                  > >>>>> to move forward fast. :) We did some work
>                 internally on DDL utilizing babel
>                  > >>>>> parser in Calcite. While babel makes
>                 Calcite's grammar extensible, at
>                  > >>>>> first impression it still seems too
>                 cumbersome for a project when too
>                  >
>                  > >>>>> much extensions are made. It's even
>                 challenging to find where the extension
>                  >
>                  > >>>>> is needed! It would be certainly better if
>                 Calcite can magically support
>                  >
>                  > >>>>> Hive QL by just turning on a flag, such as
>                 that for MYSQL_5. I can also
>                  >
>                  > >>>>> see that this could mean a lot of work on
>                 Calcite. Nevertheless, I will
>                  >
>                  > >>>>> bring up the discussion over there and to see
>                 what their community thinks.
>                  > >>>>>
>                  > >>>>> Would mind to share more info about the
>                 proposal on DDL that you
>                  > >>>>> mentioned? We can certainly collaborate on this.
>                  > >>>>>
>                  > >>>>> Thanks,
>                  > >>>>> Xuefu
>                  > >>>>>
>                  > >>>>>
>                 ------------------------------------------------------------------
>                  > >>>>> Sender:Shuyi Chen <suez1224@gmail.com
>                 <ma...@gmail.com>>
>                  > >>>>> Sent at:2018 Oct 14 (Sun) 08:30
>                  > >>>>> Recipient:Xuefu <xuefu.z@alibaba-inc.com
>                 <ma...@alibaba-inc.com>>
>                  > >>>>> Cc:yanghua1127 <yanghua1127@gmail.com
>                 <ma...@gmail.com>>; Fabian Hueske <
>                  > fhueske@gmail.com <ma...@gmail.com>>;
>                  > >>>>> dev <dev@flink.apache.org
>                 <ma...@flink.apache.org>>; user
>                 <user@flink.apache.org <ma...@flink.apache.org>>
>                  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
>                 well with Hive ecosystem
>                  > >>>>>
>                  > >>>>> Welcome to the community and thanks for the
>                 great proposal, Xuefu! I
>                  >
>                  > >>>>> think the proposal can be divided into 2
>                 stages: making Flink to support
>                  >
>                  > >>>>> Hive features, and make Hive to work with
>                 Flink. I agreed with Timo that on
>                  >
>                  > >>>>> starting with a smaller scope, so we can make
>                 progress faster. As for [6],
>                  >
>                  > >>>>> a proposal for DDL is already in progress,
>                 and will come after the unified
>                  >
>                  > >>>>> SQL connector API is done. For supporting
>                 Hive syntax, we might need to
>                  > >>>>> work with the Calcite community, and a recent
>                 effort called babel (
>                  > >>>>>
>                 https://issues.apache.org/jira/browse/CALCITE-2280) in
>                 Calcite might
>                  > >>>>> help here.
>                  > >>>>>
>                  > >>>>> Thanks
>                  > >>>>> Shuyi
>                  > >>>>>
>                  > >>>>> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <
>                  > xuefu.z@alibaba-inc.com
>                 <ma...@alibaba-inc.com>>
>                  > >>>>> wrote:
>                  > >>>>> Hi Fabian/Vno,
>                  > >>>>>
>                  >
>                  > >>>>> Thank you very much for your encouragement
>                 inquiry. Sorry that I didn't
>                  >
>                  > >>>>> see Fabian's email until I read Vino's
>                 response just now. (Somehow Fabian's
>                  > >>>>> went to the spam folder.)
>                  > >>>>>
>                  >
>                  > >>>>> My proposal contains long-term and
>                 short-terms goals. Nevertheless, the
>                  > >>>>> effort will focus on the following areas,
>                 including Fabian's list:
>                  > >>>>>
>                  > >>>>> 1. Hive metastore connectivity - This covers
>                 both read/write access,
>                  >
>                  > >>>>> which means Flink can make full use of Hive's
>                 metastore as its catalog (at
>                  > >>>>> least for the batch but can extend for
>                 streaming as well).
>                  >
>                  > >>>>> 2. Metadata compatibility - Objects
>                 (databases, tables, partitions, etc)
>                  >
>                  > >>>>> created by Hive can be understood by Flink
>                 and the reverse direction is
>                  > >>>>> true also.
>                  > >>>>> 3. Data compatibility - Similar to #2, data
>                 produced by Hive can be
>                  > >>>>> consumed by Flink and vise versa.
>                  >
>                  > >>>>> 4. Support Hive UDFs - For all Hive's native
>                 udfs, Flink either provides
>                  > >>>>> its own implementation or make Hive's
>                 implementation work in Flink.
>                  > >>>>> Further, for user created UDFs in Hive, Flink
>                 SQL should provide a
>                  >
>                  > >>>>> mechanism allowing user to import them into
>                 Flink without any code change
>                  > >>>>> required.
>                  > >>>>> 5. Data types - Flink SQL should support all
>                 data types that are
>                  > >>>>> available in Hive.
>                  > >>>>> 6. SQL Language - Flink SQL should support
>                 SQL standard (such as
>                  >
>                  > >>>>> SQL2003) with extension to support Hive's
>                 syntax and language features,
>                  > >>>>> around DDL, DML, and SELECT queries.
>                  >
>                  > >>>>> 7.  SQL CLI - this is currently developing in
>                 Flink but more effort is
>                  > >>>>> needed.
>                  >
>                  > >>>>> 8. Server - provide a server that's
>                 compatible with Hive's HiverServer2
>                  >
>                  > >>>>> in thrift APIs, such that HiveServer2 users
>                 can reuse their existing client
>                  > >>>>> (such as beeline) but connect to Flink's
>                 thrift server instead.
>                  >
>                  > >>>>> 9. JDBC/ODBC drivers - Flink may provide its
>                 own JDBC/ODBC drivers for
>                  > >>>>> other application to use to connect to its
>                 thrift server
>                  > >>>>> 10. Support other user's customizations in
>                 Hive, such as Hive Serdes,
>                  > >>>>> storage handlers, etc.
>                  >
>                  > >>>>> 11. Better task failure tolerance and task
>                 scheduling at Flink runtime.
>                  > >>>>>
>                  > >>>>> As you can see, achieving all those requires
>                 significant effort and
>                  >
>                  > >>>>> across all layers in Flink. However, a
>                 short-term goal could include only
>                  >
>                  > >>>>> core areas (such as 1, 2, 4, 5, 6, 7) or
>                 start  at a smaller scope (such as
>                  > >>>>> #3, #6).
>                  > >>>>>
>                  >
>                  > >>>>> Please share your further thoughts. If we
>                 generally agree that this is
>                  >
>                  > >>>>> the right direction, I could come up with a
>                 formal proposal quickly and
>                  > >>>>> then we can follow up with broader discussions.
>                  > >>>>>
>                  > >>>>> Thanks,
>                  > >>>>> Xuefu
>                  > >>>>>
>                  > >>>>>
>                  > >>>>>
>                  > >>>>>
>                 ------------------------------------------------------------------
>                  > >>>>> Sender:vino yang <yanghua1127@gmail.com
>                 <ma...@gmail.com>>
>                  > >>>>> Sent at:2018 Oct 11 (Thu) 09:45
>                  > >>>>> Recipient:Fabian Hueske <fhueske@gmail.com
>                 <ma...@gmail.com>>
>                  > >>>>> Cc:dev <dev@flink.apache.org
>                 <ma...@flink.apache.org>>; Xuefu
>                 <xuefu.z@alibaba-inc.com <ma...@alibaba-inc.com>
>                  > >; user <
>                  > >>>>> user@flink.apache.org
>                 <ma...@flink.apache.org>>
>                  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL
>                 well with Hive ecosystem
>                  > >>>>>
>                  > >>>>> Hi Xuefu,
>                  > >>>>>
>                  >
>                  > >>>>> Appreciate this proposal, and like Fabian, it
>                 would look better if you
>                  > >>>>> can give more details of the plan.
>                  > >>>>>
>                  > >>>>> Thanks, vino.
>                  > >>>>>
>                  > >>>>> Fabian Hueske <fhueske@gmail.com
>                 <ma...@gmail.com>> 于2018年10月10日周三
>                 下午5:27写道:
>                  > >>>>> Hi Xuefu,
>                  > >>>>>
>                  >
>                  > >>>>> Welcome to the Flink community and thanks for
>                 starting this discussion!
>                  > >>>>> Better Hive integration would be really great!
>                  > >>>>> Can you go into details of what you are
>                 proposing? I can think of a
>                  > >>>>> couple ways to improve Flink in that regard:
>                  > >>>>>
>                  > >>>>> * Support for Hive UDFs
>                  > >>>>> * Support for Hive metadata catalog
>                  > >>>>> * Support for HiveQL syntax
>                  > >>>>> * ???
>                  > >>>>>
>                  > >>>>> Best, Fabian
>                  > >>>>>
>                  > >>>>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb
>                 Zhang, Xuefu <
>                  > >>>>> xuefu.z@alibaba-inc.com
>                 <ma...@alibaba-inc.com>>:
>                  > >>>>> Hi all,
>                  > >>>>>
>                  > >>>>> Along with the community's effort, inside
>                 Alibaba we have explored
>                  >
>                  > >>>>> Flink's potential as an execution engine not
>                 just for stream processing but
>                  > >>>>> also for batch processing. We are encouraged
>                 by our findings and have
>                  >
>                  > >>>>> initiated our effort to make Flink's SQL
>                 capabilities full-fledged. When
>                  >
>                  > >>>>> comparing what's available in Flink to the
>                 offerings from competitive data
>                  >
>                  > >>>>> processing engines, we identified a major gap
>                 in Flink: a well integration
>                  >
>                  > >>>>> with Hive ecosystem. This is crucial to the
>                 success of Flink SQL and batch
>                  >
>                  > >>>>> due to the well-established data ecosystem
>                 around Hive. Therefore, we have
>                  >
>                  > >>>>> done some initial work along this direction
>                 but there are still a lot of
>                  > >>>>> effort needed.
>                  > >>>>>
>                  > >>>>> We have two strategies in mind. The first one
>                 is to make Flink SQL
>                  >
>                  > >>>>> full-fledged and well-integrated with Hive
>                 ecosystem. This is a similar
>                  >
>                  > >>>>> approach to what Spark SQL adopted. The
>                 second strategy is to make Hive
>                  >
>                  > >>>>> itself work with Flink, similar to the
>                 proposal in [1]. Each approach bears
>                  >
>                  > >>>>> its pros and cons, but they don’t need to be
>                 mutually exclusive with each
>                  > >>>>> targeting at different users and use cases.
>                 We believe that both will
>                  > >>>>> promote a much greater adoption of Flink
>                 beyond stream processing.
>                  > >>>>>
>                  > >>>>> We have been focused on the first approach
>                 and would like to showcase
>                  >
>                  > >>>>> Flink's batch and SQL capabilities with Flink
>                 SQL. However, we have also
>                  > >>>>> planned to start strategy #2 as the follow-up
>                 effort.
>                  > >>>>>
>                  >
>                  > >>>>> I'm completely new to Flink(, with a short
>                 bio [2] below), though many
>                  >
>                  > >>>>> of my colleagues here at Alibaba are
>                 long-time contributors. Nevertheless,
>                  >
>                  > >>>>> I'd like to share our thoughts and invite
>                 your early feedback. At the same
>                  >
>                  > >>>>> time, I am working on a detailed proposal on
>                 Flink SQL's integration with
>                  > >>>>> Hive ecosystem, which will be also shared
>                 when ready.
>                  > >>>>>
>                  > >>>>> While the ideas are simple, each approach
>                 will demand significant
>                  >
>                  > >>>>> effort, more than what we can afford. Thus,
>                 the input and contributions
>                  > >>>>> from the communities are greatly welcome and
>                 appreciated.
>                  > >>>>>
>                  > >>>>> Regards,
>                  > >>>>>
>                  > >>>>>
>                  > >>>>> Xuefu
>                  > >>>>>
>                  > >>>>> References:
>                  > >>>>>
>                  > >>>>> [1]
>                 https://issues.apache.org/jira/browse/HIVE-10712
>                  >
>                  > >>>>> [2] Xuefu Zhang is a long-time open source
>                 veteran, worked or working on
>                  > >>>>> many projects under Apache Foundation, of
>                 which he is also an honored
>                  >
>                  > >>>>> member. About 10 years ago he worked in the
>                 Hadoop team at Yahoo where the
>                  >
>                  > >>>>> projects just got started. Later he worked at
>                 Cloudera, initiating and
>                  >
>                  > >>>>> leading the development of Hive on Spark
>                 project in the communities and
>                  >
>                  > >>>>> across many organizations. Prior to joining
>                 Alibaba, he worked at Uber
>                  >
>                  > >>>>> where he promoted Hive on Spark to all Uber's
>                 SQL on Hadoop workload and
>                  > >>>>> significantly improved Uber's cluster efficiency.
>                  > >>>>>
>                  > >>>>>
>                  > >>>>>
>                  > >>>>>
>                  > >>>>> --
>                  >
>                  > >>>>> "So you have to trust that the dots will
>                 somehow connect in your future."
>                  > >>>>>
>                  > >>>>>
>                  > >>>>> --
>                  >
>                  > >>>>> "So you have to trust that the dots will
>                 somehow connect in your future."
>                  > >>>>>
>                  >
>                  >
>


Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by Eron Wright <er...@gmail.com>.
Thanks Timo for merging a couple of the PRs.   Are you also able to review
the others that I mentioned?  Xuefu I would like to incorporate your
feedback too.

Check out this short demonstration of using a catalog in SQL Client:
https://asciinema.org/a/C8xuAjmZSxCuApgFgZQyeIHuo

Thanks again!

On Thu, Jan 3, 2019 at 9:37 AM Eron Wright <er...@gmail.com> wrote:

> Would a couple folks raise their hand to make a review pass thru the 6 PRs
> listed above?  It is a lovely stack of PRs that is 'all green' at the
> moment.   I would be happy to open follow-on PRs to rapidly align with
> other efforts.
>
> Note that the code is agnostic to the details of the ExternalCatalog
> interface; the code would not be obsolete if/when the catalog interface is
> enhanced as per the design doc.
>
>
>
> On Wed, Jan 2, 2019 at 1:35 PM Eron Wright <er...@gmail.com> wrote:
>
>> I propose that the community review and merge the PRs that I posted, and
>> then evolve the design thru 1.8 and beyond.   I think having a basic
>> infrastructure in place now will accelerate the effort, do you agree?
>>
>> Thanks again!
>>
>> On Wed, Jan 2, 2019 at 11:20 AM Zhang, Xuefu <xu...@alibaba-inc.com>
>> wrote:
>>
>>> Hi Eron,
>>>
>>> Happy New Year!
>>>
>>> Thank you very much for your contribution, especially during the
>>> holidays. Wile I'm encouraged by your work, I'd also like to share my
>>> thoughts on how to move forward.
>>>
>>> First, please note that the design discussion is still finalizing, and
>>> we expect some moderate changes, especially around TableFactories. Another
>>> pending change is our decision to shy away from scala, which our work will
>>> be impacted by.
>>>
>>> Secondly, while your work seemed about plugging in catalogs definitions
>>> to the execution environment, which is less impacted by TableFactory
>>> change, I did notice some duplication of your work and ours. This is no big
>>> deal, but going forward, we should probable have a better communication on
>>> the work assignment so as to avoid any possible duplication of work. On the
>>> other hand, I think some of your work is interesting and valuable for
>>> inclusion once we finalize the overall design.
>>>
>>> Thus, please continue your research and experiment and let us know when
>>> you start working on anything so we can better coordinate.
>>>
>>> Thanks again for your interest and contributions.
>>>
>>> Thanks,
>>> Xuefu
>>>
>>>
>>>
>>> ------------------------------------------------------------------
>>> From:Eron Wright <er...@gmail.com>
>>> Sent At:2019 Jan. 1 (Tue.) 18:39
>>> To:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>
>>> Cc:Xiaowei Jiang <xi...@gmail.com>; twalthr <tw...@apache.org>;
>>> piotr <pi...@data-artisans.com>; Fabian Hueske <fh...@gmail.com>;
>>> suez1224 <su...@gmail.com>; Bowen Li <bo...@gmail.com>
>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>
>>> Hi folks, there's clearly some incremental steps to be taken to
>>> introduce catalog support to SQL Client, complementary to what is proposed
>>> in the Flink-Hive Metastore design doc.  I was quietly working on this over
>>> the holidays.   I posted some new sub-tasks, PRs, and sample code
>>> to FLINK-10744.
>>>
>>> What inspired me to get involved is that the catalog interface seems
>>> like a great way to encapsulate a 'library' of Flink tables and functions.
>>> For example, the NYC Taxi dataset (TaxiRides, TaxiFares, various UDFs) may
>>> be nicely encapsulated as a catalog (TaxiData).   Such a library should be
>>> fully consumable in SQL Client.
>>>
>>> I implemented the above.  Some highlights:
>>>
>>> 1. A fully-worked example of using the Taxi dataset in SQL Client via an
>>> environment file.
>>> - an ASCII video showing the SQL Client in action:
>>> https://asciinema.org/a/C8xuAjmZSxCuApgFgZQyeIHuo
>>>
>>> - the corresponding environment file (will be even more concise once
>>> 'FLINK-10696 Catalog UDFs' is merged):
>>> *https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/dist/conf/sql-client-defaults.yaml
>>> <https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/dist/conf/sql-client-defaults.yaml>*
>>>
>>> - the typed API for standalone table applications:
>>> *https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/java/com/dataartisans/flinktraining/examples/table_java/examples/ViaCatalog.java#L50
>>> <https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/java/com/dataartisans/flinktraining/examples/table_java/examples/ViaCatalog.java#L50>*
>>>
>>> 2. Implementation of the core catalog descriptor and factory.  I realize
>>> that some renames may later occur as per the design doc, and would be happy
>>> to do that as a follow-up.
>>> https://github.com/apache/flink/pull/7390
>>>
>>> 3. Implementation of a connect-style API on TableEnvironment to use
>>> catalog descriptor.
>>> https://github.com/apache/flink/pull/7392
>>>
>>> 4. Integration into SQL-Client's environment file:
>>> https://github.com/apache/flink/pull/7393
>>>
>>> I realize that the overall Hive integration is still evolving, but I
>>> believe that these PRs are a good stepping stone. Here's the list (in
>>> bottom-up order):
>>> - https://github.com/apache/flink/pull/7386
>>> - https://github.com/apache/flink/pull/7388
>>> - https://github.com/apache/flink/pull/7389
>>> - https://github.com/apache/flink/pull/7390
>>> - https://github.com/apache/flink/pull/7392
>>> - https://github.com/apache/flink/pull/7393
>>>
>>> Thanks and enjoy 2019!
>>> Eron W
>>>
>>>
>>> On Sun, Nov 18, 2018 at 3:04 PM Zhang, Xuefu <xu...@alibaba-inc.com>
>>> wrote:
>>> Hi Xiaowei,
>>>
>>> Thanks for bringing up the question. In the current design, the
>>> properties for meta objects are meant to cover anything that's specific to
>>> a particular catalog and agnostic to Flink. Anything that is common (such
>>> as schema for tables, query text for views, and udf classname) are
>>> abstracted as members of the respective classes. However, this is still in
>>> discussion, and Timo and I will go over this and provide an update.
>>>
>>> Please note that UDF is a little more involved than what the current
>>> design doc shows. I'm still refining this part.
>>>
>>> Thanks,
>>> Xuefu
>>>
>>>
>>> ------------------------------------------------------------------
>>> Sender:Xiaowei Jiang <xi...@gmail.com>
>>> Sent at:2018 Nov 18 (Sun) 15:17
>>> Recipient:dev <de...@flink.apache.org>
>>> Cc:Xuefu <xu...@alibaba-inc.com>; twalthr <tw...@apache.org>; piotr
>>> <pi...@data-artisans.com>; Fabian Hueske <fh...@gmail.com>; suez1224 <
>>> suez1224@gmail.com>
>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>
>>> Thanks Xuefu for the detailed design doc! One question on the properties
>>> associated with the catalog objects. Are we going to leave them completely
>>> free form or we are going to set some standard for that? I think that the
>>> answer may depend on if we want to explore catalog specific optimization
>>> opportunities. In any case, I think that it might be helpful for
>>> standardize as much as possible into strongly typed classes and use leave
>>> these properties for catalog specific things. But I think that we can do it
>>> in steps.
>>>
>>> Xiaowei
>>> On Fri, Nov 16, 2018 at 4:00 AM Bowen Li <bo...@gmail.com> wrote:
>>> Thanks for keeping on improving the overall design, Xuefu! It looks quite
>>>  good to me now.
>>>
>>>  Would be nice that cc-ed Flink committers can help to review and
>>> confirm!
>>>
>>>
>>>
>>>  One minor suggestion: Since the last section of design doc already
>>> touches
>>>  some new sql statements, shall we add another section in our doc and
>>>  formalize the new sql statements in SQL Client and TableEnvironment that
>>>  are gonna come along naturally with our design? Here are some that the
>>>  design doc mentioned and some that I came up with:
>>>
>>>  To be added:
>>>
>>>     - USE <catalog> - set default catalog
>>>     - USE <catalog.schema> - set default schema
>>>     - SHOW CATALOGS - show all registered catalogs
>>>     - SHOW SCHEMAS [FROM catalog] - list schemas in the current default
>>>     catalog or the specified catalog
>>>     - DESCRIBE VIEW view - show the view's definition in CatalogView
>>>     - SHOW VIEWS [FROM schema/catalog.schema] - show views from current
>>> or a
>>>     specified schema.
>>>
>>>     (DDLs that can be addressed by either our design or Shuyi's DDL
>>> design)
>>>
>>>     - CREATE/DROP/ALTER SCHEMA schema
>>>     - CREATE/DROP/ALTER CATALOG catalog
>>>
>>>  To be modified:
>>>
>>>     - SHOW TABLES [FROM schema/catalog.schema] - show tables from
>>> current or
>>>     a specified schema. Add 'from schema' to existing 'SHOW TABLES'
>>> statement
>>>     - SHOW FUNCTIONS [FROM schema/catalog.schema] - show functions from
>>>     current or a specified schema. Add 'from schema' to existing 'SHOW
>>> TABLES'
>>>     statement'
>>>
>>>
>>>  Thanks, Bowen
>>>
>>>
>>>
>>>  On Wed, Nov 14, 2018 at 10:39 PM Zhang, Xuefu <xu...@alibaba-inc.com>
>>>  wrote:
>>>
>>>  > Thanks, Bowen, for catching the error. I have granted comment
>>> permission
>>>  > with the link.
>>>  >
>>>  > I also updated the doc with the latest class definitions. Everyone is
>>>  > encouraged to review and comment.
>>>  >
>>>  > Thanks,
>>>  > Xuefu
>>>  >
>>>  > ------------------------------------------------------------------
>>>  > Sender:Bowen Li <bo...@gmail.com>
>>>  > Sent at:2018 Nov 14 (Wed) 06:44
>>>  > Recipient:Xuefu <xu...@alibaba-inc.com>
>>>  > Cc:piotr <pi...@data-artisans.com>; dev <de...@flink.apache.org>; Shuyi
>>>  > Chen <su...@gmail.com>
>>>  > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>  >
>>>  > Hi Xuefu,
>>>  >
>>>  > Currently the new design doc
>>>  > <
>>> https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit
>>> >
>>>  > is on “view only" mode, and people cannot leave comments. Can you
>>> please
>>>  > change it to "can comment" or "can edit" mode?
>>>  >
>>>  > Thanks, Bowen
>>>  >
>>>  >
>>>  > On Mon, Nov 12, 2018 at 9:51 PM Zhang, Xuefu <xuefu.z@alibaba-inc.com
>>> >
>>>  > wrote:
>>>  > Hi Piotr
>>>  >
>>>  > I have extracted the API portion of  the design and the google doc is
>>> here
>>>  > <
>>> https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit?usp=sharing
>>> >.
>>>  > Please review and provide your feedback.
>>>  >
>>>  > Thanks,
>>>  > Xuefu
>>>  >
>>>  > ------------------------------------------------------------------
>>>  > Sender:Xuefu <xu...@alibaba-inc.com>
>>>  > Sent at:2018 Nov 12 (Mon) 12:43
>>>  > Recipient:Piotr Nowojski <pi...@data-artisans.com>; dev <
>>>  > dev@flink.apache.org>
>>>  > Cc:Bowen Li <bo...@gmail.com>; Shuyi Chen <su...@gmail.com>
>>>  > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>  >
>>>  > Hi Piotr,
>>>  >
>>>  > That sounds good to me. Let's close all the open questions ((there
>>> are a
>>>  > couple of them)) in the Google doc and I should be able to quickly
>>> split
>>>  > it into the three proposals as you suggested.
>>>  >
>>>  > Thanks,
>>>  > Xuefu
>>>  >
>>>  > ------------------------------------------------------------------
>>>  > Sender:Piotr Nowojski <pi...@data-artisans.com>
>>>  > Sent at:2018 Nov 9 (Fri) 22:46
>>>  > Recipient:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>
>>>  > Cc:Bowen Li <bo...@gmail.com>; Shuyi Chen <su...@gmail.com>
>>>  > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>  >
>>>  > Hi,
>>>  >
>>>  >
>>>  > Yes, it seems like the best solution. Maybe someone else can also
>>> suggests if we can split it further? Maybe changes in the interface in one
>>> doc, reading from hive meta store another and final storing our meta
>>> informations in hive meta store?
>>>  >
>>>  > Piotrek
>>>  >
>>>  > > On 9 Nov 2018, at 01:44, Zhang, Xuefu <xu...@alibaba-inc.com>
>>> wrote:
>>>  > >
>>>  > > Hi Piotr,
>>>  > >
>>>  > > That seems to be good idea!
>>>  > >
>>>  >
>>>  > > Since the google doc for the design is currently under extensive
>>> review, I will leave it as it is for now. However, I'll convert it to two
>>> different FLIPs when the time comes.
>>>  > >
>>>  > > How does it sound to you?
>>>  > >
>>>  > > Thanks,
>>>  > > Xuefu
>>>  > >
>>>  > >
>>>  > > ------------------------------------------------------------------
>>>  > > Sender:Piotr Nowojski <pi...@data-artisans.com>
>>>  > > Sent at:2018 Nov 9 (Fri) 02:31
>>>  > > Recipient:dev <de...@flink.apache.org>
>>>  > > Cc:Bowen Li <bo...@gmail.com>; Xuefu <xuefu.z@alibaba-inc.com
>>>  > >; Shuyi Chen <su...@gmail.com>
>>>  > > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>  > >
>>>  > > Hi,
>>>  > >
>>>  >
>>>  > > Maybe we should split this topic (and the design doc) into couple
>>> of smaller ones, hopefully independent. The questions that you have asked
>>> Fabian have for example very little to do with reading metadata from Hive
>>> Meta Store?
>>>  > >
>>>  > > Piotrek
>>>  > >
>>>  > >> On 7 Nov 2018, at 14:27, Fabian Hueske <fh...@gmail.com> wrote:
>>>  > >>
>>>  > >> Hi Xuefu and all,
>>>  > >>
>>>  > >> Thanks for sharing this design document!
>>>  >
>>>  > >> I'm very much in favor of restructuring / reworking the catalog
>>> handling in
>>>  > >> Flink SQL as outlined in the document.
>>>  >
>>>  > >> Most changes described in the design document seem to be rather
>>> general and
>>>  > >> not specifically related to the Hive integration.
>>>  > >>
>>>  >
>>>  > >> IMO, there are some aspects, especially those at the boundary of
>>> Hive and
>>>  > >> Flink, that need a bit more discussion. For example
>>>  > >>
>>>  > >> * What does it take to make Flink schema compatible with Hive
>>> schema?
>>>  > >> * How will Flink tables (descriptors) be stored in HMS?
>>>  > >> * How do both Hive catalogs differ? Could they be integrated into
>>> to a
>>>  > >> single one? When to use which one?
>>>  >
>>>  > >> * What meta information is provided by HMS? What of this can be
>>> leveraged
>>>  > >> by Flink?
>>>  > >>
>>>  > >> Thank you,
>>>  > >> Fabian
>>>  > >>
>>>  > >> Am Fr., 2. Nov. 2018 um 00:31 Uhr schrieb Bowen Li <
>>> bowenli86@gmail.com
>>>  > >:
>>>  > >>
>>>  > >>> After taking a look at how other discussion threads work, I think
>>> it's
>>>  > >>> actually fine just keep our discussion here. It's up to you,
>>> Xuefu.
>>>  > >>>
>>>  > >>> The google doc LGTM. I left some minor comments.
>>>  > >>>
>>>  > >>> On Thu, Nov 1, 2018 at 10:17 AM Bowen Li <bo...@gmail.com>
>>> wrote:
>>>  > >>>
>>>  > >>>> Hi all,
>>>  > >>>>
>>>  > >>>> As Xuefu has published the design doc on google, I agree with
>>> Shuyi's
>>>  >
>>>  > >>>> suggestion that we probably should start a new email thread like
>>> "[DISCUSS]
>>>  >
>>>  > >>>> ... Hive integration design ..." on only dev mailing list for
>>> community
>>>  > >>>> devs to review. The current thread sends to both dev and user
>>> list.
>>>  > >>>>
>>>  >
>>>  > >>>> This email thread is more like validating the general idea and
>>> direction
>>>  >
>>>  > >>>> with the community, and it's been pretty long and crowded so
>>> far. Since
>>>  >
>>>  > >>>> everyone is pro for the idea, we can move forward with another
>>> thread to
>>>  > >>>> discuss and finalize the design.
>>>  > >>>>
>>>  > >>>> Thanks,
>>>  > >>>> Bowen
>>>  > >>>>
>>>  > >>>> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu <
>>>  > xuefu.z@alibaba-inc.com>
>>>  > >>>> wrote:
>>>  > >>>>
>>>  > >>>>> Hi Shuiyi,
>>>  > >>>>>
>>>  >
>>>  > >>>>> Good idea. Actually the PDF was converted from a google doc.
>>> Here is its
>>>  > >>>>> link:
>>>  > >>>>>
>>>  > >>>>>
>>>  >
>>> https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
>>>  > >>>>> Once we reach an agreement, I can convert it to a FLIP.
>>>  > >>>>>
>>>  > >>>>> Thanks,
>>>  > >>>>> Xuefu
>>>  > >>>>>
>>>  > >>>>>
>>>  > >>>>>
>>>  > >>>>>
>>> ------------------------------------------------------------------
>>>  > >>>>> Sender:Shuyi Chen <su...@gmail.com>
>>>  > >>>>> Sent at:2018 Nov 1 (Thu) 02:47
>>>  > >>>>> Recipient:Xuefu <xu...@alibaba-inc.com>
>>>  > >>>>> Cc:vino yang <ya...@gmail.com>; Fabian Hueske <
>>>  > fhueske@gmail.com>;
>>>  > >>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>>>  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive
>>> ecosystem
>>>  > >>>>>
>>>  > >>>>> Hi Xuefu,
>>>  > >>>>>
>>>  >
>>>  > >>>>> Thanks a lot for driving this big effort. I would suggest
>>> convert your
>>>  >
>>>  > >>>>> proposal and design doc into a google doc, and share it on the
>>> dev mailing
>>>  >
>>>  > >>>>> list for the community to review and comment with title like
>>> "[DISCUSS] ...
>>>  >
>>>  > >>>>> Hive integration design ..." . Once approved,  we can document
>>> it as a FLIP
>>>  >
>>>  > >>>>> (Flink Improvement Proposals), and use JIRAs to track the
>>> implementations.
>>>  > >>>>> What do you think?
>>>  > >>>>>
>>>  > >>>>> Shuyi
>>>  > >>>>>
>>>  > >>>>> On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <
>>>  > xuefu.z@alibaba-inc.com>
>>>  > >>>>> wrote:
>>>  > >>>>> Hi all,
>>>  > >>>>>
>>>  > >>>>> I have also shared a design doc on Hive metastore integration
>>> that is
>>>  >
>>>  > >>>>> attached here and also to FLINK-10556[1]. Please kindly review
>>> and share
>>>  > >>>>> your feedback.
>>>  > >>>>>
>>>  > >>>>>
>>>  > >>>>> Thanks,
>>>  > >>>>> Xuefu
>>>  > >>>>>
>>>  > >>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
>>>  > >>>>>
>>> ------------------------------------------------------------------
>>>  > >>>>> Sender:Xuefu <xu...@alibaba-inc.com>
>>>  > >>>>> Sent at:2018 Oct 25 (Thu) 01:08
>>>  > >>>>> Recipient:Xuefu <xu...@alibaba-inc.com>; Shuyi Chen <
>>>  > >>>>> suez1224@gmail.com>
>>>  > >>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <
>>>  > fhueske@gmail.com>;
>>>  > >>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>>>  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive
>>> ecosystem
>>>  > >>>>>
>>>  > >>>>> Hi all,
>>>  > >>>>>
>>>  > >>>>> To wrap up the discussion, I have attached a PDF describing the
>>>  >
>>>  > >>>>> proposal, which is also attached to FLINK-10556 [1]. Please
>>> feel free to
>>>  > >>>>> watch that JIRA to track the progress.
>>>  > >>>>>
>>>  > >>>>> Please also let me know if you have additional comments or
>>> questions.
>>>  > >>>>>
>>>  > >>>>> Thanks,
>>>  > >>>>> Xuefu
>>>  > >>>>>
>>>  > >>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
>>>  > >>>>>
>>>  > >>>>>
>>>  > >>>>>
>>> ------------------------------------------------------------------
>>>  > >>>>> Sender:Xuefu <xu...@alibaba-inc.com>
>>>  > >>>>> Sent at:2018 Oct 16 (Tue) 03:40
>>>  > >>>>> Recipient:Shuyi Chen <su...@gmail.com>
>>>  > >>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <
>>>  > fhueske@gmail.com>;
>>>  > >>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>>>  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive
>>> ecosystem
>>>  > >>>>>
>>>  > >>>>> Hi Shuyi,
>>>  > >>>>>
>>>  >
>>>  > >>>>> Thank you for your input. Yes, I agreed with a phased approach
>>> and like
>>>  >
>>>  > >>>>> to move forward fast. :) We did some work internally on DDL
>>> utilizing babel
>>>  > >>>>> parser in Calcite. While babel makes Calcite's grammar
>>> extensible, at
>>>  > >>>>> first impression it still seems too cumbersome for a project
>>> when too
>>>  >
>>>  > >>>>> much extensions are made. It's even challenging to find where
>>> the extension
>>>  >
>>>  > >>>>> is needed! It would be certainly better if Calcite can
>>> magically support
>>>  >
>>>  > >>>>> Hive QL by just turning on a flag, such as that for MYSQL_5. I
>>> can also
>>>  >
>>>  > >>>>> see that this could mean a lot of work on Calcite.
>>> Nevertheless, I will
>>>  >
>>>  > >>>>> bring up the discussion over there and to see what their
>>> community thinks.
>>>  > >>>>>
>>>  > >>>>> Would mind to share more info about the proposal on DDL that you
>>>  > >>>>> mentioned? We can certainly collaborate on this.
>>>  > >>>>>
>>>  > >>>>> Thanks,
>>>  > >>>>> Xuefu
>>>  > >>>>>
>>>  > >>>>>
>>> ------------------------------------------------------------------
>>>  > >>>>> Sender:Shuyi Chen <su...@gmail.com>
>>>  > >>>>> Sent at:2018 Oct 14 (Sun) 08:30
>>>  > >>>>> Recipient:Xuefu <xu...@alibaba-inc.com>
>>>  > >>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <
>>>  > fhueske@gmail.com>;
>>>  > >>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>>>  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive
>>> ecosystem
>>>  > >>>>>
>>>  > >>>>> Welcome to the community and thanks for the great proposal,
>>> Xuefu! I
>>>  >
>>>  > >>>>> think the proposal can be divided into 2 stages: making Flink
>>> to support
>>>  >
>>>  > >>>>> Hive features, and make Hive to work with Flink. I agreed with
>>> Timo that on
>>>  >
>>>  > >>>>> starting with a smaller scope, so we can make progress faster.
>>> As for [6],
>>>  >
>>>  > >>>>> a proposal for DDL is already in progress, and will come after
>>> the unified
>>>  >
>>>  > >>>>> SQL connector API is done. For supporting Hive syntax, we might
>>> need to
>>>  > >>>>> work with the Calcite community, and a recent effort called
>>> babel (
>>>  > >>>>> https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite
>>> might
>>>  > >>>>> help here.
>>>  > >>>>>
>>>  > >>>>> Thanks
>>>  > >>>>> Shuyi
>>>  > >>>>>
>>>  > >>>>> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <
>>>  > xuefu.z@alibaba-inc.com>
>>>  > >>>>> wrote:
>>>  > >>>>> Hi Fabian/Vno,
>>>  > >>>>>
>>>  >
>>>  > >>>>> Thank you very much for your encouragement inquiry. Sorry that
>>> I didn't
>>>  >
>>>  > >>>>> see Fabian's email until I read Vino's response just now.
>>> (Somehow Fabian's
>>>  > >>>>> went to the spam folder.)
>>>  > >>>>>
>>>  >
>>>  > >>>>> My proposal contains long-term and short-terms goals.
>>> Nevertheless, the
>>>  > >>>>> effort will focus on the following areas, including Fabian's
>>> list:
>>>  > >>>>>
>>>  > >>>>> 1. Hive metastore connectivity - This covers both read/write
>>> access,
>>>  >
>>>  > >>>>> which means Flink can make full use of Hive's metastore as its
>>> catalog (at
>>>  > >>>>> least for the batch but can extend for streaming as well).
>>>  >
>>>  > >>>>> 2. Metadata compatibility - Objects (databases, tables,
>>> partitions, etc)
>>>  >
>>>  > >>>>> created by Hive can be understood by Flink and the reverse
>>> direction is
>>>  > >>>>> true also.
>>>  > >>>>> 3. Data compatibility - Similar to #2, data produced by Hive
>>> can be
>>>  > >>>>> consumed by Flink and vise versa.
>>>  >
>>>  > >>>>> 4. Support Hive UDFs - For all Hive's native udfs, Flink either
>>> provides
>>>  > >>>>> its own implementation or make Hive's implementation work in
>>> Flink.
>>>  > >>>>> Further, for user created UDFs in Hive, Flink SQL should
>>> provide a
>>>  >
>>>  > >>>>> mechanism allowing user to import them into Flink without any
>>> code change
>>>  > >>>>> required.
>>>  > >>>>> 5. Data types -  Flink SQL should support all data types that
>>> are
>>>  > >>>>> available in Hive.
>>>  > >>>>> 6. SQL Language - Flink SQL should support SQL standard (such as
>>>  >
>>>  > >>>>> SQL2003) with extension to support Hive's syntax and language
>>> features,
>>>  > >>>>> around DDL, DML, and SELECT queries.
>>>  >
>>>  > >>>>> 7.  SQL CLI - this is currently developing in Flink but more
>>> effort is
>>>  > >>>>> needed.
>>>  >
>>>  > >>>>> 8. Server - provide a server that's compatible with Hive's
>>> HiverServer2
>>>  >
>>>  > >>>>> in thrift APIs, such that HiveServer2 users can reuse their
>>> existing client
>>>  > >>>>> (such as beeline) but connect to Flink's thrift server instead.
>>>  >
>>>  > >>>>> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC
>>> drivers for
>>>  > >>>>> other application to use to connect to its thrift server
>>>  > >>>>> 10. Support other user's customizations in Hive, such as Hive
>>> Serdes,
>>>  > >>>>> storage handlers, etc.
>>>  >
>>>  > >>>>> 11. Better task failure tolerance and task scheduling at Flink
>>> runtime.
>>>  > >>>>>
>>>  > >>>>> As you can see, achieving all those requires significant effort
>>> and
>>>  >
>>>  > >>>>> across all layers in Flink. However, a short-term goal could
>>> include only
>>>  >
>>>  > >>>>> core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller
>>> scope (such as
>>>  > >>>>> #3, #6).
>>>  > >>>>>
>>>  >
>>>  > >>>>> Please share your further thoughts. If we generally agree that
>>> this is
>>>  >
>>>  > >>>>> the right direction, I could come up with a formal proposal
>>> quickly and
>>>  > >>>>> then we can follow up with broader discussions.
>>>  > >>>>>
>>>  > >>>>> Thanks,
>>>  > >>>>> Xuefu
>>>  > >>>>>
>>>  > >>>>>
>>>  > >>>>>
>>>  > >>>>>
>>> ------------------------------------------------------------------
>>>  > >>>>> Sender:vino yang <ya...@gmail.com>
>>>  > >>>>> Sent at:2018 Oct 11 (Thu) 09:45
>>>  > >>>>> Recipient:Fabian Hueske <fh...@gmail.com>
>>>  > >>>>> Cc:dev <de...@flink.apache.org>; Xuefu <xuefu.z@alibaba-inc.com
>>>  > >; user <
>>>  > >>>>> user@flink.apache.org>
>>>  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive
>>> ecosystem
>>>  > >>>>>
>>>  > >>>>> Hi Xuefu,
>>>  > >>>>>
>>>  >
>>>  > >>>>> Appreciate this proposal, and like Fabian, it would look better
>>> if you
>>>  > >>>>> can give more details of the plan.
>>>  > >>>>>
>>>  > >>>>> Thanks, vino.
>>>  > >>>>>
>>>  > >>>>> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
>>>  > >>>>> Hi Xuefu,
>>>  > >>>>>
>>>  >
>>>  > >>>>> Welcome to the Flink community and thanks for starting this
>>> discussion!
>>>  > >>>>> Better Hive integration would be really great!
>>>  > >>>>> Can you go into details of what you are proposing? I can think
>>> of a
>>>  > >>>>> couple ways to improve Flink in that regard:
>>>  > >>>>>
>>>  > >>>>> * Support for Hive UDFs
>>>  > >>>>> * Support for Hive metadata catalog
>>>  > >>>>> * Support for HiveQL syntax
>>>  > >>>>> * ???
>>>  > >>>>>
>>>  > >>>>> Best, Fabian
>>>  > >>>>>
>>>  > >>>>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
>>>  > >>>>> xuefu.z@alibaba-inc.com>:
>>>  > >>>>> Hi all,
>>>  > >>>>>
>>>  > >>>>> Along with the community's effort, inside Alibaba we have
>>> explored
>>>  >
>>>  > >>>>> Flink's potential as an execution engine not just for stream
>>> processing but
>>>  > >>>>> also for batch processing. We are encouraged by our findings
>>> and have
>>>  >
>>>  > >>>>> initiated our effort to make Flink's SQL capabilities
>>> full-fledged. When
>>>  >
>>>  > >>>>> comparing what's available in Flink to the offerings from
>>> competitive data
>>>  >
>>>  > >>>>> processing engines, we identified a major gap in Flink: a well
>>> integration
>>>  >
>>>  > >>>>> with Hive ecosystem. This is crucial to the success of Flink
>>> SQL and batch
>>>  >
>>>  > >>>>> due to the well-established data ecosystem around Hive.
>>> Therefore, we have
>>>  >
>>>  > >>>>> done some initial work along this direction but there are still
>>> a lot of
>>>  > >>>>> effort needed.
>>>  > >>>>>
>>>  > >>>>> We have two strategies in mind. The first one is to make Flink
>>> SQL
>>>  >
>>>  > >>>>> full-fledged and well-integrated with Hive ecosystem. This is a
>>> similar
>>>  >
>>>  > >>>>> approach to what Spark SQL adopted. The second strategy is to
>>> make Hive
>>>  >
>>>  > >>>>> itself work with Flink, similar to the proposal in [1]. Each
>>> approach bears
>>>  >
>>>  > >>>>> its pros and cons, but they don’t need to be mutually exclusive
>>> with each
>>>  > >>>>> targeting at different users and use cases. We believe that
>>> both will
>>>  > >>>>> promote a much greater adoption of Flink beyond stream
>>> processing.
>>>  > >>>>>
>>>  > >>>>> We have been focused on the first approach and would like to
>>> showcase
>>>  >
>>>  > >>>>> Flink's batch and SQL capabilities with Flink SQL. However, we
>>> have also
>>>  > >>>>> planned to start strategy #2 as the follow-up effort.
>>>  > >>>>>
>>>  >
>>>  > >>>>> I'm completely new to Flink(, with a short bio [2] below),
>>> though many
>>>  >
>>>  > >>>>> of my colleagues here at Alibaba are long-time contributors.
>>> Nevertheless,
>>>  >
>>>  > >>>>> I'd like to share our thoughts and invite your early feedback.
>>> At the same
>>>  >
>>>  > >>>>> time, I am working on a detailed proposal on Flink SQL's
>>> integration with
>>>  > >>>>> Hive ecosystem, which will be also shared when ready.
>>>  > >>>>>
>>>  > >>>>> While the ideas are simple, each approach will demand
>>> significant
>>>  >
>>>  > >>>>> effort, more than what we can afford. Thus, the input and
>>> contributions
>>>  > >>>>> from the communities are greatly welcome and appreciated.
>>>  > >>>>>
>>>  > >>>>> Regards,
>>>  > >>>>>
>>>  > >>>>>
>>>  > >>>>> Xuefu
>>>  > >>>>>
>>>  > >>>>> References:
>>>  > >>>>>
>>>  > >>>>> [1] https://issues.apache.org/jira/browse/HIVE-10712
>>>  >
>>>  > >>>>> [2] Xuefu Zhang is a long-time open source veteran, worked or
>>> working on
>>>  > >>>>> many projects under Apache Foundation, of which he is also an
>>> honored
>>>  >
>>>  > >>>>> member. About 10 years ago he worked in the Hadoop team at
>>> Yahoo where the
>>>  >
>>>  > >>>>> projects just got started. Later he worked at Cloudera,
>>> initiating and
>>>  >
>>>  > >>>>> leading the development of Hive on Spark project in the
>>> communities and
>>>  >
>>>  > >>>>> across many organizations. Prior to joining Alibaba, he worked
>>> at Uber
>>>  >
>>>  > >>>>> where he promoted Hive on Spark to all Uber's SQL on Hadoop
>>> workload and
>>>  > >>>>> significantly improved Uber's cluster efficiency.
>>>  > >>>>>
>>>  > >>>>>
>>>  > >>>>>
>>>  > >>>>>
>>>  > >>>>> --
>>>  >
>>>  > >>>>> "So you have to trust that the dots will somehow connect in
>>> your future."
>>>  > >>>>>
>>>  > >>>>>
>>>  > >>>>> --
>>>  >
>>>  > >>>>> "So you have to trust that the dots will somehow connect in
>>> your future."
>>>  > >>>>>
>>>  >
>>>  >
>>>
>>>

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by Eron Wright <er...@gmail.com>.
Would a couple folks raise their hand to make a review pass thru the 6 PRs
listed above?  It is a lovely stack of PRs that is 'all green' at the
moment.   I would be happy to open follow-on PRs to rapidly align with
other efforts.

Note that the code is agnostic to the details of the ExternalCatalog
interface; the code would not be obsolete if/when the catalog interface is
enhanced as per the design doc.



On Wed, Jan 2, 2019 at 1:35 PM Eron Wright <er...@gmail.com> wrote:

> I propose that the community review and merge the PRs that I posted, and
> then evolve the design thru 1.8 and beyond.   I think having a basic
> infrastructure in place now will accelerate the effort, do you agree?
>
> Thanks again!
>
> On Wed, Jan 2, 2019 at 11:20 AM Zhang, Xuefu <xu...@alibaba-inc.com>
> wrote:
>
>> Hi Eron,
>>
>> Happy New Year!
>>
>> Thank you very much for your contribution, especially during the
>> holidays. Wile I'm encouraged by your work, I'd also like to share my
>> thoughts on how to move forward.
>>
>> First, please note that the design discussion is still finalizing, and we
>> expect some moderate changes, especially around TableFactories. Another
>> pending change is our decision to shy away from scala, which our work will
>> be impacted by.
>>
>> Secondly, while your work seemed about plugging in catalogs definitions
>> to the execution environment, which is less impacted by TableFactory
>> change, I did notice some duplication of your work and ours. This is no big
>> deal, but going forward, we should probable have a better communication on
>> the work assignment so as to avoid any possible duplication of work. On the
>> other hand, I think some of your work is interesting and valuable for
>> inclusion once we finalize the overall design.
>>
>> Thus, please continue your research and experiment and let us know when
>> you start working on anything so we can better coordinate.
>>
>> Thanks again for your interest and contributions.
>>
>> Thanks,
>> Xuefu
>>
>>
>>
>> ------------------------------------------------------------------
>> From:Eron Wright <er...@gmail.com>
>> Sent At:2019 Jan. 1 (Tue.) 18:39
>> To:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>
>> Cc:Xiaowei Jiang <xi...@gmail.com>; twalthr <tw...@apache.org>;
>> piotr <pi...@data-artisans.com>; Fabian Hueske <fh...@gmail.com>;
>> suez1224 <su...@gmail.com>; Bowen Li <bo...@gmail.com>
>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>
>> Hi folks, there's clearly some incremental steps to be taken to introduce
>> catalog support to SQL Client, complementary to what is proposed in the
>> Flink-Hive Metastore design doc.  I was quietly working on this over the
>> holidays.   I posted some new sub-tasks, PRs, and sample code
>> to FLINK-10744.
>>
>> What inspired me to get involved is that the catalog interface seems like
>> a great way to encapsulate a 'library' of Flink tables and functions.  For
>> example, the NYC Taxi dataset (TaxiRides, TaxiFares, various UDFs) may be
>> nicely encapsulated as a catalog (TaxiData).   Such a library should be
>> fully consumable in SQL Client.
>>
>> I implemented the above.  Some highlights:
>>
>> 1. A fully-worked example of using the Taxi dataset in SQL Client via an
>> environment file.
>> - an ASCII video showing the SQL Client in action:
>> https://asciinema.org/a/C8xuAjmZSxCuApgFgZQyeIHuo
>>
>> - the corresponding environment file (will be even more concise once
>> 'FLINK-10696 Catalog UDFs' is merged):
>> *https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/dist/conf/sql-client-defaults.yaml
>> <https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/dist/conf/sql-client-defaults.yaml>*
>>
>> - the typed API for standalone table applications:
>> *https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/java/com/dataartisans/flinktraining/examples/table_java/examples/ViaCatalog.java#L50
>> <https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/java/com/dataartisans/flinktraining/examples/table_java/examples/ViaCatalog.java#L50>*
>>
>> 2. Implementation of the core catalog descriptor and factory.  I realize
>> that some renames may later occur as per the design doc, and would be happy
>> to do that as a follow-up.
>> https://github.com/apache/flink/pull/7390
>>
>> 3. Implementation of a connect-style API on TableEnvironment to use
>> catalog descriptor.
>> https://github.com/apache/flink/pull/7392
>>
>> 4. Integration into SQL-Client's environment file:
>> https://github.com/apache/flink/pull/7393
>>
>> I realize that the overall Hive integration is still evolving, but I
>> believe that these PRs are a good stepping stone. Here's the list (in
>> bottom-up order):
>> - https://github.com/apache/flink/pull/7386
>> - https://github.com/apache/flink/pull/7388
>> - https://github.com/apache/flink/pull/7389
>> - https://github.com/apache/flink/pull/7390
>> - https://github.com/apache/flink/pull/7392
>> - https://github.com/apache/flink/pull/7393
>>
>> Thanks and enjoy 2019!
>> Eron W
>>
>>
>> On Sun, Nov 18, 2018 at 3:04 PM Zhang, Xuefu <xu...@alibaba-inc.com>
>> wrote:
>> Hi Xiaowei,
>>
>> Thanks for bringing up the question. In the current design, the
>> properties for meta objects are meant to cover anything that's specific to
>> a particular catalog and agnostic to Flink. Anything that is common (such
>> as schema for tables, query text for views, and udf classname) are
>> abstracted as members of the respective classes. However, this is still in
>> discussion, and Timo and I will go over this and provide an update.
>>
>> Please note that UDF is a little more involved than what the current
>> design doc shows. I'm still refining this part.
>>
>> Thanks,
>> Xuefu
>>
>>
>> ------------------------------------------------------------------
>> Sender:Xiaowei Jiang <xi...@gmail.com>
>> Sent at:2018 Nov 18 (Sun) 15:17
>> Recipient:dev <de...@flink.apache.org>
>> Cc:Xuefu <xu...@alibaba-inc.com>; twalthr <tw...@apache.org>; piotr <
>> piotr@data-artisans.com>; Fabian Hueske <fh...@gmail.com>; suez1224 <
>> suez1224@gmail.com>
>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>
>> Thanks Xuefu for the detailed design doc! One question on the properties
>> associated with the catalog objects. Are we going to leave them completely
>> free form or we are going to set some standard for that? I think that the
>> answer may depend on if we want to explore catalog specific optimization
>> opportunities. In any case, I think that it might be helpful for
>> standardize as much as possible into strongly typed classes and use leave
>> these properties for catalog specific things. But I think that we can do it
>> in steps.
>>
>> Xiaowei
>> On Fri, Nov 16, 2018 at 4:00 AM Bowen Li <bo...@gmail.com> wrote:
>> Thanks for keeping on improving the overall design, Xuefu! It looks quite
>>  good to me now.
>>
>>  Would be nice that cc-ed Flink committers can help to review and confirm!
>>
>>
>>
>>  One minor suggestion: Since the last section of design doc already
>> touches
>>  some new sql statements, shall we add another section in our doc and
>>  formalize the new sql statements in SQL Client and TableEnvironment that
>>  are gonna come along naturally with our design? Here are some that the
>>  design doc mentioned and some that I came up with:
>>
>>  To be added:
>>
>>     - USE <catalog> - set default catalog
>>     - USE <catalog.schema> - set default schema
>>     - SHOW CATALOGS - show all registered catalogs
>>     - SHOW SCHEMAS [FROM catalog] - list schemas in the current default
>>     catalog or the specified catalog
>>     - DESCRIBE VIEW view - show the view's definition in CatalogView
>>     - SHOW VIEWS [FROM schema/catalog.schema] - show views from current
>> or a
>>     specified schema.
>>
>>     (DDLs that can be addressed by either our design or Shuyi's DDL
>> design)
>>
>>     - CREATE/DROP/ALTER SCHEMA schema
>>     - CREATE/DROP/ALTER CATALOG catalog
>>
>>  To be modified:
>>
>>     - SHOW TABLES [FROM schema/catalog.schema] - show tables from current
>> or
>>     a specified schema. Add 'from schema' to existing 'SHOW TABLES'
>> statement
>>     - SHOW FUNCTIONS [FROM schema/catalog.schema] - show functions from
>>     current or a specified schema. Add 'from schema' to existing 'SHOW
>> TABLES'
>>     statement'
>>
>>
>>  Thanks, Bowen
>>
>>
>>
>>  On Wed, Nov 14, 2018 at 10:39 PM Zhang, Xuefu <xu...@alibaba-inc.com>
>>  wrote:
>>
>>  > Thanks, Bowen, for catching the error. I have granted comment
>> permission
>>  > with the link.
>>  >
>>  > I also updated the doc with the latest class definitions. Everyone is
>>  > encouraged to review and comment.
>>  >
>>  > Thanks,
>>  > Xuefu
>>  >
>>  > ------------------------------------------------------------------
>>  > Sender:Bowen Li <bo...@gmail.com>
>>  > Sent at:2018 Nov 14 (Wed) 06:44
>>  > Recipient:Xuefu <xu...@alibaba-inc.com>
>>  > Cc:piotr <pi...@data-artisans.com>; dev <de...@flink.apache.org>; Shuyi
>>  > Chen <su...@gmail.com>
>>  > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>  >
>>  > Hi Xuefu,
>>  >
>>  > Currently the new design doc
>>  > <
>> https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit
>> >
>>  > is on “view only" mode, and people cannot leave comments. Can you
>> please
>>  > change it to "can comment" or "can edit" mode?
>>  >
>>  > Thanks, Bowen
>>  >
>>  >
>>  > On Mon, Nov 12, 2018 at 9:51 PM Zhang, Xuefu <xu...@alibaba-inc.com>
>>  > wrote:
>>  > Hi Piotr
>>  >
>>  > I have extracted the API portion of  the design and the google doc is
>> here
>>  > <
>> https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit?usp=sharing
>> >.
>>  > Please review and provide your feedback.
>>  >
>>  > Thanks,
>>  > Xuefu
>>  >
>>  > ------------------------------------------------------------------
>>  > Sender:Xuefu <xu...@alibaba-inc.com>
>>  > Sent at:2018 Nov 12 (Mon) 12:43
>>  > Recipient:Piotr Nowojski <pi...@data-artisans.com>; dev <
>>  > dev@flink.apache.org>
>>  > Cc:Bowen Li <bo...@gmail.com>; Shuyi Chen <su...@gmail.com>
>>  > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>  >
>>  > Hi Piotr,
>>  >
>>  > That sounds good to me. Let's close all the open questions ((there are
>> a
>>  > couple of them)) in the Google doc and I should be able to quickly
>> split
>>  > it into the three proposals as you suggested.
>>  >
>>  > Thanks,
>>  > Xuefu
>>  >
>>  > ------------------------------------------------------------------
>>  > Sender:Piotr Nowojski <pi...@data-artisans.com>
>>  > Sent at:2018 Nov 9 (Fri) 22:46
>>  > Recipient:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>
>>  > Cc:Bowen Li <bo...@gmail.com>; Shuyi Chen <su...@gmail.com>
>>  > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>  >
>>  > Hi,
>>  >
>>  >
>>  > Yes, it seems like the best solution. Maybe someone else can also
>> suggests if we can split it further? Maybe changes in the interface in one
>> doc, reading from hive meta store another and final storing our meta
>> informations in hive meta store?
>>  >
>>  > Piotrek
>>  >
>>  > > On 9 Nov 2018, at 01:44, Zhang, Xuefu <xu...@alibaba-inc.com>
>> wrote:
>>  > >
>>  > > Hi Piotr,
>>  > >
>>  > > That seems to be good idea!
>>  > >
>>  >
>>  > > Since the google doc for the design is currently under extensive
>> review, I will leave it as it is for now. However, I'll convert it to two
>> different FLIPs when the time comes.
>>  > >
>>  > > How does it sound to you?
>>  > >
>>  > > Thanks,
>>  > > Xuefu
>>  > >
>>  > >
>>  > > ------------------------------------------------------------------
>>  > > Sender:Piotr Nowojski <pi...@data-artisans.com>
>>  > > Sent at:2018 Nov 9 (Fri) 02:31
>>  > > Recipient:dev <de...@flink.apache.org>
>>  > > Cc:Bowen Li <bo...@gmail.com>; Xuefu <xuefu.z@alibaba-inc.com
>>  > >; Shuyi Chen <su...@gmail.com>
>>  > > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>  > >
>>  > > Hi,
>>  > >
>>  >
>>  > > Maybe we should split this topic (and the design doc) into couple of
>> smaller ones, hopefully independent. The questions that you have asked
>> Fabian have for example very little to do with reading metadata from Hive
>> Meta Store?
>>  > >
>>  > > Piotrek
>>  > >
>>  > >> On 7 Nov 2018, at 14:27, Fabian Hueske <fh...@gmail.com> wrote:
>>  > >>
>>  > >> Hi Xuefu and all,
>>  > >>
>>  > >> Thanks for sharing this design document!
>>  >
>>  > >> I'm very much in favor of restructuring / reworking the catalog
>> handling in
>>  > >> Flink SQL as outlined in the document.
>>  >
>>  > >> Most changes described in the design document seem to be rather
>> general and
>>  > >> not specifically related to the Hive integration.
>>  > >>
>>  >
>>  > >> IMO, there are some aspects, especially those at the boundary of
>> Hive and
>>  > >> Flink, that need a bit more discussion. For example
>>  > >>
>>  > >> * What does it take to make Flink schema compatible with Hive
>> schema?
>>  > >> * How will Flink tables (descriptors) be stored in HMS?
>>  > >> * How do both Hive catalogs differ? Could they be integrated into
>> to a
>>  > >> single one? When to use which one?
>>  >
>>  > >> * What meta information is provided by HMS? What of this can be
>> leveraged
>>  > >> by Flink?
>>  > >>
>>  > >> Thank you,
>>  > >> Fabian
>>  > >>
>>  > >> Am Fr., 2. Nov. 2018 um 00:31 Uhr schrieb Bowen Li <
>> bowenli86@gmail.com
>>  > >:
>>  > >>
>>  > >>> After taking a look at how other discussion threads work, I think
>> it's
>>  > >>> actually fine just keep our discussion here. It's up to you, Xuefu.
>>  > >>>
>>  > >>> The google doc LGTM. I left some minor comments.
>>  > >>>
>>  > >>> On Thu, Nov 1, 2018 at 10:17 AM Bowen Li <bo...@gmail.com>
>> wrote:
>>  > >>>
>>  > >>>> Hi all,
>>  > >>>>
>>  > >>>> As Xuefu has published the design doc on google, I agree with
>> Shuyi's
>>  >
>>  > >>>> suggestion that we probably should start a new email thread like
>> "[DISCUSS]
>>  >
>>  > >>>> ... Hive integration design ..." on only dev mailing list for
>> community
>>  > >>>> devs to review. The current thread sends to both dev and user
>> list.
>>  > >>>>
>>  >
>>  > >>>> This email thread is more like validating the general idea and
>> direction
>>  >
>>  > >>>> with the community, and it's been pretty long and crowded so far.
>> Since
>>  >
>>  > >>>> everyone is pro for the idea, we can move forward with another
>> thread to
>>  > >>>> discuss and finalize the design.
>>  > >>>>
>>  > >>>> Thanks,
>>  > >>>> Bowen
>>  > >>>>
>>  > >>>> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu <
>>  > xuefu.z@alibaba-inc.com>
>>  > >>>> wrote:
>>  > >>>>
>>  > >>>>> Hi Shuiyi,
>>  > >>>>>
>>  >
>>  > >>>>> Good idea. Actually the PDF was converted from a google doc.
>> Here is its
>>  > >>>>> link:
>>  > >>>>>
>>  > >>>>>
>>  >
>> https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
>>  > >>>>> Once we reach an agreement, I can convert it to a FLIP.
>>  > >>>>>
>>  > >>>>> Thanks,
>>  > >>>>> Xuefu
>>  > >>>>>
>>  > >>>>>
>>  > >>>>>
>>  > >>>>>
>> ------------------------------------------------------------------
>>  > >>>>> Sender:Shuyi Chen <su...@gmail.com>
>>  > >>>>> Sent at:2018 Nov 1 (Thu) 02:47
>>  > >>>>> Recipient:Xuefu <xu...@alibaba-inc.com>
>>  > >>>>> Cc:vino yang <ya...@gmail.com>; Fabian Hueske <
>>  > fhueske@gmail.com>;
>>  > >>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>>  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive
>> ecosystem
>>  > >>>>>
>>  > >>>>> Hi Xuefu,
>>  > >>>>>
>>  >
>>  > >>>>> Thanks a lot for driving this big effort. I would suggest
>> convert your
>>  >
>>  > >>>>> proposal and design doc into a google doc, and share it on the
>> dev mailing
>>  >
>>  > >>>>> list for the community to review and comment with title like
>> "[DISCUSS] ...
>>  >
>>  > >>>>> Hive integration design ..." . Once approved,  we can document
>> it as a FLIP
>>  >
>>  > >>>>> (Flink Improvement Proposals), and use JIRAs to track the
>> implementations.
>>  > >>>>> What do you think?
>>  > >>>>>
>>  > >>>>> Shuyi
>>  > >>>>>
>>  > >>>>> On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <
>>  > xuefu.z@alibaba-inc.com>
>>  > >>>>> wrote:
>>  > >>>>> Hi all,
>>  > >>>>>
>>  > >>>>> I have also shared a design doc on Hive metastore integration
>> that is
>>  >
>>  > >>>>> attached here and also to FLINK-10556[1]. Please kindly review
>> and share
>>  > >>>>> your feedback.
>>  > >>>>>
>>  > >>>>>
>>  > >>>>> Thanks,
>>  > >>>>> Xuefu
>>  > >>>>>
>>  > >>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
>>  > >>>>>
>> ------------------------------------------------------------------
>>  > >>>>> Sender:Xuefu <xu...@alibaba-inc.com>
>>  > >>>>> Sent at:2018 Oct 25 (Thu) 01:08
>>  > >>>>> Recipient:Xuefu <xu...@alibaba-inc.com>; Shuyi Chen <
>>  > >>>>> suez1224@gmail.com>
>>  > >>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <
>>  > fhueske@gmail.com>;
>>  > >>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>>  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive
>> ecosystem
>>  > >>>>>
>>  > >>>>> Hi all,
>>  > >>>>>
>>  > >>>>> To wrap up the discussion, I have attached a PDF describing the
>>  >
>>  > >>>>> proposal, which is also attached to FLINK-10556 [1]. Please feel
>> free to
>>  > >>>>> watch that JIRA to track the progress.
>>  > >>>>>
>>  > >>>>> Please also let me know if you have additional comments or
>> questions.
>>  > >>>>>
>>  > >>>>> Thanks,
>>  > >>>>> Xuefu
>>  > >>>>>
>>  > >>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
>>  > >>>>>
>>  > >>>>>
>>  > >>>>>
>> ------------------------------------------------------------------
>>  > >>>>> Sender:Xuefu <xu...@alibaba-inc.com>
>>  > >>>>> Sent at:2018 Oct 16 (Tue) 03:40
>>  > >>>>> Recipient:Shuyi Chen <su...@gmail.com>
>>  > >>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <
>>  > fhueske@gmail.com>;
>>  > >>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>>  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive
>> ecosystem
>>  > >>>>>
>>  > >>>>> Hi Shuyi,
>>  > >>>>>
>>  >
>>  > >>>>> Thank you for your input. Yes, I agreed with a phased approach
>> and like
>>  >
>>  > >>>>> to move forward fast. :) We did some work internally on DDL
>> utilizing babel
>>  > >>>>> parser in Calcite. While babel makes Calcite's grammar
>> extensible, at
>>  > >>>>> first impression it still seems too cumbersome for a project
>> when too
>>  >
>>  > >>>>> much extensions are made. It's even challenging to find where
>> the extension
>>  >
>>  > >>>>> is needed! It would be certainly better if Calcite can magically
>> support
>>  >
>>  > >>>>> Hive QL by just turning on a flag, such as that for MYSQL_5. I
>> can also
>>  >
>>  > >>>>> see that this could mean a lot of work on Calcite. Nevertheless,
>> I will
>>  >
>>  > >>>>> bring up the discussion over there and to see what their
>> community thinks.
>>  > >>>>>
>>  > >>>>> Would mind to share more info about the proposal on DDL that you
>>  > >>>>> mentioned? We can certainly collaborate on this.
>>  > >>>>>
>>  > >>>>> Thanks,
>>  > >>>>> Xuefu
>>  > >>>>>
>>  > >>>>>
>> ------------------------------------------------------------------
>>  > >>>>> Sender:Shuyi Chen <su...@gmail.com>
>>  > >>>>> Sent at:2018 Oct 14 (Sun) 08:30
>>  > >>>>> Recipient:Xuefu <xu...@alibaba-inc.com>
>>  > >>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <
>>  > fhueske@gmail.com>;
>>  > >>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>>  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive
>> ecosystem
>>  > >>>>>
>>  > >>>>> Welcome to the community and thanks for the great proposal,
>> Xuefu! I
>>  >
>>  > >>>>> think the proposal can be divided into 2 stages: making Flink to
>> support
>>  >
>>  > >>>>> Hive features, and make Hive to work with Flink. I agreed with
>> Timo that on
>>  >
>>  > >>>>> starting with a smaller scope, so we can make progress faster.
>> As for [6],
>>  >
>>  > >>>>> a proposal for DDL is already in progress, and will come after
>> the unified
>>  >
>>  > >>>>> SQL connector API is done. For supporting Hive syntax, we might
>> need to
>>  > >>>>> work with the Calcite community, and a recent effort called
>> babel (
>>  > >>>>> https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite
>> might
>>  > >>>>> help here.
>>  > >>>>>
>>  > >>>>> Thanks
>>  > >>>>> Shuyi
>>  > >>>>>
>>  > >>>>> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <
>>  > xuefu.z@alibaba-inc.com>
>>  > >>>>> wrote:
>>  > >>>>> Hi Fabian/Vno,
>>  > >>>>>
>>  >
>>  > >>>>> Thank you very much for your encouragement inquiry. Sorry that I
>> didn't
>>  >
>>  > >>>>> see Fabian's email until I read Vino's response just now.
>> (Somehow Fabian's
>>  > >>>>> went to the spam folder.)
>>  > >>>>>
>>  >
>>  > >>>>> My proposal contains long-term and short-terms goals.
>> Nevertheless, the
>>  > >>>>> effort will focus on the following areas, including Fabian's
>> list:
>>  > >>>>>
>>  > >>>>> 1. Hive metastore connectivity - This covers both read/write
>> access,
>>  >
>>  > >>>>> which means Flink can make full use of Hive's metastore as its
>> catalog (at
>>  > >>>>> least for the batch but can extend for streaming as well).
>>  >
>>  > >>>>> 2. Metadata compatibility - Objects (databases, tables,
>> partitions, etc)
>>  >
>>  > >>>>> created by Hive can be understood by Flink and the reverse
>> direction is
>>  > >>>>> true also.
>>  > >>>>> 3. Data compatibility - Similar to #2, data produced by Hive can
>> be
>>  > >>>>> consumed by Flink and vise versa.
>>  >
>>  > >>>>> 4. Support Hive UDFs - For all Hive's native udfs, Flink either
>> provides
>>  > >>>>> its own implementation or make Hive's implementation work in
>> Flink.
>>  > >>>>> Further, for user created UDFs in Hive, Flink SQL should provide
>> a
>>  >
>>  > >>>>> mechanism allowing user to import them into Flink without any
>> code change
>>  > >>>>> required.
>>  > >>>>> 5. Data types -  Flink SQL should support all data types that are
>>  > >>>>> available in Hive.
>>  > >>>>> 6. SQL Language - Flink SQL should support SQL standard (such as
>>  >
>>  > >>>>> SQL2003) with extension to support Hive's syntax and language
>> features,
>>  > >>>>> around DDL, DML, and SELECT queries.
>>  >
>>  > >>>>> 7.  SQL CLI - this is currently developing in Flink but more
>> effort is
>>  > >>>>> needed.
>>  >
>>  > >>>>> 8. Server - provide a server that's compatible with Hive's
>> HiverServer2
>>  >
>>  > >>>>> in thrift APIs, such that HiveServer2 users can reuse their
>> existing client
>>  > >>>>> (such as beeline) but connect to Flink's thrift server instead.
>>  >
>>  > >>>>> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC
>> drivers for
>>  > >>>>> other application to use to connect to its thrift server
>>  > >>>>> 10. Support other user's customizations in Hive, such as Hive
>> Serdes,
>>  > >>>>> storage handlers, etc.
>>  >
>>  > >>>>> 11. Better task failure tolerance and task scheduling at Flink
>> runtime.
>>  > >>>>>
>>  > >>>>> As you can see, achieving all those requires significant effort
>> and
>>  >
>>  > >>>>> across all layers in Flink. However, a short-term goal could
>> include only
>>  >
>>  > >>>>> core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller
>> scope (such as
>>  > >>>>> #3, #6).
>>  > >>>>>
>>  >
>>  > >>>>> Please share your further thoughts. If we generally agree that
>> this is
>>  >
>>  > >>>>> the right direction, I could come up with a formal proposal
>> quickly and
>>  > >>>>> then we can follow up with broader discussions.
>>  > >>>>>
>>  > >>>>> Thanks,
>>  > >>>>> Xuefu
>>  > >>>>>
>>  > >>>>>
>>  > >>>>>
>>  > >>>>>
>> ------------------------------------------------------------------
>>  > >>>>> Sender:vino yang <ya...@gmail.com>
>>  > >>>>> Sent at:2018 Oct 11 (Thu) 09:45
>>  > >>>>> Recipient:Fabian Hueske <fh...@gmail.com>
>>  > >>>>> Cc:dev <de...@flink.apache.org>; Xuefu <xuefu.z@alibaba-inc.com
>>  > >; user <
>>  > >>>>> user@flink.apache.org>
>>  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive
>> ecosystem
>>  > >>>>>
>>  > >>>>> Hi Xuefu,
>>  > >>>>>
>>  >
>>  > >>>>> Appreciate this proposal, and like Fabian, it would look better
>> if you
>>  > >>>>> can give more details of the plan.
>>  > >>>>>
>>  > >>>>> Thanks, vino.
>>  > >>>>>
>>  > >>>>> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
>>  > >>>>> Hi Xuefu,
>>  > >>>>>
>>  >
>>  > >>>>> Welcome to the Flink community and thanks for starting this
>> discussion!
>>  > >>>>> Better Hive integration would be really great!
>>  > >>>>> Can you go into details of what you are proposing? I can think
>> of a
>>  > >>>>> couple ways to improve Flink in that regard:
>>  > >>>>>
>>  > >>>>> * Support for Hive UDFs
>>  > >>>>> * Support for Hive metadata catalog
>>  > >>>>> * Support for HiveQL syntax
>>  > >>>>> * ???
>>  > >>>>>
>>  > >>>>> Best, Fabian
>>  > >>>>>
>>  > >>>>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
>>  > >>>>> xuefu.z@alibaba-inc.com>:
>>  > >>>>> Hi all,
>>  > >>>>>
>>  > >>>>> Along with the community's effort, inside Alibaba we have
>> explored
>>  >
>>  > >>>>> Flink's potential as an execution engine not just for stream
>> processing but
>>  > >>>>> also for batch processing. We are encouraged by our findings and
>> have
>>  >
>>  > >>>>> initiated our effort to make Flink's SQL capabilities
>> full-fledged. When
>>  >
>>  > >>>>> comparing what's available in Flink to the offerings from
>> competitive data
>>  >
>>  > >>>>> processing engines, we identified a major gap in Flink: a well
>> integration
>>  >
>>  > >>>>> with Hive ecosystem. This is crucial to the success of Flink SQL
>> and batch
>>  >
>>  > >>>>> due to the well-established data ecosystem around Hive.
>> Therefore, we have
>>  >
>>  > >>>>> done some initial work along this direction but there are still
>> a lot of
>>  > >>>>> effort needed.
>>  > >>>>>
>>  > >>>>> We have two strategies in mind. The first one is to make Flink
>> SQL
>>  >
>>  > >>>>> full-fledged and well-integrated with Hive ecosystem. This is a
>> similar
>>  >
>>  > >>>>> approach to what Spark SQL adopted. The second strategy is to
>> make Hive
>>  >
>>  > >>>>> itself work with Flink, similar to the proposal in [1]. Each
>> approach bears
>>  >
>>  > >>>>> its pros and cons, but they don’t need to be mutually exclusive
>> with each
>>  > >>>>> targeting at different users and use cases. We believe that both
>> will
>>  > >>>>> promote a much greater adoption of Flink beyond stream
>> processing.
>>  > >>>>>
>>  > >>>>> We have been focused on the first approach and would like to
>> showcase
>>  >
>>  > >>>>> Flink's batch and SQL capabilities with Flink SQL. However, we
>> have also
>>  > >>>>> planned to start strategy #2 as the follow-up effort.
>>  > >>>>>
>>  >
>>  > >>>>> I'm completely new to Flink(, with a short bio [2] below),
>> though many
>>  >
>>  > >>>>> of my colleagues here at Alibaba are long-time contributors.
>> Nevertheless,
>>  >
>>  > >>>>> I'd like to share our thoughts and invite your early feedback.
>> At the same
>>  >
>>  > >>>>> time, I am working on a detailed proposal on Flink SQL's
>> integration with
>>  > >>>>> Hive ecosystem, which will be also shared when ready.
>>  > >>>>>
>>  > >>>>> While the ideas are simple, each approach will demand significant
>>  >
>>  > >>>>> effort, more than what we can afford. Thus, the input and
>> contributions
>>  > >>>>> from the communities are greatly welcome and appreciated.
>>  > >>>>>
>>  > >>>>> Regards,
>>  > >>>>>
>>  > >>>>>
>>  > >>>>> Xuefu
>>  > >>>>>
>>  > >>>>> References:
>>  > >>>>>
>>  > >>>>> [1] https://issues.apache.org/jira/browse/HIVE-10712
>>  >
>>  > >>>>> [2] Xuefu Zhang is a long-time open source veteran, worked or
>> working on
>>  > >>>>> many projects under Apache Foundation, of which he is also an
>> honored
>>  >
>>  > >>>>> member. About 10 years ago he worked in the Hadoop team at Yahoo
>> where the
>>  >
>>  > >>>>> projects just got started. Later he worked at Cloudera,
>> initiating and
>>  >
>>  > >>>>> leading the development of Hive on Spark project in the
>> communities and
>>  >
>>  > >>>>> across many organizations. Prior to joining Alibaba, he worked
>> at Uber
>>  >
>>  > >>>>> where he promoted Hive on Spark to all Uber's SQL on Hadoop
>> workload and
>>  > >>>>> significantly improved Uber's cluster efficiency.
>>  > >>>>>
>>  > >>>>>
>>  > >>>>>
>>  > >>>>>
>>  > >>>>> --
>>  >
>>  > >>>>> "So you have to trust that the dots will somehow connect in your
>> future."
>>  > >>>>>
>>  > >>>>>
>>  > >>>>> --
>>  >
>>  > >>>>> "So you have to trust that the dots will somehow connect in your
>> future."
>>  > >>>>>
>>  >
>>  >
>>
>>

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by Eron Wright <er...@gmail.com>.
I propose that the community review and merge the PRs that I posted, and
then evolve the design thru 1.8 and beyond.   I think having a basic
infrastructure in place now will accelerate the effort, do you agree?

Thanks again!

On Wed, Jan 2, 2019 at 11:20 AM Zhang, Xuefu <xu...@alibaba-inc.com>
wrote:

> Hi Eron,
>
> Happy New Year!
>
> Thank you very much for your contribution, especially during the holidays.
> Wile I'm encouraged by your work, I'd also like to share my thoughts on how
> to move forward.
>
> First, please note that the design discussion is still finalizing, and we
> expect some moderate changes, especially around TableFactories. Another
> pending change is our decision to shy away from scala, which our work will
> be impacted by.
>
> Secondly, while your work seemed about plugging in catalogs definitions to
> the execution environment, which is less impacted by TableFactory change, I
> did notice some duplication of your work and ours. This is no big deal, but
> going forward, we should probable have a better communication on the work
> assignment so as to avoid any possible duplication of work. On the other
> hand, I think some of your work is interesting and valuable for inclusion
> once we finalize the overall design.
>
> Thus, please continue your research and experiment and let us know when
> you start working on anything so we can better coordinate.
>
> Thanks again for your interest and contributions.
>
> Thanks,
> Xuefu
>
>
>
> ------------------------------------------------------------------
> From:Eron Wright <er...@gmail.com>
> Sent At:2019 Jan. 1 (Tue.) 18:39
> To:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>
> Cc:Xiaowei Jiang <xi...@gmail.com>; twalthr <tw...@apache.org>;
> piotr <pi...@data-artisans.com>; Fabian Hueske <fh...@gmail.com>;
> suez1224 <su...@gmail.com>; Bowen Li <bo...@gmail.com>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> Hi folks, there's clearly some incremental steps to be taken to introduce
> catalog support to SQL Client, complementary to what is proposed in the
> Flink-Hive Metastore design doc.  I was quietly working on this over the
> holidays.   I posted some new sub-tasks, PRs, and sample code
> to FLINK-10744.
>
> What inspired me to get involved is that the catalog interface seems like
> a great way to encapsulate a 'library' of Flink tables and functions.  For
> example, the NYC Taxi dataset (TaxiRides, TaxiFares, various UDFs) may be
> nicely encapsulated as a catalog (TaxiData).   Such a library should be
> fully consumable in SQL Client.
>
> I implemented the above.  Some highlights:
>
> 1. A fully-worked example of using the Taxi dataset in SQL Client via an
> environment file.
> - an ASCII video showing the SQL Client in action:
> https://asciinema.org/a/C8xuAjmZSxCuApgFgZQyeIHuo
>
> - the corresponding environment file (will be even more concise once
> 'FLINK-10696 Catalog UDFs' is merged):
> *https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/dist/conf/sql-client-defaults.yaml
> <https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/dist/conf/sql-client-defaults.yaml>*
>
> - the typed API for standalone table applications:
> *https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/java/com/dataartisans/flinktraining/examples/table_java/examples/ViaCatalog.java#L50
> <https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/java/com/dataartisans/flinktraining/examples/table_java/examples/ViaCatalog.java#L50>*
>
> 2. Implementation of the core catalog descriptor and factory.  I realize
> that some renames may later occur as per the design doc, and would be happy
> to do that as a follow-up.
> https://github.com/apache/flink/pull/7390
>
> 3. Implementation of a connect-style API on TableEnvironment to use
> catalog descriptor.
> https://github.com/apache/flink/pull/7392
>
> 4. Integration into SQL-Client's environment file:
> https://github.com/apache/flink/pull/7393
>
> I realize that the overall Hive integration is still evolving, but I
> believe that these PRs are a good stepping stone. Here's the list (in
> bottom-up order):
> - https://github.com/apache/flink/pull/7386
> - https://github.com/apache/flink/pull/7388
> - https://github.com/apache/flink/pull/7389
> - https://github.com/apache/flink/pull/7390
> - https://github.com/apache/flink/pull/7392
> - https://github.com/apache/flink/pull/7393
>
> Thanks and enjoy 2019!
> Eron W
>
>
> On Sun, Nov 18, 2018 at 3:04 PM Zhang, Xuefu <xu...@alibaba-inc.com>
> wrote:
> Hi Xiaowei,
>
> Thanks for bringing up the question. In the current design, the properties
> for meta objects are meant to cover anything that's specific to a
> particular catalog and agnostic to Flink. Anything that is common (such as
> schema for tables, query text for views, and udf classname) are abstracted
> as members of the respective classes. However, this is still in discussion,
> and Timo and I will go over this and provide an update.
>
> Please note that UDF is a little more involved than what the current
> design doc shows. I'm still refining this part.
>
> Thanks,
> Xuefu
>
>
> ------------------------------------------------------------------
> Sender:Xiaowei Jiang <xi...@gmail.com>
> Sent at:2018 Nov 18 (Sun) 15:17
> Recipient:dev <de...@flink.apache.org>
> Cc:Xuefu <xu...@alibaba-inc.com>; twalthr <tw...@apache.org>; piotr <
> piotr@data-artisans.com>; Fabian Hueske <fh...@gmail.com>; suez1224 <
> suez1224@gmail.com>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> Thanks Xuefu for the detailed design doc! One question on the properties
> associated with the catalog objects. Are we going to leave them completely
> free form or we are going to set some standard for that? I think that the
> answer may depend on if we want to explore catalog specific optimization
> opportunities. In any case, I think that it might be helpful for
> standardize as much as possible into strongly typed classes and use leave
> these properties for catalog specific things. But I think that we can do it
> in steps.
>
> Xiaowei
> On Fri, Nov 16, 2018 at 4:00 AM Bowen Li <bo...@gmail.com> wrote:
> Thanks for keeping on improving the overall design, Xuefu! It looks quite
>  good to me now.
>
>  Would be nice that cc-ed Flink committers can help to review and confirm!
>
>
>
>  One minor suggestion: Since the last section of design doc already touches
>  some new sql statements, shall we add another section in our doc and
>  formalize the new sql statements in SQL Client and TableEnvironment that
>  are gonna come along naturally with our design? Here are some that the
>  design doc mentioned and some that I came up with:
>
>  To be added:
>
>     - USE <catalog> - set default catalog
>     - USE <catalog.schema> - set default schema
>     - SHOW CATALOGS - show all registered catalogs
>     - SHOW SCHEMAS [FROM catalog] - list schemas in the current default
>     catalog or the specified catalog
>     - DESCRIBE VIEW view - show the view's definition in CatalogView
>     - SHOW VIEWS [FROM schema/catalog.schema] - show views from current or
> a
>     specified schema.
>
>     (DDLs that can be addressed by either our design or Shuyi's DDL design)
>
>     - CREATE/DROP/ALTER SCHEMA schema
>     - CREATE/DROP/ALTER CATALOG catalog
>
>  To be modified:
>
>     - SHOW TABLES [FROM schema/catalog.schema] - show tables from current
> or
>     a specified schema. Add 'from schema' to existing 'SHOW TABLES'
> statement
>     - SHOW FUNCTIONS [FROM schema/catalog.schema] - show functions from
>     current or a specified schema. Add 'from schema' to existing 'SHOW
> TABLES'
>     statement'
>
>
>  Thanks, Bowen
>
>
>
>  On Wed, Nov 14, 2018 at 10:39 PM Zhang, Xuefu <xu...@alibaba-inc.com>
>  wrote:
>
>  > Thanks, Bowen, for catching the error. I have granted comment permission
>  > with the link.
>  >
>  > I also updated the doc with the latest class definitions. Everyone is
>  > encouraged to review and comment.
>  >
>  > Thanks,
>  > Xuefu
>  >
>  > ------------------------------------------------------------------
>  > Sender:Bowen Li <bo...@gmail.com>
>  > Sent at:2018 Nov 14 (Wed) 06:44
>  > Recipient:Xuefu <xu...@alibaba-inc.com>
>  > Cc:piotr <pi...@data-artisans.com>; dev <de...@flink.apache.org>; Shuyi
>  > Chen <su...@gmail.com>
>  > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>  >
>  > Hi Xuefu,
>  >
>  > Currently the new design doc
>  > <
> https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit
> >
>  > is on “view only" mode, and people cannot leave comments. Can you please
>  > change it to "can comment" or "can edit" mode?
>  >
>  > Thanks, Bowen
>  >
>  >
>  > On Mon, Nov 12, 2018 at 9:51 PM Zhang, Xuefu <xu...@alibaba-inc.com>
>  > wrote:
>  > Hi Piotr
>  >
>  > I have extracted the API portion of  the design and the google doc is
> here
>  > <
> https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit?usp=sharing
> >.
>  > Please review and provide your feedback.
>  >
>  > Thanks,
>  > Xuefu
>  >
>  > ------------------------------------------------------------------
>  > Sender:Xuefu <xu...@alibaba-inc.com>
>  > Sent at:2018 Nov 12 (Mon) 12:43
>  > Recipient:Piotr Nowojski <pi...@data-artisans.com>; dev <
>  > dev@flink.apache.org>
>  > Cc:Bowen Li <bo...@gmail.com>; Shuyi Chen <su...@gmail.com>
>  > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>  >
>  > Hi Piotr,
>  >
>  > That sounds good to me. Let's close all the open questions ((there are a
>  > couple of them)) in the Google doc and I should be able to quickly split
>  > it into the three proposals as you suggested.
>  >
>  > Thanks,
>  > Xuefu
>  >
>  > ------------------------------------------------------------------
>  > Sender:Piotr Nowojski <pi...@data-artisans.com>
>  > Sent at:2018 Nov 9 (Fri) 22:46
>  > Recipient:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>
>  > Cc:Bowen Li <bo...@gmail.com>; Shuyi Chen <su...@gmail.com>
>  > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>  >
>  > Hi,
>  >
>  >
>  > Yes, it seems like the best solution. Maybe someone else can also
> suggests if we can split it further? Maybe changes in the interface in one
> doc, reading from hive meta store another and final storing our meta
> informations in hive meta store?
>  >
>  > Piotrek
>  >
>  > > On 9 Nov 2018, at 01:44, Zhang, Xuefu <xu...@alibaba-inc.com>
> wrote:
>  > >
>  > > Hi Piotr,
>  > >
>  > > That seems to be good idea!
>  > >
>  >
>  > > Since the google doc for the design is currently under extensive
> review, I will leave it as it is for now. However, I'll convert it to two
> different FLIPs when the time comes.
>  > >
>  > > How does it sound to you?
>  > >
>  > > Thanks,
>  > > Xuefu
>  > >
>  > >
>  > > ------------------------------------------------------------------
>  > > Sender:Piotr Nowojski <pi...@data-artisans.com>
>  > > Sent at:2018 Nov 9 (Fri) 02:31
>  > > Recipient:dev <de...@flink.apache.org>
>  > > Cc:Bowen Li <bo...@gmail.com>; Xuefu <xuefu.z@alibaba-inc.com
>  > >; Shuyi Chen <su...@gmail.com>
>  > > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>  > >
>  > > Hi,
>  > >
>  >
>  > > Maybe we should split this topic (and the design doc) into couple of
> smaller ones, hopefully independent. The questions that you have asked
> Fabian have for example very little to do with reading metadata from Hive
> Meta Store?
>  > >
>  > > Piotrek
>  > >
>  > >> On 7 Nov 2018, at 14:27, Fabian Hueske <fh...@gmail.com> wrote:
>  > >>
>  > >> Hi Xuefu and all,
>  > >>
>  > >> Thanks for sharing this design document!
>  >
>  > >> I'm very much in favor of restructuring / reworking the catalog
> handling in
>  > >> Flink SQL as outlined in the document.
>  >
>  > >> Most changes described in the design document seem to be rather
> general and
>  > >> not specifically related to the Hive integration.
>  > >>
>  >
>  > >> IMO, there are some aspects, especially those at the boundary of
> Hive and
>  > >> Flink, that need a bit more discussion. For example
>  > >>
>  > >> * What does it take to make Flink schema compatible with Hive schema?
>  > >> * How will Flink tables (descriptors) be stored in HMS?
>  > >> * How do both Hive catalogs differ? Could they be integrated into to
> a
>  > >> single one? When to use which one?
>  >
>  > >> * What meta information is provided by HMS? What of this can be
> leveraged
>  > >> by Flink?
>  > >>
>  > >> Thank you,
>  > >> Fabian
>  > >>
>  > >> Am Fr., 2. Nov. 2018 um 00:31 Uhr schrieb Bowen Li <
> bowenli86@gmail.com
>  > >:
>  > >>
>  > >>> After taking a look at how other discussion threads work, I think
> it's
>  > >>> actually fine just keep our discussion here. It's up to you, Xuefu.
>  > >>>
>  > >>> The google doc LGTM. I left some minor comments.
>  > >>>
>  > >>> On Thu, Nov 1, 2018 at 10:17 AM Bowen Li <bo...@gmail.com>
> wrote:
>  > >>>
>  > >>>> Hi all,
>  > >>>>
>  > >>>> As Xuefu has published the design doc on google, I agree with
> Shuyi's
>  >
>  > >>>> suggestion that we probably should start a new email thread like
> "[DISCUSS]
>  >
>  > >>>> ... Hive integration design ..." on only dev mailing list for
> community
>  > >>>> devs to review. The current thread sends to both dev and user list.
>  > >>>>
>  >
>  > >>>> This email thread is more like validating the general idea and
> direction
>  >
>  > >>>> with the community, and it's been pretty long and crowded so far.
> Since
>  >
>  > >>>> everyone is pro for the idea, we can move forward with another
> thread to
>  > >>>> discuss and finalize the design.
>  > >>>>
>  > >>>> Thanks,
>  > >>>> Bowen
>  > >>>>
>  > >>>> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu <
>  > xuefu.z@alibaba-inc.com>
>  > >>>> wrote:
>  > >>>>
>  > >>>>> Hi Shuiyi,
>  > >>>>>
>  >
>  > >>>>> Good idea. Actually the PDF was converted from a google doc. Here
> is its
>  > >>>>> link:
>  > >>>>>
>  > >>>>>
>  >
> https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
>  > >>>>> Once we reach an agreement, I can convert it to a FLIP.
>  > >>>>>
>  > >>>>> Thanks,
>  > >>>>> Xuefu
>  > >>>>>
>  > >>>>>
>  > >>>>>
>  > >>>>> ------------------------------------------------------------------
>  > >>>>> Sender:Shuyi Chen <su...@gmail.com>
>  > >>>>> Sent at:2018 Nov 1 (Thu) 02:47
>  > >>>>> Recipient:Xuefu <xu...@alibaba-inc.com>
>  > >>>>> Cc:vino yang <ya...@gmail.com>; Fabian Hueske <
>  > fhueske@gmail.com>;
>  > >>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>  > >>>>>
>  > >>>>> Hi Xuefu,
>  > >>>>>
>  >
>  > >>>>> Thanks a lot for driving this big effort. I would suggest convert
> your
>  >
>  > >>>>> proposal and design doc into a google doc, and share it on the
> dev mailing
>  >
>  > >>>>> list for the community to review and comment with title like
> "[DISCUSS] ...
>  >
>  > >>>>> Hive integration design ..." . Once approved,  we can document it
> as a FLIP
>  >
>  > >>>>> (Flink Improvement Proposals), and use JIRAs to track the
> implementations.
>  > >>>>> What do you think?
>  > >>>>>
>  > >>>>> Shuyi
>  > >>>>>
>  > >>>>> On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <
>  > xuefu.z@alibaba-inc.com>
>  > >>>>> wrote:
>  > >>>>> Hi all,
>  > >>>>>
>  > >>>>> I have also shared a design doc on Hive metastore integration
> that is
>  >
>  > >>>>> attached here and also to FLINK-10556[1]. Please kindly review
> and share
>  > >>>>> your feedback.
>  > >>>>>
>  > >>>>>
>  > >>>>> Thanks,
>  > >>>>> Xuefu
>  > >>>>>
>  > >>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
>  > >>>>> ------------------------------------------------------------------
>  > >>>>> Sender:Xuefu <xu...@alibaba-inc.com>
>  > >>>>> Sent at:2018 Oct 25 (Thu) 01:08
>  > >>>>> Recipient:Xuefu <xu...@alibaba-inc.com>; Shuyi Chen <
>  > >>>>> suez1224@gmail.com>
>  > >>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <
>  > fhueske@gmail.com>;
>  > >>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>  > >>>>>
>  > >>>>> Hi all,
>  > >>>>>
>  > >>>>> To wrap up the discussion, I have attached a PDF describing the
>  >
>  > >>>>> proposal, which is also attached to FLINK-10556 [1]. Please feel
> free to
>  > >>>>> watch that JIRA to track the progress.
>  > >>>>>
>  > >>>>> Please also let me know if you have additional comments or
> questions.
>  > >>>>>
>  > >>>>> Thanks,
>  > >>>>> Xuefu
>  > >>>>>
>  > >>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
>  > >>>>>
>  > >>>>>
>  > >>>>> ------------------------------------------------------------------
>  > >>>>> Sender:Xuefu <xu...@alibaba-inc.com>
>  > >>>>> Sent at:2018 Oct 16 (Tue) 03:40
>  > >>>>> Recipient:Shuyi Chen <su...@gmail.com>
>  > >>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <
>  > fhueske@gmail.com>;
>  > >>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>  > >>>>>
>  > >>>>> Hi Shuyi,
>  > >>>>>
>  >
>  > >>>>> Thank you for your input. Yes, I agreed with a phased approach
> and like
>  >
>  > >>>>> to move forward fast. :) We did some work internally on DDL
> utilizing babel
>  > >>>>> parser in Calcite. While babel makes Calcite's grammar
> extensible, at
>  > >>>>> first impression it still seems too cumbersome for a project when
> too
>  >
>  > >>>>> much extensions are made. It's even challenging to find where the
> extension
>  >
>  > >>>>> is needed! It would be certainly better if Calcite can magically
> support
>  >
>  > >>>>> Hive QL by just turning on a flag, such as that for MYSQL_5. I
> can also
>  >
>  > >>>>> see that this could mean a lot of work on Calcite. Nevertheless,
> I will
>  >
>  > >>>>> bring up the discussion over there and to see what their
> community thinks.
>  > >>>>>
>  > >>>>> Would mind to share more info about the proposal on DDL that you
>  > >>>>> mentioned? We can certainly collaborate on this.
>  > >>>>>
>  > >>>>> Thanks,
>  > >>>>> Xuefu
>  > >>>>>
>  > >>>>> ------------------------------------------------------------------
>  > >>>>> Sender:Shuyi Chen <su...@gmail.com>
>  > >>>>> Sent at:2018 Oct 14 (Sun) 08:30
>  > >>>>> Recipient:Xuefu <xu...@alibaba-inc.com>
>  > >>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <
>  > fhueske@gmail.com>;
>  > >>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>  > >>>>>
>  > >>>>> Welcome to the community and thanks for the great proposal,
> Xuefu! I
>  >
>  > >>>>> think the proposal can be divided into 2 stages: making Flink to
> support
>  >
>  > >>>>> Hive features, and make Hive to work with Flink. I agreed with
> Timo that on
>  >
>  > >>>>> starting with a smaller scope, so we can make progress faster. As
> for [6],
>  >
>  > >>>>> a proposal for DDL is already in progress, and will come after
> the unified
>  >
>  > >>>>> SQL connector API is done. For supporting Hive syntax, we might
> need to
>  > >>>>> work with the Calcite community, and a recent effort called babel
> (
>  > >>>>> https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite
> might
>  > >>>>> help here.
>  > >>>>>
>  > >>>>> Thanks
>  > >>>>> Shuyi
>  > >>>>>
>  > >>>>> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <
>  > xuefu.z@alibaba-inc.com>
>  > >>>>> wrote:
>  > >>>>> Hi Fabian/Vno,
>  > >>>>>
>  >
>  > >>>>> Thank you very much for your encouragement inquiry. Sorry that I
> didn't
>  >
>  > >>>>> see Fabian's email until I read Vino's response just now.
> (Somehow Fabian's
>  > >>>>> went to the spam folder.)
>  > >>>>>
>  >
>  > >>>>> My proposal contains long-term and short-terms goals.
> Nevertheless, the
>  > >>>>> effort will focus on the following areas, including Fabian's list:
>  > >>>>>
>  > >>>>> 1. Hive metastore connectivity - This covers both read/write
> access,
>  >
>  > >>>>> which means Flink can make full use of Hive's metastore as its
> catalog (at
>  > >>>>> least for the batch but can extend for streaming as well).
>  >
>  > >>>>> 2. Metadata compatibility - Objects (databases, tables,
> partitions, etc)
>  >
>  > >>>>> created by Hive can be understood by Flink and the reverse
> direction is
>  > >>>>> true also.
>  > >>>>> 3. Data compatibility - Similar to #2, data produced by Hive can
> be
>  > >>>>> consumed by Flink and vise versa.
>  >
>  > >>>>> 4. Support Hive UDFs - For all Hive's native udfs, Flink either
> provides
>  > >>>>> its own implementation or make Hive's implementation work in
> Flink.
>  > >>>>> Further, for user created UDFs in Hive, Flink SQL should provide a
>  >
>  > >>>>> mechanism allowing user to import them into Flink without any
> code change
>  > >>>>> required.
>  > >>>>> 5. Data types -  Flink SQL should support all data types that are
>  > >>>>> available in Hive.
>  > >>>>> 6. SQL Language - Flink SQL should support SQL standard (such as
>  >
>  > >>>>> SQL2003) with extension to support Hive's syntax and language
> features,
>  > >>>>> around DDL, DML, and SELECT queries.
>  >
>  > >>>>> 7.  SQL CLI - this is currently developing in Flink but more
> effort is
>  > >>>>> needed.
>  >
>  > >>>>> 8. Server - provide a server that's compatible with Hive's
> HiverServer2
>  >
>  > >>>>> in thrift APIs, such that HiveServer2 users can reuse their
> existing client
>  > >>>>> (such as beeline) but connect to Flink's thrift server instead.
>  >
>  > >>>>> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC
> drivers for
>  > >>>>> other application to use to connect to its thrift server
>  > >>>>> 10. Support other user's customizations in Hive, such as Hive
> Serdes,
>  > >>>>> storage handlers, etc.
>  >
>  > >>>>> 11. Better task failure tolerance and task scheduling at Flink
> runtime.
>  > >>>>>
>  > >>>>> As you can see, achieving all those requires significant effort
> and
>  >
>  > >>>>> across all layers in Flink. However, a short-term goal could
> include only
>  >
>  > >>>>> core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller
> scope (such as
>  > >>>>> #3, #6).
>  > >>>>>
>  >
>  > >>>>> Please share your further thoughts. If we generally agree that
> this is
>  >
>  > >>>>> the right direction, I could come up with a formal proposal
> quickly and
>  > >>>>> then we can follow up with broader discussions.
>  > >>>>>
>  > >>>>> Thanks,
>  > >>>>> Xuefu
>  > >>>>>
>  > >>>>>
>  > >>>>>
>  > >>>>> ------------------------------------------------------------------
>  > >>>>> Sender:vino yang <ya...@gmail.com>
>  > >>>>> Sent at:2018 Oct 11 (Thu) 09:45
>  > >>>>> Recipient:Fabian Hueske <fh...@gmail.com>
>  > >>>>> Cc:dev <de...@flink.apache.org>; Xuefu <xuefu.z@alibaba-inc.com
>  > >; user <
>  > >>>>> user@flink.apache.org>
>  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>  > >>>>>
>  > >>>>> Hi Xuefu,
>  > >>>>>
>  >
>  > >>>>> Appreciate this proposal, and like Fabian, it would look better
> if you
>  > >>>>> can give more details of the plan.
>  > >>>>>
>  > >>>>> Thanks, vino.
>  > >>>>>
>  > >>>>> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
>  > >>>>> Hi Xuefu,
>  > >>>>>
>  >
>  > >>>>> Welcome to the Flink community and thanks for starting this
> discussion!
>  > >>>>> Better Hive integration would be really great!
>  > >>>>> Can you go into details of what you are proposing? I can think of
> a
>  > >>>>> couple ways to improve Flink in that regard:
>  > >>>>>
>  > >>>>> * Support for Hive UDFs
>  > >>>>> * Support for Hive metadata catalog
>  > >>>>> * Support for HiveQL syntax
>  > >>>>> * ???
>  > >>>>>
>  > >>>>> Best, Fabian
>  > >>>>>
>  > >>>>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
>  > >>>>> xuefu.z@alibaba-inc.com>:
>  > >>>>> Hi all,
>  > >>>>>
>  > >>>>> Along with the community's effort, inside Alibaba we have explored
>  >
>  > >>>>> Flink's potential as an execution engine not just for stream
> processing but
>  > >>>>> also for batch processing. We are encouraged by our findings and
> have
>  >
>  > >>>>> initiated our effort to make Flink's SQL capabilities
> full-fledged. When
>  >
>  > >>>>> comparing what's available in Flink to the offerings from
> competitive data
>  >
>  > >>>>> processing engines, we identified a major gap in Flink: a well
> integration
>  >
>  > >>>>> with Hive ecosystem. This is crucial to the success of Flink SQL
> and batch
>  >
>  > >>>>> due to the well-established data ecosystem around Hive.
> Therefore, we have
>  >
>  > >>>>> done some initial work along this direction but there are still a
> lot of
>  > >>>>> effort needed.
>  > >>>>>
>  > >>>>> We have two strategies in mind. The first one is to make Flink SQL
>  >
>  > >>>>> full-fledged and well-integrated with Hive ecosystem. This is a
> similar
>  >
>  > >>>>> approach to what Spark SQL adopted. The second strategy is to
> make Hive
>  >
>  > >>>>> itself work with Flink, similar to the proposal in [1]. Each
> approach bears
>  >
>  > >>>>> its pros and cons, but they don’t need to be mutually exclusive
> with each
>  > >>>>> targeting at different users and use cases. We believe that both
> will
>  > >>>>> promote a much greater adoption of Flink beyond stream processing.
>  > >>>>>
>  > >>>>> We have been focused on the first approach and would like to
> showcase
>  >
>  > >>>>> Flink's batch and SQL capabilities with Flink SQL. However, we
> have also
>  > >>>>> planned to start strategy #2 as the follow-up effort.
>  > >>>>>
>  >
>  > >>>>> I'm completely new to Flink(, with a short bio [2] below), though
> many
>  >
>  > >>>>> of my colleagues here at Alibaba are long-time contributors.
> Nevertheless,
>  >
>  > >>>>> I'd like to share our thoughts and invite your early feedback. At
> the same
>  >
>  > >>>>> time, I am working on a detailed proposal on Flink SQL's
> integration with
>  > >>>>> Hive ecosystem, which will be also shared when ready.
>  > >>>>>
>  > >>>>> While the ideas are simple, each approach will demand significant
>  >
>  > >>>>> effort, more than what we can afford. Thus, the input and
> contributions
>  > >>>>> from the communities are greatly welcome and appreciated.
>  > >>>>>
>  > >>>>> Regards,
>  > >>>>>
>  > >>>>>
>  > >>>>> Xuefu
>  > >>>>>
>  > >>>>> References:
>  > >>>>>
>  > >>>>> [1] https://issues.apache.org/jira/browse/HIVE-10712
>  >
>  > >>>>> [2] Xuefu Zhang is a long-time open source veteran, worked or
> working on
>  > >>>>> many projects under Apache Foundation, of which he is also an
> honored
>  >
>  > >>>>> member. About 10 years ago he worked in the Hadoop team at Yahoo
> where the
>  >
>  > >>>>> projects just got started. Later he worked at Cloudera,
> initiating and
>  >
>  > >>>>> leading the development of Hive on Spark project in the
> communities and
>  >
>  > >>>>> across many organizations. Prior to joining Alibaba, he worked at
> Uber
>  >
>  > >>>>> where he promoted Hive on Spark to all Uber's SQL on Hadoop
> workload and
>  > >>>>> significantly improved Uber's cluster efficiency.
>  > >>>>>
>  > >>>>>
>  > >>>>>
>  > >>>>>
>  > >>>>> --
>  >
>  > >>>>> "So you have to trust that the dots will somehow connect in your
> future."
>  > >>>>>
>  > >>>>>
>  > >>>>> --
>  >
>  > >>>>> "So you have to trust that the dots will somehow connect in your
> future."
>  > >>>>>
>  >
>  >
>
>

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by "Zhang, Xuefu" <xu...@alibaba-inc.com>.
Hi Eron,

Happy New Year!

Thank you very much for your contribution, especially during the holidays. Wile I'm encouraged by your work, I'd also like to share my thoughts on how to move forward.

First, please note that the design discussion is still finalizing, and we expect some moderate changes, especially around TableFactories. Another pending change is our decision to shy away from scala, which our work will be impacted by.

Secondly, while your work seemed about plugging in catalogs definitions to the execution environment, which is less impacted by TableFactory change, I did notice some duplication of your work and ours. This is no big deal, but going forward, we should probable have a better communication on the work assignment so as to avoid any possible duplication of work. On the other hand, I think some of your work is interesting and valuable for inclusion once we finalize the overall design.

Thus, please continue your research and experiment and let us know when you start working on anything so we can better coordinate.

Thanks again for your interest and contributions.

Thanks,
Xuefu




------------------------------------------------------------------
From:Eron Wright <er...@gmail.com>
Sent At:2019 Jan. 1 (Tue.) 18:39
To:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>
Cc:Xiaowei Jiang <xi...@gmail.com>; twalthr <tw...@apache.org>; piotr <pi...@data-artisans.com>; Fabian Hueske <fh...@gmail.com>; suez1224 <su...@gmail.com>; Bowen Li <bo...@gmail.com>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi folks, there's clearly some incremental steps to be taken to introduce catalog support to SQL Client, complementary to what is proposed in the Flink-Hive Metastore design doc.  I was quietly working on this over the holidays.   I posted some new sub-tasks, PRs, and sample code to FLINK-10744. 

What inspired me to get involved is that the catalog interface seems like a great way to encapsulate a 'library' of Flink tables and functions.  For example, the NYC Taxi dataset (TaxiRides, TaxiFares, various UDFs) may be nicely encapsulated as a catalog (TaxiData).   Such a library should be fully consumable in SQL Client.

I implemented the above.  Some highlights:
1. A fully-worked example of using the Taxi dataset in SQL Client via an environment file.
- an ASCII video showing the SQL Client in action:
https://asciinema.org/a/C8xuAjmZSxCuApgFgZQyeIHuo

- the corresponding environment file (will be even more concise once 'FLINK-10696 Catalog UDFs' is merged):
https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/dist/conf/sql-client-defaults.yaml

- the typed API for standalone table applications:
https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/java/com/dataartisans/flinktraining/examples/table_java/examples/ViaCatalog.java#L50

2. Implementation of the core catalog descriptor and factory.  I realize that some renames may later occur as per the design doc, and would be happy to do that as a follow-up.
https://github.com/apache/flink/pull/7390

3. Implementation of a connect-style API on TableEnvironment to use catalog descriptor.
https://github.com/apache/flink/pull/7392

4. Integration into SQL-Client's environment file:
https://github.com/apache/flink/pull/7393

I realize that the overall Hive integration is still evolving, but I believe that these PRs are a good stepping stone. Here's the list (in bottom-up order):
- https://github.com/apache/flink/pull/7386
- https://github.com/apache/flink/pull/7388
- https://github.com/apache/flink/pull/7389
- https://github.com/apache/flink/pull/7390
- https://github.com/apache/flink/pull/7392
- https://github.com/apache/flink/pull/7393

Thanks and enjoy 2019!
Eron W


On Sun, Nov 18, 2018 at 3:04 PM Zhang, Xuefu <xu...@alibaba-inc.com> wrote:
Hi Xiaowei,

 Thanks for bringing up the question. In the current design, the properties for meta objects are meant to cover anything that's specific to a particular catalog and agnostic to Flink. Anything that is common (such as schema for tables, query text for views, and udf classname) are abstracted as members of the respective classes. However, this is still in discussion, and Timo and I will go over this and provide an update.

 Please note that UDF is a little more involved than what the current design doc shows. I'm still refining this part.

 Thanks,
 Xuefu


 ------------------------------------------------------------------
 Sender:Xiaowei Jiang <xi...@gmail.com>
 Sent at:2018 Nov 18 (Sun) 15:17
 Recipient:dev <de...@flink.apache.org>
 Cc:Xuefu <xu...@alibaba-inc.com>; twalthr <tw...@apache.org>; piotr <pi...@data-artisans.com>; Fabian Hueske <fh...@gmail.com>; suez1224 <su...@gmail.com>
 Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

 Thanks Xuefu for the detailed design doc! One question on the properties associated with the catalog objects. Are we going to leave them completely free form or we are going to set some standard for that? I think that the answer may depend on if we want to explore catalog specific optimization opportunities. In any case, I think that it might be helpful for standardize as much as possible into strongly typed classes and use leave these properties for catalog specific things. But I think that we can do it in steps.

 Xiaowei
 On Fri, Nov 16, 2018 at 4:00 AM Bowen Li <bo...@gmail.com> wrote:
 Thanks for keeping on improving the overall design, Xuefu! It looks quite
  good to me now.

  Would be nice that cc-ed Flink committers can help to review and confirm!



  One minor suggestion: Since the last section of design doc already touches
  some new sql statements, shall we add another section in our doc and
  formalize the new sql statements in SQL Client and TableEnvironment that
  are gonna come along naturally with our design? Here are some that the
  design doc mentioned and some that I came up with:

  To be added:

     - USE <catalog> - set default catalog
     - USE <catalog.schema> - set default schema
     - SHOW CATALOGS - show all registered catalogs
     - SHOW SCHEMAS [FROM catalog] - list schemas in the current default
     catalog or the specified catalog
     - DESCRIBE VIEW view - show the view's definition in CatalogView
     - SHOW VIEWS [FROM schema/catalog.schema] - show views from current or a
     specified schema.

     (DDLs that can be addressed by either our design or Shuyi's DDL design)

     - CREATE/DROP/ALTER SCHEMA schema
     - CREATE/DROP/ALTER CATALOG catalog

  To be modified:

     - SHOW TABLES [FROM schema/catalog.schema] - show tables from current or
     a specified schema. Add 'from schema' to existing 'SHOW TABLES' statement
     - SHOW FUNCTIONS [FROM schema/catalog.schema] - show functions from
     current or a specified schema. Add 'from schema' to existing 'SHOW TABLES'
     statement'


  Thanks, Bowen



  On Wed, Nov 14, 2018 at 10:39 PM Zhang, Xuefu <xu...@alibaba-inc.com>
  wrote:

  > Thanks, Bowen, for catching the error. I have granted comment permission
  > with the link.
  >
  > I also updated the doc with the latest class definitions. Everyone is
  > encouraged to review and comment.
  >
  > Thanks,
  > Xuefu
  >
  > ------------------------------------------------------------------
  > Sender:Bowen Li <bo...@gmail.com>
  > Sent at:2018 Nov 14 (Wed) 06:44
  > Recipient:Xuefu <xu...@alibaba-inc.com>
  > Cc:piotr <pi...@data-artisans.com>; dev <de...@flink.apache.org>; Shuyi
  > Chen <su...@gmail.com>
  > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
  >
  > Hi Xuefu,
  >
  > Currently the new design doc
  > <https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit>
  > is on “view only" mode, and people cannot leave comments. Can you please
  > change it to "can comment" or "can edit" mode?
  >
  > Thanks, Bowen
  >
  >
  > On Mon, Nov 12, 2018 at 9:51 PM Zhang, Xuefu <xu...@alibaba-inc.com>
  > wrote:
  > Hi Piotr
  >
  > I have extracted the API portion of  the design and the google doc is here
  > <https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit?usp=sharing>.
  > Please review and provide your feedback.
  >
  > Thanks,
  > Xuefu
  >
  > ------------------------------------------------------------------
  > Sender:Xuefu <xu...@alibaba-inc.com>
  > Sent at:2018 Nov 12 (Mon) 12:43
  > Recipient:Piotr Nowojski <pi...@data-artisans.com>; dev <
  > dev@flink.apache.org>
  > Cc:Bowen Li <bo...@gmail.com>; Shuyi Chen <su...@gmail.com>
  > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
  >
  > Hi Piotr,
  >
  > That sounds good to me. Let's close all the open questions ((there are a
  > couple of them)) in the Google doc and I should be able to quickly split
  > it into the three proposals as you suggested.
  >
  > Thanks,
  > Xuefu
  >
  > ------------------------------------------------------------------
  > Sender:Piotr Nowojski <pi...@data-artisans.com>
  > Sent at:2018 Nov 9 (Fri) 22:46
  > Recipient:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>
  > Cc:Bowen Li <bo...@gmail.com>; Shuyi Chen <su...@gmail.com>
  > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
  >
  > Hi,
  >
  >
  > Yes, it seems like the best solution. Maybe someone else can also suggests if we can split it further? Maybe changes in the interface in one doc, reading from hive meta store another and final storing our meta informations in hive meta store?
  >
  > Piotrek
  >
  > > On 9 Nov 2018, at 01:44, Zhang, Xuefu <xu...@alibaba-inc.com> wrote:
  > >
  > > Hi Piotr,
  > >
  > > That seems to be good idea!
  > >
  >
  > > Since the google doc for the design is currently under extensive review, I will leave it as it is for now. However, I'll convert it to two different FLIPs when the time comes.
  > >
  > > How does it sound to you?
  > >
  > > Thanks,
  > > Xuefu
  > >
  > >
  > > ------------------------------------------------------------------
  > > Sender:Piotr Nowojski <pi...@data-artisans.com>
  > > Sent at:2018 Nov 9 (Fri) 02:31
  > > Recipient:dev <de...@flink.apache.org>
  > > Cc:Bowen Li <bo...@gmail.com>; Xuefu <xuefu.z@alibaba-inc.com
  > >; Shuyi Chen <su...@gmail.com>
  > > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
  > >
  > > Hi,
  > >
  >
  > > Maybe we should split this topic (and the design doc) into couple of smaller ones, hopefully independent. The questions that you have asked Fabian have for example very little to do with reading metadata from Hive Meta Store?
  > >
  > > Piotrek
  > >
  > >> On 7 Nov 2018, at 14:27, Fabian Hueske <fh...@gmail.com> wrote:
  > >>
  > >> Hi Xuefu and all,
  > >>
  > >> Thanks for sharing this design document!
  >
  > >> I'm very much in favor of restructuring / reworking the catalog handling in
  > >> Flink SQL as outlined in the document.
  >
  > >> Most changes described in the design document seem to be rather general and
  > >> not specifically related to the Hive integration.
  > >>
  >
  > >> IMO, there are some aspects, especially those at the boundary of Hive and
  > >> Flink, that need a bit more discussion. For example
  > >>
  > >> * What does it take to make Flink schema compatible with Hive schema?
  > >> * How will Flink tables (descriptors) be stored in HMS?
  > >> * How do both Hive catalogs differ? Could they be integrated into to a
  > >> single one? When to use which one?
  >
  > >> * What meta information is provided by HMS? What of this can be leveraged
  > >> by Flink?
  > >>
  > >> Thank you,
  > >> Fabian
  > >>
  > >> Am Fr., 2. Nov. 2018 um 00:31 Uhr schrieb Bowen Li <bowenli86@gmail.com
  > >:
  > >>
  > >>> After taking a look at how other discussion threads work, I think it's
  > >>> actually fine just keep our discussion here. It's up to you, Xuefu.
  > >>>
  > >>> The google doc LGTM. I left some minor comments.
  > >>>
  > >>> On Thu, Nov 1, 2018 at 10:17 AM Bowen Li <bo...@gmail.com> wrote:
  > >>>
  > >>>> Hi all,
  > >>>>
  > >>>> As Xuefu has published the design doc on google, I agree with Shuyi's
  >
  > >>>> suggestion that we probably should start a new email thread like "[DISCUSS]
  >
  > >>>> ... Hive integration design ..." on only dev mailing list for community
  > >>>> devs to review. The current thread sends to both dev and user list.
  > >>>>
  >
  > >>>> This email thread is more like validating the general idea and direction
  >
  > >>>> with the community, and it's been pretty long and crowded so far. Since
  >
  > >>>> everyone is pro for the idea, we can move forward with another thread to
  > >>>> discuss and finalize the design.
  > >>>>
  > >>>> Thanks,
  > >>>> Bowen
  > >>>>
  > >>>> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu <
  > xuefu.z@alibaba-inc.com>
  > >>>> wrote:
  > >>>>
  > >>>>> Hi Shuiyi,
  > >>>>>
  >
  > >>>>> Good idea. Actually the PDF was converted from a google doc. Here is its
  > >>>>> link:
  > >>>>>
  > >>>>>
  > https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
  > >>>>> Once we reach an agreement, I can convert it to a FLIP.
  > >>>>>
  > >>>>> Thanks,
  > >>>>> Xuefu
  > >>>>>
  > >>>>>
  > >>>>>
  > >>>>> ------------------------------------------------------------------
  > >>>>> Sender:Shuyi Chen <su...@gmail.com>
  > >>>>> Sent at:2018 Nov 1 (Thu) 02:47
  > >>>>> Recipient:Xuefu <xu...@alibaba-inc.com>
  > >>>>> Cc:vino yang <ya...@gmail.com>; Fabian Hueske <
  > fhueske@gmail.com>;
  > >>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
  > >>>>>
  > >>>>> Hi Xuefu,
  > >>>>>
  >
  > >>>>> Thanks a lot for driving this big effort. I would suggest convert your
  >
  > >>>>> proposal and design doc into a google doc, and share it on the dev mailing
  >
  > >>>>> list for the community to review and comment with title like "[DISCUSS] ...
  >
  > >>>>> Hive integration design ..." . Once approved,  we can document it as a FLIP
  >
  > >>>>> (Flink Improvement Proposals), and use JIRAs to track the implementations.
  > >>>>> What do you think?
  > >>>>>
  > >>>>> Shuyi
  > >>>>>
  > >>>>> On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <
  > xuefu.z@alibaba-inc.com>
  > >>>>> wrote:
  > >>>>> Hi all,
  > >>>>>
  > >>>>> I have also shared a design doc on Hive metastore integration that is
  >
  > >>>>> attached here and also to FLINK-10556[1]. Please kindly review and share
  > >>>>> your feedback.
  > >>>>>
  > >>>>>
  > >>>>> Thanks,
  > >>>>> Xuefu
  > >>>>>
  > >>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
  > >>>>> ------------------------------------------------------------------
  > >>>>> Sender:Xuefu <xu...@alibaba-inc.com>
  > >>>>> Sent at:2018 Oct 25 (Thu) 01:08
  > >>>>> Recipient:Xuefu <xu...@alibaba-inc.com>; Shuyi Chen <
  > >>>>> suez1224@gmail.com>
  > >>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <
  > fhueske@gmail.com>;
  > >>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
  > >>>>>
  > >>>>> Hi all,
  > >>>>>
  > >>>>> To wrap up the discussion, I have attached a PDF describing the
  >
  > >>>>> proposal, which is also attached to FLINK-10556 [1]. Please feel free to
  > >>>>> watch that JIRA to track the progress.
  > >>>>>
  > >>>>> Please also let me know if you have additional comments or questions.
  > >>>>>
  > >>>>> Thanks,
  > >>>>> Xuefu
  > >>>>>
  > >>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
  > >>>>>
  > >>>>>
  > >>>>> ------------------------------------------------------------------
  > >>>>> Sender:Xuefu <xu...@alibaba-inc.com>
  > >>>>> Sent at:2018 Oct 16 (Tue) 03:40
  > >>>>> Recipient:Shuyi Chen <su...@gmail.com>
  > >>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <
  > fhueske@gmail.com>;
  > >>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
  > >>>>>
  > >>>>> Hi Shuyi,
  > >>>>>
  >
  > >>>>> Thank you for your input. Yes, I agreed with a phased approach and like
  >
  > >>>>> to move forward fast. :) We did some work internally on DDL utilizing babel
  > >>>>> parser in Calcite. While babel makes Calcite's grammar extensible, at
  > >>>>> first impression it still seems too cumbersome for a project when too
  >
  > >>>>> much extensions are made. It's even challenging to find where the extension
  >
  > >>>>> is needed! It would be certainly better if Calcite can magically support
  >
  > >>>>> Hive QL by just turning on a flag, such as that for MYSQL_5. I can also
  >
  > >>>>> see that this could mean a lot of work on Calcite. Nevertheless, I will
  >
  > >>>>> bring up the discussion over there and to see what their community thinks.
  > >>>>>
  > >>>>> Would mind to share more info about the proposal on DDL that you
  > >>>>> mentioned? We can certainly collaborate on this.
  > >>>>>
  > >>>>> Thanks,
  > >>>>> Xuefu
  > >>>>>
  > >>>>> ------------------------------------------------------------------
  > >>>>> Sender:Shuyi Chen <su...@gmail.com>
  > >>>>> Sent at:2018 Oct 14 (Sun) 08:30
  > >>>>> Recipient:Xuefu <xu...@alibaba-inc.com>
  > >>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <
  > fhueske@gmail.com>;
  > >>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
  > >>>>>
  > >>>>> Welcome to the community and thanks for the great proposal, Xuefu! I
  >
  > >>>>> think the proposal can be divided into 2 stages: making Flink to support
  >
  > >>>>> Hive features, and make Hive to work with Flink. I agreed with Timo that on
  >
  > >>>>> starting with a smaller scope, so we can make progress faster. As for [6],
  >
  > >>>>> a proposal for DDL is already in progress, and will come after the unified
  >
  > >>>>> SQL connector API is done. For supporting Hive syntax, we might need to
  > >>>>> work with the Calcite community, and a recent effort called babel (
  > >>>>> https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite might
  > >>>>> help here.
  > >>>>>
  > >>>>> Thanks
  > >>>>> Shuyi
  > >>>>>
  > >>>>> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <
  > xuefu.z@alibaba-inc.com>
  > >>>>> wrote:
  > >>>>> Hi Fabian/Vno,
  > >>>>>
  >
  > >>>>> Thank you very much for your encouragement inquiry. Sorry that I didn't
  >
  > >>>>> see Fabian's email until I read Vino's response just now. (Somehow Fabian's
  > >>>>> went to the spam folder.)
  > >>>>>
  >
  > >>>>> My proposal contains long-term and short-terms goals. Nevertheless, the
  > >>>>> effort will focus on the following areas, including Fabian's list:
  > >>>>>
  > >>>>> 1. Hive metastore connectivity - This covers both read/write access,
  >
  > >>>>> which means Flink can make full use of Hive's metastore as its catalog (at
  > >>>>> least for the batch but can extend for streaming as well).
  >
  > >>>>> 2. Metadata compatibility - Objects (databases, tables, partitions, etc)
  >
  > >>>>> created by Hive can be understood by Flink and the reverse direction is
  > >>>>> true also.
  > >>>>> 3. Data compatibility - Similar to #2, data produced by Hive can be
  > >>>>> consumed by Flink and vise versa.
  >
  > >>>>> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides
  > >>>>> its own implementation or make Hive's implementation work in Flink.
  > >>>>> Further, for user created UDFs in Hive, Flink SQL should provide a
  >
  > >>>>> mechanism allowing user to import them into Flink without any code change
  > >>>>> required.
  > >>>>> 5. Data types -  Flink SQL should support all data types that are
  > >>>>> available in Hive.
  > >>>>> 6. SQL Language - Flink SQL should support SQL standard (such as
  >
  > >>>>> SQL2003) with extension to support Hive's syntax and language features,
  > >>>>> around DDL, DML, and SELECT queries.
  >
  > >>>>> 7.  SQL CLI - this is currently developing in Flink but more effort is
  > >>>>> needed.
  >
  > >>>>> 8. Server - provide a server that's compatible with Hive's HiverServer2
  >
  > >>>>> in thrift APIs, such that HiveServer2 users can reuse their existing client
  > >>>>> (such as beeline) but connect to Flink's thrift server instead.
  >
  > >>>>> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for
  > >>>>> other application to use to connect to its thrift server
  > >>>>> 10. Support other user's customizations in Hive, such as Hive Serdes,
  > >>>>> storage handlers, etc.
  >
  > >>>>> 11. Better task failure tolerance and task scheduling at Flink runtime.
  > >>>>>
  > >>>>> As you can see, achieving all those requires significant effort and
  >
  > >>>>> across all layers in Flink. However, a short-term goal could  include only
  >
  > >>>>> core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as
  > >>>>> #3, #6).
  > >>>>>
  >
  > >>>>> Please share your further thoughts. If we generally agree that this is
  >
  > >>>>> the right direction, I could come up with a formal proposal quickly and
  > >>>>> then we can follow up with broader discussions.
  > >>>>>
  > >>>>> Thanks,
  > >>>>> Xuefu
  > >>>>>
  > >>>>>
  > >>>>>
  > >>>>> ------------------------------------------------------------------
  > >>>>> Sender:vino yang <ya...@gmail.com>
  > >>>>> Sent at:2018 Oct 11 (Thu) 09:45
  > >>>>> Recipient:Fabian Hueske <fh...@gmail.com>
  > >>>>> Cc:dev <de...@flink.apache.org>; Xuefu <xuefu.z@alibaba-inc.com
  > >; user <
  > >>>>> user@flink.apache.org>
  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
  > >>>>>
  > >>>>> Hi Xuefu,
  > >>>>>
  >
  > >>>>> Appreciate this proposal, and like Fabian, it would look better if you
  > >>>>> can give more details of the plan.
  > >>>>>
  > >>>>> Thanks, vino.
  > >>>>>
  > >>>>> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
  > >>>>> Hi Xuefu,
  > >>>>>
  >
  > >>>>> Welcome to the Flink community and thanks for starting this discussion!
  > >>>>> Better Hive integration would be really great!
  > >>>>> Can you go into details of what you are proposing? I can think of a
  > >>>>> couple ways to improve Flink in that regard:
  > >>>>>
  > >>>>> * Support for Hive UDFs
  > >>>>> * Support for Hive metadata catalog
  > >>>>> * Support for HiveQL syntax
  > >>>>> * ???
  > >>>>>
  > >>>>> Best, Fabian
  > >>>>>
  > >>>>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
  > >>>>> xuefu.z@alibaba-inc.com>:
  > >>>>> Hi all,
  > >>>>>
  > >>>>> Along with the community's effort, inside Alibaba we have explored
  >
  > >>>>> Flink's potential as an execution engine not just for stream processing but
  > >>>>> also for batch processing. We are encouraged by our findings and have
  >
  > >>>>> initiated our effort to make Flink's SQL capabilities full-fledged. When
  >
  > >>>>> comparing what's available in Flink to the offerings from competitive data
  >
  > >>>>> processing engines, we identified a major gap in Flink: a well integration
  >
  > >>>>> with Hive ecosystem. This is crucial to the success of Flink SQL and batch
  >
  > >>>>> due to the well-established data ecosystem around Hive. Therefore, we have
  >
  > >>>>> done some initial work along this direction but there are still a lot of
  > >>>>> effort needed.
  > >>>>>
  > >>>>> We have two strategies in mind. The first one is to make Flink SQL
  >
  > >>>>> full-fledged and well-integrated with Hive ecosystem. This is a similar
  >
  > >>>>> approach to what Spark SQL adopted. The second strategy is to make Hive
  >
  > >>>>> itself work with Flink, similar to the proposal in [1]. Each approach bears
  >
  > >>>>> its pros and cons, but they don’t need to be mutually exclusive with each
  > >>>>> targeting at different users and use cases. We believe that both will
  > >>>>> promote a much greater adoption of Flink beyond stream processing.
  > >>>>>
  > >>>>> We have been focused on the first approach and would like to showcase
  >
  > >>>>> Flink's batch and SQL capabilities with Flink SQL. However, we have also
  > >>>>> planned to start strategy #2 as the follow-up effort.
  > >>>>>
  >
  > >>>>> I'm completely new to Flink(, with a short bio [2] below), though many
  >
  > >>>>> of my colleagues here at Alibaba are long-time contributors. Nevertheless,
  >
  > >>>>> I'd like to share our thoughts and invite your early feedback. At the same
  >
  > >>>>> time, I am working on a detailed proposal on Flink SQL's integration with
  > >>>>> Hive ecosystem, which will be also shared when ready.
  > >>>>>
  > >>>>> While the ideas are simple, each approach will demand significant
  >
  > >>>>> effort, more than what we can afford. Thus, the input and contributions
  > >>>>> from the communities are greatly welcome and appreciated.
  > >>>>>
  > >>>>> Regards,
  > >>>>>
  > >>>>>
  > >>>>> Xuefu
  > >>>>>
  > >>>>> References:
  > >>>>>
  > >>>>> [1] https://issues.apache.org/jira/browse/HIVE-10712
  >
  > >>>>> [2] Xuefu Zhang is a long-time open source veteran, worked or working on
  > >>>>> many projects under Apache Foundation, of which he is also an honored
  >
  > >>>>> member. About 10 years ago he worked in the Hadoop team at Yahoo where the
  >
  > >>>>> projects just got started. Later he worked at Cloudera, initiating and
  >
  > >>>>> leading the development of Hive on Spark project in the communities and
  >
  > >>>>> across many organizations. Prior to joining Alibaba, he worked at Uber
  >
  > >>>>> where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and
  > >>>>> significantly improved Uber's cluster efficiency.
  > >>>>>
  > >>>>>
  > >>>>>
  > >>>>>
  > >>>>> --
  >
  > >>>>> "So you have to trust that the dots will somehow connect in your future."
  > >>>>>
  > >>>>>
  > >>>>> --
  >
  > >>>>> "So you have to trust that the dots will somehow connect in your future."
  > >>>>>
  >
  >


Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by Eron Wright <er...@gmail.com>.
Hi folks, there's clearly some incremental steps to be taken to introduce
catalog support to SQL Client, complementary to what is proposed in the
Flink-Hive Metastore design doc.  I was quietly working on this over the
holidays.   I posted some new sub-tasks, PRs, and sample code
to FLINK-10744.

What inspired me to get involved is that the catalog interface seems like a
great way to encapsulate a 'library' of Flink tables and functions.  For
example, the NYC Taxi dataset (TaxiRides, TaxiFares, various UDFs) may be
nicely encapsulated as a catalog (TaxiData).   Such a library should be
fully consumable in SQL Client.

I implemented the above.  Some highlights:

1. A fully-worked example of using the Taxi dataset in SQL Client via an
environment file.
- an ASCII video showing the SQL Client in action:
https://asciinema.org/a/C8xuAjmZSxCuApgFgZQyeIHuo

- the corresponding environment file (will be even more concise once
'FLINK-10696 Catalog UDFs' is merged):
*https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/dist/conf/sql-client-defaults.yaml
<https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/dist/conf/sql-client-defaults.yaml>*

- the typed API for standalone table applications:
*https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/java/com/dataartisans/flinktraining/examples/table_java/examples/ViaCatalog.java#L50
<https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/java/com/dataartisans/flinktraining/examples/table_java/examples/ViaCatalog.java#L50>*

2. Implementation of the core catalog descriptor and factory.  I realize
that some renames may later occur as per the design doc, and would be happy
to do that as a follow-up.
https://github.com/apache/flink/pull/7390

3. Implementation of a connect-style API on TableEnvironment to use catalog
descriptor.
https://github.com/apache/flink/pull/7392

4. Integration into SQL-Client's environment file:
https://github.com/apache/flink/pull/7393

I realize that the overall Hive integration is still evolving, but I
believe that these PRs are a good stepping stone. Here's the list (in
bottom-up order):
- https://github.com/apache/flink/pull/7386
- https://github.com/apache/flink/pull/7388
- https://github.com/apache/flink/pull/7389
- https://github.com/apache/flink/pull/7390
- https://github.com/apache/flink/pull/7392
- https://github.com/apache/flink/pull/7393

Thanks and enjoy 2019!
Eron W


On Sun, Nov 18, 2018 at 3:04 PM Zhang, Xuefu <xu...@alibaba-inc.com>
wrote:

> Hi Xiaowei,
>
> Thanks for bringing up the question. In the current design, the properties
> for meta objects are meant to cover anything that's specific to a
> particular catalog and agnostic to Flink. Anything that is common (such as
> schema for tables, query text for views, and udf classname) are abstracted
> as members of the respective classes. However, this is still in discussion,
> and Timo and I will go over this and provide an update.
>
> Please note that UDF is a little more involved than what the current
> design doc shows. I'm still refining this part.
>
> Thanks,
> Xuefu
>
>
> ------------------------------------------------------------------
> Sender:Xiaowei Jiang <xi...@gmail.com>
> Sent at:2018 Nov 18 (Sun) 15:17
> Recipient:dev <de...@flink.apache.org>
> Cc:Xuefu <xu...@alibaba-inc.com>; twalthr <tw...@apache.org>; piotr <
> piotr@data-artisans.com>; Fabian Hueske <fh...@gmail.com>; suez1224 <
> suez1224@gmail.com>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> Thanks Xuefu for the detailed design doc! One question on the properties
> associated with the catalog objects. Are we going to leave them completely
> free form or we are going to set some standard for that? I think that the
> answer may depend on if we want to explore catalog specific optimization
> opportunities. In any case, I think that it might be helpful for
> standardize as much as possible into strongly typed classes and use leave
> these properties for catalog specific things. But I think that we can do it
> in steps.
>
> Xiaowei
> On Fri, Nov 16, 2018 at 4:00 AM Bowen Li <bo...@gmail.com> wrote:
> Thanks for keeping on improving the overall design, Xuefu! It looks quite
>  good to me now.
>
>  Would be nice that cc-ed Flink committers can help to review and confirm!
>
>
>
>  One minor suggestion: Since the last section of design doc already touches
>  some new sql statements, shall we add another section in our doc and
>  formalize the new sql statements in SQL Client and TableEnvironment that
>  are gonna come along naturally with our design? Here are some that the
>  design doc mentioned and some that I came up with:
>
>  To be added:
>
>     - USE <catalog> - set default catalog
>     - USE <catalog.schema> - set default schema
>     - SHOW CATALOGS - show all registered catalogs
>     - SHOW SCHEMAS [FROM catalog] - list schemas in the current default
>     catalog or the specified catalog
>     - DESCRIBE VIEW view - show the view's definition in CatalogView
>     - SHOW VIEWS [FROM schema/catalog.schema] - show views from current or
> a
>     specified schema.
>
>     (DDLs that can be addressed by either our design or Shuyi's DDL design)
>
>     - CREATE/DROP/ALTER SCHEMA schema
>     - CREATE/DROP/ALTER CATALOG catalog
>
>  To be modified:
>
>     - SHOW TABLES [FROM schema/catalog.schema] - show tables from current
> or
>     a specified schema. Add 'from schema' to existing 'SHOW TABLES'
> statement
>     - SHOW FUNCTIONS [FROM schema/catalog.schema] - show functions from
>     current or a specified schema. Add 'from schema' to existing 'SHOW
> TABLES'
>     statement'
>
>
>  Thanks, Bowen
>
>
>
>  On Wed, Nov 14, 2018 at 10:39 PM Zhang, Xuefu <xu...@alibaba-inc.com>
>  wrote:
>
>  > Thanks, Bowen, for catching the error. I have granted comment permission
>  > with the link.
>  >
>  > I also updated the doc with the latest class definitions. Everyone is
>  > encouraged to review and comment.
>  >
>  > Thanks,
>  > Xuefu
>  >
>  > ------------------------------------------------------------------
>  > Sender:Bowen Li <bo...@gmail.com>
>  > Sent at:2018 Nov 14 (Wed) 06:44
>  > Recipient:Xuefu <xu...@alibaba-inc.com>
>  > Cc:piotr <pi...@data-artisans.com>; dev <de...@flink.apache.org>; Shuyi
>  > Chen <su...@gmail.com>
>  > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>  >
>  > Hi Xuefu,
>  >
>  > Currently the new design doc
>  > <
> https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit
> >
>  > is on “view only" mode, and people cannot leave comments. Can you please
>  > change it to "can comment" or "can edit" mode?
>  >
>  > Thanks, Bowen
>  >
>  >
>  > On Mon, Nov 12, 2018 at 9:51 PM Zhang, Xuefu <xu...@alibaba-inc.com>
>  > wrote:
>  > Hi Piotr
>  >
>  > I have extracted the API portion of  the design and the google doc is
> here
>  > <
> https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit?usp=sharing
> >.
>  > Please review and provide your feedback.
>  >
>  > Thanks,
>  > Xuefu
>  >
>  > ------------------------------------------------------------------
>  > Sender:Xuefu <xu...@alibaba-inc.com>
>  > Sent at:2018 Nov 12 (Mon) 12:43
>  > Recipient:Piotr Nowojski <pi...@data-artisans.com>; dev <
>  > dev@flink.apache.org>
>  > Cc:Bowen Li <bo...@gmail.com>; Shuyi Chen <su...@gmail.com>
>  > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>  >
>  > Hi Piotr,
>  >
>  > That sounds good to me. Let's close all the open questions ((there are a
>  > couple of them)) in the Google doc and I should be able to quickly split
>  > it into the three proposals as you suggested.
>  >
>  > Thanks,
>  > Xuefu
>  >
>  > ------------------------------------------------------------------
>  > Sender:Piotr Nowojski <pi...@data-artisans.com>
>  > Sent at:2018 Nov 9 (Fri) 22:46
>  > Recipient:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>
>  > Cc:Bowen Li <bo...@gmail.com>; Shuyi Chen <su...@gmail.com>
>  > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>  >
>  > Hi,
>  >
>  >
>  > Yes, it seems like the best solution. Maybe someone else can also
> suggests if we can split it further? Maybe changes in the interface in one
> doc, reading from hive meta store another and final storing our meta
> informations in hive meta store?
>  >
>  > Piotrek
>  >
>  > > On 9 Nov 2018, at 01:44, Zhang, Xuefu <xu...@alibaba-inc.com>
> wrote:
>  > >
>  > > Hi Piotr,
>  > >
>  > > That seems to be good idea!
>  > >
>  >
>  > > Since the google doc for the design is currently under extensive
> review, I will leave it as it is for now. However, I'll convert it to two
> different FLIPs when the time comes.
>  > >
>  > > How does it sound to you?
>  > >
>  > > Thanks,
>  > > Xuefu
>  > >
>  > >
>  > > ------------------------------------------------------------------
>  > > Sender:Piotr Nowojski <pi...@data-artisans.com>
>  > > Sent at:2018 Nov 9 (Fri) 02:31
>  > > Recipient:dev <de...@flink.apache.org>
>  > > Cc:Bowen Li <bo...@gmail.com>; Xuefu <xuefu.z@alibaba-inc.com
>  > >; Shuyi Chen <su...@gmail.com>
>  > > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>  > >
>  > > Hi,
>  > >
>  >
>  > > Maybe we should split this topic (and the design doc) into couple of
> smaller ones, hopefully independent. The questions that you have asked
> Fabian have for example very little to do with reading metadata from Hive
> Meta Store?
>  > >
>  > > Piotrek
>  > >
>  > >> On 7 Nov 2018, at 14:27, Fabian Hueske <fh...@gmail.com> wrote:
>  > >>
>  > >> Hi Xuefu and all,
>  > >>
>  > >> Thanks for sharing this design document!
>  >
>  > >> I'm very much in favor of restructuring / reworking the catalog
> handling in
>  > >> Flink SQL as outlined in the document.
>  >
>  > >> Most changes described in the design document seem to be rather
> general and
>  > >> not specifically related to the Hive integration.
>  > >>
>  >
>  > >> IMO, there are some aspects, especially those at the boundary of
> Hive and
>  > >> Flink, that need a bit more discussion. For example
>  > >>
>  > >> * What does it take to make Flink schema compatible with Hive schema?
>  > >> * How will Flink tables (descriptors) be stored in HMS?
>  > >> * How do both Hive catalogs differ? Could they be integrated into to
> a
>  > >> single one? When to use which one?
>  >
>  > >> * What meta information is provided by HMS? What of this can be
> leveraged
>  > >> by Flink?
>  > >>
>  > >> Thank you,
>  > >> Fabian
>  > >>
>  > >> Am Fr., 2. Nov. 2018 um 00:31 Uhr schrieb Bowen Li <
> bowenli86@gmail.com
>  > >:
>  > >>
>  > >>> After taking a look at how other discussion threads work, I think
> it's
>  > >>> actually fine just keep our discussion here. It's up to you, Xuefu.
>  > >>>
>  > >>> The google doc LGTM. I left some minor comments.
>  > >>>
>  > >>> On Thu, Nov 1, 2018 at 10:17 AM Bowen Li <bo...@gmail.com>
> wrote:
>  > >>>
>  > >>>> Hi all,
>  > >>>>
>  > >>>> As Xuefu has published the design doc on google, I agree with
> Shuyi's
>  >
>  > >>>> suggestion that we probably should start a new email thread like
> "[DISCUSS]
>  >
>  > >>>> ... Hive integration design ..." on only dev mailing list for
> community
>  > >>>> devs to review. The current thread sends to both dev and user list.
>  > >>>>
>  >
>  > >>>> This email thread is more like validating the general idea and
> direction
>  >
>  > >>>> with the community, and it's been pretty long and crowded so far.
> Since
>  >
>  > >>>> everyone is pro for the idea, we can move forward with another
> thread to
>  > >>>> discuss and finalize the design.
>  > >>>>
>  > >>>> Thanks,
>  > >>>> Bowen
>  > >>>>
>  > >>>> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu <
>  > xuefu.z@alibaba-inc.com>
>  > >>>> wrote:
>  > >>>>
>  > >>>>> Hi Shuiyi,
>  > >>>>>
>  >
>  > >>>>> Good idea. Actually the PDF was converted from a google doc. Here
> is its
>  > >>>>> link:
>  > >>>>>
>  > >>>>>
>  >
> https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
>  > >>>>> Once we reach an agreement, I can convert it to a FLIP.
>  > >>>>>
>  > >>>>> Thanks,
>  > >>>>> Xuefu
>  > >>>>>
>  > >>>>>
>  > >>>>>
>  > >>>>> ------------------------------------------------------------------
>  > >>>>> Sender:Shuyi Chen <su...@gmail.com>
>  > >>>>> Sent at:2018 Nov 1 (Thu) 02:47
>  > >>>>> Recipient:Xuefu <xu...@alibaba-inc.com>
>  > >>>>> Cc:vino yang <ya...@gmail.com>; Fabian Hueske <
>  > fhueske@gmail.com>;
>  > >>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>  > >>>>>
>  > >>>>> Hi Xuefu,
>  > >>>>>
>  >
>  > >>>>> Thanks a lot for driving this big effort. I would suggest convert
> your
>  >
>  > >>>>> proposal and design doc into a google doc, and share it on the
> dev mailing
>  >
>  > >>>>> list for the community to review and comment with title like
> "[DISCUSS] ...
>  >
>  > >>>>> Hive integration design ..." . Once approved,  we can document it
> as a FLIP
>  >
>  > >>>>> (Flink Improvement Proposals), and use JIRAs to track the
> implementations.
>  > >>>>> What do you think?
>  > >>>>>
>  > >>>>> Shuyi
>  > >>>>>
>  > >>>>> On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <
>  > xuefu.z@alibaba-inc.com>
>  > >>>>> wrote:
>  > >>>>> Hi all,
>  > >>>>>
>  > >>>>> I have also shared a design doc on Hive metastore integration
> that is
>  >
>  > >>>>> attached here and also to FLINK-10556[1]. Please kindly review
> and share
>  > >>>>> your feedback.
>  > >>>>>
>  > >>>>>
>  > >>>>> Thanks,
>  > >>>>> Xuefu
>  > >>>>>
>  > >>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
>  > >>>>> ------------------------------------------------------------------
>  > >>>>> Sender:Xuefu <xu...@alibaba-inc.com>
>  > >>>>> Sent at:2018 Oct 25 (Thu) 01:08
>  > >>>>> Recipient:Xuefu <xu...@alibaba-inc.com>; Shuyi Chen <
>  > >>>>> suez1224@gmail.com>
>  > >>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <
>  > fhueske@gmail.com>;
>  > >>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>  > >>>>>
>  > >>>>> Hi all,
>  > >>>>>
>  > >>>>> To wrap up the discussion, I have attached a PDF describing the
>  >
>  > >>>>> proposal, which is also attached to FLINK-10556 [1]. Please feel
> free to
>  > >>>>> watch that JIRA to track the progress.
>  > >>>>>
>  > >>>>> Please also let me know if you have additional comments or
> questions.
>  > >>>>>
>  > >>>>> Thanks,
>  > >>>>> Xuefu
>  > >>>>>
>  > >>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
>  > >>>>>
>  > >>>>>
>  > >>>>> ------------------------------------------------------------------
>  > >>>>> Sender:Xuefu <xu...@alibaba-inc.com>
>  > >>>>> Sent at:2018 Oct 16 (Tue) 03:40
>  > >>>>> Recipient:Shuyi Chen <su...@gmail.com>
>  > >>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <
>  > fhueske@gmail.com>;
>  > >>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>  > >>>>>
>  > >>>>> Hi Shuyi,
>  > >>>>>
>  >
>  > >>>>> Thank you for your input. Yes, I agreed with a phased approach
> and like
>  >
>  > >>>>> to move forward fast. :) We did some work internally on DDL
> utilizing babel
>  > >>>>> parser in Calcite. While babel makes Calcite's grammar
> extensible, at
>  > >>>>> first impression it still seems too cumbersome for a project when
> too
>  >
>  > >>>>> much extensions are made. It's even challenging to find where the
> extension
>  >
>  > >>>>> is needed! It would be certainly better if Calcite can magically
> support
>  >
>  > >>>>> Hive QL by just turning on a flag, such as that for MYSQL_5. I
> can also
>  >
>  > >>>>> see that this could mean a lot of work on Calcite. Nevertheless,
> I will
>  >
>  > >>>>> bring up the discussion over there and to see what their
> community thinks.
>  > >>>>>
>  > >>>>> Would mind to share more info about the proposal on DDL that you
>  > >>>>> mentioned? We can certainly collaborate on this.
>  > >>>>>
>  > >>>>> Thanks,
>  > >>>>> Xuefu
>  > >>>>>
>  > >>>>> ------------------------------------------------------------------
>  > >>>>> Sender:Shuyi Chen <su...@gmail.com>
>  > >>>>> Sent at:2018 Oct 14 (Sun) 08:30
>  > >>>>> Recipient:Xuefu <xu...@alibaba-inc.com>
>  > >>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <
>  > fhueske@gmail.com>;
>  > >>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>  > >>>>>
>  > >>>>> Welcome to the community and thanks for the great proposal,
> Xuefu! I
>  >
>  > >>>>> think the proposal can be divided into 2 stages: making Flink to
> support
>  >
>  > >>>>> Hive features, and make Hive to work with Flink. I agreed with
> Timo that on
>  >
>  > >>>>> starting with a smaller scope, so we can make progress faster. As
> for [6],
>  >
>  > >>>>> a proposal for DDL is already in progress, and will come after
> the unified
>  >
>  > >>>>> SQL connector API is done. For supporting Hive syntax, we might
> need to
>  > >>>>> work with the Calcite community, and a recent effort called babel
> (
>  > >>>>> https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite
> might
>  > >>>>> help here.
>  > >>>>>
>  > >>>>> Thanks
>  > >>>>> Shuyi
>  > >>>>>
>  > >>>>> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <
>  > xuefu.z@alibaba-inc.com>
>  > >>>>> wrote:
>  > >>>>> Hi Fabian/Vno,
>  > >>>>>
>  >
>  > >>>>> Thank you very much for your encouragement inquiry. Sorry that I
> didn't
>  >
>  > >>>>> see Fabian's email until I read Vino's response just now.
> (Somehow Fabian's
>  > >>>>> went to the spam folder.)
>  > >>>>>
>  >
>  > >>>>> My proposal contains long-term and short-terms goals.
> Nevertheless, the
>  > >>>>> effort will focus on the following areas, including Fabian's list:
>  > >>>>>
>  > >>>>> 1. Hive metastore connectivity - This covers both read/write
> access,
>  >
>  > >>>>> which means Flink can make full use of Hive's metastore as its
> catalog (at
>  > >>>>> least for the batch but can extend for streaming as well).
>  >
>  > >>>>> 2. Metadata compatibility - Objects (databases, tables,
> partitions, etc)
>  >
>  > >>>>> created by Hive can be understood by Flink and the reverse
> direction is
>  > >>>>> true also.
>  > >>>>> 3. Data compatibility - Similar to #2, data produced by Hive can
> be
>  > >>>>> consumed by Flink and vise versa.
>  >
>  > >>>>> 4. Support Hive UDFs - For all Hive's native udfs, Flink either
> provides
>  > >>>>> its own implementation or make Hive's implementation work in
> Flink.
>  > >>>>> Further, for user created UDFs in Hive, Flink SQL should provide a
>  >
>  > >>>>> mechanism allowing user to import them into Flink without any
> code change
>  > >>>>> required.
>  > >>>>> 5. Data types -  Flink SQL should support all data types that are
>  > >>>>> available in Hive.
>  > >>>>> 6. SQL Language - Flink SQL should support SQL standard (such as
>  >
>  > >>>>> SQL2003) with extension to support Hive's syntax and language
> features,
>  > >>>>> around DDL, DML, and SELECT queries.
>  >
>  > >>>>> 7.  SQL CLI - this is currently developing in Flink but more
> effort is
>  > >>>>> needed.
>  >
>  > >>>>> 8. Server - provide a server that's compatible with Hive's
> HiverServer2
>  >
>  > >>>>> in thrift APIs, such that HiveServer2 users can reuse their
> existing client
>  > >>>>> (such as beeline) but connect to Flink's thrift server instead.
>  >
>  > >>>>> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC
> drivers for
>  > >>>>> other application to use to connect to its thrift server
>  > >>>>> 10. Support other user's customizations in Hive, such as Hive
> Serdes,
>  > >>>>> storage handlers, etc.
>  >
>  > >>>>> 11. Better task failure tolerance and task scheduling at Flink
> runtime.
>  > >>>>>
>  > >>>>> As you can see, achieving all those requires significant effort
> and
>  >
>  > >>>>> across all layers in Flink. However, a short-term goal could
> include only
>  >
>  > >>>>> core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller
> scope (such as
>  > >>>>> #3, #6).
>  > >>>>>
>  >
>  > >>>>> Please share your further thoughts. If we generally agree that
> this is
>  >
>  > >>>>> the right direction, I could come up with a formal proposal
> quickly and
>  > >>>>> then we can follow up with broader discussions.
>  > >>>>>
>  > >>>>> Thanks,
>  > >>>>> Xuefu
>  > >>>>>
>  > >>>>>
>  > >>>>>
>  > >>>>> ------------------------------------------------------------------
>  > >>>>> Sender:vino yang <ya...@gmail.com>
>  > >>>>> Sent at:2018 Oct 11 (Thu) 09:45
>  > >>>>> Recipient:Fabian Hueske <fh...@gmail.com>
>  > >>>>> Cc:dev <de...@flink.apache.org>; Xuefu <xuefu.z@alibaba-inc.com
>  > >; user <
>  > >>>>> user@flink.apache.org>
>  > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>  > >>>>>
>  > >>>>> Hi Xuefu,
>  > >>>>>
>  >
>  > >>>>> Appreciate this proposal, and like Fabian, it would look better
> if you
>  > >>>>> can give more details of the plan.
>  > >>>>>
>  > >>>>> Thanks, vino.
>  > >>>>>
>  > >>>>> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
>  > >>>>> Hi Xuefu,
>  > >>>>>
>  >
>  > >>>>> Welcome to the Flink community and thanks for starting this
> discussion!
>  > >>>>> Better Hive integration would be really great!
>  > >>>>> Can you go into details of what you are proposing? I can think of
> a
>  > >>>>> couple ways to improve Flink in that regard:
>  > >>>>>
>  > >>>>> * Support for Hive UDFs
>  > >>>>> * Support for Hive metadata catalog
>  > >>>>> * Support for HiveQL syntax
>  > >>>>> * ???
>  > >>>>>
>  > >>>>> Best, Fabian
>  > >>>>>
>  > >>>>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
>  > >>>>> xuefu.z@alibaba-inc.com>:
>  > >>>>> Hi all,
>  > >>>>>
>  > >>>>> Along with the community's effort, inside Alibaba we have explored
>  >
>  > >>>>> Flink's potential as an execution engine not just for stream
> processing but
>  > >>>>> also for batch processing. We are encouraged by our findings and
> have
>  >
>  > >>>>> initiated our effort to make Flink's SQL capabilities
> full-fledged. When
>  >
>  > >>>>> comparing what's available in Flink to the offerings from
> competitive data
>  >
>  > >>>>> processing engines, we identified a major gap in Flink: a well
> integration
>  >
>  > >>>>> with Hive ecosystem. This is crucial to the success of Flink SQL
> and batch
>  >
>  > >>>>> due to the well-established data ecosystem around Hive.
> Therefore, we have
>  >
>  > >>>>> done some initial work along this direction but there are still a
> lot of
>  > >>>>> effort needed.
>  > >>>>>
>  > >>>>> We have two strategies in mind. The first one is to make Flink SQL
>  >
>  > >>>>> full-fledged and well-integrated with Hive ecosystem. This is a
> similar
>  >
>  > >>>>> approach to what Spark SQL adopted. The second strategy is to
> make Hive
>  >
>  > >>>>> itself work with Flink, similar to the proposal in [1]. Each
> approach bears
>  >
>  > >>>>> its pros and cons, but they don’t need to be mutually exclusive
> with each
>  > >>>>> targeting at different users and use cases. We believe that both
> will
>  > >>>>> promote a much greater adoption of Flink beyond stream processing.
>  > >>>>>
>  > >>>>> We have been focused on the first approach and would like to
> showcase
>  >
>  > >>>>> Flink's batch and SQL capabilities with Flink SQL. However, we
> have also
>  > >>>>> planned to start strategy #2 as the follow-up effort.
>  > >>>>>
>  >
>  > >>>>> I'm completely new to Flink(, with a short bio [2] below), though
> many
>  >
>  > >>>>> of my colleagues here at Alibaba are long-time contributors.
> Nevertheless,
>  >
>  > >>>>> I'd like to share our thoughts and invite your early feedback. At
> the same
>  >
>  > >>>>> time, I am working on a detailed proposal on Flink SQL's
> integration with
>  > >>>>> Hive ecosystem, which will be also shared when ready.
>  > >>>>>
>  > >>>>> While the ideas are simple, each approach will demand significant
>  >
>  > >>>>> effort, more than what we can afford. Thus, the input and
> contributions
>  > >>>>> from the communities are greatly welcome and appreciated.
>  > >>>>>
>  > >>>>> Regards,
>  > >>>>>
>  > >>>>>
>  > >>>>> Xuefu
>  > >>>>>
>  > >>>>> References:
>  > >>>>>
>  > >>>>> [1] https://issues.apache.org/jira/browse/HIVE-10712
>  >
>  > >>>>> [2] Xuefu Zhang is a long-time open source veteran, worked or
> working on
>  > >>>>> many projects under Apache Foundation, of which he is also an
> honored
>  >
>  > >>>>> member. About 10 years ago he worked in the Hadoop team at Yahoo
> where the
>  >
>  > >>>>> projects just got started. Later he worked at Cloudera,
> initiating and
>  >
>  > >>>>> leading the development of Hive on Spark project in the
> communities and
>  >
>  > >>>>> across many organizations. Prior to joining Alibaba, he worked at
> Uber
>  >
>  > >>>>> where he promoted Hive on Spark to all Uber's SQL on Hadoop
> workload and
>  > >>>>> significantly improved Uber's cluster efficiency.
>  > >>>>>
>  > >>>>>
>  > >>>>>
>  > >>>>>
>  > >>>>> --
>  >
>  > >>>>> "So you have to trust that the dots will somehow connect in your
> future."
>  > >>>>>
>  > >>>>>
>  > >>>>> --
>  >
>  > >>>>> "So you have to trust that the dots will somehow connect in your
> future."
>  > >>>>>
>  >
>  >
>
>

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by "Zhang, Xuefu" <xu...@alibaba-inc.com>.
Hi Xiaowei,

Thanks for bringing up the question. In the current design, the properties for meta objects are meant to cover anything that's specific to a particular catalog and agnostic to Flink. Anything that is common (such as schema for tables, query text for views, and udf classname) are abstracted as members of the respective classes. However, this is still in discussion, and Timo and I will go over this and provide an update.

Please note that UDF is a little more involved than what the current design doc shows. I'm still refining this part.

Thanks,
Xuefu


------------------------------------------------------------------
Sender:Xiaowei Jiang <xi...@gmail.com>
Sent at:2018 Nov 18 (Sun) 15:17
Recipient:dev <de...@flink.apache.org>
Cc:Xuefu <xu...@alibaba-inc.com>; twalthr <tw...@apache.org>; piotr <pi...@data-artisans.com>; Fabian Hueske <fh...@gmail.com>; suez1224 <su...@gmail.com>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Thanks Xuefu for the detailed design doc! One question on the properties associated with the catalog objects. Are we going to leave them completely free form or we are going to set some standard for that? I think that the answer may depend on if we want to explore catalog specific optimization opportunities. In any case, I think that it might be helpful for standardize as much as possible into strongly typed classes and use leave these properties for catalog specific things. But I think that we can do it in steps.

Xiaowei
On Fri, Nov 16, 2018 at 4:00 AM Bowen Li <bo...@gmail.com> wrote:
Thanks for keeping on improving the overall design, Xuefu! It looks quite
 good to me now.

 Would be nice that cc-ed Flink committers can help to review and confirm!



 One minor suggestion: Since the last section of design doc already touches
 some new sql statements, shall we add another section in our doc and
 formalize the new sql statements in SQL Client and TableEnvironment that
 are gonna come along naturally with our design? Here are some that the
 design doc mentioned and some that I came up with:

 To be added:

    - USE <catalog> - set default catalog
    - USE <catalog.schema> - set default schema
    - SHOW CATALOGS - show all registered catalogs
    - SHOW SCHEMAS [FROM catalog] - list schemas in the current default
    catalog or the specified catalog
    - DESCRIBE VIEW view - show the view's definition in CatalogView
    - SHOW VIEWS [FROM schema/catalog.schema] - show views from current or a
    specified schema.

    (DDLs that can be addressed by either our design or Shuyi's DDL design)

    - CREATE/DROP/ALTER SCHEMA schema
    - CREATE/DROP/ALTER CATALOG catalog

 To be modified:

    - SHOW TABLES [FROM schema/catalog.schema] - show tables from current or
    a specified schema. Add 'from schema' to existing 'SHOW TABLES' statement
    - SHOW FUNCTIONS [FROM schema/catalog.schema] - show functions from
    current or a specified schema. Add 'from schema' to existing 'SHOW TABLES'
    statement'


 Thanks, Bowen



 On Wed, Nov 14, 2018 at 10:39 PM Zhang, Xuefu <xu...@alibaba-inc.com>
 wrote:

 > Thanks, Bowen, for catching the error. I have granted comment permission
 > with the link.
 >
 > I also updated the doc with the latest class definitions. Everyone is
 > encouraged to review and comment.
 >
 > Thanks,
 > Xuefu
 >
 > ------------------------------------------------------------------
 > Sender:Bowen Li <bo...@gmail.com>
 > Sent at:2018 Nov 14 (Wed) 06:44
 > Recipient:Xuefu <xu...@alibaba-inc.com>
 > Cc:piotr <pi...@data-artisans.com>; dev <de...@flink.apache.org>; Shuyi
 > Chen <su...@gmail.com>
 > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
 >
 > Hi Xuefu,
 >
 > Currently the new design doc
 > <https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit>
 > is on “view only" mode, and people cannot leave comments. Can you please
 > change it to "can comment" or "can edit" mode?
 >
 > Thanks, Bowen
 >
 >
 > On Mon, Nov 12, 2018 at 9:51 PM Zhang, Xuefu <xu...@alibaba-inc.com>
 > wrote:
 > Hi Piotr
 >
 > I have extracted the API portion of  the design and the google doc is here
 > <https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit?usp=sharing>.
 > Please review and provide your feedback.
 >
 > Thanks,
 > Xuefu
 >
 > ------------------------------------------------------------------
 > Sender:Xuefu <xu...@alibaba-inc.com>
 > Sent at:2018 Nov 12 (Mon) 12:43
 > Recipient:Piotr Nowojski <pi...@data-artisans.com>; dev <
 > dev@flink.apache.org>
 > Cc:Bowen Li <bo...@gmail.com>; Shuyi Chen <su...@gmail.com>
 > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
 >
 > Hi Piotr,
 >
 > That sounds good to me. Let's close all the open questions ((there are a
 > couple of them)) in the Google doc and I should be able to quickly split
 > it into the three proposals as you suggested.
 >
 > Thanks,
 > Xuefu
 >
 > ------------------------------------------------------------------
 > Sender:Piotr Nowojski <pi...@data-artisans.com>
 > Sent at:2018 Nov 9 (Fri) 22:46
 > Recipient:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>
 > Cc:Bowen Li <bo...@gmail.com>; Shuyi Chen <su...@gmail.com>
 > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
 >
 > Hi,
 >
 >
 > Yes, it seems like the best solution. Maybe someone else can also suggests if we can split it further? Maybe changes in the interface in one doc, reading from hive meta store another and final storing our meta informations in hive meta store?
 >
 > Piotrek
 >
 > > On 9 Nov 2018, at 01:44, Zhang, Xuefu <xu...@alibaba-inc.com> wrote:
 > >
 > > Hi Piotr,
 > >
 > > That seems to be good idea!
 > >
 >
 > > Since the google doc for the design is currently under extensive review, I will leave it as it is for now. However, I'll convert it to two different FLIPs when the time comes.
 > >
 > > How does it sound to you?
 > >
 > > Thanks,
 > > Xuefu
 > >
 > >
 > > ------------------------------------------------------------------
 > > Sender:Piotr Nowojski <pi...@data-artisans.com>
 > > Sent at:2018 Nov 9 (Fri) 02:31
 > > Recipient:dev <de...@flink.apache.org>
 > > Cc:Bowen Li <bo...@gmail.com>; Xuefu <xuefu.z@alibaba-inc.com
 > >; Shuyi Chen <su...@gmail.com>
 > > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
 > >
 > > Hi,
 > >
 >
 > > Maybe we should split this topic (and the design doc) into couple of smaller ones, hopefully independent. The questions that you have asked Fabian have for example very little to do with reading metadata from Hive Meta Store?
 > >
 > > Piotrek
 > >
 > >> On 7 Nov 2018, at 14:27, Fabian Hueske <fh...@gmail.com> wrote:
 > >>
 > >> Hi Xuefu and all,
 > >>
 > >> Thanks for sharing this design document!
 >
 > >> I'm very much in favor of restructuring / reworking the catalog handling in
 > >> Flink SQL as outlined in the document.
 >
 > >> Most changes described in the design document seem to be rather general and
 > >> not specifically related to the Hive integration.
 > >>
 >
 > >> IMO, there are some aspects, especially those at the boundary of Hive and
 > >> Flink, that need a bit more discussion. For example
 > >>
 > >> * What does it take to make Flink schema compatible with Hive schema?
 > >> * How will Flink tables (descriptors) be stored in HMS?
 > >> * How do both Hive catalogs differ? Could they be integrated into to a
 > >> single one? When to use which one?
 >
 > >> * What meta information is provided by HMS? What of this can be leveraged
 > >> by Flink?
 > >>
 > >> Thank you,
 > >> Fabian
 > >>
 > >> Am Fr., 2. Nov. 2018 um 00:31 Uhr schrieb Bowen Li <bowenli86@gmail.com
 > >:
 > >>
 > >>> After taking a look at how other discussion threads work, I think it's
 > >>> actually fine just keep our discussion here. It's up to you, Xuefu.
 > >>>
 > >>> The google doc LGTM. I left some minor comments.
 > >>>
 > >>> On Thu, Nov 1, 2018 at 10:17 AM Bowen Li <bo...@gmail.com> wrote:
 > >>>
 > >>>> Hi all,
 > >>>>
 > >>>> As Xuefu has published the design doc on google, I agree with Shuyi's
 >
 > >>>> suggestion that we probably should start a new email thread like "[DISCUSS]
 >
 > >>>> ... Hive integration design ..." on only dev mailing list for community
 > >>>> devs to review. The current thread sends to both dev and user list.
 > >>>>
 >
 > >>>> This email thread is more like validating the general idea and direction
 >
 > >>>> with the community, and it's been pretty long and crowded so far. Since
 >
 > >>>> everyone is pro for the idea, we can move forward with another thread to
 > >>>> discuss and finalize the design.
 > >>>>
 > >>>> Thanks,
 > >>>> Bowen
 > >>>>
 > >>>> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu <
 > xuefu.z@alibaba-inc.com>
 > >>>> wrote:
 > >>>>
 > >>>>> Hi Shuiyi,
 > >>>>>
 >
 > >>>>> Good idea. Actually the PDF was converted from a google doc. Here is its
 > >>>>> link:
 > >>>>>
 > >>>>>
 > https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
 > >>>>> Once we reach an agreement, I can convert it to a FLIP.
 > >>>>>
 > >>>>> Thanks,
 > >>>>> Xuefu
 > >>>>>
 > >>>>>
 > >>>>>
 > >>>>> ------------------------------------------------------------------
 > >>>>> Sender:Shuyi Chen <su...@gmail.com>
 > >>>>> Sent at:2018 Nov 1 (Thu) 02:47
 > >>>>> Recipient:Xuefu <xu...@alibaba-inc.com>
 > >>>>> Cc:vino yang <ya...@gmail.com>; Fabian Hueske <
 > fhueske@gmail.com>;
 > >>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
 > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
 > >>>>>
 > >>>>> Hi Xuefu,
 > >>>>>
 >
 > >>>>> Thanks a lot for driving this big effort. I would suggest convert your
 >
 > >>>>> proposal and design doc into a google doc, and share it on the dev mailing
 >
 > >>>>> list for the community to review and comment with title like "[DISCUSS] ...
 >
 > >>>>> Hive integration design ..." . Once approved,  we can document it as a FLIP
 >
 > >>>>> (Flink Improvement Proposals), and use JIRAs to track the implementations.
 > >>>>> What do you think?
 > >>>>>
 > >>>>> Shuyi
 > >>>>>
 > >>>>> On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <
 > xuefu.z@alibaba-inc.com>
 > >>>>> wrote:
 > >>>>> Hi all,
 > >>>>>
 > >>>>> I have also shared a design doc on Hive metastore integration that is
 >
 > >>>>> attached here and also to FLINK-10556[1]. Please kindly review and share
 > >>>>> your feedback.
 > >>>>>
 > >>>>>
 > >>>>> Thanks,
 > >>>>> Xuefu
 > >>>>>
 > >>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
 > >>>>> ------------------------------------------------------------------
 > >>>>> Sender:Xuefu <xu...@alibaba-inc.com>
 > >>>>> Sent at:2018 Oct 25 (Thu) 01:08
 > >>>>> Recipient:Xuefu <xu...@alibaba-inc.com>; Shuyi Chen <
 > >>>>> suez1224@gmail.com>
 > >>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <
 > fhueske@gmail.com>;
 > >>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
 > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
 > >>>>>
 > >>>>> Hi all,
 > >>>>>
 > >>>>> To wrap up the discussion, I have attached a PDF describing the
 >
 > >>>>> proposal, which is also attached to FLINK-10556 [1]. Please feel free to
 > >>>>> watch that JIRA to track the progress.
 > >>>>>
 > >>>>> Please also let me know if you have additional comments or questions.
 > >>>>>
 > >>>>> Thanks,
 > >>>>> Xuefu
 > >>>>>
 > >>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
 > >>>>>
 > >>>>>
 > >>>>> ------------------------------------------------------------------
 > >>>>> Sender:Xuefu <xu...@alibaba-inc.com>
 > >>>>> Sent at:2018 Oct 16 (Tue) 03:40
 > >>>>> Recipient:Shuyi Chen <su...@gmail.com>
 > >>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <
 > fhueske@gmail.com>;
 > >>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
 > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
 > >>>>>
 > >>>>> Hi Shuyi,
 > >>>>>
 >
 > >>>>> Thank you for your input. Yes, I agreed with a phased approach and like
 >
 > >>>>> to move forward fast. :) We did some work internally on DDL utilizing babel
 > >>>>> parser in Calcite. While babel makes Calcite's grammar extensible, at
 > >>>>> first impression it still seems too cumbersome for a project when too
 >
 > >>>>> much extensions are made. It's even challenging to find where the extension
 >
 > >>>>> is needed! It would be certainly better if Calcite can magically support
 >
 > >>>>> Hive QL by just turning on a flag, such as that for MYSQL_5. I can also
 >
 > >>>>> see that this could mean a lot of work on Calcite. Nevertheless, I will
 >
 > >>>>> bring up the discussion over there and to see what their community thinks.
 > >>>>>
 > >>>>> Would mind to share more info about the proposal on DDL that you
 > >>>>> mentioned? We can certainly collaborate on this.
 > >>>>>
 > >>>>> Thanks,
 > >>>>> Xuefu
 > >>>>>
 > >>>>> ------------------------------------------------------------------
 > >>>>> Sender:Shuyi Chen <su...@gmail.com>
 > >>>>> Sent at:2018 Oct 14 (Sun) 08:30
 > >>>>> Recipient:Xuefu <xu...@alibaba-inc.com>
 > >>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <
 > fhueske@gmail.com>;
 > >>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
 > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
 > >>>>>
 > >>>>> Welcome to the community and thanks for the great proposal, Xuefu! I
 >
 > >>>>> think the proposal can be divided into 2 stages: making Flink to support
 >
 > >>>>> Hive features, and make Hive to work with Flink. I agreed with Timo that on
 >
 > >>>>> starting with a smaller scope, so we can make progress faster. As for [6],
 >
 > >>>>> a proposal for DDL is already in progress, and will come after the unified
 >
 > >>>>> SQL connector API is done. For supporting Hive syntax, we might need to
 > >>>>> work with the Calcite community, and a recent effort called babel (
 > >>>>> https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite might
 > >>>>> help here.
 > >>>>>
 > >>>>> Thanks
 > >>>>> Shuyi
 > >>>>>
 > >>>>> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <
 > xuefu.z@alibaba-inc.com>
 > >>>>> wrote:
 > >>>>> Hi Fabian/Vno,
 > >>>>>
 >
 > >>>>> Thank you very much for your encouragement inquiry. Sorry that I didn't
 >
 > >>>>> see Fabian's email until I read Vino's response just now. (Somehow Fabian's
 > >>>>> went to the spam folder.)
 > >>>>>
 >
 > >>>>> My proposal contains long-term and short-terms goals. Nevertheless, the
 > >>>>> effort will focus on the following areas, including Fabian's list:
 > >>>>>
 > >>>>> 1. Hive metastore connectivity - This covers both read/write access,
 >
 > >>>>> which means Flink can make full use of Hive's metastore as its catalog (at
 > >>>>> least for the batch but can extend for streaming as well).
 >
 > >>>>> 2. Metadata compatibility - Objects (databases, tables, partitions, etc)
 >
 > >>>>> created by Hive can be understood by Flink and the reverse direction is
 > >>>>> true also.
 > >>>>> 3. Data compatibility - Similar to #2, data produced by Hive can be
 > >>>>> consumed by Flink and vise versa.
 >
 > >>>>> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides
 > >>>>> its own implementation or make Hive's implementation work in Flink.
 > >>>>> Further, for user created UDFs in Hive, Flink SQL should provide a
 >
 > >>>>> mechanism allowing user to import them into Flink without any code change
 > >>>>> required.
 > >>>>> 5. Data types -  Flink SQL should support all data types that are
 > >>>>> available in Hive.
 > >>>>> 6. SQL Language - Flink SQL should support SQL standard (such as
 >
 > >>>>> SQL2003) with extension to support Hive's syntax and language features,
 > >>>>> around DDL, DML, and SELECT queries.
 >
 > >>>>> 7.  SQL CLI - this is currently developing in Flink but more effort is
 > >>>>> needed.
 >
 > >>>>> 8. Server - provide a server that's compatible with Hive's HiverServer2
 >
 > >>>>> in thrift APIs, such that HiveServer2 users can reuse their existing client
 > >>>>> (such as beeline) but connect to Flink's thrift server instead.
 >
 > >>>>> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for
 > >>>>> other application to use to connect to its thrift server
 > >>>>> 10. Support other user's customizations in Hive, such as Hive Serdes,
 > >>>>> storage handlers, etc.
 >
 > >>>>> 11. Better task failure tolerance and task scheduling at Flink runtime.
 > >>>>>
 > >>>>> As you can see, achieving all those requires significant effort and
 >
 > >>>>> across all layers in Flink. However, a short-term goal could  include only
 >
 > >>>>> core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as
 > >>>>> #3, #6).
 > >>>>>
 >
 > >>>>> Please share your further thoughts. If we generally agree that this is
 >
 > >>>>> the right direction, I could come up with a formal proposal quickly and
 > >>>>> then we can follow up with broader discussions.
 > >>>>>
 > >>>>> Thanks,
 > >>>>> Xuefu
 > >>>>>
 > >>>>>
 > >>>>>
 > >>>>> ------------------------------------------------------------------
 > >>>>> Sender:vino yang <ya...@gmail.com>
 > >>>>> Sent at:2018 Oct 11 (Thu) 09:45
 > >>>>> Recipient:Fabian Hueske <fh...@gmail.com>
 > >>>>> Cc:dev <de...@flink.apache.org>; Xuefu <xuefu.z@alibaba-inc.com
 > >; user <
 > >>>>> user@flink.apache.org>
 > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
 > >>>>>
 > >>>>> Hi Xuefu,
 > >>>>>
 >
 > >>>>> Appreciate this proposal, and like Fabian, it would look better if you
 > >>>>> can give more details of the plan.
 > >>>>>
 > >>>>> Thanks, vino.
 > >>>>>
 > >>>>> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
 > >>>>> Hi Xuefu,
 > >>>>>
 >
 > >>>>> Welcome to the Flink community and thanks for starting this discussion!
 > >>>>> Better Hive integration would be really great!
 > >>>>> Can you go into details of what you are proposing? I can think of a
 > >>>>> couple ways to improve Flink in that regard:
 > >>>>>
 > >>>>> * Support for Hive UDFs
 > >>>>> * Support for Hive metadata catalog
 > >>>>> * Support for HiveQL syntax
 > >>>>> * ???
 > >>>>>
 > >>>>> Best, Fabian
 > >>>>>
 > >>>>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
 > >>>>> xuefu.z@alibaba-inc.com>:
 > >>>>> Hi all,
 > >>>>>
 > >>>>> Along with the community's effort, inside Alibaba we have explored
 >
 > >>>>> Flink's potential as an execution engine not just for stream processing but
 > >>>>> also for batch processing. We are encouraged by our findings and have
 >
 > >>>>> initiated our effort to make Flink's SQL capabilities full-fledged. When
 >
 > >>>>> comparing what's available in Flink to the offerings from competitive data
 >
 > >>>>> processing engines, we identified a major gap in Flink: a well integration
 >
 > >>>>> with Hive ecosystem. This is crucial to the success of Flink SQL and batch
 >
 > >>>>> due to the well-established data ecosystem around Hive. Therefore, we have
 >
 > >>>>> done some initial work along this direction but there are still a lot of
 > >>>>> effort needed.
 > >>>>>
 > >>>>> We have two strategies in mind. The first one is to make Flink SQL
 >
 > >>>>> full-fledged and well-integrated with Hive ecosystem. This is a similar
 >
 > >>>>> approach to what Spark SQL adopted. The second strategy is to make Hive
 >
 > >>>>> itself work with Flink, similar to the proposal in [1]. Each approach bears
 >
 > >>>>> its pros and cons, but they don’t need to be mutually exclusive with each
 > >>>>> targeting at different users and use cases. We believe that both will
 > >>>>> promote a much greater adoption of Flink beyond stream processing.
 > >>>>>
 > >>>>> We have been focused on the first approach and would like to showcase
 >
 > >>>>> Flink's batch and SQL capabilities with Flink SQL. However, we have also
 > >>>>> planned to start strategy #2 as the follow-up effort.
 > >>>>>
 >
 > >>>>> I'm completely new to Flink(, with a short bio [2] below), though many
 >
 > >>>>> of my colleagues here at Alibaba are long-time contributors. Nevertheless,
 >
 > >>>>> I'd like to share our thoughts and invite your early feedback. At the same
 >
 > >>>>> time, I am working on a detailed proposal on Flink SQL's integration with
 > >>>>> Hive ecosystem, which will be also shared when ready.
 > >>>>>
 > >>>>> While the ideas are simple, each approach will demand significant
 >
 > >>>>> effort, more than what we can afford. Thus, the input and contributions
 > >>>>> from the communities are greatly welcome and appreciated.
 > >>>>>
 > >>>>> Regards,
 > >>>>>
 > >>>>>
 > >>>>> Xuefu
 > >>>>>
 > >>>>> References:
 > >>>>>
 > >>>>> [1] https://issues.apache.org/jira/browse/HIVE-10712
 >
 > >>>>> [2] Xuefu Zhang is a long-time open source veteran, worked or working on
 > >>>>> many projects under Apache Foundation, of which he is also an honored
 >
 > >>>>> member. About 10 years ago he worked in the Hadoop team at Yahoo where the
 >
 > >>>>> projects just got started. Later he worked at Cloudera, initiating and
 >
 > >>>>> leading the development of Hive on Spark project in the communities and
 >
 > >>>>> across many organizations. Prior to joining Alibaba, he worked at Uber
 >
 > >>>>> where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and
 > >>>>> significantly improved Uber's cluster efficiency.
 > >>>>>
 > >>>>>
 > >>>>>
 > >>>>>
 > >>>>> --
 >
 > >>>>> "So you have to trust that the dots will somehow connect in your future."
 > >>>>>
 > >>>>>
 > >>>>> --
 >
 > >>>>> "So you have to trust that the dots will somehow connect in your future."
 > >>>>>
 >
 >


Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by Xiaowei Jiang <xi...@gmail.com>.
Thanks Xuefu for the detailed design doc! One question on the properties
associated with the catalog objects. Are we going to leave them completely
free form or we are going to set some standard for that? I think that the
answer may depend on if we want to explore catalog specific optimization
opportunities. In any case, I think that it might be helpful for
standardize as much as possible into strongly typed classes and use leave
these properties for catalog specific things. But I think that we can do it
in steps.

Xiaowei

On Fri, Nov 16, 2018 at 4:00 AM Bowen Li <bo...@gmail.com> wrote:

> Thanks for keeping on improving the overall design, Xuefu! It looks quite
> good to me now.
>
> Would be nice that cc-ed Flink committers can help to review and confirm!
>
>
>
> One minor suggestion: Since the last section of design doc already touches
> some new sql statements, shall we add another section in our doc and
> formalize the new sql statements in SQL Client and TableEnvironment that
> are gonna come along naturally with our design? Here are some that the
> design doc mentioned and some that I came up with:
>
> To be added:
>
>    - USE <catalog> - set default catalog
>    - USE <catalog.schema> - set default schema
>    - SHOW CATALOGS - show all registered catalogs
>    - SHOW SCHEMAS [FROM catalog] - list schemas in the current default
>    catalog or the specified catalog
>    - DESCRIBE VIEW view - show the view's definition in CatalogView
>    - SHOW VIEWS [FROM schema/catalog.schema] - show views from current or a
>    specified schema.
>
>    (DDLs that can be addressed by either our design or Shuyi's DDL design)
>
>    - CREATE/DROP/ALTER SCHEMA schema
>    - CREATE/DROP/ALTER CATALOG catalog
>
> To be modified:
>
>    - SHOW TABLES [FROM schema/catalog.schema] - show tables from current or
>    a specified schema. Add 'from schema' to existing 'SHOW TABLES'
> statement
>    - SHOW FUNCTIONS [FROM schema/catalog.schema] - show functions from
>    current or a specified schema. Add 'from schema' to existing 'SHOW
> TABLES'
>    statement'
>
>
> Thanks, Bowen
>
>
>
> On Wed, Nov 14, 2018 at 10:39 PM Zhang, Xuefu <xu...@alibaba-inc.com>
> wrote:
>
> > Thanks, Bowen, for catching the error. I have granted comment permission
> > with the link.
> >
> > I also updated the doc with the latest class definitions. Everyone is
> > encouraged to review and comment.
> >
> > Thanks,
> > Xuefu
> >
> > ------------------------------------------------------------------
> > Sender:Bowen Li <bo...@gmail.com>
> > Sent at:2018 Nov 14 (Wed) 06:44
> > Recipient:Xuefu <xu...@alibaba-inc.com>
> > Cc:piotr <pi...@data-artisans.com>; dev <de...@flink.apache.org>; Shuyi
> > Chen <su...@gmail.com>
> > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> >
> > Hi Xuefu,
> >
> > Currently the new design doc
> > <
> https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit
> >
> > is on “view only" mode, and people cannot leave comments. Can you please
> > change it to "can comment" or "can edit" mode?
> >
> > Thanks, Bowen
> >
> >
> > On Mon, Nov 12, 2018 at 9:51 PM Zhang, Xuefu <xu...@alibaba-inc.com>
> > wrote:
> > Hi Piotr
> >
> > I have extracted the API portion of  the design and the google doc is
> here
> > <
> https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit?usp=sharing
> >.
> > Please review and provide your feedback.
> >
> > Thanks,
> > Xuefu
> >
> > ------------------------------------------------------------------
> > Sender:Xuefu <xu...@alibaba-inc.com>
> > Sent at:2018 Nov 12 (Mon) 12:43
> > Recipient:Piotr Nowojski <pi...@data-artisans.com>; dev <
> > dev@flink.apache.org>
> > Cc:Bowen Li <bo...@gmail.com>; Shuyi Chen <su...@gmail.com>
> > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> >
> > Hi Piotr,
> >
> > That sounds good to me. Let's close all the open questions ((there are a
> > couple of them)) in the Google doc and I should be able to quickly split
> > it into the three proposals as you suggested.
> >
> > Thanks,
> > Xuefu
> >
> > ------------------------------------------------------------------
> > Sender:Piotr Nowojski <pi...@data-artisans.com>
> > Sent at:2018 Nov 9 (Fri) 22:46
> > Recipient:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>
> > Cc:Bowen Li <bo...@gmail.com>; Shuyi Chen <su...@gmail.com>
> > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> >
> > Hi,
> >
> >
> > Yes, it seems like the best solution. Maybe someone else can also
> suggests if we can split it further? Maybe changes in the interface in one
> doc, reading from hive meta store another and final storing our meta
> informations in hive meta store?
> >
> > Piotrek
> >
> > > On 9 Nov 2018, at 01:44, Zhang, Xuefu <xu...@alibaba-inc.com> wrote:
> > >
> > > Hi Piotr,
> > >
> > > That seems to be good idea!
> > >
> >
> > > Since the google doc for the design is currently under extensive
> review, I will leave it as it is for now. However, I'll convert it to two
> different FLIPs when the time comes.
> > >
> > > How does it sound to you?
> > >
> > > Thanks,
> > > Xuefu
> > >
> > >
> > > ------------------------------------------------------------------
> > > Sender:Piotr Nowojski <pi...@data-artisans.com>
> > > Sent at:2018 Nov 9 (Fri) 02:31
> > > Recipient:dev <de...@flink.apache.org>
> > > Cc:Bowen Li <bo...@gmail.com>; Xuefu <xuefu.z@alibaba-inc.com
> > >; Shuyi Chen <su...@gmail.com>
> > > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> > >
> > > Hi,
> > >
> >
> > > Maybe we should split this topic (and the design doc) into couple of
> smaller ones, hopefully independent. The questions that you have asked
> Fabian have for example very little to do with reading metadata from Hive
> Meta Store?
> > >
> > > Piotrek
> > >
> > >> On 7 Nov 2018, at 14:27, Fabian Hueske <fh...@gmail.com> wrote:
> > >>
> > >> Hi Xuefu and all,
> > >>
> > >> Thanks for sharing this design document!
> >
> > >> I'm very much in favor of restructuring / reworking the catalog
> handling in
> > >> Flink SQL as outlined in the document.
> >
> > >> Most changes described in the design document seem to be rather
> general and
> > >> not specifically related to the Hive integration.
> > >>
> >
> > >> IMO, there are some aspects, especially those at the boundary of Hive
> and
> > >> Flink, that need a bit more discussion. For example
> > >>
> > >> * What does it take to make Flink schema compatible with Hive schema?
> > >> * How will Flink tables (descriptors) be stored in HMS?
> > >> * How do both Hive catalogs differ? Could they be integrated into to a
> > >> single one? When to use which one?
> >
> > >> * What meta information is provided by HMS? What of this can be
> leveraged
> > >> by Flink?
> > >>
> > >> Thank you,
> > >> Fabian
> > >>
> > >> Am Fr., 2. Nov. 2018 um 00:31 Uhr schrieb Bowen Li <
> bowenli86@gmail.com
> > >:
> > >>
> > >>> After taking a look at how other discussion threads work, I think
> it's
> > >>> actually fine just keep our discussion here. It's up to you, Xuefu.
> > >>>
> > >>> The google doc LGTM. I left some minor comments.
> > >>>
> > >>> On Thu, Nov 1, 2018 at 10:17 AM Bowen Li <bo...@gmail.com>
> wrote:
> > >>>
> > >>>> Hi all,
> > >>>>
> > >>>> As Xuefu has published the design doc on google, I agree with
> Shuyi's
> >
> > >>>> suggestion that we probably should start a new email thread like
> "[DISCUSS]
> >
> > >>>> ... Hive integration design ..." on only dev mailing list for
> community
> > >>>> devs to review. The current thread sends to both dev and user list.
> > >>>>
> >
> > >>>> This email thread is more like validating the general idea and
> direction
> >
> > >>>> with the community, and it's been pretty long and crowded so far.
> Since
> >
> > >>>> everyone is pro for the idea, we can move forward with another
> thread to
> > >>>> discuss and finalize the design.
> > >>>>
> > >>>> Thanks,
> > >>>> Bowen
> > >>>>
> > >>>> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu <
> > xuefu.z@alibaba-inc.com>
> > >>>> wrote:
> > >>>>
> > >>>>> Hi Shuiyi,
> > >>>>>
> >
> > >>>>> Good idea. Actually the PDF was converted from a google doc. Here
> is its
> > >>>>> link:
> > >>>>>
> > >>>>>
> >
> https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
> > >>>>> Once we reach an agreement, I can convert it to a FLIP.
> > >>>>>
> > >>>>> Thanks,
> > >>>>> Xuefu
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> ------------------------------------------------------------------
> > >>>>> Sender:Shuyi Chen <su...@gmail.com>
> > >>>>> Sent at:2018 Nov 1 (Thu) 02:47
> > >>>>> Recipient:Xuefu <xu...@alibaba-inc.com>
> > >>>>> Cc:vino yang <ya...@gmail.com>; Fabian Hueske <
> > fhueske@gmail.com>;
> > >>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
> > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> > >>>>>
> > >>>>> Hi Xuefu,
> > >>>>>
> >
> > >>>>> Thanks a lot for driving this big effort. I would suggest convert
> your
> >
> > >>>>> proposal and design doc into a google doc, and share it on the dev
> mailing
> >
> > >>>>> list for the community to review and comment with title like
> "[DISCUSS] ...
> >
> > >>>>> Hive integration design ..." . Once approved,  we can document it
> as a FLIP
> >
> > >>>>> (Flink Improvement Proposals), and use JIRAs to track the
> implementations.
> > >>>>> What do you think?
> > >>>>>
> > >>>>> Shuyi
> > >>>>>
> > >>>>> On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <
> > xuefu.z@alibaba-inc.com>
> > >>>>> wrote:
> > >>>>> Hi all,
> > >>>>>
> > >>>>> I have also shared a design doc on Hive metastore integration that
> is
> >
> > >>>>> attached here and also to FLINK-10556[1]. Please kindly review and
> share
> > >>>>> your feedback.
> > >>>>>
> > >>>>>
> > >>>>> Thanks,
> > >>>>> Xuefu
> > >>>>>
> > >>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
> > >>>>> ------------------------------------------------------------------
> > >>>>> Sender:Xuefu <xu...@alibaba-inc.com>
> > >>>>> Sent at:2018 Oct 25 (Thu) 01:08
> > >>>>> Recipient:Xuefu <xu...@alibaba-inc.com>; Shuyi Chen <
> > >>>>> suez1224@gmail.com>
> > >>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <
> > fhueske@gmail.com>;
> > >>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
> > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> > >>>>>
> > >>>>> Hi all,
> > >>>>>
> > >>>>> To wrap up the discussion, I have attached a PDF describing the
> >
> > >>>>> proposal, which is also attached to FLINK-10556 [1]. Please feel
> free to
> > >>>>> watch that JIRA to track the progress.
> > >>>>>
> > >>>>> Please also let me know if you have additional comments or
> questions.
> > >>>>>
> > >>>>> Thanks,
> > >>>>> Xuefu
> > >>>>>
> > >>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
> > >>>>>
> > >>>>>
> > >>>>> ------------------------------------------------------------------
> > >>>>> Sender:Xuefu <xu...@alibaba-inc.com>
> > >>>>> Sent at:2018 Oct 16 (Tue) 03:40
> > >>>>> Recipient:Shuyi Chen <su...@gmail.com>
> > >>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <
> > fhueske@gmail.com>;
> > >>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
> > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> > >>>>>
> > >>>>> Hi Shuyi,
> > >>>>>
> >
> > >>>>> Thank you for your input. Yes, I agreed with a phased approach and
> like
> >
> > >>>>> to move forward fast. :) We did some work internally on DDL
> utilizing babel
> > >>>>> parser in Calcite. While babel makes Calcite's grammar extensible,
> at
> > >>>>> first impression it still seems too cumbersome for a project when
> too
> >
> > >>>>> much extensions are made. It's even challenging to find where the
> extension
> >
> > >>>>> is needed! It would be certainly better if Calcite can magically
> support
> >
> > >>>>> Hive QL by just turning on a flag, such as that for MYSQL_5. I can
> also
> >
> > >>>>> see that this could mean a lot of work on Calcite. Nevertheless, I
> will
> >
> > >>>>> bring up the discussion over there and to see what their community
> thinks.
> > >>>>>
> > >>>>> Would mind to share more info about the proposal on DDL that you
> > >>>>> mentioned? We can certainly collaborate on this.
> > >>>>>
> > >>>>> Thanks,
> > >>>>> Xuefu
> > >>>>>
> > >>>>> ------------------------------------------------------------------
> > >>>>> Sender:Shuyi Chen <su...@gmail.com>
> > >>>>> Sent at:2018 Oct 14 (Sun) 08:30
> > >>>>> Recipient:Xuefu <xu...@alibaba-inc.com>
> > >>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <
> > fhueske@gmail.com>;
> > >>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
> > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> > >>>>>
> > >>>>> Welcome to the community and thanks for the great proposal, Xuefu!
> I
> >
> > >>>>> think the proposal can be divided into 2 stages: making Flink to
> support
> >
> > >>>>> Hive features, and make Hive to work with Flink. I agreed with
> Timo that on
> >
> > >>>>> starting with a smaller scope, so we can make progress faster. As
> for [6],
> >
> > >>>>> a proposal for DDL is already in progress, and will come after the
> unified
> >
> > >>>>> SQL connector API is done. For supporting Hive syntax, we might
> need to
> > >>>>> work with the Calcite community, and a recent effort called babel (
> > >>>>> https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite
> might
> > >>>>> help here.
> > >>>>>
> > >>>>> Thanks
> > >>>>> Shuyi
> > >>>>>
> > >>>>> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <
> > xuefu.z@alibaba-inc.com>
> > >>>>> wrote:
> > >>>>> Hi Fabian/Vno,
> > >>>>>
> >
> > >>>>> Thank you very much for your encouragement inquiry. Sorry that I
> didn't
> >
> > >>>>> see Fabian's email until I read Vino's response just now. (Somehow
> Fabian's
> > >>>>> went to the spam folder.)
> > >>>>>
> >
> > >>>>> My proposal contains long-term and short-terms goals.
> Nevertheless, the
> > >>>>> effort will focus on the following areas, including Fabian's list:
> > >>>>>
> > >>>>> 1. Hive metastore connectivity - This covers both read/write
> access,
> >
> > >>>>> which means Flink can make full use of Hive's metastore as its
> catalog (at
> > >>>>> least for the batch but can extend for streaming as well).
> >
> > >>>>> 2. Metadata compatibility - Objects (databases, tables,
> partitions, etc)
> >
> > >>>>> created by Hive can be understood by Flink and the reverse
> direction is
> > >>>>> true also.
> > >>>>> 3. Data compatibility - Similar to #2, data produced by Hive can be
> > >>>>> consumed by Flink and vise versa.
> >
> > >>>>> 4. Support Hive UDFs - For all Hive's native udfs, Flink either
> provides
> > >>>>> its own implementation or make Hive's implementation work in Flink.
> > >>>>> Further, for user created UDFs in Hive, Flink SQL should provide a
> >
> > >>>>> mechanism allowing user to import them into Flink without any code
> change
> > >>>>> required.
> > >>>>> 5. Data types -  Flink SQL should support all data types that are
> > >>>>> available in Hive.
> > >>>>> 6. SQL Language - Flink SQL should support SQL standard (such as
> >
> > >>>>> SQL2003) with extension to support Hive's syntax and language
> features,
> > >>>>> around DDL, DML, and SELECT queries.
> >
> > >>>>> 7.  SQL CLI - this is currently developing in Flink but more
> effort is
> > >>>>> needed.
> >
> > >>>>> 8. Server - provide a server that's compatible with Hive's
> HiverServer2
> >
> > >>>>> in thrift APIs, such that HiveServer2 users can reuse their
> existing client
> > >>>>> (such as beeline) but connect to Flink's thrift server instead.
> >
> > >>>>> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers
> for
> > >>>>> other application to use to connect to its thrift server
> > >>>>> 10. Support other user's customizations in Hive, such as Hive
> Serdes,
> > >>>>> storage handlers, etc.
> >
> > >>>>> 11. Better task failure tolerance and task scheduling at Flink
> runtime.
> > >>>>>
> > >>>>> As you can see, achieving all those requires significant effort and
> >
> > >>>>> across all layers in Flink. However, a short-term goal could
> include only
> >
> > >>>>> core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope
> (such as
> > >>>>> #3, #6).
> > >>>>>
> >
> > >>>>> Please share your further thoughts. If we generally agree that
> this is
> >
> > >>>>> the right direction, I could come up with a formal proposal
> quickly and
> > >>>>> then we can follow up with broader discussions.
> > >>>>>
> > >>>>> Thanks,
> > >>>>> Xuefu
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> ------------------------------------------------------------------
> > >>>>> Sender:vino yang <ya...@gmail.com>
> > >>>>> Sent at:2018 Oct 11 (Thu) 09:45
> > >>>>> Recipient:Fabian Hueske <fh...@gmail.com>
> > >>>>> Cc:dev <de...@flink.apache.org>; Xuefu <xuefu.z@alibaba-inc.com
> > >; user <
> > >>>>> user@flink.apache.org>
> > >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> > >>>>>
> > >>>>> Hi Xuefu,
> > >>>>>
> >
> > >>>>> Appreciate this proposal, and like Fabian, it would look better if
> you
> > >>>>> can give more details of the plan.
> > >>>>>
> > >>>>> Thanks, vino.
> > >>>>>
> > >>>>> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
> > >>>>> Hi Xuefu,
> > >>>>>
> >
> > >>>>> Welcome to the Flink community and thanks for starting this
> discussion!
> > >>>>> Better Hive integration would be really great!
> > >>>>> Can you go into details of what you are proposing? I can think of a
> > >>>>> couple ways to improve Flink in that regard:
> > >>>>>
> > >>>>> * Support for Hive UDFs
> > >>>>> * Support for Hive metadata catalog
> > >>>>> * Support for HiveQL syntax
> > >>>>> * ???
> > >>>>>
> > >>>>> Best, Fabian
> > >>>>>
> > >>>>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
> > >>>>> xuefu.z@alibaba-inc.com>:
> > >>>>> Hi all,
> > >>>>>
> > >>>>> Along with the community's effort, inside Alibaba we have explored
> >
> > >>>>> Flink's potential as an execution engine not just for stream
> processing but
> > >>>>> also for batch processing. We are encouraged by our findings and
> have
> >
> > >>>>> initiated our effort to make Flink's SQL capabilities
> full-fledged. When
> >
> > >>>>> comparing what's available in Flink to the offerings from
> competitive data
> >
> > >>>>> processing engines, we identified a major gap in Flink: a well
> integration
> >
> > >>>>> with Hive ecosystem. This is crucial to the success of Flink SQL
> and batch
> >
> > >>>>> due to the well-established data ecosystem around Hive. Therefore,
> we have
> >
> > >>>>> done some initial work along this direction but there are still a
> lot of
> > >>>>> effort needed.
> > >>>>>
> > >>>>> We have two strategies in mind. The first one is to make Flink SQL
> >
> > >>>>> full-fledged and well-integrated with Hive ecosystem. This is a
> similar
> >
> > >>>>> approach to what Spark SQL adopted. The second strategy is to make
> Hive
> >
> > >>>>> itself work with Flink, similar to the proposal in [1]. Each
> approach bears
> >
> > >>>>> its pros and cons, but they don’t need to be mutually exclusive
> with each
> > >>>>> targeting at different users and use cases. We believe that both
> will
> > >>>>> promote a much greater adoption of Flink beyond stream processing.
> > >>>>>
> > >>>>> We have been focused on the first approach and would like to
> showcase
> >
> > >>>>> Flink's batch and SQL capabilities with Flink SQL. However, we
> have also
> > >>>>> planned to start strategy #2 as the follow-up effort.
> > >>>>>
> >
> > >>>>> I'm completely new to Flink(, with a short bio [2] below), though
> many
> >
> > >>>>> of my colleagues here at Alibaba are long-time contributors.
> Nevertheless,
> >
> > >>>>> I'd like to share our thoughts and invite your early feedback. At
> the same
> >
> > >>>>> time, I am working on a detailed proposal on Flink SQL's
> integration with
> > >>>>> Hive ecosystem, which will be also shared when ready.
> > >>>>>
> > >>>>> While the ideas are simple, each approach will demand significant
> >
> > >>>>> effort, more than what we can afford. Thus, the input and
> contributions
> > >>>>> from the communities are greatly welcome and appreciated.
> > >>>>>
> > >>>>> Regards,
> > >>>>>
> > >>>>>
> > >>>>> Xuefu
> > >>>>>
> > >>>>> References:
> > >>>>>
> > >>>>> [1] https://issues.apache.org/jira/browse/HIVE-10712
> >
> > >>>>> [2] Xuefu Zhang is a long-time open source veteran, worked or
> working on
> > >>>>> many projects under Apache Foundation, of which he is also an
> honored
> >
> > >>>>> member. About 10 years ago he worked in the Hadoop team at Yahoo
> where the
> >
> > >>>>> projects just got started. Later he worked at Cloudera, initiating
> and
> >
> > >>>>> leading the development of Hive on Spark project in the
> communities and
> >
> > >>>>> across many organizations. Prior to joining Alibaba, he worked at
> Uber
> >
> > >>>>> where he promoted Hive on Spark to all Uber's SQL on Hadoop
> workload and
> > >>>>> significantly improved Uber's cluster efficiency.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> --
> >
> > >>>>> "So you have to trust that the dots will somehow connect in your
> future."
> > >>>>>
> > >>>>>
> > >>>>> --
> >
> > >>>>> "So you have to trust that the dots will somehow connect in your
> future."
> > >>>>>
> >
> >
>

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by Bowen Li <bo...@gmail.com>.
Thanks for keeping on improving the overall design, Xuefu! It looks quite
good to me now.

Would be nice that cc-ed Flink committers can help to review and confirm!



One minor suggestion: Since the last section of design doc already touches
some new sql statements, shall we add another section in our doc and
formalize the new sql statements in SQL Client and TableEnvironment that
are gonna come along naturally with our design? Here are some that the
design doc mentioned and some that I came up with:

To be added:

   - USE <catalog> - set default catalog
   - USE <catalog.schema> - set default schema
   - SHOW CATALOGS - show all registered catalogs
   - SHOW SCHEMAS [FROM catalog] - list schemas in the current default
   catalog or the specified catalog
   - DESCRIBE VIEW view - show the view's definition in CatalogView
   - SHOW VIEWS [FROM schema/catalog.schema] - show views from current or a
   specified schema.

   (DDLs that can be addressed by either our design or Shuyi's DDL design)

   - CREATE/DROP/ALTER SCHEMA schema
   - CREATE/DROP/ALTER CATALOG catalog

To be modified:

   - SHOW TABLES [FROM schema/catalog.schema] - show tables from current or
   a specified schema. Add 'from schema' to existing 'SHOW TABLES' statement
   - SHOW FUNCTIONS [FROM schema/catalog.schema] - show functions from
   current or a specified schema. Add 'from schema' to existing 'SHOW TABLES'
   statement'


Thanks, Bowen



On Wed, Nov 14, 2018 at 10:39 PM Zhang, Xuefu <xu...@alibaba-inc.com>
wrote:

> Thanks, Bowen, for catching the error. I have granted comment permission
> with the link.
>
> I also updated the doc with the latest class definitions. Everyone is
> encouraged to review and comment.
>
> Thanks,
> Xuefu
>
> ------------------------------------------------------------------
> Sender:Bowen Li <bo...@gmail.com>
> Sent at:2018 Nov 14 (Wed) 06:44
> Recipient:Xuefu <xu...@alibaba-inc.com>
> Cc:piotr <pi...@data-artisans.com>; dev <de...@flink.apache.org>; Shuyi
> Chen <su...@gmail.com>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> Hi Xuefu,
>
> Currently the new design doc
> <https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit>
> is on “view only" mode, and people cannot leave comments. Can you please
> change it to "can comment" or "can edit" mode?
>
> Thanks, Bowen
>
>
> On Mon, Nov 12, 2018 at 9:51 PM Zhang, Xuefu <xu...@alibaba-inc.com>
> wrote:
> Hi Piotr
>
> I have extracted the API portion of  the design and the google doc is here
> <https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit?usp=sharing>.
> Please review and provide your feedback.
>
> Thanks,
> Xuefu
>
> ------------------------------------------------------------------
> Sender:Xuefu <xu...@alibaba-inc.com>
> Sent at:2018 Nov 12 (Mon) 12:43
> Recipient:Piotr Nowojski <pi...@data-artisans.com>; dev <
> dev@flink.apache.org>
> Cc:Bowen Li <bo...@gmail.com>; Shuyi Chen <su...@gmail.com>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> Hi Piotr,
>
> That sounds good to me. Let's close all the open questions ((there are a
> couple of them)) in the Google doc and I should be able to quickly split
> it into the three proposals as you suggested.
>
> Thanks,
> Xuefu
>
> ------------------------------------------------------------------
> Sender:Piotr Nowojski <pi...@data-artisans.com>
> Sent at:2018 Nov 9 (Fri) 22:46
> Recipient:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>
> Cc:Bowen Li <bo...@gmail.com>; Shuyi Chen <su...@gmail.com>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> Hi,
>
>
> Yes, it seems like the best solution. Maybe someone else can also suggests if we can split it further? Maybe changes in the interface in one doc, reading from hive meta store another and final storing our meta informations in hive meta store?
>
> Piotrek
>
> > On 9 Nov 2018, at 01:44, Zhang, Xuefu <xu...@alibaba-inc.com> wrote:
> >
> > Hi Piotr,
> >
> > That seems to be good idea!
> >
>
> > Since the google doc for the design is currently under extensive review, I will leave it as it is for now. However, I'll convert it to two different FLIPs when the time comes.
> >
> > How does it sound to you?
> >
> > Thanks,
> > Xuefu
> >
> >
> > ------------------------------------------------------------------
> > Sender:Piotr Nowojski <pi...@data-artisans.com>
> > Sent at:2018 Nov 9 (Fri) 02:31
> > Recipient:dev <de...@flink.apache.org>
> > Cc:Bowen Li <bo...@gmail.com>; Xuefu <xuefu.z@alibaba-inc.com
> >; Shuyi Chen <su...@gmail.com>
> > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> >
> > Hi,
> >
>
> > Maybe we should split this topic (and the design doc) into couple of smaller ones, hopefully independent. The questions that you have asked Fabian have for example very little to do with reading metadata from Hive Meta Store?
> >
> > Piotrek
> >
> >> On 7 Nov 2018, at 14:27, Fabian Hueske <fh...@gmail.com> wrote:
> >>
> >> Hi Xuefu and all,
> >>
> >> Thanks for sharing this design document!
>
> >> I'm very much in favor of restructuring / reworking the catalog handling in
> >> Flink SQL as outlined in the document.
>
> >> Most changes described in the design document seem to be rather general and
> >> not specifically related to the Hive integration.
> >>
>
> >> IMO, there are some aspects, especially those at the boundary of Hive and
> >> Flink, that need a bit more discussion. For example
> >>
> >> * What does it take to make Flink schema compatible with Hive schema?
> >> * How will Flink tables (descriptors) be stored in HMS?
> >> * How do both Hive catalogs differ? Could they be integrated into to a
> >> single one? When to use which one?
>
> >> * What meta information is provided by HMS? What of this can be leveraged
> >> by Flink?
> >>
> >> Thank you,
> >> Fabian
> >>
> >> Am Fr., 2. Nov. 2018 um 00:31 Uhr schrieb Bowen Li <bowenli86@gmail.com
> >:
> >>
> >>> After taking a look at how other discussion threads work, I think it's
> >>> actually fine just keep our discussion here. It's up to you, Xuefu.
> >>>
> >>> The google doc LGTM. I left some minor comments.
> >>>
> >>> On Thu, Nov 1, 2018 at 10:17 AM Bowen Li <bo...@gmail.com> wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> As Xuefu has published the design doc on google, I agree with Shuyi's
>
> >>>> suggestion that we probably should start a new email thread like "[DISCUSS]
>
> >>>> ... Hive integration design ..." on only dev mailing list for community
> >>>> devs to review. The current thread sends to both dev and user list.
> >>>>
>
> >>>> This email thread is more like validating the general idea and direction
>
> >>>> with the community, and it's been pretty long and crowded so far. Since
>
> >>>> everyone is pro for the idea, we can move forward with another thread to
> >>>> discuss and finalize the design.
> >>>>
> >>>> Thanks,
> >>>> Bowen
> >>>>
> >>>> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu <
> xuefu.z@alibaba-inc.com>
> >>>> wrote:
> >>>>
> >>>>> Hi Shuiyi,
> >>>>>
>
> >>>>> Good idea. Actually the PDF was converted from a google doc. Here is its
> >>>>> link:
> >>>>>
> >>>>>
> https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
> >>>>> Once we reach an agreement, I can convert it to a FLIP.
> >>>>>
> >>>>> Thanks,
> >>>>> Xuefu
> >>>>>
> >>>>>
> >>>>>
> >>>>> ------------------------------------------------------------------
> >>>>> Sender:Shuyi Chen <su...@gmail.com>
> >>>>> Sent at:2018 Nov 1 (Thu) 02:47
> >>>>> Recipient:Xuefu <xu...@alibaba-inc.com>
> >>>>> Cc:vino yang <ya...@gmail.com>; Fabian Hueske <
> fhueske@gmail.com>;
> >>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
> >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> >>>>>
> >>>>> Hi Xuefu,
> >>>>>
>
> >>>>> Thanks a lot for driving this big effort. I would suggest convert your
>
> >>>>> proposal and design doc into a google doc, and share it on the dev mailing
>
> >>>>> list for the community to review and comment with title like "[DISCUSS] ...
>
> >>>>> Hive integration design ..." . Once approved,  we can document it as a FLIP
>
> >>>>> (Flink Improvement Proposals), and use JIRAs to track the implementations.
> >>>>> What do you think?
> >>>>>
> >>>>> Shuyi
> >>>>>
> >>>>> On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <
> xuefu.z@alibaba-inc.com>
> >>>>> wrote:
> >>>>> Hi all,
> >>>>>
> >>>>> I have also shared a design doc on Hive metastore integration that is
>
> >>>>> attached here and also to FLINK-10556[1]. Please kindly review and share
> >>>>> your feedback.
> >>>>>
> >>>>>
> >>>>> Thanks,
> >>>>> Xuefu
> >>>>>
> >>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
> >>>>> ------------------------------------------------------------------
> >>>>> Sender:Xuefu <xu...@alibaba-inc.com>
> >>>>> Sent at:2018 Oct 25 (Thu) 01:08
> >>>>> Recipient:Xuefu <xu...@alibaba-inc.com>; Shuyi Chen <
> >>>>> suez1224@gmail.com>
> >>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <
> fhueske@gmail.com>;
> >>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
> >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> >>>>>
> >>>>> Hi all,
> >>>>>
> >>>>> To wrap up the discussion, I have attached a PDF describing the
>
> >>>>> proposal, which is also attached to FLINK-10556 [1]. Please feel free to
> >>>>> watch that JIRA to track the progress.
> >>>>>
> >>>>> Please also let me know if you have additional comments or questions.
> >>>>>
> >>>>> Thanks,
> >>>>> Xuefu
> >>>>>
> >>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
> >>>>>
> >>>>>
> >>>>> ------------------------------------------------------------------
> >>>>> Sender:Xuefu <xu...@alibaba-inc.com>
> >>>>> Sent at:2018 Oct 16 (Tue) 03:40
> >>>>> Recipient:Shuyi Chen <su...@gmail.com>
> >>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <
> fhueske@gmail.com>;
> >>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
> >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> >>>>>
> >>>>> Hi Shuyi,
> >>>>>
>
> >>>>> Thank you for your input. Yes, I agreed with a phased approach and like
>
> >>>>> to move forward fast. :) We did some work internally on DDL utilizing babel
> >>>>> parser in Calcite. While babel makes Calcite's grammar extensible, at
> >>>>> first impression it still seems too cumbersome for a project when too
>
> >>>>> much extensions are made. It's even challenging to find where the extension
>
> >>>>> is needed! It would be certainly better if Calcite can magically support
>
> >>>>> Hive QL by just turning on a flag, such as that for MYSQL_5. I can also
>
> >>>>> see that this could mean a lot of work on Calcite. Nevertheless, I will
>
> >>>>> bring up the discussion over there and to see what their community thinks.
> >>>>>
> >>>>> Would mind to share more info about the proposal on DDL that you
> >>>>> mentioned? We can certainly collaborate on this.
> >>>>>
> >>>>> Thanks,
> >>>>> Xuefu
> >>>>>
> >>>>> ------------------------------------------------------------------
> >>>>> Sender:Shuyi Chen <su...@gmail.com>
> >>>>> Sent at:2018 Oct 14 (Sun) 08:30
> >>>>> Recipient:Xuefu <xu...@alibaba-inc.com>
> >>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <
> fhueske@gmail.com>;
> >>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
> >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> >>>>>
> >>>>> Welcome to the community and thanks for the great proposal, Xuefu! I
>
> >>>>> think the proposal can be divided into 2 stages: making Flink to support
>
> >>>>> Hive features, and make Hive to work with Flink. I agreed with Timo that on
>
> >>>>> starting with a smaller scope, so we can make progress faster. As for [6],
>
> >>>>> a proposal for DDL is already in progress, and will come after the unified
>
> >>>>> SQL connector API is done. For supporting Hive syntax, we might need to
> >>>>> work with the Calcite community, and a recent effort called babel (
> >>>>> https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite might
> >>>>> help here.
> >>>>>
> >>>>> Thanks
> >>>>> Shuyi
> >>>>>
> >>>>> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <
> xuefu.z@alibaba-inc.com>
> >>>>> wrote:
> >>>>> Hi Fabian/Vno,
> >>>>>
>
> >>>>> Thank you very much for your encouragement inquiry. Sorry that I didn't
>
> >>>>> see Fabian's email until I read Vino's response just now. (Somehow Fabian's
> >>>>> went to the spam folder.)
> >>>>>
>
> >>>>> My proposal contains long-term and short-terms goals. Nevertheless, the
> >>>>> effort will focus on the following areas, including Fabian's list:
> >>>>>
> >>>>> 1. Hive metastore connectivity - This covers both read/write access,
>
> >>>>> which means Flink can make full use of Hive's metastore as its catalog (at
> >>>>> least for the batch but can extend for streaming as well).
>
> >>>>> 2. Metadata compatibility - Objects (databases, tables, partitions, etc)
>
> >>>>> created by Hive can be understood by Flink and the reverse direction is
> >>>>> true also.
> >>>>> 3. Data compatibility - Similar to #2, data produced by Hive can be
> >>>>> consumed by Flink and vise versa.
>
> >>>>> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides
> >>>>> its own implementation or make Hive's implementation work in Flink.
> >>>>> Further, for user created UDFs in Hive, Flink SQL should provide a
>
> >>>>> mechanism allowing user to import them into Flink without any code change
> >>>>> required.
> >>>>> 5. Data types -  Flink SQL should support all data types that are
> >>>>> available in Hive.
> >>>>> 6. SQL Language - Flink SQL should support SQL standard (such as
>
> >>>>> SQL2003) with extension to support Hive's syntax and language features,
> >>>>> around DDL, DML, and SELECT queries.
>
> >>>>> 7.  SQL CLI - this is currently developing in Flink but more effort is
> >>>>> needed.
>
> >>>>> 8. Server - provide a server that's compatible with Hive's HiverServer2
>
> >>>>> in thrift APIs, such that HiveServer2 users can reuse their existing client
> >>>>> (such as beeline) but connect to Flink's thrift server instead.
>
> >>>>> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for
> >>>>> other application to use to connect to its thrift server
> >>>>> 10. Support other user's customizations in Hive, such as Hive Serdes,
> >>>>> storage handlers, etc.
>
> >>>>> 11. Better task failure tolerance and task scheduling at Flink runtime.
> >>>>>
> >>>>> As you can see, achieving all those requires significant effort and
>
> >>>>> across all layers in Flink. However, a short-term goal could  include only
>
> >>>>> core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as
> >>>>> #3, #6).
> >>>>>
>
> >>>>> Please share your further thoughts. If we generally agree that this is
>
> >>>>> the right direction, I could come up with a formal proposal quickly and
> >>>>> then we can follow up with broader discussions.
> >>>>>
> >>>>> Thanks,
> >>>>> Xuefu
> >>>>>
> >>>>>
> >>>>>
> >>>>> ------------------------------------------------------------------
> >>>>> Sender:vino yang <ya...@gmail.com>
> >>>>> Sent at:2018 Oct 11 (Thu) 09:45
> >>>>> Recipient:Fabian Hueske <fh...@gmail.com>
> >>>>> Cc:dev <de...@flink.apache.org>; Xuefu <xuefu.z@alibaba-inc.com
> >; user <
> >>>>> user@flink.apache.org>
> >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> >>>>>
> >>>>> Hi Xuefu,
> >>>>>
>
> >>>>> Appreciate this proposal, and like Fabian, it would look better if you
> >>>>> can give more details of the plan.
> >>>>>
> >>>>> Thanks, vino.
> >>>>>
> >>>>> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
> >>>>> Hi Xuefu,
> >>>>>
>
> >>>>> Welcome to the Flink community and thanks for starting this discussion!
> >>>>> Better Hive integration would be really great!
> >>>>> Can you go into details of what you are proposing? I can think of a
> >>>>> couple ways to improve Flink in that regard:
> >>>>>
> >>>>> * Support for Hive UDFs
> >>>>> * Support for Hive metadata catalog
> >>>>> * Support for HiveQL syntax
> >>>>> * ???
> >>>>>
> >>>>> Best, Fabian
> >>>>>
> >>>>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
> >>>>> xuefu.z@alibaba-inc.com>:
> >>>>> Hi all,
> >>>>>
> >>>>> Along with the community's effort, inside Alibaba we have explored
>
> >>>>> Flink's potential as an execution engine not just for stream processing but
> >>>>> also for batch processing. We are encouraged by our findings and have
>
> >>>>> initiated our effort to make Flink's SQL capabilities full-fledged. When
>
> >>>>> comparing what's available in Flink to the offerings from competitive data
>
> >>>>> processing engines, we identified a major gap in Flink: a well integration
>
> >>>>> with Hive ecosystem. This is crucial to the success of Flink SQL and batch
>
> >>>>> due to the well-established data ecosystem around Hive. Therefore, we have
>
> >>>>> done some initial work along this direction but there are still a lot of
> >>>>> effort needed.
> >>>>>
> >>>>> We have two strategies in mind. The first one is to make Flink SQL
>
> >>>>> full-fledged and well-integrated with Hive ecosystem. This is a similar
>
> >>>>> approach to what Spark SQL adopted. The second strategy is to make Hive
>
> >>>>> itself work with Flink, similar to the proposal in [1]. Each approach bears
>
> >>>>> its pros and cons, but they don’t need to be mutually exclusive with each
> >>>>> targeting at different users and use cases. We believe that both will
> >>>>> promote a much greater adoption of Flink beyond stream processing.
> >>>>>
> >>>>> We have been focused on the first approach and would like to showcase
>
> >>>>> Flink's batch and SQL capabilities with Flink SQL. However, we have also
> >>>>> planned to start strategy #2 as the follow-up effort.
> >>>>>
>
> >>>>> I'm completely new to Flink(, with a short bio [2] below), though many
>
> >>>>> of my colleagues here at Alibaba are long-time contributors. Nevertheless,
>
> >>>>> I'd like to share our thoughts and invite your early feedback. At the same
>
> >>>>> time, I am working on a detailed proposal on Flink SQL's integration with
> >>>>> Hive ecosystem, which will be also shared when ready.
> >>>>>
> >>>>> While the ideas are simple, each approach will demand significant
>
> >>>>> effort, more than what we can afford. Thus, the input and contributions
> >>>>> from the communities are greatly welcome and appreciated.
> >>>>>
> >>>>> Regards,
> >>>>>
> >>>>>
> >>>>> Xuefu
> >>>>>
> >>>>> References:
> >>>>>
> >>>>> [1] https://issues.apache.org/jira/browse/HIVE-10712
>
> >>>>> [2] Xuefu Zhang is a long-time open source veteran, worked or working on
> >>>>> many projects under Apache Foundation, of which he is also an honored
>
> >>>>> member. About 10 years ago he worked in the Hadoop team at Yahoo where the
>
> >>>>> projects just got started. Later he worked at Cloudera, initiating and
>
> >>>>> leading the development of Hive on Spark project in the communities and
>
> >>>>> across many organizations. Prior to joining Alibaba, he worked at Uber
>
> >>>>> where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and
> >>>>> significantly improved Uber's cluster efficiency.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
>
> >>>>> "So you have to trust that the dots will somehow connect in your future."
> >>>>>
> >>>>>
> >>>>> --
>
> >>>>> "So you have to trust that the dots will somehow connect in your future."
> >>>>>
>
>

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by "Zhang, Xuefu" <xu...@alibaba-inc.com>.
Thanks, Bowen, for catching the error. I have granted comment permission with the link.

I also updated the doc with the latest class definitions. Everyone is encouraged to review and comment.

Thanks,
Xuefu


------------------------------------------------------------------
Sender:Bowen Li <bo...@gmail.com>
Sent at:2018 Nov 14 (Wed) 06:44
Recipient:Xuefu <xu...@alibaba-inc.com>
Cc:piotr <pi...@data-artisans.com>; dev <de...@flink.apache.org>; Shuyi Chen <su...@gmail.com>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Xuefu,

Currently the new design doc is on “view only" mode, and people cannot leave comments. Can you please change it to "can comment" or "can edit" mode?

Thanks, Bowen


On Mon, Nov 12, 2018 at 9:51 PM Zhang, Xuefu <xu...@alibaba-inc.com> wrote:

Hi Piotr

I have extracted the API portion of  the design and the google doc is here. Please review and provide your feedback.

Thanks,
Xuefu

------------------------------------------------------------------
Sender:Xuefu <xu...@alibaba-inc.com>
Sent at:2018 Nov 12 (Mon) 12:43
Recipient:Piotr Nowojski <pi...@data-artisans.com>; dev <de...@flink.apache.org>
Cc:Bowen Li <bo...@gmail.com>; Shuyi Chen <su...@gmail.com>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Piotr,

That sounds good to me. Let's close all the open questions ((there are a couple of them)) in the Google doc and I should be able to quickly split it into the three proposals as you suggested.

Thanks,
Xuefu

------------------------------------------------------------------
Sender:Piotr Nowojski <pi...@data-artisans.com>
Sent at:2018 Nov 9 (Fri) 22:46
Recipient:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>
Cc:Bowen Li <bo...@gmail.com>; Shuyi Chen <su...@gmail.com>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi,

Yes, it seems like the best solution. Maybe someone else can also suggests if we can split it further? Maybe changes in the interface in one doc, reading from hive meta store another and final storing our meta informations in hive meta store?

Piotrek

> On 9 Nov 2018, at 01:44, Zhang, Xuefu <xu...@alibaba-inc.com> wrote:
> 
> Hi Piotr,
> 
> That seems to be good idea!
> 
> Since the google doc for the design is currently under extensive review, I will leave it as it is for now. However, I'll convert it to two different FLIPs when the time comes.
> 
> How does it sound to you?
> 
> Thanks,
> Xuefu
> 
> 
> ------------------------------------------------------------------
> Sender:Piotr Nowojski <pi...@data-artisans.com>
> Sent at:2018 Nov 9 (Fri) 02:31
> Recipient:dev <de...@flink.apache.org>
> Cc:Bowen Li <bo...@gmail.com>; Xuefu <xu...@alibaba-inc.com>; Shuyi Chen <su...@gmail.com>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> 
> Hi,
> 
> Maybe we should split this topic (and the design doc) into couple of smaller ones, hopefully independent. The questions that you have asked Fabian have for example very little to do with reading metadata from Hive Meta Store?
> 
> Piotrek 
> 
>> On 7 Nov 2018, at 14:27, Fabian Hueske <fh...@gmail.com> wrote:
>> 
>> Hi Xuefu and all,
>> 
>> Thanks for sharing this design document!
>> I'm very much in favor of restructuring / reworking the catalog handling in
>> Flink SQL as outlined in the document.
>> Most changes described in the design document seem to be rather general and
>> not specifically related to the Hive integration.
>> 
>> IMO, there are some aspects, especially those at the boundary of Hive and
>> Flink, that need a bit more discussion. For example
>> 
>> * What does it take to make Flink schema compatible with Hive schema?
>> * How will Flink tables (descriptors) be stored in HMS?
>> * How do both Hive catalogs differ? Could they be integrated into to a
>> single one? When to use which one?
>> * What meta information is provided by HMS? What of this can be leveraged
>> by Flink?
>> 
>> Thank you,
>> Fabian
>> 
>> Am Fr., 2. Nov. 2018 um 00:31 Uhr schrieb Bowen Li <bo...@gmail.com>:
>> 
>>> After taking a look at how other discussion threads work, I think it's
>>> actually fine just keep our discussion here. It's up to you, Xuefu.
>>> 
>>> The google doc LGTM. I left some minor comments.
>>> 
>>> On Thu, Nov 1, 2018 at 10:17 AM Bowen Li <bo...@gmail.com> wrote:
>>> 
>>>> Hi all,
>>>> 
>>>> As Xuefu has published the design doc on google, I agree with Shuyi's
>>>> suggestion that we probably should start a new email thread like "[DISCUSS]
>>>> ... Hive integration design ..." on only dev mailing list for community
>>>> devs to review. The current thread sends to both dev and user list.
>>>> 
>>>> This email thread is more like validating the general idea and direction
>>>> with the community, and it's been pretty long and crowded so far. Since
>>>> everyone is pro for the idea, we can move forward with another thread to
>>>> discuss and finalize the design.
>>>> 
>>>> Thanks,
>>>> Bowen
>>>> 
>>>> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu <xu...@alibaba-inc.com>
>>>> wrote:
>>>> 
>>>>> Hi Shuiyi,
>>>>> 
>>>>> Good idea. Actually the PDF was converted from a google doc. Here is its
>>>>> link:
>>>>> 
>>>>> https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
>>>>> Once we reach an agreement, I can convert it to a FLIP.
>>>>> 
>>>>> Thanks,
>>>>> Xuefu
>>>>> 
>>>>> 
>>>>> 
>>>>> ------------------------------------------------------------------
>>>>> Sender:Shuyi Chen <su...@gmail.com>
>>>>> Sent at:2018 Nov 1 (Thu) 02:47
>>>>> Recipient:Xuefu <xu...@alibaba-inc.com>
>>>>> Cc:vino yang <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
>>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>>> 
>>>>> Hi Xuefu,
>>>>> 
>>>>> Thanks a lot for driving this big effort. I would suggest convert your
>>>>> proposal and design doc into a google doc, and share it on the dev mailing
>>>>> list for the community to review and comment with title like "[DISCUSS] ...
>>>>> Hive integration design ..." . Once approved,  we can document it as a FLIP
>>>>> (Flink Improvement Proposals), and use JIRAs to track the implementations.
>>>>> What do you think?
>>>>> 
>>>>> Shuyi
>>>>> 
>>>>> On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <xu...@alibaba-inc.com>
>>>>> wrote:
>>>>> Hi all,
>>>>> 
>>>>> I have also shared a design doc on Hive metastore integration that is
>>>>> attached here and also to FLINK-10556[1]. Please kindly review and share
>>>>> your feedback.
>>>>> 
>>>>> 
>>>>> Thanks,
>>>>> Xuefu
>>>>> 
>>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
>>>>> ------------------------------------------------------------------
>>>>> Sender:Xuefu <xu...@alibaba-inc.com>
>>>>> Sent at:2018 Oct 25 (Thu) 01:08
>>>>> Recipient:Xuefu <xu...@alibaba-inc.com>; Shuyi Chen <
>>>>> suez1224@gmail.com>
>>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
>>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>>> 
>>>>> Hi all,
>>>>> 
>>>>> To wrap up the discussion, I have attached a PDF describing the
>>>>> proposal, which is also attached to FLINK-10556 [1]. Please feel free to
>>>>> watch that JIRA to track the progress.
>>>>> 
>>>>> Please also let me know if you have additional comments or questions.
>>>>> 
>>>>> Thanks,
>>>>> Xuefu
>>>>> 
>>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
>>>>> 
>>>>> 
>>>>> ------------------------------------------------------------------
>>>>> Sender:Xuefu <xu...@alibaba-inc.com>
>>>>> Sent at:2018 Oct 16 (Tue) 03:40
>>>>> Recipient:Shuyi Chen <su...@gmail.com>
>>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
>>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>>> 
>>>>> Hi Shuyi,
>>>>> 
>>>>> Thank you for your input. Yes, I agreed with a phased approach and like
>>>>> to move forward fast. :) We did some work internally on DDL utilizing babel
>>>>> parser in Calcite. While babel makes Calcite's grammar extensible, at
>>>>> first impression it still seems too cumbersome for a project when too
>>>>> much extensions are made. It's even challenging to find where the extension
>>>>> is needed! It would be certainly better if Calcite can magically support
>>>>> Hive QL by just turning on a flag, such as that for MYSQL_5. I can also
>>>>> see that this could mean a lot of work on Calcite. Nevertheless, I will
>>>>> bring up the discussion over there and to see what their community thinks.
>>>>> 
>>>>> Would mind to share more info about the proposal on DDL that you
>>>>> mentioned? We can certainly collaborate on this.
>>>>> 
>>>>> Thanks,
>>>>> Xuefu
>>>>> 
>>>>> ------------------------------------------------------------------
>>>>> Sender:Shuyi Chen <su...@gmail.com>
>>>>> Sent at:2018 Oct 14 (Sun) 08:30
>>>>> Recipient:Xuefu <xu...@alibaba-inc.com>
>>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
>>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>>> 
>>>>> Welcome to the community and thanks for the great proposal, Xuefu! I
>>>>> think the proposal can be divided into 2 stages: making Flink to support
>>>>> Hive features, and make Hive to work with Flink. I agreed with Timo that on
>>>>> starting with a smaller scope, so we can make progress faster. As for [6],
>>>>> a proposal for DDL is already in progress, and will come after the unified
>>>>> SQL connector API is done. For supporting Hive syntax, we might need to
>>>>> work with the Calcite community, and a recent effort called babel (
>>>>> https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite might
>>>>> help here.
>>>>> 
>>>>> Thanks
>>>>> Shuyi
>>>>> 
>>>>> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <xu...@alibaba-inc.com>
>>>>> wrote:
>>>>> Hi Fabian/Vno,
>>>>> 
>>>>> Thank you very much for your encouragement inquiry. Sorry that I didn't
>>>>> see Fabian's email until I read Vino's response just now. (Somehow Fabian's
>>>>> went to the spam folder.)
>>>>> 
>>>>> My proposal contains long-term and short-terms goals. Nevertheless, the
>>>>> effort will focus on the following areas, including Fabian's list:
>>>>> 
>>>>> 1. Hive metastore connectivity - This covers both read/write access,
>>>>> which means Flink can make full use of Hive's metastore as its catalog (at
>>>>> least for the batch but can extend for streaming as well).
>>>>> 2. Metadata compatibility - Objects (databases, tables, partitions, etc)
>>>>> created by Hive can be understood by Flink and the reverse direction is
>>>>> true also.
>>>>> 3. Data compatibility - Similar to #2, data produced by Hive can be
>>>>> consumed by Flink and vise versa.
>>>>> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides
>>>>> its own implementation or make Hive's implementation work in Flink.
>>>>> Further, for user created UDFs in Hive, Flink SQL should provide a
>>>>> mechanism allowing user to import them into Flink without any code change
>>>>> required.
>>>>> 5. Data types -  Flink SQL should support all data types that are
>>>>> available in Hive.
>>>>> 6. SQL Language - Flink SQL should support SQL standard (such as
>>>>> SQL2003) with extension to support Hive's syntax and language features,
>>>>> around DDL, DML, and SELECT queries.
>>>>> 7.  SQL CLI - this is currently developing in Flink but more effort is
>>>>> needed.
>>>>> 8. Server - provide a server that's compatible with Hive's HiverServer2
>>>>> in thrift APIs, such that HiveServer2 users can reuse their existing client
>>>>> (such as beeline) but connect to Flink's thrift server instead.
>>>>> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for
>>>>> other application to use to connect to its thrift server
>>>>> 10. Support other user's customizations in Hive, such as Hive Serdes,
>>>>> storage handlers, etc.
>>>>> 11. Better task failure tolerance and task scheduling at Flink runtime.
>>>>> 
>>>>> As you can see, achieving all those requires significant effort and
>>>>> across all layers in Flink. However, a short-term goal could  include only
>>>>> core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as
>>>>> #3, #6).
>>>>> 
>>>>> Please share your further thoughts. If we generally agree that this is
>>>>> the right direction, I could come up with a formal proposal quickly and
>>>>> then we can follow up with broader discussions.
>>>>> 
>>>>> Thanks,
>>>>> Xuefu
>>>>> 
>>>>> 
>>>>> 
>>>>> ------------------------------------------------------------------
>>>>> Sender:vino yang <ya...@gmail.com>
>>>>> Sent at:2018 Oct 11 (Thu) 09:45
>>>>> Recipient:Fabian Hueske <fh...@gmail.com>
>>>>> Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <
>>>>> user@flink.apache.org>
>>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>>> 
>>>>> Hi Xuefu,
>>>>> 
>>>>> Appreciate this proposal, and like Fabian, it would look better if you
>>>>> can give more details of the plan.
>>>>> 
>>>>> Thanks, vino.
>>>>> 
>>>>> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
>>>>> Hi Xuefu,
>>>>> 
>>>>> Welcome to the Flink community and thanks for starting this discussion!
>>>>> Better Hive integration would be really great!
>>>>> Can you go into details of what you are proposing? I can think of a
>>>>> couple ways to improve Flink in that regard:
>>>>> 
>>>>> * Support for Hive UDFs
>>>>> * Support for Hive metadata catalog
>>>>> * Support for HiveQL syntax
>>>>> * ???
>>>>> 
>>>>> Best, Fabian
>>>>> 
>>>>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
>>>>> xuefu.z@alibaba-inc.com>:
>>>>> Hi all,
>>>>> 
>>>>> Along with the community's effort, inside Alibaba we have explored
>>>>> Flink's potential as an execution engine not just for stream processing but
>>>>> also for batch processing. We are encouraged by our findings and have
>>>>> initiated our effort to make Flink's SQL capabilities full-fledged. When
>>>>> comparing what's available in Flink to the offerings from competitive data
>>>>> processing engines, we identified a major gap in Flink: a well integration
>>>>> with Hive ecosystem. This is crucial to the success of Flink SQL and batch
>>>>> due to the well-established data ecosystem around Hive. Therefore, we have
>>>>> done some initial work along this direction but there are still a lot of
>>>>> effort needed.
>>>>> 
>>>>> We have two strategies in mind. The first one is to make Flink SQL
>>>>> full-fledged and well-integrated with Hive ecosystem. This is a similar
>>>>> approach to what Spark SQL adopted. The second strategy is to make Hive
>>>>> itself work with Flink, similar to the proposal in [1]. Each approach bears
>>>>> its pros and cons, but they don’t need to be mutually exclusive with each
>>>>> targeting at different users and use cases. We believe that both will
>>>>> promote a much greater adoption of Flink beyond stream processing.
>>>>> 
>>>>> We have been focused on the first approach and would like to showcase
>>>>> Flink's batch and SQL capabilities with Flink SQL. However, we have also
>>>>> planned to start strategy #2 as the follow-up effort.
>>>>> 
>>>>> I'm completely new to Flink(, with a short bio [2] below), though many
>>>>> of my colleagues here at Alibaba are long-time contributors. Nevertheless,
>>>>> I'd like to share our thoughts and invite your early feedback. At the same
>>>>> time, I am working on a detailed proposal on Flink SQL's integration with
>>>>> Hive ecosystem, which will be also shared when ready.
>>>>> 
>>>>> While the ideas are simple, each approach will demand significant
>>>>> effort, more than what we can afford. Thus, the input and contributions
>>>>> from the communities are greatly welcome and appreciated.
>>>>> 
>>>>> Regards,
>>>>> 
>>>>> 
>>>>> Xuefu
>>>>> 
>>>>> References:
>>>>> 
>>>>> [1] https://issues.apache.org/jira/browse/HIVE-10712
>>>>> [2] Xuefu Zhang is a long-time open source veteran, worked or working on
>>>>> many projects under Apache Foundation, of which he is also an honored
>>>>> member. About 10 years ago he worked in the Hadoop team at Yahoo where the
>>>>> projects just got started. Later he worked at Cloudera, initiating and
>>>>> leading the development of Hive on Spark project in the communities and
>>>>> across many organizations. Prior to joining Alibaba, he worked at Uber
>>>>> where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and
>>>>> significantly improved Uber's cluster efficiency.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> "So you have to trust that the dots will somehow connect in your future."
>>>>> 
>>>>> 
>>>>> --
>>>>> "So you have to trust that the dots will somehow connect in your future."
>>>>>  

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by Bowen Li <bo...@gmail.com>.
Hi Xuefu,

Currently the new design doc
<https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit>
is on “view only" mode, and people cannot leave comments. Can you please
change it to "can comment" or "can edit" mode?

Thanks, Bowen


On Mon, Nov 12, 2018 at 9:51 PM Zhang, Xuefu <xu...@alibaba-inc.com>
wrote:

> Hi Piotr
>
> I have extracted the API portion of  the design and the google doc is here
> <https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit?usp=sharing>.
> Please review and provide your feedback.
>
> Thanks,
> Xuefu
>
> ------------------------------------------------------------------
> Sender:Xuefu <xu...@alibaba-inc.com>
> Sent at:2018 Nov 12 (Mon) 12:43
> Recipient:Piotr Nowojski <pi...@data-artisans.com>; dev <
> dev@flink.apache.org>
> Cc:Bowen Li <bo...@gmail.com>; Shuyi Chen <su...@gmail.com>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> Hi Piotr,
>
> That sounds good to me. Let's close all the open questions ((there are a
> couple of them)) in the Google doc and I should be able to quickly split
> it into the three proposals as you suggested.
>
> Thanks,
> Xuefu
>
> ------------------------------------------------------------------
> Sender:Piotr Nowojski <pi...@data-artisans.com>
> Sent at:2018 Nov 9 (Fri) 22:46
> Recipient:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>
> Cc:Bowen Li <bo...@gmail.com>; Shuyi Chen <su...@gmail.com>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> Hi,
>
>
> Yes, it seems like the best solution. Maybe someone else can also suggests if we can split it further? Maybe changes in the interface in one doc, reading from hive meta store another and final storing our meta informations in hive meta store?
>
> Piotrek
>
> > On 9 Nov 2018, at 01:44, Zhang, Xuefu <xu...@alibaba-inc.com> wrote:
> >
> > Hi Piotr,
> >
> > That seems to be good idea!
> >
>
> > Since the google doc for the design is currently under extensive review, I will leave it as it is for now. However, I'll convert it to two different FLIPs when the time comes.
> >
> > How does it sound to you?
> >
> > Thanks,
> > Xuefu
> >
> >
> > ------------------------------------------------------------------
> > Sender:Piotr Nowojski <pi...@data-artisans.com>
> > Sent at:2018 Nov 9 (Fri) 02:31
> > Recipient:dev <de...@flink.apache.org>
> > Cc:Bowen Li <bo...@gmail.com>; Xuefu <xuefu.z@alibaba-inc.com
> >; Shuyi Chen <su...@gmail.com>
> > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> >
> > Hi,
> >
>
> > Maybe we should split this topic (and the design doc) into couple of smaller ones, hopefully independent. The questions that you have asked Fabian have for example very little to do with reading metadata from Hive Meta Store?
> >
> > Piotrek
> >
> >> On 7 Nov 2018, at 14:27, Fabian Hueske <fh...@gmail.com> wrote:
> >>
> >> Hi Xuefu and all,
> >>
> >> Thanks for sharing this design document!
>
> >> I'm very much in favor of restructuring / reworking the catalog handling in
> >> Flink SQL as outlined in the document.
>
> >> Most changes described in the design document seem to be rather general and
> >> not specifically related to the Hive integration.
> >>
>
> >> IMO, there are some aspects, especially those at the boundary of Hive and
> >> Flink, that need a bit more discussion. For example
> >>
> >> * What does it take to make Flink schema compatible with Hive schema?
> >> * How will Flink tables (descriptors) be stored in HMS?
> >> * How do both Hive catalogs differ? Could they be integrated into to a
> >> single one? When to use which one?
>
> >> * What meta information is provided by HMS? What of this can be leveraged
> >> by Flink?
> >>
> >> Thank you,
> >> Fabian
> >>
> >> Am Fr., 2. Nov. 2018 um 00:31 Uhr schrieb Bowen Li <bowenli86@gmail.com
> >:
> >>
> >>> After taking a look at how other discussion threads work, I think it's
> >>> actually fine just keep our discussion here. It's up to you, Xuefu.
> >>>
> >>> The google doc LGTM. I left some minor comments.
> >>>
> >>> On Thu, Nov 1, 2018 at 10:17 AM Bowen Li <bo...@gmail.com> wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> As Xuefu has published the design doc on google, I agree with Shuyi's
>
> >>>> suggestion that we probably should start a new email thread like "[DISCUSS]
>
> >>>> ... Hive integration design ..." on only dev mailing list for community
> >>>> devs to review. The current thread sends to both dev and user list.
> >>>>
>
> >>>> This email thread is more like validating the general idea and direction
>
> >>>> with the community, and it's been pretty long and crowded so far. Since
>
> >>>> everyone is pro for the idea, we can move forward with another thread to
> >>>> discuss and finalize the design.
> >>>>
> >>>> Thanks,
> >>>> Bowen
> >>>>
> >>>> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu <
> xuefu.z@alibaba-inc.com>
> >>>> wrote:
> >>>>
> >>>>> Hi Shuiyi,
> >>>>>
>
> >>>>> Good idea. Actually the PDF was converted from a google doc. Here is its
> >>>>> link:
> >>>>>
> >>>>>
> https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
> >>>>> Once we reach an agreement, I can convert it to a FLIP.
> >>>>>
> >>>>> Thanks,
> >>>>> Xuefu
> >>>>>
> >>>>>
> >>>>>
> >>>>> ------------------------------------------------------------------
> >>>>> Sender:Shuyi Chen <su...@gmail.com>
> >>>>> Sent at:2018 Nov 1 (Thu) 02:47
> >>>>> Recipient:Xuefu <xu...@alibaba-inc.com>
> >>>>> Cc:vino yang <ya...@gmail.com>; Fabian Hueske <
> fhueske@gmail.com>;
> >>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
> >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> >>>>>
> >>>>> Hi Xuefu,
> >>>>>
>
> >>>>> Thanks a lot for driving this big effort. I would suggest convert your
>
> >>>>> proposal and design doc into a google doc, and share it on the dev mailing
>
> >>>>> list for the community to review and comment with title like "[DISCUSS] ...
>
> >>>>> Hive integration design ..." . Once approved,  we can document it as a FLIP
>
> >>>>> (Flink Improvement Proposals), and use JIRAs to track the implementations.
> >>>>> What do you think?
> >>>>>
> >>>>> Shuyi
> >>>>>
> >>>>> On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <
> xuefu.z@alibaba-inc.com>
> >>>>> wrote:
> >>>>> Hi all,
> >>>>>
> >>>>> I have also shared a design doc on Hive metastore integration that is
>
> >>>>> attached here and also to FLINK-10556[1]. Please kindly review and share
> >>>>> your feedback.
> >>>>>
> >>>>>
> >>>>> Thanks,
> >>>>> Xuefu
> >>>>>
> >>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
> >>>>> ------------------------------------------------------------------
> >>>>> Sender:Xuefu <xu...@alibaba-inc.com>
> >>>>> Sent at:2018 Oct 25 (Thu) 01:08
> >>>>> Recipient:Xuefu <xu...@alibaba-inc.com>; Shuyi Chen <
> >>>>> suez1224@gmail.com>
> >>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <
> fhueske@gmail.com>;
> >>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
> >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> >>>>>
> >>>>> Hi all,
> >>>>>
> >>>>> To wrap up the discussion, I have attached a PDF describing the
>
> >>>>> proposal, which is also attached to FLINK-10556 [1]. Please feel free to
> >>>>> watch that JIRA to track the progress.
> >>>>>
> >>>>> Please also let me know if you have additional comments or questions.
> >>>>>
> >>>>> Thanks,
> >>>>> Xuefu
> >>>>>
> >>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
> >>>>>
> >>>>>
> >>>>> ------------------------------------------------------------------
> >>>>> Sender:Xuefu <xu...@alibaba-inc.com>
> >>>>> Sent at:2018 Oct 16 (Tue) 03:40
> >>>>> Recipient:Shuyi Chen <su...@gmail.com>
> >>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <
> fhueske@gmail.com>;
> >>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
> >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> >>>>>
> >>>>> Hi Shuyi,
> >>>>>
>
> >>>>> Thank you for your input. Yes, I agreed with a phased approach and like
>
> >>>>> to move forward fast. :) We did some work internally on DDL utilizing babel
> >>>>> parser in Calcite. While babel makes Calcite's grammar extensible, at
> >>>>> first impression it still seems too cumbersome for a project when too
>
> >>>>> much extensions are made. It's even challenging to find where the extension
>
> >>>>> is needed! It would be certainly better if Calcite can magically support
>
> >>>>> Hive QL by just turning on a flag, such as that for MYSQL_5. I can also
>
> >>>>> see that this could mean a lot of work on Calcite. Nevertheless, I will
>
> >>>>> bring up the discussion over there and to see what their community thinks.
> >>>>>
> >>>>> Would mind to share more info about the proposal on DDL that you
> >>>>> mentioned? We can certainly collaborate on this.
> >>>>>
> >>>>> Thanks,
> >>>>> Xuefu
> >>>>>
> >>>>> ------------------------------------------------------------------
> >>>>> Sender:Shuyi Chen <su...@gmail.com>
> >>>>> Sent at:2018 Oct 14 (Sun) 08:30
> >>>>> Recipient:Xuefu <xu...@alibaba-inc.com>
> >>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <
> fhueske@gmail.com>;
> >>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
> >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> >>>>>
> >>>>> Welcome to the community and thanks for the great proposal, Xuefu! I
>
> >>>>> think the proposal can be divided into 2 stages: making Flink to support
>
> >>>>> Hive features, and make Hive to work with Flink. I agreed with Timo that on
>
> >>>>> starting with a smaller scope, so we can make progress faster. As for [6],
>
> >>>>> a proposal for DDL is already in progress, and will come after the unified
>
> >>>>> SQL connector API is done. For supporting Hive syntax, we might need to
> >>>>> work with the Calcite community, and a recent effort called babel (
> >>>>> https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite might
> >>>>> help here.
> >>>>>
> >>>>> Thanks
> >>>>> Shuyi
> >>>>>
> >>>>> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <
> xuefu.z@alibaba-inc.com>
> >>>>> wrote:
> >>>>> Hi Fabian/Vno,
> >>>>>
>
> >>>>> Thank you very much for your encouragement inquiry. Sorry that I didn't
>
> >>>>> see Fabian's email until I read Vino's response just now. (Somehow Fabian's
> >>>>> went to the spam folder.)
> >>>>>
>
> >>>>> My proposal contains long-term and short-terms goals. Nevertheless, the
> >>>>> effort will focus on the following areas, including Fabian's list:
> >>>>>
> >>>>> 1. Hive metastore connectivity - This covers both read/write access,
>
> >>>>> which means Flink can make full use of Hive's metastore as its catalog (at
> >>>>> least for the batch but can extend for streaming as well).
>
> >>>>> 2. Metadata compatibility - Objects (databases, tables, partitions, etc)
>
> >>>>> created by Hive can be understood by Flink and the reverse direction is
> >>>>> true also.
> >>>>> 3. Data compatibility - Similar to #2, data produced by Hive can be
> >>>>> consumed by Flink and vise versa.
>
> >>>>> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides
> >>>>> its own implementation or make Hive's implementation work in Flink.
> >>>>> Further, for user created UDFs in Hive, Flink SQL should provide a
>
> >>>>> mechanism allowing user to import them into Flink without any code change
> >>>>> required.
> >>>>> 5. Data types -  Flink SQL should support all data types that are
> >>>>> available in Hive.
> >>>>> 6. SQL Language - Flink SQL should support SQL standard (such as
>
> >>>>> SQL2003) with extension to support Hive's syntax and language features,
> >>>>> around DDL, DML, and SELECT queries.
>
> >>>>> 7.  SQL CLI - this is currently developing in Flink but more effort is
> >>>>> needed.
>
> >>>>> 8. Server - provide a server that's compatible with Hive's HiverServer2
>
> >>>>> in thrift APIs, such that HiveServer2 users can reuse their existing client
> >>>>> (such as beeline) but connect to Flink's thrift server instead.
>
> >>>>> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for
> >>>>> other application to use to connect to its thrift server
> >>>>> 10. Support other user's customizations in Hive, such as Hive Serdes,
> >>>>> storage handlers, etc.
>
> >>>>> 11. Better task failure tolerance and task scheduling at Flink runtime.
> >>>>>
> >>>>> As you can see, achieving all those requires significant effort and
>
> >>>>> across all layers in Flink. However, a short-term goal could  include only
>
> >>>>> core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as
> >>>>> #3, #6).
> >>>>>
>
> >>>>> Please share your further thoughts. If we generally agree that this is
>
> >>>>> the right direction, I could come up with a formal proposal quickly and
> >>>>> then we can follow up with broader discussions.
> >>>>>
> >>>>> Thanks,
> >>>>> Xuefu
> >>>>>
> >>>>>
> >>>>>
> >>>>> ------------------------------------------------------------------
> >>>>> Sender:vino yang <ya...@gmail.com>
> >>>>> Sent at:2018 Oct 11 (Thu) 09:45
> >>>>> Recipient:Fabian Hueske <fh...@gmail.com>
> >>>>> Cc:dev <de...@flink.apache.org>; Xuefu <xuefu.z@alibaba-inc.com
> >; user <
> >>>>> user@flink.apache.org>
> >>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> >>>>>
> >>>>> Hi Xuefu,
> >>>>>
>
> >>>>> Appreciate this proposal, and like Fabian, it would look better if you
> >>>>> can give more details of the plan.
> >>>>>
> >>>>> Thanks, vino.
> >>>>>
> >>>>> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
> >>>>> Hi Xuefu,
> >>>>>
>
> >>>>> Welcome to the Flink community and thanks for starting this discussion!
> >>>>> Better Hive integration would be really great!
> >>>>> Can you go into details of what you are proposing? I can think of a
> >>>>> couple ways to improve Flink in that regard:
> >>>>>
> >>>>> * Support for Hive UDFs
> >>>>> * Support for Hive metadata catalog
> >>>>> * Support for HiveQL syntax
> >>>>> * ???
> >>>>>
> >>>>> Best, Fabian
> >>>>>
> >>>>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
> >>>>> xuefu.z@alibaba-inc.com>:
> >>>>> Hi all,
> >>>>>
> >>>>> Along with the community's effort, inside Alibaba we have explored
>
> >>>>> Flink's potential as an execution engine not just for stream processing but
> >>>>> also for batch processing. We are encouraged by our findings and have
>
> >>>>> initiated our effort to make Flink's SQL capabilities full-fledged. When
>
> >>>>> comparing what's available in Flink to the offerings from competitive data
>
> >>>>> processing engines, we identified a major gap in Flink: a well integration
>
> >>>>> with Hive ecosystem. This is crucial to the success of Flink SQL and batch
>
> >>>>> due to the well-established data ecosystem around Hive. Therefore, we have
>
> >>>>> done some initial work along this direction but there are still a lot of
> >>>>> effort needed.
> >>>>>
> >>>>> We have two strategies in mind. The first one is to make Flink SQL
>
> >>>>> full-fledged and well-integrated with Hive ecosystem. This is a similar
>
> >>>>> approach to what Spark SQL adopted. The second strategy is to make Hive
>
> >>>>> itself work with Flink, similar to the proposal in [1]. Each approach bears
>
> >>>>> its pros and cons, but they don’t need to be mutually exclusive with each
> >>>>> targeting at different users and use cases. We believe that both will
> >>>>> promote a much greater adoption of Flink beyond stream processing.
> >>>>>
> >>>>> We have been focused on the first approach and would like to showcase
>
> >>>>> Flink's batch and SQL capabilities with Flink SQL. However, we have also
> >>>>> planned to start strategy #2 as the follow-up effort.
> >>>>>
>
> >>>>> I'm completely new to Flink(, with a short bio [2] below), though many
>
> >>>>> of my colleagues here at Alibaba are long-time contributors. Nevertheless,
>
> >>>>> I'd like to share our thoughts and invite your early feedback. At the same
>
> >>>>> time, I am working on a detailed proposal on Flink SQL's integration with
> >>>>> Hive ecosystem, which will be also shared when ready.
> >>>>>
> >>>>> While the ideas are simple, each approach will demand significant
>
> >>>>> effort, more than what we can afford. Thus, the input and contributions
> >>>>> from the communities are greatly welcome and appreciated.
> >>>>>
> >>>>> Regards,
> >>>>>
> >>>>>
> >>>>> Xuefu
> >>>>>
> >>>>> References:
> >>>>>
> >>>>> [1] https://issues.apache.org/jira/browse/HIVE-10712
>
> >>>>> [2] Xuefu Zhang is a long-time open source veteran, worked or working on
> >>>>> many projects under Apache Foundation, of which he is also an honored
>
> >>>>> member. About 10 years ago he worked in the Hadoop team at Yahoo where the
>
> >>>>> projects just got started. Later he worked at Cloudera, initiating and
>
> >>>>> leading the development of Hive on Spark project in the communities and
>
> >>>>> across many organizations. Prior to joining Alibaba, he worked at Uber
>
> >>>>> where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and
> >>>>> significantly improved Uber's cluster efficiency.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
>
> >>>>> "So you have to trust that the dots will somehow connect in your future."
> >>>>>
> >>>>>
> >>>>> --
>
> >>>>> "So you have to trust that the dots will somehow connect in your future."
> >>>>>
>
>

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by "Zhang, Xuefu" <xu...@alibaba-inc.com>.
Hi Piotr

I have extracted the API portion of  the design and the google doc is here. Please review and provide your feedback.

Thanks,
Xuefu


------------------------------------------------------------------
Sender:Xuefu <xu...@alibaba-inc.com>
Sent at:2018 Nov 12 (Mon) 12:43
Recipient:Piotr Nowojski <pi...@data-artisans.com>; dev <de...@flink.apache.org>
Cc:Bowen Li <bo...@gmail.com>; Shuyi Chen <su...@gmail.com>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Piotr,

That sounds good to me. Let's close all the open questions ((there are a couple of them)) in the Google doc and I should be able to quickly split it into the three proposals as you suggested.

Thanks,
Xuefu


------------------------------------------------------------------
Sender:Piotr Nowojski <pi...@data-artisans.com>
Sent at:2018 Nov 9 (Fri) 22:46
Recipient:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>
Cc:Bowen Li <bo...@gmail.com>; Shuyi Chen <su...@gmail.com>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi,

Yes, it seems like the best solution. Maybe someone else can also suggests if we can split it further? Maybe changes in the interface in one doc, reading from hive meta store another and final storing our meta informations in hive meta store?

Piotrek

> On 9 Nov 2018, at 01:44, Zhang, Xuefu <xu...@alibaba-inc.com> wrote:
> 
> Hi Piotr,
> 
> That seems to be good idea!
> 
> Since the google doc for the design is currently under extensive review, I will leave it as it is for now. However, I'll convert it to two different FLIPs when the time comes.
> 
> How does it sound to you?
> 
> Thanks,
> Xuefu
> 
> 
> ------------------------------------------------------------------
> Sender:Piotr Nowojski <pi...@data-artisans.com>
> Sent at:2018 Nov 9 (Fri) 02:31
> Recipient:dev <de...@flink.apache.org>
> Cc:Bowen Li <bo...@gmail.com>; Xuefu <xu...@alibaba-inc.com>; Shuyi Chen <su...@gmail.com>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> 
> Hi,
> 
> Maybe we should split this topic (and the design doc) into couple of smaller ones, hopefully independent. The questions that you have asked Fabian have for example very little to do with reading metadata from Hive Meta Store?
> 
> Piotrek 
> 
>> On 7 Nov 2018, at 14:27, Fabian Hueske <fh...@gmail.com> wrote:
>> 
>> Hi Xuefu and all,
>> 
>> Thanks for sharing this design document!
>> I'm very much in favor of restructuring / reworking the catalog handling in
>> Flink SQL as outlined in the document.
>> Most changes described in the design document seem to be rather general and
>> not specifically related to the Hive integration.
>> 
>> IMO, there are some aspects, especially those at the boundary of Hive and
>> Flink, that need a bit more discussion. For example
>> 
>> * What does it take to make Flink schema compatible with Hive schema?
>> * How will Flink tables (descriptors) be stored in HMS?
>> * How do both Hive catalogs differ? Could they be integrated into to a
>> single one? When to use which one?
>> * What meta information is provided by HMS? What of this can be leveraged
>> by Flink?
>> 
>> Thank you,
>> Fabian
>> 
>> Am Fr., 2. Nov. 2018 um 00:31 Uhr schrieb Bowen Li <bo...@gmail.com>:
>> 
>>> After taking a look at how other discussion threads work, I think it's
>>> actually fine just keep our discussion here. It's up to you, Xuefu.
>>> 
>>> The google doc LGTM. I left some minor comments.
>>> 
>>> On Thu, Nov 1, 2018 at 10:17 AM Bowen Li <bo...@gmail.com> wrote:
>>> 
>>>> Hi all,
>>>> 
>>>> As Xuefu has published the design doc on google, I agree with Shuyi's
>>>> suggestion that we probably should start a new email thread like "[DISCUSS]
>>>> ... Hive integration design ..." on only dev mailing list for community
>>>> devs to review. The current thread sends to both dev and user list.
>>>> 
>>>> This email thread is more like validating the general idea and direction
>>>> with the community, and it's been pretty long and crowded so far. Since
>>>> everyone is pro for the idea, we can move forward with another thread to
>>>> discuss and finalize the design.
>>>> 
>>>> Thanks,
>>>> Bowen
>>>> 
>>>> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu <xu...@alibaba-inc.com>
>>>> wrote:
>>>> 
>>>>> Hi Shuiyi,
>>>>> 
>>>>> Good idea. Actually the PDF was converted from a google doc. Here is its
>>>>> link:
>>>>> 
>>>>> https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
>>>>> Once we reach an agreement, I can convert it to a FLIP.
>>>>> 
>>>>> Thanks,
>>>>> Xuefu
>>>>> 
>>>>> 
>>>>> 
>>>>> ------------------------------------------------------------------
>>>>> Sender:Shuyi Chen <su...@gmail.com>
>>>>> Sent at:2018 Nov 1 (Thu) 02:47
>>>>> Recipient:Xuefu <xu...@alibaba-inc.com>
>>>>> Cc:vino yang <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
>>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>>> 
>>>>> Hi Xuefu,
>>>>> 
>>>>> Thanks a lot for driving this big effort. I would suggest convert your
>>>>> proposal and design doc into a google doc, and share it on the dev mailing
>>>>> list for the community to review and comment with title like "[DISCUSS] ...
>>>>> Hive integration design ..." . Once approved,  we can document it as a FLIP
>>>>> (Flink Improvement Proposals), and use JIRAs to track the implementations.
>>>>> What do you think?
>>>>> 
>>>>> Shuyi
>>>>> 
>>>>> On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <xu...@alibaba-inc.com>
>>>>> wrote:
>>>>> Hi all,
>>>>> 
>>>>> I have also shared a design doc on Hive metastore integration that is
>>>>> attached here and also to FLINK-10556[1]. Please kindly review and share
>>>>> your feedback.
>>>>> 
>>>>> 
>>>>> Thanks,
>>>>> Xuefu
>>>>> 
>>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
>>>>> ------------------------------------------------------------------
>>>>> Sender:Xuefu <xu...@alibaba-inc.com>
>>>>> Sent at:2018 Oct 25 (Thu) 01:08
>>>>> Recipient:Xuefu <xu...@alibaba-inc.com>; Shuyi Chen <
>>>>> suez1224@gmail.com>
>>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
>>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>>> 
>>>>> Hi all,
>>>>> 
>>>>> To wrap up the discussion, I have attached a PDF describing the
>>>>> proposal, which is also attached to FLINK-10556 [1]. Please feel free to
>>>>> watch that JIRA to track the progress.
>>>>> 
>>>>> Please also let me know if you have additional comments or questions.
>>>>> 
>>>>> Thanks,
>>>>> Xuefu
>>>>> 
>>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
>>>>> 
>>>>> 
>>>>> ------------------------------------------------------------------
>>>>> Sender:Xuefu <xu...@alibaba-inc.com>
>>>>> Sent at:2018 Oct 16 (Tue) 03:40
>>>>> Recipient:Shuyi Chen <su...@gmail.com>
>>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
>>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>>> 
>>>>> Hi Shuyi,
>>>>> 
>>>>> Thank you for your input. Yes, I agreed with a phased approach and like
>>>>> to move forward fast. :) We did some work internally on DDL utilizing babel
>>>>> parser in Calcite. While babel makes Calcite's grammar extensible, at
>>>>> first impression it still seems too cumbersome for a project when too
>>>>> much extensions are made. It's even challenging to find where the extension
>>>>> is needed! It would be certainly better if Calcite can magically support
>>>>> Hive QL by just turning on a flag, such as that for MYSQL_5. I can also
>>>>> see that this could mean a lot of work on Calcite. Nevertheless, I will
>>>>> bring up the discussion over there and to see what their community thinks.
>>>>> 
>>>>> Would mind to share more info about the proposal on DDL that you
>>>>> mentioned? We can certainly collaborate on this.
>>>>> 
>>>>> Thanks,
>>>>> Xuefu
>>>>> 
>>>>> ------------------------------------------------------------------
>>>>> Sender:Shuyi Chen <su...@gmail.com>
>>>>> Sent at:2018 Oct 14 (Sun) 08:30
>>>>> Recipient:Xuefu <xu...@alibaba-inc.com>
>>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
>>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>>> 
>>>>> Welcome to the community and thanks for the great proposal, Xuefu! I
>>>>> think the proposal can be divided into 2 stages: making Flink to support
>>>>> Hive features, and make Hive to work with Flink. I agreed with Timo that on
>>>>> starting with a smaller scope, so we can make progress faster. As for [6],
>>>>> a proposal for DDL is already in progress, and will come after the unified
>>>>> SQL connector API is done. For supporting Hive syntax, we might need to
>>>>> work with the Calcite community, and a recent effort called babel (
>>>>> https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite might
>>>>> help here.
>>>>> 
>>>>> Thanks
>>>>> Shuyi
>>>>> 
>>>>> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <xu...@alibaba-inc.com>
>>>>> wrote:
>>>>> Hi Fabian/Vno,
>>>>> 
>>>>> Thank you very much for your encouragement inquiry. Sorry that I didn't
>>>>> see Fabian's email until I read Vino's response just now. (Somehow Fabian's
>>>>> went to the spam folder.)
>>>>> 
>>>>> My proposal contains long-term and short-terms goals. Nevertheless, the
>>>>> effort will focus on the following areas, including Fabian's list:
>>>>> 
>>>>> 1. Hive metastore connectivity - This covers both read/write access,
>>>>> which means Flink can make full use of Hive's metastore as its catalog (at
>>>>> least for the batch but can extend for streaming as well).
>>>>> 2. Metadata compatibility - Objects (databases, tables, partitions, etc)
>>>>> created by Hive can be understood by Flink and the reverse direction is
>>>>> true also.
>>>>> 3. Data compatibility - Similar to #2, data produced by Hive can be
>>>>> consumed by Flink and vise versa.
>>>>> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides
>>>>> its own implementation or make Hive's implementation work in Flink.
>>>>> Further, for user created UDFs in Hive, Flink SQL should provide a
>>>>> mechanism allowing user to import them into Flink without any code change
>>>>> required.
>>>>> 5. Data types -  Flink SQL should support all data types that are
>>>>> available in Hive.
>>>>> 6. SQL Language - Flink SQL should support SQL standard (such as
>>>>> SQL2003) with extension to support Hive's syntax and language features,
>>>>> around DDL, DML, and SELECT queries.
>>>>> 7.  SQL CLI - this is currently developing in Flink but more effort is
>>>>> needed.
>>>>> 8. Server - provide a server that's compatible with Hive's HiverServer2
>>>>> in thrift APIs, such that HiveServer2 users can reuse their existing client
>>>>> (such as beeline) but connect to Flink's thrift server instead.
>>>>> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for
>>>>> other application to use to connect to its thrift server
>>>>> 10. Support other user's customizations in Hive, such as Hive Serdes,
>>>>> storage handlers, etc.
>>>>> 11. Better task failure tolerance and task scheduling at Flink runtime.
>>>>> 
>>>>> As you can see, achieving all those requires significant effort and
>>>>> across all layers in Flink. However, a short-term goal could  include only
>>>>> core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as
>>>>> #3, #6).
>>>>> 
>>>>> Please share your further thoughts. If we generally agree that this is
>>>>> the right direction, I could come up with a formal proposal quickly and
>>>>> then we can follow up with broader discussions.
>>>>> 
>>>>> Thanks,
>>>>> Xuefu
>>>>> 
>>>>> 
>>>>> 
>>>>> ------------------------------------------------------------------
>>>>> Sender:vino yang <ya...@gmail.com>
>>>>> Sent at:2018 Oct 11 (Thu) 09:45
>>>>> Recipient:Fabian Hueske <fh...@gmail.com>
>>>>> Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <
>>>>> user@flink.apache.org>
>>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>>> 
>>>>> Hi Xuefu,
>>>>> 
>>>>> Appreciate this proposal, and like Fabian, it would look better if you
>>>>> can give more details of the plan.
>>>>> 
>>>>> Thanks, vino.
>>>>> 
>>>>> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
>>>>> Hi Xuefu,
>>>>> 
>>>>> Welcome to the Flink community and thanks for starting this discussion!
>>>>> Better Hive integration would be really great!
>>>>> Can you go into details of what you are proposing? I can think of a
>>>>> couple ways to improve Flink in that regard:
>>>>> 
>>>>> * Support for Hive UDFs
>>>>> * Support for Hive metadata catalog
>>>>> * Support for HiveQL syntax
>>>>> * ???
>>>>> 
>>>>> Best, Fabian
>>>>> 
>>>>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
>>>>> xuefu.z@alibaba-inc.com>:
>>>>> Hi all,
>>>>> 
>>>>> Along with the community's effort, inside Alibaba we have explored
>>>>> Flink's potential as an execution engine not just for stream processing but
>>>>> also for batch processing. We are encouraged by our findings and have
>>>>> initiated our effort to make Flink's SQL capabilities full-fledged. When
>>>>> comparing what's available in Flink to the offerings from competitive data
>>>>> processing engines, we identified a major gap in Flink: a well integration
>>>>> with Hive ecosystem. This is crucial to the success of Flink SQL and batch
>>>>> due to the well-established data ecosystem around Hive. Therefore, we have
>>>>> done some initial work along this direction but there are still a lot of
>>>>> effort needed.
>>>>> 
>>>>> We have two strategies in mind. The first one is to make Flink SQL
>>>>> full-fledged and well-integrated with Hive ecosystem. This is a similar
>>>>> approach to what Spark SQL adopted. The second strategy is to make Hive
>>>>> itself work with Flink, similar to the proposal in [1]. Each approach bears
>>>>> its pros and cons, but they don’t need to be mutually exclusive with each
>>>>> targeting at different users and use cases. We believe that both will
>>>>> promote a much greater adoption of Flink beyond stream processing.
>>>>> 
>>>>> We have been focused on the first approach and would like to showcase
>>>>> Flink's batch and SQL capabilities with Flink SQL. However, we have also
>>>>> planned to start strategy #2 as the follow-up effort.
>>>>> 
>>>>> I'm completely new to Flink(, with a short bio [2] below), though many
>>>>> of my colleagues here at Alibaba are long-time contributors. Nevertheless,
>>>>> I'd like to share our thoughts and invite your early feedback. At the same
>>>>> time, I am working on a detailed proposal on Flink SQL's integration with
>>>>> Hive ecosystem, which will be also shared when ready.
>>>>> 
>>>>> While the ideas are simple, each approach will demand significant
>>>>> effort, more than what we can afford. Thus, the input and contributions
>>>>> from the communities are greatly welcome and appreciated.
>>>>> 
>>>>> Regards,
>>>>> 
>>>>> 
>>>>> Xuefu
>>>>> 
>>>>> References:
>>>>> 
>>>>> [1] https://issues.apache.org/jira/browse/HIVE-10712
>>>>> [2] Xuefu Zhang is a long-time open source veteran, worked or working on
>>>>> many projects under Apache Foundation, of which he is also an honored
>>>>> member. About 10 years ago he worked in the Hadoop team at Yahoo where the
>>>>> projects just got started. Later he worked at Cloudera, initiating and
>>>>> leading the development of Hive on Spark project in the communities and
>>>>> across many organizations. Prior to joining Alibaba, he worked at Uber
>>>>> where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and
>>>>> significantly improved Uber's cluster efficiency.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> "So you have to trust that the dots will somehow connect in your future."
>>>>> 
>>>>> 
>>>>> --
>>>>> "So you have to trust that the dots will somehow connect in your future."
>>>>> 

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by "Zhang, Xuefu" <xu...@alibaba-inc.com>.
Hi Piotr,

That sounds good to me. Let's close all the open questions ((there are a couple of them)) in the Google doc and I should be able to quickly split it into the three proposals as you suggested.

Thanks,
Xuefu


------------------------------------------------------------------
Sender:Piotr Nowojski <pi...@data-artisans.com>
Sent at:2018 Nov 9 (Fri) 22:46
Recipient:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>
Cc:Bowen Li <bo...@gmail.com>; Shuyi Chen <su...@gmail.com>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi,

Yes, it seems like the best solution. Maybe someone else can also suggests if we can split it further? Maybe changes in the interface in one doc, reading from hive meta store another and final storing our meta informations in hive meta store?

Piotrek

> On 9 Nov 2018, at 01:44, Zhang, Xuefu <xu...@alibaba-inc.com> wrote:
> 
> Hi Piotr,
> 
> That seems to be good idea!
> 
> Since the google doc for the design is currently under extensive review, I will leave it as it is for now. However, I'll convert it to two different FLIPs when the time comes.
> 
> How does it sound to you?
> 
> Thanks,
> Xuefu
> 
> 
> ------------------------------------------------------------------
> Sender:Piotr Nowojski <pi...@data-artisans.com>
> Sent at:2018 Nov 9 (Fri) 02:31
> Recipient:dev <de...@flink.apache.org>
> Cc:Bowen Li <bo...@gmail.com>; Xuefu <xu...@alibaba-inc.com>; Shuyi Chen <su...@gmail.com>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> 
> Hi,
> 
> Maybe we should split this topic (and the design doc) into couple of smaller ones, hopefully independent. The questions that you have asked Fabian have for example very little to do with reading metadata from Hive Meta Store?
> 
> Piotrek 
> 
>> On 7 Nov 2018, at 14:27, Fabian Hueske <fh...@gmail.com> wrote:
>> 
>> Hi Xuefu and all,
>> 
>> Thanks for sharing this design document!
>> I'm very much in favor of restructuring / reworking the catalog handling in
>> Flink SQL as outlined in the document.
>> Most changes described in the design document seem to be rather general and
>> not specifically related to the Hive integration.
>> 
>> IMO, there are some aspects, especially those at the boundary of Hive and
>> Flink, that need a bit more discussion. For example
>> 
>> * What does it take to make Flink schema compatible with Hive schema?
>> * How will Flink tables (descriptors) be stored in HMS?
>> * How do both Hive catalogs differ? Could they be integrated into to a
>> single one? When to use which one?
>> * What meta information is provided by HMS? What of this can be leveraged
>> by Flink?
>> 
>> Thank you,
>> Fabian
>> 
>> Am Fr., 2. Nov. 2018 um 00:31 Uhr schrieb Bowen Li <bo...@gmail.com>:
>> 
>>> After taking a look at how other discussion threads work, I think it's
>>> actually fine just keep our discussion here. It's up to you, Xuefu.
>>> 
>>> The google doc LGTM. I left some minor comments.
>>> 
>>> On Thu, Nov 1, 2018 at 10:17 AM Bowen Li <bo...@gmail.com> wrote:
>>> 
>>>> Hi all,
>>>> 
>>>> As Xuefu has published the design doc on google, I agree with Shuyi's
>>>> suggestion that we probably should start a new email thread like "[DISCUSS]
>>>> ... Hive integration design ..." on only dev mailing list for community
>>>> devs to review. The current thread sends to both dev and user list.
>>>> 
>>>> This email thread is more like validating the general idea and direction
>>>> with the community, and it's been pretty long and crowded so far. Since
>>>> everyone is pro for the idea, we can move forward with another thread to
>>>> discuss and finalize the design.
>>>> 
>>>> Thanks,
>>>> Bowen
>>>> 
>>>> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu <xu...@alibaba-inc.com>
>>>> wrote:
>>>> 
>>>>> Hi Shuiyi,
>>>>> 
>>>>> Good idea. Actually the PDF was converted from a google doc. Here is its
>>>>> link:
>>>>> 
>>>>> https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
>>>>> Once we reach an agreement, I can convert it to a FLIP.
>>>>> 
>>>>> Thanks,
>>>>> Xuefu
>>>>> 
>>>>> 
>>>>> 
>>>>> ------------------------------------------------------------------
>>>>> Sender:Shuyi Chen <su...@gmail.com>
>>>>> Sent at:2018 Nov 1 (Thu) 02:47
>>>>> Recipient:Xuefu <xu...@alibaba-inc.com>
>>>>> Cc:vino yang <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
>>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>>> 
>>>>> Hi Xuefu,
>>>>> 
>>>>> Thanks a lot for driving this big effort. I would suggest convert your
>>>>> proposal and design doc into a google doc, and share it on the dev mailing
>>>>> list for the community to review and comment with title like "[DISCUSS] ...
>>>>> Hive integration design ..." . Once approved,  we can document it as a FLIP
>>>>> (Flink Improvement Proposals), and use JIRAs to track the implementations.
>>>>> What do you think?
>>>>> 
>>>>> Shuyi
>>>>> 
>>>>> On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <xu...@alibaba-inc.com>
>>>>> wrote:
>>>>> Hi all,
>>>>> 
>>>>> I have also shared a design doc on Hive metastore integration that is
>>>>> attached here and also to FLINK-10556[1]. Please kindly review and share
>>>>> your feedback.
>>>>> 
>>>>> 
>>>>> Thanks,
>>>>> Xuefu
>>>>> 
>>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
>>>>> ------------------------------------------------------------------
>>>>> Sender:Xuefu <xu...@alibaba-inc.com>
>>>>> Sent at:2018 Oct 25 (Thu) 01:08
>>>>> Recipient:Xuefu <xu...@alibaba-inc.com>; Shuyi Chen <
>>>>> suez1224@gmail.com>
>>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
>>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>>> 
>>>>> Hi all,
>>>>> 
>>>>> To wrap up the discussion, I have attached a PDF describing the
>>>>> proposal, which is also attached to FLINK-10556 [1]. Please feel free to
>>>>> watch that JIRA to track the progress.
>>>>> 
>>>>> Please also let me know if you have additional comments or questions.
>>>>> 
>>>>> Thanks,
>>>>> Xuefu
>>>>> 
>>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
>>>>> 
>>>>> 
>>>>> ------------------------------------------------------------------
>>>>> Sender:Xuefu <xu...@alibaba-inc.com>
>>>>> Sent at:2018 Oct 16 (Tue) 03:40
>>>>> Recipient:Shuyi Chen <su...@gmail.com>
>>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
>>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>>> 
>>>>> Hi Shuyi,
>>>>> 
>>>>> Thank you for your input. Yes, I agreed with a phased approach and like
>>>>> to move forward fast. :) We did some work internally on DDL utilizing babel
>>>>> parser in Calcite. While babel makes Calcite's grammar extensible, at
>>>>> first impression it still seems too cumbersome for a project when too
>>>>> much extensions are made. It's even challenging to find where the extension
>>>>> is needed! It would be certainly better if Calcite can magically support
>>>>> Hive QL by just turning on a flag, such as that for MYSQL_5. I can also
>>>>> see that this could mean a lot of work on Calcite. Nevertheless, I will
>>>>> bring up the discussion over there and to see what their community thinks.
>>>>> 
>>>>> Would mind to share more info about the proposal on DDL that you
>>>>> mentioned? We can certainly collaborate on this.
>>>>> 
>>>>> Thanks,
>>>>> Xuefu
>>>>> 
>>>>> ------------------------------------------------------------------
>>>>> Sender:Shuyi Chen <su...@gmail.com>
>>>>> Sent at:2018 Oct 14 (Sun) 08:30
>>>>> Recipient:Xuefu <xu...@alibaba-inc.com>
>>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
>>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>>> 
>>>>> Welcome to the community and thanks for the great proposal, Xuefu! I
>>>>> think the proposal can be divided into 2 stages: making Flink to support
>>>>> Hive features, and make Hive to work with Flink. I agreed with Timo that on
>>>>> starting with a smaller scope, so we can make progress faster. As for [6],
>>>>> a proposal for DDL is already in progress, and will come after the unified
>>>>> SQL connector API is done. For supporting Hive syntax, we might need to
>>>>> work with the Calcite community, and a recent effort called babel (
>>>>> https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite might
>>>>> help here.
>>>>> 
>>>>> Thanks
>>>>> Shuyi
>>>>> 
>>>>> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <xu...@alibaba-inc.com>
>>>>> wrote:
>>>>> Hi Fabian/Vno,
>>>>> 
>>>>> Thank you very much for your encouragement inquiry. Sorry that I didn't
>>>>> see Fabian's email until I read Vino's response just now. (Somehow Fabian's
>>>>> went to the spam folder.)
>>>>> 
>>>>> My proposal contains long-term and short-terms goals. Nevertheless, the
>>>>> effort will focus on the following areas, including Fabian's list:
>>>>> 
>>>>> 1. Hive metastore connectivity - This covers both read/write access,
>>>>> which means Flink can make full use of Hive's metastore as its catalog (at
>>>>> least for the batch but can extend for streaming as well).
>>>>> 2. Metadata compatibility - Objects (databases, tables, partitions, etc)
>>>>> created by Hive can be understood by Flink and the reverse direction is
>>>>> true also.
>>>>> 3. Data compatibility - Similar to #2, data produced by Hive can be
>>>>> consumed by Flink and vise versa.
>>>>> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides
>>>>> its own implementation or make Hive's implementation work in Flink.
>>>>> Further, for user created UDFs in Hive, Flink SQL should provide a
>>>>> mechanism allowing user to import them into Flink without any code change
>>>>> required.
>>>>> 5. Data types -  Flink SQL should support all data types that are
>>>>> available in Hive.
>>>>> 6. SQL Language - Flink SQL should support SQL standard (such as
>>>>> SQL2003) with extension to support Hive's syntax and language features,
>>>>> around DDL, DML, and SELECT queries.
>>>>> 7.  SQL CLI - this is currently developing in Flink but more effort is
>>>>> needed.
>>>>> 8. Server - provide a server that's compatible with Hive's HiverServer2
>>>>> in thrift APIs, such that HiveServer2 users can reuse their existing client
>>>>> (such as beeline) but connect to Flink's thrift server instead.
>>>>> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for
>>>>> other application to use to connect to its thrift server
>>>>> 10. Support other user's customizations in Hive, such as Hive Serdes,
>>>>> storage handlers, etc.
>>>>> 11. Better task failure tolerance and task scheduling at Flink runtime.
>>>>> 
>>>>> As you can see, achieving all those requires significant effort and
>>>>> across all layers in Flink. However, a short-term goal could  include only
>>>>> core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as
>>>>> #3, #6).
>>>>> 
>>>>> Please share your further thoughts. If we generally agree that this is
>>>>> the right direction, I could come up with a formal proposal quickly and
>>>>> then we can follow up with broader discussions.
>>>>> 
>>>>> Thanks,
>>>>> Xuefu
>>>>> 
>>>>> 
>>>>> 
>>>>> ------------------------------------------------------------------
>>>>> Sender:vino yang <ya...@gmail.com>
>>>>> Sent at:2018 Oct 11 (Thu) 09:45
>>>>> Recipient:Fabian Hueske <fh...@gmail.com>
>>>>> Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <
>>>>> user@flink.apache.org>
>>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>>> 
>>>>> Hi Xuefu,
>>>>> 
>>>>> Appreciate this proposal, and like Fabian, it would look better if you
>>>>> can give more details of the plan.
>>>>> 
>>>>> Thanks, vino.
>>>>> 
>>>>> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
>>>>> Hi Xuefu,
>>>>> 
>>>>> Welcome to the Flink community and thanks for starting this discussion!
>>>>> Better Hive integration would be really great!
>>>>> Can you go into details of what you are proposing? I can think of a
>>>>> couple ways to improve Flink in that regard:
>>>>> 
>>>>> * Support for Hive UDFs
>>>>> * Support for Hive metadata catalog
>>>>> * Support for HiveQL syntax
>>>>> * ???
>>>>> 
>>>>> Best, Fabian
>>>>> 
>>>>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
>>>>> xuefu.z@alibaba-inc.com>:
>>>>> Hi all,
>>>>> 
>>>>> Along with the community's effort, inside Alibaba we have explored
>>>>> Flink's potential as an execution engine not just for stream processing but
>>>>> also for batch processing. We are encouraged by our findings and have
>>>>> initiated our effort to make Flink's SQL capabilities full-fledged. When
>>>>> comparing what's available in Flink to the offerings from competitive data
>>>>> processing engines, we identified a major gap in Flink: a well integration
>>>>> with Hive ecosystem. This is crucial to the success of Flink SQL and batch
>>>>> due to the well-established data ecosystem around Hive. Therefore, we have
>>>>> done some initial work along this direction but there are still a lot of
>>>>> effort needed.
>>>>> 
>>>>> We have two strategies in mind. The first one is to make Flink SQL
>>>>> full-fledged and well-integrated with Hive ecosystem. This is a similar
>>>>> approach to what Spark SQL adopted. The second strategy is to make Hive
>>>>> itself work with Flink, similar to the proposal in [1]. Each approach bears
>>>>> its pros and cons, but they don’t need to be mutually exclusive with each
>>>>> targeting at different users and use cases. We believe that both will
>>>>> promote a much greater adoption of Flink beyond stream processing.
>>>>> 
>>>>> We have been focused on the first approach and would like to showcase
>>>>> Flink's batch and SQL capabilities with Flink SQL. However, we have also
>>>>> planned to start strategy #2 as the follow-up effort.
>>>>> 
>>>>> I'm completely new to Flink(, with a short bio [2] below), though many
>>>>> of my colleagues here at Alibaba are long-time contributors. Nevertheless,
>>>>> I'd like to share our thoughts and invite your early feedback. At the same
>>>>> time, I am working on a detailed proposal on Flink SQL's integration with
>>>>> Hive ecosystem, which will be also shared when ready.
>>>>> 
>>>>> While the ideas are simple, each approach will demand significant
>>>>> effort, more than what we can afford. Thus, the input and contributions
>>>>> from the communities are greatly welcome and appreciated.
>>>>> 
>>>>> Regards,
>>>>> 
>>>>> 
>>>>> Xuefu
>>>>> 
>>>>> References:
>>>>> 
>>>>> [1] https://issues.apache.org/jira/browse/HIVE-10712
>>>>> [2] Xuefu Zhang is a long-time open source veteran, worked or working on
>>>>> many projects under Apache Foundation, of which he is also an honored
>>>>> member. About 10 years ago he worked in the Hadoop team at Yahoo where the
>>>>> projects just got started. Later he worked at Cloudera, initiating and
>>>>> leading the development of Hive on Spark project in the communities and
>>>>> across many organizations. Prior to joining Alibaba, he worked at Uber
>>>>> where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and
>>>>> significantly improved Uber's cluster efficiency.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> "So you have to trust that the dots will somehow connect in your future."
>>>>> 
>>>>> 
>>>>> --
>>>>> "So you have to trust that the dots will somehow connect in your future."
>>>>> 

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by Piotr Nowojski <pi...@data-artisans.com>.
Hi,

Yes, it seems like the best solution. Maybe someone else can also suggests if we can split it further? Maybe changes in the interface in one doc, reading from hive meta store another and final storing our meta informations in hive meta store?

Piotrek

> On 9 Nov 2018, at 01:44, Zhang, Xuefu <xu...@alibaba-inc.com> wrote:
> 
> Hi Piotr,
> 
> That seems to be good idea!
> 
> Since the google doc for the design is currently under extensive review, I will leave it as it is for now. However, I'll convert it to two different FLIPs when the time comes.
> 
> How does it sound to you?
> 
> Thanks,
> Xuefu
> 
> 
> ------------------------------------------------------------------
> Sender:Piotr Nowojski <pi...@data-artisans.com>
> Sent at:2018 Nov 9 (Fri) 02:31
> Recipient:dev <de...@flink.apache.org>
> Cc:Bowen Li <bo...@gmail.com>; Xuefu <xu...@alibaba-inc.com>; Shuyi Chen <su...@gmail.com>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> 
> Hi,
> 
> Maybe we should split this topic (and the design doc) into couple of smaller ones, hopefully independent. The questions that you have asked Fabian have for example very little to do with reading metadata from Hive Meta Store?
> 
> Piotrek 
> 
>> On 7 Nov 2018, at 14:27, Fabian Hueske <fh...@gmail.com> wrote:
>> 
>> Hi Xuefu and all,
>> 
>> Thanks for sharing this design document!
>> I'm very much in favor of restructuring / reworking the catalog handling in
>> Flink SQL as outlined in the document.
>> Most changes described in the design document seem to be rather general and
>> not specifically related to the Hive integration.
>> 
>> IMO, there are some aspects, especially those at the boundary of Hive and
>> Flink, that need a bit more discussion. For example
>> 
>> * What does it take to make Flink schema compatible with Hive schema?
>> * How will Flink tables (descriptors) be stored in HMS?
>> * How do both Hive catalogs differ? Could they be integrated into to a
>> single one? When to use which one?
>> * What meta information is provided by HMS? What of this can be leveraged
>> by Flink?
>> 
>> Thank you,
>> Fabian
>> 
>> Am Fr., 2. Nov. 2018 um 00:31 Uhr schrieb Bowen Li <bo...@gmail.com>:
>> 
>>> After taking a look at how other discussion threads work, I think it's
>>> actually fine just keep our discussion here. It's up to you, Xuefu.
>>> 
>>> The google doc LGTM. I left some minor comments.
>>> 
>>> On Thu, Nov 1, 2018 at 10:17 AM Bowen Li <bo...@gmail.com> wrote:
>>> 
>>>> Hi all,
>>>> 
>>>> As Xuefu has published the design doc on google, I agree with Shuyi's
>>>> suggestion that we probably should start a new email thread like "[DISCUSS]
>>>> ... Hive integration design ..." on only dev mailing list for community
>>>> devs to review. The current thread sends to both dev and user list.
>>>> 
>>>> This email thread is more like validating the general idea and direction
>>>> with the community, and it's been pretty long and crowded so far. Since
>>>> everyone is pro for the idea, we can move forward with another thread to
>>>> discuss and finalize the design.
>>>> 
>>>> Thanks,
>>>> Bowen
>>>> 
>>>> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu <xu...@alibaba-inc.com>
>>>> wrote:
>>>> 
>>>>> Hi Shuiyi,
>>>>> 
>>>>> Good idea. Actually the PDF was converted from a google doc. Here is its
>>>>> link:
>>>>> 
>>>>> https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
>>>>> Once we reach an agreement, I can convert it to a FLIP.
>>>>> 
>>>>> Thanks,
>>>>> Xuefu
>>>>> 
>>>>> 
>>>>> 
>>>>> ------------------------------------------------------------------
>>>>> Sender:Shuyi Chen <su...@gmail.com>
>>>>> Sent at:2018 Nov 1 (Thu) 02:47
>>>>> Recipient:Xuefu <xu...@alibaba-inc.com>
>>>>> Cc:vino yang <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
>>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>>> 
>>>>> Hi Xuefu,
>>>>> 
>>>>> Thanks a lot for driving this big effort. I would suggest convert your
>>>>> proposal and design doc into a google doc, and share it on the dev mailing
>>>>> list for the community to review and comment with title like "[DISCUSS] ...
>>>>> Hive integration design ..." . Once approved,  we can document it as a FLIP
>>>>> (Flink Improvement Proposals), and use JIRAs to track the implementations.
>>>>> What do you think?
>>>>> 
>>>>> Shuyi
>>>>> 
>>>>> On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <xu...@alibaba-inc.com>
>>>>> wrote:
>>>>> Hi all,
>>>>> 
>>>>> I have also shared a design doc on Hive metastore integration that is
>>>>> attached here and also to FLINK-10556[1]. Please kindly review and share
>>>>> your feedback.
>>>>> 
>>>>> 
>>>>> Thanks,
>>>>> Xuefu
>>>>> 
>>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
>>>>> ------------------------------------------------------------------
>>>>> Sender:Xuefu <xu...@alibaba-inc.com>
>>>>> Sent at:2018 Oct 25 (Thu) 01:08
>>>>> Recipient:Xuefu <xu...@alibaba-inc.com>; Shuyi Chen <
>>>>> suez1224@gmail.com>
>>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
>>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>>> 
>>>>> Hi all,
>>>>> 
>>>>> To wrap up the discussion, I have attached a PDF describing the
>>>>> proposal, which is also attached to FLINK-10556 [1]. Please feel free to
>>>>> watch that JIRA to track the progress.
>>>>> 
>>>>> Please also let me know if you have additional comments or questions.
>>>>> 
>>>>> Thanks,
>>>>> Xuefu
>>>>> 
>>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
>>>>> 
>>>>> 
>>>>> ------------------------------------------------------------------
>>>>> Sender:Xuefu <xu...@alibaba-inc.com>
>>>>> Sent at:2018 Oct 16 (Tue) 03:40
>>>>> Recipient:Shuyi Chen <su...@gmail.com>
>>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
>>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>>> 
>>>>> Hi Shuyi,
>>>>> 
>>>>> Thank you for your input. Yes, I agreed with a phased approach and like
>>>>> to move forward fast. :) We did some work internally on DDL utilizing babel
>>>>> parser in Calcite. While babel makes Calcite's grammar extensible, at
>>>>> first impression it still seems too cumbersome for a project when too
>>>>> much extensions are made. It's even challenging to find where the extension
>>>>> is needed! It would be certainly better if Calcite can magically support
>>>>> Hive QL by just turning on a flag, such as that for MYSQL_5. I can also
>>>>> see that this could mean a lot of work on Calcite. Nevertheless, I will
>>>>> bring up the discussion over there and to see what their community thinks.
>>>>> 
>>>>> Would mind to share more info about the proposal on DDL that you
>>>>> mentioned? We can certainly collaborate on this.
>>>>> 
>>>>> Thanks,
>>>>> Xuefu
>>>>> 
>>>>> ------------------------------------------------------------------
>>>>> Sender:Shuyi Chen <su...@gmail.com>
>>>>> Sent at:2018 Oct 14 (Sun) 08:30
>>>>> Recipient:Xuefu <xu...@alibaba-inc.com>
>>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
>>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>>> 
>>>>> Welcome to the community and thanks for the great proposal, Xuefu! I
>>>>> think the proposal can be divided into 2 stages: making Flink to support
>>>>> Hive features, and make Hive to work with Flink. I agreed with Timo that on
>>>>> starting with a smaller scope, so we can make progress faster. As for [6],
>>>>> a proposal for DDL is already in progress, and will come after the unified
>>>>> SQL connector API is done. For supporting Hive syntax, we might need to
>>>>> work with the Calcite community, and a recent effort called babel (
>>>>> https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite might
>>>>> help here.
>>>>> 
>>>>> Thanks
>>>>> Shuyi
>>>>> 
>>>>> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <xu...@alibaba-inc.com>
>>>>> wrote:
>>>>> Hi Fabian/Vno,
>>>>> 
>>>>> Thank you very much for your encouragement inquiry. Sorry that I didn't
>>>>> see Fabian's email until I read Vino's response just now. (Somehow Fabian's
>>>>> went to the spam folder.)
>>>>> 
>>>>> My proposal contains long-term and short-terms goals. Nevertheless, the
>>>>> effort will focus on the following areas, including Fabian's list:
>>>>> 
>>>>> 1. Hive metastore connectivity - This covers both read/write access,
>>>>> which means Flink can make full use of Hive's metastore as its catalog (at
>>>>> least for the batch but can extend for streaming as well).
>>>>> 2. Metadata compatibility - Objects (databases, tables, partitions, etc)
>>>>> created by Hive can be understood by Flink and the reverse direction is
>>>>> true also.
>>>>> 3. Data compatibility - Similar to #2, data produced by Hive can be
>>>>> consumed by Flink and vise versa.
>>>>> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides
>>>>> its own implementation or make Hive's implementation work in Flink.
>>>>> Further, for user created UDFs in Hive, Flink SQL should provide a
>>>>> mechanism allowing user to import them into Flink without any code change
>>>>> required.
>>>>> 5. Data types -  Flink SQL should support all data types that are
>>>>> available in Hive.
>>>>> 6. SQL Language - Flink SQL should support SQL standard (such as
>>>>> SQL2003) with extension to support Hive's syntax and language features,
>>>>> around DDL, DML, and SELECT queries.
>>>>> 7.  SQL CLI - this is currently developing in Flink but more effort is
>>>>> needed.
>>>>> 8. Server - provide a server that's compatible with Hive's HiverServer2
>>>>> in thrift APIs, such that HiveServer2 users can reuse their existing client
>>>>> (such as beeline) but connect to Flink's thrift server instead.
>>>>> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for
>>>>> other application to use to connect to its thrift server
>>>>> 10. Support other user's customizations in Hive, such as Hive Serdes,
>>>>> storage handlers, etc.
>>>>> 11. Better task failure tolerance and task scheduling at Flink runtime.
>>>>> 
>>>>> As you can see, achieving all those requires significant effort and
>>>>> across all layers in Flink. However, a short-term goal could  include only
>>>>> core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as
>>>>> #3, #6).
>>>>> 
>>>>> Please share your further thoughts. If we generally agree that this is
>>>>> the right direction, I could come up with a formal proposal quickly and
>>>>> then we can follow up with broader discussions.
>>>>> 
>>>>> Thanks,
>>>>> Xuefu
>>>>> 
>>>>> 
>>>>> 
>>>>> ------------------------------------------------------------------
>>>>> Sender:vino yang <ya...@gmail.com>
>>>>> Sent at:2018 Oct 11 (Thu) 09:45
>>>>> Recipient:Fabian Hueske <fh...@gmail.com>
>>>>> Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <
>>>>> user@flink.apache.org>
>>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>>> 
>>>>> Hi Xuefu,
>>>>> 
>>>>> Appreciate this proposal, and like Fabian, it would look better if you
>>>>> can give more details of the plan.
>>>>> 
>>>>> Thanks, vino.
>>>>> 
>>>>> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
>>>>> Hi Xuefu,
>>>>> 
>>>>> Welcome to the Flink community and thanks for starting this discussion!
>>>>> Better Hive integration would be really great!
>>>>> Can you go into details of what you are proposing? I can think of a
>>>>> couple ways to improve Flink in that regard:
>>>>> 
>>>>> * Support for Hive UDFs
>>>>> * Support for Hive metadata catalog
>>>>> * Support for HiveQL syntax
>>>>> * ???
>>>>> 
>>>>> Best, Fabian
>>>>> 
>>>>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
>>>>> xuefu.z@alibaba-inc.com>:
>>>>> Hi all,
>>>>> 
>>>>> Along with the community's effort, inside Alibaba we have explored
>>>>> Flink's potential as an execution engine not just for stream processing but
>>>>> also for batch processing. We are encouraged by our findings and have
>>>>> initiated our effort to make Flink's SQL capabilities full-fledged. When
>>>>> comparing what's available in Flink to the offerings from competitive data
>>>>> processing engines, we identified a major gap in Flink: a well integration
>>>>> with Hive ecosystem. This is crucial to the success of Flink SQL and batch
>>>>> due to the well-established data ecosystem around Hive. Therefore, we have
>>>>> done some initial work along this direction but there are still a lot of
>>>>> effort needed.
>>>>> 
>>>>> We have two strategies in mind. The first one is to make Flink SQL
>>>>> full-fledged and well-integrated with Hive ecosystem. This is a similar
>>>>> approach to what Spark SQL adopted. The second strategy is to make Hive
>>>>> itself work with Flink, similar to the proposal in [1]. Each approach bears
>>>>> its pros and cons, but they don’t need to be mutually exclusive with each
>>>>> targeting at different users and use cases. We believe that both will
>>>>> promote a much greater adoption of Flink beyond stream processing.
>>>>> 
>>>>> We have been focused on the first approach and would like to showcase
>>>>> Flink's batch and SQL capabilities with Flink SQL. However, we have also
>>>>> planned to start strategy #2 as the follow-up effort.
>>>>> 
>>>>> I'm completely new to Flink(, with a short bio [2] below), though many
>>>>> of my colleagues here at Alibaba are long-time contributors. Nevertheless,
>>>>> I'd like to share our thoughts and invite your early feedback. At the same
>>>>> time, I am working on a detailed proposal on Flink SQL's integration with
>>>>> Hive ecosystem, which will be also shared when ready.
>>>>> 
>>>>> While the ideas are simple, each approach will demand significant
>>>>> effort, more than what we can afford. Thus, the input and contributions
>>>>> from the communities are greatly welcome and appreciated.
>>>>> 
>>>>> Regards,
>>>>> 
>>>>> 
>>>>> Xuefu
>>>>> 
>>>>> References:
>>>>> 
>>>>> [1] https://issues.apache.org/jira/browse/HIVE-10712
>>>>> [2] Xuefu Zhang is a long-time open source veteran, worked or working on
>>>>> many projects under Apache Foundation, of which he is also an honored
>>>>> member. About 10 years ago he worked in the Hadoop team at Yahoo where the
>>>>> projects just got started. Later he worked at Cloudera, initiating and
>>>>> leading the development of Hive on Spark project in the communities and
>>>>> across many organizations. Prior to joining Alibaba, he worked at Uber
>>>>> where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and
>>>>> significantly improved Uber's cluster efficiency.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> "So you have to trust that the dots will somehow connect in your future."
>>>>> 
>>>>> 
>>>>> --
>>>>> "So you have to trust that the dots will somehow connect in your future."
>>>>> 


Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by "Zhang, Xuefu" <xu...@alibaba-inc.com>.
Hi Piotr,

That seems to be good idea!

Since the google doc for the design is currently under extensive review, I will leave it as it is for now. However, I'll convert it to two different FLIPs when the time comes.

How does it sound to you?

Thanks,
Xuefu


------------------------------------------------------------------
Sender:Piotr Nowojski <pi...@data-artisans.com>
Sent at:2018 Nov 9 (Fri) 02:31
Recipient:dev <de...@flink.apache.org>
Cc:Bowen Li <bo...@gmail.com>; Xuefu <xu...@alibaba-inc.com>; Shuyi Chen <su...@gmail.com>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi,

Maybe we should split this topic (and the design doc) into couple of smaller ones, hopefully independent. The questions that you have asked Fabian have for example very little to do with reading metadata from Hive Meta Store?

Piotrek 

> On 7 Nov 2018, at 14:27, Fabian Hueske <fh...@gmail.com> wrote:
> 
> Hi Xuefu and all,
> 
> Thanks for sharing this design document!
> I'm very much in favor of restructuring / reworking the catalog handling in
> Flink SQL as outlined in the document.
> Most changes described in the design document seem to be rather general and
> not specifically related to the Hive integration.
> 
> IMO, there are some aspects, especially those at the boundary of Hive and
> Flink, that need a bit more discussion. For example
> 
> * What does it take to make Flink schema compatible with Hive schema?
> * How will Flink tables (descriptors) be stored in HMS?
> * How do both Hive catalogs differ? Could they be integrated into to a
> single one? When to use which one?
> * What meta information is provided by HMS? What of this can be leveraged
> by Flink?
> 
> Thank you,
> Fabian
> 
> Am Fr., 2. Nov. 2018 um 00:31 Uhr schrieb Bowen Li <bo...@gmail.com>:
> 
>> After taking a look at how other discussion threads work, I think it's
>> actually fine just keep our discussion here. It's up to you, Xuefu.
>> 
>> The google doc LGTM. I left some minor comments.
>> 
>> On Thu, Nov 1, 2018 at 10:17 AM Bowen Li <bo...@gmail.com> wrote:
>> 
>>> Hi all,
>>> 
>>> As Xuefu has published the design doc on google, I agree with Shuyi's
>>> suggestion that we probably should start a new email thread like "[DISCUSS]
>>> ... Hive integration design ..." on only dev mailing list for community
>>> devs to review. The current thread sends to both dev and user list.
>>> 
>>> This email thread is more like validating the general idea and direction
>>> with the community, and it's been pretty long and crowded so far. Since
>>> everyone is pro for the idea, we can move forward with another thread to
>>> discuss and finalize the design.
>>> 
>>> Thanks,
>>> Bowen
>>> 
>>> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu <xu...@alibaba-inc.com>
>>> wrote:
>>> 
>>>> Hi Shuiyi,
>>>> 
>>>> Good idea. Actually the PDF was converted from a google doc. Here is its
>>>> link:
>>>> 
>>>> https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
>>>> Once we reach an agreement, I can convert it to a FLIP.
>>>> 
>>>> Thanks,
>>>> Xuefu
>>>> 
>>>> 
>>>> 
>>>> ------------------------------------------------------------------
>>>> Sender:Shuyi Chen <su...@gmail.com>
>>>> Sent at:2018 Nov 1 (Thu) 02:47
>>>> Recipient:Xuefu <xu...@alibaba-inc.com>
>>>> Cc:vino yang <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>> 
>>>> Hi Xuefu,
>>>> 
>>>> Thanks a lot for driving this big effort. I would suggest convert your
>>>> proposal and design doc into a google doc, and share it on the dev mailing
>>>> list for the community to review and comment with title like "[DISCUSS] ...
>>>> Hive integration design ..." . Once approved,  we can document it as a FLIP
>>>> (Flink Improvement Proposals), and use JIRAs to track the implementations.
>>>> What do you think?
>>>> 
>>>> Shuyi
>>>> 
>>>> On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <xu...@alibaba-inc.com>
>>>> wrote:
>>>> Hi all,
>>>> 
>>>> I have also shared a design doc on Hive metastore integration that is
>>>> attached here and also to FLINK-10556[1]. Please kindly review and share
>>>> your feedback.
>>>> 
>>>> 
>>>> Thanks,
>>>> Xuefu
>>>> 
>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
>>>> ------------------------------------------------------------------
>>>> Sender:Xuefu <xu...@alibaba-inc.com>
>>>> Sent at:2018 Oct 25 (Thu) 01:08
>>>> Recipient:Xuefu <xu...@alibaba-inc.com>; Shuyi Chen <
>>>> suez1224@gmail.com>
>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>> 
>>>> Hi all,
>>>> 
>>>> To wrap up the discussion, I have attached a PDF describing the
>>>> proposal, which is also attached to FLINK-10556 [1]. Please feel free to
>>>> watch that JIRA to track the progress.
>>>> 
>>>> Please also let me know if you have additional comments or questions.
>>>> 
>>>> Thanks,
>>>> Xuefu
>>>> 
>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
>>>> 
>>>> 
>>>> ------------------------------------------------------------------
>>>> Sender:Xuefu <xu...@alibaba-inc.com>
>>>> Sent at:2018 Oct 16 (Tue) 03:40
>>>> Recipient:Shuyi Chen <su...@gmail.com>
>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>> 
>>>> Hi Shuyi,
>>>> 
>>>> Thank you for your input. Yes, I agreed with a phased approach and like
>>>> to move forward fast. :) We did some work internally on DDL utilizing babel
>>>> parser in Calcite. While babel makes Calcite's grammar extensible, at
>>>> first impression it still seems too cumbersome for a project when too
>>>> much extensions are made. It's even challenging to find where the extension
>>>> is needed! It would be certainly better if Calcite can magically support
>>>> Hive QL by just turning on a flag, such as that for MYSQL_5. I can also
>>>> see that this could mean a lot of work on Calcite. Nevertheless, I will
>>>> bring up the discussion over there and to see what their community thinks.
>>>> 
>>>> Would mind to share more info about the proposal on DDL that you
>>>> mentioned? We can certainly collaborate on this.
>>>> 
>>>> Thanks,
>>>> Xuefu
>>>> 
>>>> ------------------------------------------------------------------
>>>> Sender:Shuyi Chen <su...@gmail.com>
>>>> Sent at:2018 Oct 14 (Sun) 08:30
>>>> Recipient:Xuefu <xu...@alibaba-inc.com>
>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>> 
>>>> Welcome to the community and thanks for the great proposal, Xuefu! I
>>>> think the proposal can be divided into 2 stages: making Flink to support
>>>> Hive features, and make Hive to work with Flink. I agreed with Timo that on
>>>> starting with a smaller scope, so we can make progress faster. As for [6],
>>>> a proposal for DDL is already in progress, and will come after the unified
>>>> SQL connector API is done. For supporting Hive syntax, we might need to
>>>> work with the Calcite community, and a recent effort called babel (
>>>> https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite might
>>>> help here.
>>>> 
>>>> Thanks
>>>> Shuyi
>>>> 
>>>> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <xu...@alibaba-inc.com>
>>>> wrote:
>>>> Hi Fabian/Vno,
>>>> 
>>>> Thank you very much for your encouragement inquiry. Sorry that I didn't
>>>> see Fabian's email until I read Vino's response just now. (Somehow Fabian's
>>>> went to the spam folder.)
>>>> 
>>>> My proposal contains long-term and short-terms goals. Nevertheless, the
>>>> effort will focus on the following areas, including Fabian's list:
>>>> 
>>>> 1. Hive metastore connectivity - This covers both read/write access,
>>>> which means Flink can make full use of Hive's metastore as its catalog (at
>>>> least for the batch but can extend for streaming as well).
>>>> 2. Metadata compatibility - Objects (databases, tables, partitions, etc)
>>>> created by Hive can be understood by Flink and the reverse direction is
>>>> true also.
>>>> 3. Data compatibility - Similar to #2, data produced by Hive can be
>>>> consumed by Flink and vise versa.
>>>> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides
>>>> its own implementation or make Hive's implementation work in Flink.
>>>> Further, for user created UDFs in Hive, Flink SQL should provide a
>>>> mechanism allowing user to import them into Flink without any code change
>>>> required.
>>>> 5. Data types -  Flink SQL should support all data types that are
>>>> available in Hive.
>>>> 6. SQL Language - Flink SQL should support SQL standard (such as
>>>> SQL2003) with extension to support Hive's syntax and language features,
>>>> around DDL, DML, and SELECT queries.
>>>> 7.  SQL CLI - this is currently developing in Flink but more effort is
>>>> needed.
>>>> 8. Server - provide a server that's compatible with Hive's HiverServer2
>>>> in thrift APIs, such that HiveServer2 users can reuse their existing client
>>>> (such as beeline) but connect to Flink's thrift server instead.
>>>> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for
>>>> other application to use to connect to its thrift server
>>>> 10. Support other user's customizations in Hive, such as Hive Serdes,
>>>> storage handlers, etc.
>>>> 11. Better task failure tolerance and task scheduling at Flink runtime.
>>>> 
>>>> As you can see, achieving all those requires significant effort and
>>>> across all layers in Flink. However, a short-term goal could  include only
>>>> core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as
>>>> #3, #6).
>>>> 
>>>> Please share your further thoughts. If we generally agree that this is
>>>> the right direction, I could come up with a formal proposal quickly and
>>>> then we can follow up with broader discussions.
>>>> 
>>>> Thanks,
>>>> Xuefu
>>>> 
>>>> 
>>>> 
>>>> ------------------------------------------------------------------
>>>> Sender:vino yang <ya...@gmail.com>
>>>> Sent at:2018 Oct 11 (Thu) 09:45
>>>> Recipient:Fabian Hueske <fh...@gmail.com>
>>>> Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <
>>>> user@flink.apache.org>
>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>> 
>>>> Hi Xuefu,
>>>> 
>>>> Appreciate this proposal, and like Fabian, it would look better if you
>>>> can give more details of the plan.
>>>> 
>>>> Thanks, vino.
>>>> 
>>>> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
>>>> Hi Xuefu,
>>>> 
>>>> Welcome to the Flink community and thanks for starting this discussion!
>>>> Better Hive integration would be really great!
>>>> Can you go into details of what you are proposing? I can think of a
>>>> couple ways to improve Flink in that regard:
>>>> 
>>>> * Support for Hive UDFs
>>>> * Support for Hive metadata catalog
>>>> * Support for HiveQL syntax
>>>> * ???
>>>> 
>>>> Best, Fabian
>>>> 
>>>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
>>>> xuefu.z@alibaba-inc.com>:
>>>> Hi all,
>>>> 
>>>> Along with the community's effort, inside Alibaba we have explored
>>>> Flink's potential as an execution engine not just for stream processing but
>>>> also for batch processing. We are encouraged by our findings and have
>>>> initiated our effort to make Flink's SQL capabilities full-fledged. When
>>>> comparing what's available in Flink to the offerings from competitive data
>>>> processing engines, we identified a major gap in Flink: a well integration
>>>> with Hive ecosystem. This is crucial to the success of Flink SQL and batch
>>>> due to the well-established data ecosystem around Hive. Therefore, we have
>>>> done some initial work along this direction but there are still a lot of
>>>> effort needed.
>>>> 
>>>> We have two strategies in mind. The first one is to make Flink SQL
>>>> full-fledged and well-integrated with Hive ecosystem. This is a similar
>>>> approach to what Spark SQL adopted. The second strategy is to make Hive
>>>> itself work with Flink, similar to the proposal in [1]. Each approach bears
>>>> its pros and cons, but they don’t need to be mutually exclusive with each
>>>> targeting at different users and use cases. We believe that both will
>>>> promote a much greater adoption of Flink beyond stream processing.
>>>> 
>>>> We have been focused on the first approach and would like to showcase
>>>> Flink's batch and SQL capabilities with Flink SQL. However, we have also
>>>> planned to start strategy #2 as the follow-up effort.
>>>> 
>>>> I'm completely new to Flink(, with a short bio [2] below), though many
>>>> of my colleagues here at Alibaba are long-time contributors. Nevertheless,
>>>> I'd like to share our thoughts and invite your early feedback. At the same
>>>> time, I am working on a detailed proposal on Flink SQL's integration with
>>>> Hive ecosystem, which will be also shared when ready.
>>>> 
>>>> While the ideas are simple, each approach will demand significant
>>>> effort, more than what we can afford. Thus, the input and contributions
>>>> from the communities are greatly welcome and appreciated.
>>>> 
>>>> Regards,
>>>> 
>>>> 
>>>> Xuefu
>>>> 
>>>> References:
>>>> 
>>>> [1] https://issues.apache.org/jira/browse/HIVE-10712
>>>> [2] Xuefu Zhang is a long-time open source veteran, worked or working on
>>>> many projects under Apache Foundation, of which he is also an honored
>>>> member. About 10 years ago he worked in the Hadoop team at Yahoo where the
>>>> projects just got started. Later he worked at Cloudera, initiating and
>>>> leading the development of Hive on Spark project in the communities and
>>>> across many organizations. Prior to joining Alibaba, he worked at Uber
>>>> where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and
>>>> significantly improved Uber's cluster efficiency.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> "So you have to trust that the dots will somehow connect in your future."
>>>> 
>>>> 
>>>> --
>>>> "So you have to trust that the dots will somehow connect in your future."
>>>> 
>>>> 

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by Piotr Nowojski <pi...@data-artisans.com>.
Hi,

Maybe we should split this topic (and the design doc) into couple of smaller ones, hopefully independent. The questions that you have asked Fabian have for example very little to do with reading metadata from Hive Meta Store?

Piotrek 

> On 7 Nov 2018, at 14:27, Fabian Hueske <fh...@gmail.com> wrote:
> 
> Hi Xuefu and all,
> 
> Thanks for sharing this design document!
> I'm very much in favor of restructuring / reworking the catalog handling in
> Flink SQL as outlined in the document.
> Most changes described in the design document seem to be rather general and
> not specifically related to the Hive integration.
> 
> IMO, there are some aspects, especially those at the boundary of Hive and
> Flink, that need a bit more discussion. For example
> 
> * What does it take to make Flink schema compatible with Hive schema?
> * How will Flink tables (descriptors) be stored in HMS?
> * How do both Hive catalogs differ? Could they be integrated into to a
> single one? When to use which one?
> * What meta information is provided by HMS? What of this can be leveraged
> by Flink?
> 
> Thank you,
> Fabian
> 
> Am Fr., 2. Nov. 2018 um 00:31 Uhr schrieb Bowen Li <bo...@gmail.com>:
> 
>> After taking a look at how other discussion threads work, I think it's
>> actually fine just keep our discussion here. It's up to you, Xuefu.
>> 
>> The google doc LGTM. I left some minor comments.
>> 
>> On Thu, Nov 1, 2018 at 10:17 AM Bowen Li <bo...@gmail.com> wrote:
>> 
>>> Hi all,
>>> 
>>> As Xuefu has published the design doc on google, I agree with Shuyi's
>>> suggestion that we probably should start a new email thread like "[DISCUSS]
>>> ... Hive integration design ..." on only dev mailing list for community
>>> devs to review. The current thread sends to both dev and user list.
>>> 
>>> This email thread is more like validating the general idea and direction
>>> with the community, and it's been pretty long and crowded so far. Since
>>> everyone is pro for the idea, we can move forward with another thread to
>>> discuss and finalize the design.
>>> 
>>> Thanks,
>>> Bowen
>>> 
>>> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu <xu...@alibaba-inc.com>
>>> wrote:
>>> 
>>>> Hi Shuiyi,
>>>> 
>>>> Good idea. Actually the PDF was converted from a google doc. Here is its
>>>> link:
>>>> 
>>>> https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
>>>> Once we reach an agreement, I can convert it to a FLIP.
>>>> 
>>>> Thanks,
>>>> Xuefu
>>>> 
>>>> 
>>>> 
>>>> ------------------------------------------------------------------
>>>> Sender:Shuyi Chen <su...@gmail.com>
>>>> Sent at:2018 Nov 1 (Thu) 02:47
>>>> Recipient:Xuefu <xu...@alibaba-inc.com>
>>>> Cc:vino yang <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>> 
>>>> Hi Xuefu,
>>>> 
>>>> Thanks a lot for driving this big effort. I would suggest convert your
>>>> proposal and design doc into a google doc, and share it on the dev mailing
>>>> list for the community to review and comment with title like "[DISCUSS] ...
>>>> Hive integration design ..." . Once approved,  we can document it as a FLIP
>>>> (Flink Improvement Proposals), and use JIRAs to track the implementations.
>>>> What do you think?
>>>> 
>>>> Shuyi
>>>> 
>>>> On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <xu...@alibaba-inc.com>
>>>> wrote:
>>>> Hi all,
>>>> 
>>>> I have also shared a design doc on Hive metastore integration that is
>>>> attached here and also to FLINK-10556[1]. Please kindly review and share
>>>> your feedback.
>>>> 
>>>> 
>>>> Thanks,
>>>> Xuefu
>>>> 
>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
>>>> ------------------------------------------------------------------
>>>> Sender:Xuefu <xu...@alibaba-inc.com>
>>>> Sent at:2018 Oct 25 (Thu) 01:08
>>>> Recipient:Xuefu <xu...@alibaba-inc.com>; Shuyi Chen <
>>>> suez1224@gmail.com>
>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>> 
>>>> Hi all,
>>>> 
>>>> To wrap up the discussion, I have attached a PDF describing the
>>>> proposal, which is also attached to FLINK-10556 [1]. Please feel free to
>>>> watch that JIRA to track the progress.
>>>> 
>>>> Please also let me know if you have additional comments or questions.
>>>> 
>>>> Thanks,
>>>> Xuefu
>>>> 
>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
>>>> 
>>>> 
>>>> ------------------------------------------------------------------
>>>> Sender:Xuefu <xu...@alibaba-inc.com>
>>>> Sent at:2018 Oct 16 (Tue) 03:40
>>>> Recipient:Shuyi Chen <su...@gmail.com>
>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>> 
>>>> Hi Shuyi,
>>>> 
>>>> Thank you for your input. Yes, I agreed with a phased approach and like
>>>> to move forward fast. :) We did some work internally on DDL utilizing babel
>>>> parser in Calcite. While babel makes Calcite's grammar extensible, at
>>>> first impression it still seems too cumbersome for a project when too
>>>> much extensions are made. It's even challenging to find where the extension
>>>> is needed! It would be certainly better if Calcite can magically support
>>>> Hive QL by just turning on a flag, such as that for MYSQL_5. I can also
>>>> see that this could mean a lot of work on Calcite. Nevertheless, I will
>>>> bring up the discussion over there and to see what their community thinks.
>>>> 
>>>> Would mind to share more info about the proposal on DDL that you
>>>> mentioned? We can certainly collaborate on this.
>>>> 
>>>> Thanks,
>>>> Xuefu
>>>> 
>>>> ------------------------------------------------------------------
>>>> Sender:Shuyi Chen <su...@gmail.com>
>>>> Sent at:2018 Oct 14 (Sun) 08:30
>>>> Recipient:Xuefu <xu...@alibaba-inc.com>
>>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
>>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>> 
>>>> Welcome to the community and thanks for the great proposal, Xuefu! I
>>>> think the proposal can be divided into 2 stages: making Flink to support
>>>> Hive features, and make Hive to work with Flink. I agreed with Timo that on
>>>> starting with a smaller scope, so we can make progress faster. As for [6],
>>>> a proposal for DDL is already in progress, and will come after the unified
>>>> SQL connector API is done. For supporting Hive syntax, we might need to
>>>> work with the Calcite community, and a recent effort called babel (
>>>> https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite might
>>>> help here.
>>>> 
>>>> Thanks
>>>> Shuyi
>>>> 
>>>> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <xu...@alibaba-inc.com>
>>>> wrote:
>>>> Hi Fabian/Vno,
>>>> 
>>>> Thank you very much for your encouragement inquiry. Sorry that I didn't
>>>> see Fabian's email until I read Vino's response just now. (Somehow Fabian's
>>>> went to the spam folder.)
>>>> 
>>>> My proposal contains long-term and short-terms goals. Nevertheless, the
>>>> effort will focus on the following areas, including Fabian's list:
>>>> 
>>>> 1. Hive metastore connectivity - This covers both read/write access,
>>>> which means Flink can make full use of Hive's metastore as its catalog (at
>>>> least for the batch but can extend for streaming as well).
>>>> 2. Metadata compatibility - Objects (databases, tables, partitions, etc)
>>>> created by Hive can be understood by Flink and the reverse direction is
>>>> true also.
>>>> 3. Data compatibility - Similar to #2, data produced by Hive can be
>>>> consumed by Flink and vise versa.
>>>> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides
>>>> its own implementation or make Hive's implementation work in Flink.
>>>> Further, for user created UDFs in Hive, Flink SQL should provide a
>>>> mechanism allowing user to import them into Flink without any code change
>>>> required.
>>>> 5. Data types -  Flink SQL should support all data types that are
>>>> available in Hive.
>>>> 6. SQL Language - Flink SQL should support SQL standard (such as
>>>> SQL2003) with extension to support Hive's syntax and language features,
>>>> around DDL, DML, and SELECT queries.
>>>> 7.  SQL CLI - this is currently developing in Flink but more effort is
>>>> needed.
>>>> 8. Server - provide a server that's compatible with Hive's HiverServer2
>>>> in thrift APIs, such that HiveServer2 users can reuse their existing client
>>>> (such as beeline) but connect to Flink's thrift server instead.
>>>> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for
>>>> other application to use to connect to its thrift server
>>>> 10. Support other user's customizations in Hive, such as Hive Serdes,
>>>> storage handlers, etc.
>>>> 11. Better task failure tolerance and task scheduling at Flink runtime.
>>>> 
>>>> As you can see, achieving all those requires significant effort and
>>>> across all layers in Flink. However, a short-term goal could  include only
>>>> core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as
>>>> #3, #6).
>>>> 
>>>> Please share your further thoughts. If we generally agree that this is
>>>> the right direction, I could come up with a formal proposal quickly and
>>>> then we can follow up with broader discussions.
>>>> 
>>>> Thanks,
>>>> Xuefu
>>>> 
>>>> 
>>>> 
>>>> ------------------------------------------------------------------
>>>> Sender:vino yang <ya...@gmail.com>
>>>> Sent at:2018 Oct 11 (Thu) 09:45
>>>> Recipient:Fabian Hueske <fh...@gmail.com>
>>>> Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <
>>>> user@flink.apache.org>
>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>> 
>>>> Hi Xuefu,
>>>> 
>>>> Appreciate this proposal, and like Fabian, it would look better if you
>>>> can give more details of the plan.
>>>> 
>>>> Thanks, vino.
>>>> 
>>>> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
>>>> Hi Xuefu,
>>>> 
>>>> Welcome to the Flink community and thanks for starting this discussion!
>>>> Better Hive integration would be really great!
>>>> Can you go into details of what you are proposing? I can think of a
>>>> couple ways to improve Flink in that regard:
>>>> 
>>>> * Support for Hive UDFs
>>>> * Support for Hive metadata catalog
>>>> * Support for HiveQL syntax
>>>> * ???
>>>> 
>>>> Best, Fabian
>>>> 
>>>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
>>>> xuefu.z@alibaba-inc.com>:
>>>> Hi all,
>>>> 
>>>> Along with the community's effort, inside Alibaba we have explored
>>>> Flink's potential as an execution engine not just for stream processing but
>>>> also for batch processing. We are encouraged by our findings and have
>>>> initiated our effort to make Flink's SQL capabilities full-fledged. When
>>>> comparing what's available in Flink to the offerings from competitive data
>>>> processing engines, we identified a major gap in Flink: a well integration
>>>> with Hive ecosystem. This is crucial to the success of Flink SQL and batch
>>>> due to the well-established data ecosystem around Hive. Therefore, we have
>>>> done some initial work along this direction but there are still a lot of
>>>> effort needed.
>>>> 
>>>> We have two strategies in mind. The first one is to make Flink SQL
>>>> full-fledged and well-integrated with Hive ecosystem. This is a similar
>>>> approach to what Spark SQL adopted. The second strategy is to make Hive
>>>> itself work with Flink, similar to the proposal in [1]. Each approach bears
>>>> its pros and cons, but they don’t need to be mutually exclusive with each
>>>> targeting at different users and use cases. We believe that both will
>>>> promote a much greater adoption of Flink beyond stream processing.
>>>> 
>>>> We have been focused on the first approach and would like to showcase
>>>> Flink's batch and SQL capabilities with Flink SQL. However, we have also
>>>> planned to start strategy #2 as the follow-up effort.
>>>> 
>>>> I'm completely new to Flink(, with a short bio [2] below), though many
>>>> of my colleagues here at Alibaba are long-time contributors. Nevertheless,
>>>> I'd like to share our thoughts and invite your early feedback. At the same
>>>> time, I am working on a detailed proposal on Flink SQL's integration with
>>>> Hive ecosystem, which will be also shared when ready.
>>>> 
>>>> While the ideas are simple, each approach will demand significant
>>>> effort, more than what we can afford. Thus, the input and contributions
>>>> from the communities are greatly welcome and appreciated.
>>>> 
>>>> Regards,
>>>> 
>>>> 
>>>> Xuefu
>>>> 
>>>> References:
>>>> 
>>>> [1] https://issues.apache.org/jira/browse/HIVE-10712
>>>> [2] Xuefu Zhang is a long-time open source veteran, worked or working on
>>>> many projects under Apache Foundation, of which he is also an honored
>>>> member. About 10 years ago he worked in the Hadoop team at Yahoo where the
>>>> projects just got started. Later he worked at Cloudera, initiating and
>>>> leading the development of Hive on Spark project in the communities and
>>>> across many organizations. Prior to joining Alibaba, he worked at Uber
>>>> where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and
>>>> significantly improved Uber's cluster efficiency.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> "So you have to trust that the dots will somehow connect in your future."
>>>> 
>>>> 
>>>> --
>>>> "So you have to trust that the dots will somehow connect in your future."
>>>> 
>>>> 


Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by Fabian Hueske <fh...@gmail.com>.
Hi Xuefu and all,

Thanks for sharing this design document!
I'm very much in favor of restructuring / reworking the catalog handling in
Flink SQL as outlined in the document.
Most changes described in the design document seem to be rather general and
not specifically related to the Hive integration.

IMO, there are some aspects, especially those at the boundary of Hive and
Flink, that need a bit more discussion. For example

* What does it take to make Flink schema compatible with Hive schema?
* How will Flink tables (descriptors) be stored in HMS?
* How do both Hive catalogs differ? Could they be integrated into to a
single one? When to use which one?
* What meta information is provided by HMS? What of this can be leveraged
by Flink?

Thank you,
Fabian

Am Fr., 2. Nov. 2018 um 00:31 Uhr schrieb Bowen Li <bo...@gmail.com>:

> After taking a look at how other discussion threads work, I think it's
> actually fine just keep our discussion here. It's up to you, Xuefu.
>
> The google doc LGTM. I left some minor comments.
>
> On Thu, Nov 1, 2018 at 10:17 AM Bowen Li <bo...@gmail.com> wrote:
>
>> Hi all,
>>
>> As Xuefu has published the design doc on google, I agree with Shuyi's
>> suggestion that we probably should start a new email thread like "[DISCUSS]
>> ... Hive integration design ..." on only dev mailing list for community
>> devs to review. The current thread sends to both dev and user list.
>>
>> This email thread is more like validating the general idea and direction
>> with the community, and it's been pretty long and crowded so far. Since
>> everyone is pro for the idea, we can move forward with another thread to
>> discuss and finalize the design.
>>
>> Thanks,
>> Bowen
>>
>> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu <xu...@alibaba-inc.com>
>> wrote:
>>
>>> Hi Shuiyi,
>>>
>>> Good idea. Actually the PDF was converted from a google doc. Here is its
>>> link:
>>>
>>> https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
>>> Once we reach an agreement, I can convert it to a FLIP.
>>>
>>> Thanks,
>>> Xuefu
>>>
>>>
>>>
>>> ------------------------------------------------------------------
>>> Sender:Shuyi Chen <su...@gmail.com>
>>> Sent at:2018 Nov 1 (Thu) 02:47
>>> Recipient:Xuefu <xu...@alibaba-inc.com>
>>> Cc:vino yang <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>
>>> Hi Xuefu,
>>>
>>> Thanks a lot for driving this big effort. I would suggest convert your
>>> proposal and design doc into a google doc, and share it on the dev mailing
>>> list for the community to review and comment with title like "[DISCUSS] ...
>>> Hive integration design ..." . Once approved,  we can document it as a FLIP
>>> (Flink Improvement Proposals), and use JIRAs to track the implementations.
>>> What do you think?
>>>
>>> Shuyi
>>>
>>> On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <xu...@alibaba-inc.com>
>>> wrote:
>>> Hi all,
>>>
>>> I have also shared a design doc on Hive metastore integration that is
>>> attached here and also to FLINK-10556[1]. Please kindly review and share
>>> your feedback.
>>>
>>>
>>> Thanks,
>>> Xuefu
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
>>> ------------------------------------------------------------------
>>> Sender:Xuefu <xu...@alibaba-inc.com>
>>> Sent at:2018 Oct 25 (Thu) 01:08
>>> Recipient:Xuefu <xu...@alibaba-inc.com>; Shuyi Chen <
>>> suez1224@gmail.com>
>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>
>>> Hi all,
>>>
>>> To wrap up the discussion, I have attached a PDF describing the
>>> proposal, which is also attached to FLINK-10556 [1]. Please feel free to
>>> watch that JIRA to track the progress.
>>>
>>> Please also let me know if you have additional comments or questions.
>>>
>>> Thanks,
>>> Xuefu
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
>>>
>>>
>>> ------------------------------------------------------------------
>>> Sender:Xuefu <xu...@alibaba-inc.com>
>>> Sent at:2018 Oct 16 (Tue) 03:40
>>> Recipient:Shuyi Chen <su...@gmail.com>
>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>
>>> Hi Shuyi,
>>>
>>> Thank you for your input. Yes, I agreed with a phased approach and like
>>> to move forward fast. :) We did some work internally on DDL utilizing babel
>>> parser in Calcite. While babel makes Calcite's grammar extensible, at
>>> first impression it still seems too cumbersome for a project when too
>>> much extensions are made. It's even challenging to find where the extension
>>> is needed! It would be certainly better if Calcite can magically support
>>> Hive QL by just turning on a flag, such as that for MYSQL_5. I can also
>>> see that this could mean a lot of work on Calcite. Nevertheless, I will
>>> bring up the discussion over there and to see what their community thinks.
>>>
>>> Would mind to share more info about the proposal on DDL that you
>>> mentioned? We can certainly collaborate on this.
>>>
>>> Thanks,
>>> Xuefu
>>>
>>> ------------------------------------------------------------------
>>> Sender:Shuyi Chen <su...@gmail.com>
>>> Sent at:2018 Oct 14 (Sun) 08:30
>>> Recipient:Xuefu <xu...@alibaba-inc.com>
>>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
>>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>
>>> Welcome to the community and thanks for the great proposal, Xuefu! I
>>> think the proposal can be divided into 2 stages: making Flink to support
>>> Hive features, and make Hive to work with Flink. I agreed with Timo that on
>>> starting with a smaller scope, so we can make progress faster. As for [6],
>>> a proposal for DDL is already in progress, and will come after the unified
>>> SQL connector API is done. For supporting Hive syntax, we might need to
>>> work with the Calcite community, and a recent effort called babel (
>>> https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite might
>>> help here.
>>>
>>> Thanks
>>> Shuyi
>>>
>>> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <xu...@alibaba-inc.com>
>>> wrote:
>>> Hi Fabian/Vno,
>>>
>>> Thank you very much for your encouragement inquiry. Sorry that I didn't
>>> see Fabian's email until I read Vino's response just now. (Somehow Fabian's
>>> went to the spam folder.)
>>>
>>> My proposal contains long-term and short-terms goals. Nevertheless, the
>>> effort will focus on the following areas, including Fabian's list:
>>>
>>> 1. Hive metastore connectivity - This covers both read/write access,
>>> which means Flink can make full use of Hive's metastore as its catalog (at
>>> least for the batch but can extend for streaming as well).
>>> 2. Metadata compatibility - Objects (databases, tables, partitions, etc)
>>> created by Hive can be understood by Flink and the reverse direction is
>>> true also.
>>> 3. Data compatibility - Similar to #2, data produced by Hive can be
>>> consumed by Flink and vise versa.
>>> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides
>>> its own implementation or make Hive's implementation work in Flink.
>>> Further, for user created UDFs in Hive, Flink SQL should provide a
>>> mechanism allowing user to import them into Flink without any code change
>>> required.
>>> 5. Data types -  Flink SQL should support all data types that are
>>> available in Hive.
>>> 6. SQL Language - Flink SQL should support SQL standard (such as
>>> SQL2003) with extension to support Hive's syntax and language features,
>>> around DDL, DML, and SELECT queries.
>>> 7.  SQL CLI - this is currently developing in Flink but more effort is
>>> needed.
>>> 8. Server - provide a server that's compatible with Hive's HiverServer2
>>> in thrift APIs, such that HiveServer2 users can reuse their existing client
>>> (such as beeline) but connect to Flink's thrift server instead.
>>> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for
>>> other application to use to connect to its thrift server
>>> 10. Support other user's customizations in Hive, such as Hive Serdes,
>>> storage handlers, etc.
>>> 11. Better task failure tolerance and task scheduling at Flink runtime.
>>>
>>> As you can see, achieving all those requires significant effort and
>>> across all layers in Flink. However, a short-term goal could  include only
>>> core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as
>>> #3, #6).
>>>
>>> Please share your further thoughts. If we generally agree that this is
>>> the right direction, I could come up with a formal proposal quickly and
>>> then we can follow up with broader discussions.
>>>
>>> Thanks,
>>> Xuefu
>>>
>>>
>>>
>>> ------------------------------------------------------------------
>>> Sender:vino yang <ya...@gmail.com>
>>> Sent at:2018 Oct 11 (Thu) 09:45
>>> Recipient:Fabian Hueske <fh...@gmail.com>
>>> Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <
>>> user@flink.apache.org>
>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>
>>> Hi Xuefu,
>>>
>>> Appreciate this proposal, and like Fabian, it would look better if you
>>> can give more details of the plan.
>>>
>>> Thanks, vino.
>>>
>>> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
>>> Hi Xuefu,
>>>
>>> Welcome to the Flink community and thanks for starting this discussion!
>>> Better Hive integration would be really great!
>>> Can you go into details of what you are proposing? I can think of a
>>> couple ways to improve Flink in that regard:
>>>
>>> * Support for Hive UDFs
>>> * Support for Hive metadata catalog
>>> * Support for HiveQL syntax
>>> * ???
>>>
>>> Best, Fabian
>>>
>>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
>>> xuefu.z@alibaba-inc.com>:
>>> Hi all,
>>>
>>> Along with the community's effort, inside Alibaba we have explored
>>> Flink's potential as an execution engine not just for stream processing but
>>> also for batch processing. We are encouraged by our findings and have
>>> initiated our effort to make Flink's SQL capabilities full-fledged. When
>>> comparing what's available in Flink to the offerings from competitive data
>>> processing engines, we identified a major gap in Flink: a well integration
>>> with Hive ecosystem. This is crucial to the success of Flink SQL and batch
>>> due to the well-established data ecosystem around Hive. Therefore, we have
>>> done some initial work along this direction but there are still a lot of
>>> effort needed.
>>>
>>> We have two strategies in mind. The first one is to make Flink SQL
>>> full-fledged and well-integrated with Hive ecosystem. This is a similar
>>> approach to what Spark SQL adopted. The second strategy is to make Hive
>>> itself work with Flink, similar to the proposal in [1]. Each approach bears
>>> its pros and cons, but they don’t need to be mutually exclusive with each
>>> targeting at different users and use cases. We believe that both will
>>> promote a much greater adoption of Flink beyond stream processing.
>>>
>>> We have been focused on the first approach and would like to showcase
>>> Flink's batch and SQL capabilities with Flink SQL. However, we have also
>>> planned to start strategy #2 as the follow-up effort.
>>>
>>> I'm completely new to Flink(, with a short bio [2] below), though many
>>> of my colleagues here at Alibaba are long-time contributors. Nevertheless,
>>> I'd like to share our thoughts and invite your early feedback. At the same
>>> time, I am working on a detailed proposal on Flink SQL's integration with
>>> Hive ecosystem, which will be also shared when ready.
>>>
>>> While the ideas are simple, each approach will demand significant
>>> effort, more than what we can afford. Thus, the input and contributions
>>> from the communities are greatly welcome and appreciated.
>>>
>>> Regards,
>>>
>>>
>>> Xuefu
>>>
>>> References:
>>>
>>> [1] https://issues.apache.org/jira/browse/HIVE-10712
>>> [2] Xuefu Zhang is a long-time open source veteran, worked or working on
>>> many projects under Apache Foundation, of which he is also an honored
>>> member. About 10 years ago he worked in the Hadoop team at Yahoo where the
>>> projects just got started. Later he worked at Cloudera, initiating and
>>> leading the development of Hive on Spark project in the communities and
>>> across many organizations. Prior to joining Alibaba, he worked at Uber
>>> where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and
>>> significantly improved Uber's cluster efficiency.
>>>
>>>
>>>
>>>
>>> --
>>> "So you have to trust that the dots will somehow connect in your future."
>>>
>>>
>>> --
>>> "So you have to trust that the dots will somehow connect in your future."
>>>
>>>

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by Bowen Li <bo...@gmail.com>.
After taking a look at how other discussion threads work, I think it's
actually fine just keep our discussion here. It's up to you, Xuefu.

The google doc LGTM. I left some minor comments.

On Thu, Nov 1, 2018 at 10:17 AM Bowen Li <bo...@gmail.com> wrote:

> Hi all,
>
> As Xuefu has published the design doc on google, I agree with Shuyi's
> suggestion that we probably should start a new email thread like "[DISCUSS]
> ... Hive integration design ..." on only dev mailing list for community
> devs to review. The current thread sends to both dev and user list.
>
> This email thread is more like validating the general idea and direction
> with the community, and it's been pretty long and crowded so far. Since
> everyone is pro for the idea, we can move forward with another thread to
> discuss and finalize the design.
>
> Thanks,
> Bowen
>
> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu <xu...@alibaba-inc.com>
> wrote:
>
>> Hi Shuiyi,
>>
>> Good idea. Actually the PDF was converted from a google doc. Here is its
>> link:
>>
>> https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
>> Once we reach an agreement, I can convert it to a FLIP.
>>
>> Thanks,
>> Xuefu
>>
>>
>>
>> ------------------------------------------------------------------
>> Sender:Shuyi Chen <su...@gmail.com>
>> Sent at:2018 Nov 1 (Thu) 02:47
>> Recipient:Xuefu <xu...@alibaba-inc.com>
>> Cc:vino yang <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>
>> Hi Xuefu,
>>
>> Thanks a lot for driving this big effort. I would suggest convert your
>> proposal and design doc into a google doc, and share it on the dev mailing
>> list for the community to review and comment with title like "[DISCUSS] ...
>> Hive integration design ..." . Once approved,  we can document it as a FLIP
>> (Flink Improvement Proposals), and use JIRAs to track the implementations.
>> What do you think?
>>
>> Shuyi
>>
>> On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <xu...@alibaba-inc.com>
>> wrote:
>> Hi all,
>>
>> I have also shared a design doc on Hive metastore integration that is
>> attached here and also to FLINK-10556[1]. Please kindly review and share
>> your feedback.
>>
>>
>> Thanks,
>> Xuefu
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-10556
>> ------------------------------------------------------------------
>> Sender:Xuefu <xu...@alibaba-inc.com>
>> Sent at:2018 Oct 25 (Thu) 01:08
>> Recipient:Xuefu <xu...@alibaba-inc.com>; Shuyi Chen <suez1224@gmail.com
>> >
>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>
>> Hi all,
>>
>> To wrap up the discussion, I have attached a PDF describing the proposal,
>> which is also attached to FLINK-10556 [1]. Please feel free to watch that
>> JIRA to track the progress.
>>
>> Please also let me know if you have additional comments or questions.
>>
>> Thanks,
>> Xuefu
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-10556
>>
>>
>> ------------------------------------------------------------------
>> Sender:Xuefu <xu...@alibaba-inc.com>
>> Sent at:2018 Oct 16 (Tue) 03:40
>> Recipient:Shuyi Chen <su...@gmail.com>
>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>
>> Hi Shuyi,
>>
>> Thank you for your input. Yes, I agreed with a phased approach and like
>> to move forward fast. :) We did some work internally on DDL utilizing babel
>> parser in Calcite. While babel makes Calcite's grammar extensible, at
>> first impression it still seems too cumbersome for a project when too
>> much extensions are made. It's even challenging to find where the extension
>> is needed! It would be certainly better if Calcite can magically support
>> Hive QL by just turning on a flag, such as that for MYSQL_5. I can also
>> see that this could mean a lot of work on Calcite. Nevertheless, I will
>> bring up the discussion over there and to see what their community thinks.
>>
>> Would mind to share more info about the proposal on DDL that you
>> mentioned? We can certainly collaborate on this.
>>
>> Thanks,
>> Xuefu
>>
>> ------------------------------------------------------------------
>> Sender:Shuyi Chen <su...@gmail.com>
>> Sent at:2018 Oct 14 (Sun) 08:30
>> Recipient:Xuefu <xu...@alibaba-inc.com>
>> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
>> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>
>> Welcome to the community and thanks for the great proposal, Xuefu! I
>> think the proposal can be divided into 2 stages: making Flink to support
>> Hive features, and make Hive to work with Flink. I agreed with Timo that on
>> starting with a smaller scope, so we can make progress faster. As for [6],
>> a proposal for DDL is already in progress, and will come after the unified
>> SQL connector API is done. For supporting Hive syntax, we might need to
>> work with the Calcite community, and a recent effort called babel (
>> https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite might
>> help here.
>>
>> Thanks
>> Shuyi
>>
>> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <xu...@alibaba-inc.com>
>> wrote:
>> Hi Fabian/Vno,
>>
>> Thank you very much for your encouragement inquiry. Sorry that I didn't
>> see Fabian's email until I read Vino's response just now. (Somehow Fabian's
>> went to the spam folder.)
>>
>> My proposal contains long-term and short-terms goals. Nevertheless, the
>> effort will focus on the following areas, including Fabian's list:
>>
>> 1. Hive metastore connectivity - This covers both read/write access,
>> which means Flink can make full use of Hive's metastore as its catalog (at
>> least for the batch but can extend for streaming as well).
>> 2. Metadata compatibility - Objects (databases, tables, partitions, etc)
>> created by Hive can be understood by Flink and the reverse direction is
>> true also.
>> 3. Data compatibility - Similar to #2, data produced by Hive can be
>> consumed by Flink and vise versa.
>> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides
>> its own implementation or make Hive's implementation work in Flink.
>> Further, for user created UDFs in Hive, Flink SQL should provide a
>> mechanism allowing user to import them into Flink without any code change
>> required.
>> 5. Data types -  Flink SQL should support all data types that are
>> available in Hive.
>> 6. SQL Language - Flink SQL should support SQL standard (such as SQL2003)
>> with extension to support Hive's syntax and language features, around DDL,
>> DML, and SELECT queries.
>> 7.  SQL CLI - this is currently developing in Flink but more effort is
>> needed.
>> 8. Server - provide a server that's compatible with Hive's HiverServer2
>> in thrift APIs, such that HiveServer2 users can reuse their existing client
>> (such as beeline) but connect to Flink's thrift server instead.
>> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for
>> other application to use to connect to its thrift server
>> 10. Support other user's customizations in Hive, such as Hive Serdes,
>> storage handlers, etc.
>> 11. Better task failure tolerance and task scheduling at Flink runtime.
>>
>> As you can see, achieving all those requires significant effort and
>> across all layers in Flink. However, a short-term goal could  include only
>> core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as
>> #3, #6).
>>
>> Please share your further thoughts. If we generally agree that this is
>> the right direction, I could come up with a formal proposal quickly and
>> then we can follow up with broader discussions.
>>
>> Thanks,
>> Xuefu
>>
>>
>>
>> ------------------------------------------------------------------
>> Sender:vino yang <ya...@gmail.com>
>> Sent at:2018 Oct 11 (Thu) 09:45
>> Recipient:Fabian Hueske <fh...@gmail.com>
>> Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <
>> user@flink.apache.org>
>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>
>> Hi Xuefu,
>>
>> Appreciate this proposal, and like Fabian, it would look better if you
>> can give more details of the plan.
>>
>> Thanks, vino.
>>
>> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
>> Hi Xuefu,
>>
>> Welcome to the Flink community and thanks for starting this discussion!
>> Better Hive integration would be really great!
>> Can you go into details of what you are proposing? I can think of a
>> couple ways to improve Flink in that regard:
>>
>> * Support for Hive UDFs
>> * Support for Hive metadata catalog
>> * Support for HiveQL syntax
>> * ???
>>
>> Best, Fabian
>>
>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
>> xuefu.z@alibaba-inc.com>:
>> Hi all,
>>
>> Along with the community's effort, inside Alibaba we have explored
>> Flink's potential as an execution engine not just for stream processing but
>> also for batch processing. We are encouraged by our findings and have
>> initiated our effort to make Flink's SQL capabilities full-fledged. When
>> comparing what's available in Flink to the offerings from competitive data
>> processing engines, we identified a major gap in Flink: a well integration
>> with Hive ecosystem. This is crucial to the success of Flink SQL and batch
>> due to the well-established data ecosystem around Hive. Therefore, we have
>> done some initial work along this direction but there are still a lot of
>> effort needed.
>>
>> We have two strategies in mind. The first one is to make Flink SQL
>> full-fledged and well-integrated with Hive ecosystem. This is a similar
>> approach to what Spark SQL adopted. The second strategy is to make Hive
>> itself work with Flink, similar to the proposal in [1]. Each approach bears
>> its pros and cons, but they don’t need to be mutually exclusive with each
>> targeting at different users and use cases. We believe that both will
>> promote a much greater adoption of Flink beyond stream processing.
>>
>> We have been focused on the first approach and would like to showcase
>> Flink's batch and SQL capabilities with Flink SQL. However, we have also
>> planned to start strategy #2 as the follow-up effort.
>>
>> I'm completely new to Flink(, with a short bio [2] below), though many of
>> my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd
>> like to share our thoughts and invite your early feedback. At the same
>> time, I am working on a detailed proposal on Flink SQL's integration with
>> Hive ecosystem, which will be also shared when ready.
>>
>> While the ideas are simple, each approach will demand significant effort,
>> more than what we can afford. Thus, the input and contributions from the
>> communities are greatly welcome and appreciated.
>>
>> Regards,
>>
>>
>> Xuefu
>>
>> References:
>>
>> [1] https://issues.apache.org/jira/browse/HIVE-10712
>> [2] Xuefu Zhang is a long-time open source veteran, worked or working on
>> many projects under Apache Foundation, of which he is also an honored
>> member. About 10 years ago he worked in the Hadoop team at Yahoo where the
>> projects just got started. Later he worked at Cloudera, initiating and
>> leading the development of Hive on Spark project in the communities and
>> across many organizations. Prior to joining Alibaba, he worked at Uber
>> where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and
>> significantly improved Uber's cluster efficiency.
>>
>>
>>
>>
>> --
>> "So you have to trust that the dots will somehow connect in your future."
>>
>>
>> --
>> "So you have to trust that the dots will somehow connect in your future."
>>
>>

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by Bowen Li <bo...@gmail.com>.
Hi all,

As Xuefu has published the design doc on google, I agree with Shuyi's
suggestion that we probably should start a new email thread like "[DISCUSS]
... Hive integration design ..." on only dev mailing list for community
devs to review. The current thread sends to both dev and user list.

This email thread is more like validating the general idea and direction
with the community, and it's been pretty long and crowded so far. Since
everyone is pro for the idea, we can move forward with another thread to
discuss and finalize the design.

Thanks,
Bowen

On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu <xu...@alibaba-inc.com>
wrote:

> Hi Shuiyi,
>
> Good idea. Actually the PDF was converted from a google doc. Here is its
> link:
>
> https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
> Once we reach an agreement, I can convert it to a FLIP.
>
> Thanks,
> Xuefu
>
>
>
> ------------------------------------------------------------------
> Sender:Shuyi Chen <su...@gmail.com>
> Sent at:2018 Nov 1 (Thu) 02:47
> Recipient:Xuefu <xu...@alibaba-inc.com>
> Cc:vino yang <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> Hi Xuefu,
>
> Thanks a lot for driving this big effort. I would suggest convert your
> proposal and design doc into a google doc, and share it on the dev mailing
> list for the community to review and comment with title like "[DISCUSS] ...
> Hive integration design ..." . Once approved,  we can document it as a FLIP
> (Flink Improvement Proposals), and use JIRAs to track the implementations.
> What do you think?
>
> Shuyi
>
> On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <xu...@alibaba-inc.com>
> wrote:
> Hi all,
>
> I have also shared a design doc on Hive metastore integration that is
> attached here and also to FLINK-10556[1]. Please kindly review and share
> your feedback.
>
>
> Thanks,
> Xuefu
>
> [1] https://issues.apache.org/jira/browse/FLINK-10556
> ------------------------------------------------------------------
> Sender:Xuefu <xu...@alibaba-inc.com>
> Sent at:2018 Oct 25 (Thu) 01:08
> Recipient:Xuefu <xu...@alibaba-inc.com>; Shuyi Chen <su...@gmail.com>
> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> Hi all,
>
> To wrap up the discussion, I have attached a PDF describing the proposal,
> which is also attached to FLINK-10556 [1]. Please feel free to watch that
> JIRA to track the progress.
>
> Please also let me know if you have additional comments or questions.
>
> Thanks,
> Xuefu
>
> [1] https://issues.apache.org/jira/browse/FLINK-10556
>
>
> ------------------------------------------------------------------
> Sender:Xuefu <xu...@alibaba-inc.com>
> Sent at:2018 Oct 16 (Tue) 03:40
> Recipient:Shuyi Chen <su...@gmail.com>
> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> Hi Shuyi,
>
> Thank you for your input. Yes, I agreed with a phased approach and like to
> move forward fast. :) We did some work internally on DDL utilizing babel
> parser in Calcite. While babel makes Calcite's grammar extensible, at
> first impression it still seems too cumbersome for a project when too
> much extensions are made. It's even challenging to find where the extension
> is needed! It would be certainly better if Calcite can magically support
> Hive QL by just turning on a flag, such as that for MYSQL_5. I can also
> see that this could mean a lot of work on Calcite. Nevertheless, I will
> bring up the discussion over there and to see what their community thinks.
>
> Would mind to share more info about the proposal on DDL that you
> mentioned? We can certainly collaborate on this.
>
> Thanks,
> Xuefu
>
> ------------------------------------------------------------------
> Sender:Shuyi Chen <su...@gmail.com>
> Sent at:2018 Oct 14 (Sun) 08:30
> Recipient:Xuefu <xu...@alibaba-inc.com>
> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> Welcome to the community and thanks for the great proposal, Xuefu! I think
> the proposal can be divided into 2 stages: making Flink to support Hive
> features, and make Hive to work with Flink. I agreed with Timo that on
> starting with a smaller scope, so we can make progress faster. As for [6],
> a proposal for DDL is already in progress, and will come after the unified
> SQL connector API is done. For supporting Hive syntax, we might need to
> work with the Calcite community, and a recent effort called babel (
> https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite might help
> here.
>
> Thanks
> Shuyi
>
> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <xu...@alibaba-inc.com>
> wrote:
> Hi Fabian/Vno,
>
> Thank you very much for your encouragement inquiry. Sorry that I didn't
> see Fabian's email until I read Vino's response just now. (Somehow Fabian's
> went to the spam folder.)
>
> My proposal contains long-term and short-terms goals. Nevertheless, the
> effort will focus on the following areas, including Fabian's list:
>
> 1. Hive metastore connectivity - This covers both read/write access, which
> means Flink can make full use of Hive's metastore as its catalog (at least
> for the batch but can extend for streaming as well).
> 2. Metadata compatibility - Objects (databases, tables, partitions, etc)
> created by Hive can be understood by Flink and the reverse direction is
> true also.
> 3. Data compatibility - Similar to #2, data produced by Hive can be
> consumed by Flink and vise versa.
> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides
> its own implementation or make Hive's implementation work in Flink.
> Further, for user created UDFs in Hive, Flink SQL should provide a
> mechanism allowing user to import them into Flink without any code change
> required.
> 5. Data types -  Flink SQL should support all data types that are
> available in Hive.
> 6. SQL Language - Flink SQL should support SQL standard (such as SQL2003)
> with extension to support Hive's syntax and language features, around DDL,
> DML, and SELECT queries.
> 7.  SQL CLI - this is currently developing in Flink but more effort is
> needed.
> 8. Server - provide a server that's compatible with Hive's HiverServer2 in
> thrift APIs, such that HiveServer2 users can reuse their existing client
> (such as beeline) but connect to Flink's thrift server instead.
> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for
> other application to use to connect to its thrift server
> 10. Support other user's customizations in Hive, such as Hive Serdes,
> storage handlers, etc.
> 11. Better task failure tolerance and task scheduling at Flink runtime.
>
> As you can see, achieving all those requires significant effort and across
> all layers in Flink. However, a short-term goal could  include only core
> areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3,
> #6).
>
> Please share your further thoughts. If we generally agree that this is the
> right direction, I could come up with a formal proposal quickly and then we
> can follow up with broader discussions.
>
> Thanks,
> Xuefu
>
>
>
> ------------------------------------------------------------------
> Sender:vino yang <ya...@gmail.com>
> Sent at:2018 Oct 11 (Thu) 09:45
> Recipient:Fabian Hueske <fh...@gmail.com>
> Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <
> user@flink.apache.org>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> Hi Xuefu,
>
> Appreciate this proposal, and like Fabian, it would look better if you can
> give more details of the plan.
>
> Thanks, vino.
>
> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
> Hi Xuefu,
>
> Welcome to the Flink community and thanks for starting this discussion!
> Better Hive integration would be really great!
> Can you go into details of what you are proposing? I can think of a couple
> ways to improve Flink in that regard:
>
> * Support for Hive UDFs
> * Support for Hive metadata catalog
> * Support for HiveQL syntax
> * ???
>
> Best, Fabian
>
> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
> xuefu.z@alibaba-inc.com>:
> Hi all,
>
> Along with the community's effort, inside Alibaba we have explored Flink's
> potential as an execution engine not just for stream processing but also
> for batch processing. We are encouraged by our findings and have initiated
> our effort to make Flink's SQL capabilities full-fledged. When comparing
> what's available in Flink to the offerings from competitive data processing
> engines, we identified a major gap in Flink: a well integration with Hive
> ecosystem. This is crucial to the success of Flink SQL and batch due to the
> well-established data ecosystem around Hive. Therefore, we have done some
> initial work along this direction but there are still a lot of effort
> needed.
>
> We have two strategies in mind. The first one is to make Flink SQL
> full-fledged and well-integrated with Hive ecosystem. This is a similar
> approach to what Spark SQL adopted. The second strategy is to make Hive
> itself work with Flink, similar to the proposal in [1]. Each approach bears
> its pros and cons, but they don’t need to be mutually exclusive with each
> targeting at different users and use cases. We believe that both will
> promote a much greater adoption of Flink beyond stream processing.
>
> We have been focused on the first approach and would like to showcase
> Flink's batch and SQL capabilities with Flink SQL. However, we have also
> planned to start strategy #2 as the follow-up effort.
>
> I'm completely new to Flink(, with a short bio [2] below), though many of
> my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd
> like to share our thoughts and invite your early feedback. At the same
> time, I am working on a detailed proposal on Flink SQL's integration with
> Hive ecosystem, which will be also shared when ready.
>
> While the ideas are simple, each approach will demand significant effort,
> more than what we can afford. Thus, the input and contributions from the
> communities are greatly welcome and appreciated.
>
> Regards,
>
>
> Xuefu
>
> References:
>
> [1] https://issues.apache.org/jira/browse/HIVE-10712
> [2] Xuefu Zhang is a long-time open source veteran, worked or working on
> many projects under Apache Foundation, of which he is also an honored
> member. About 10 years ago he worked in the Hadoop team at Yahoo where the
> projects just got started. Later he worked at Cloudera, initiating and
> leading the development of Hive on Spark project in the communities and
> across many organizations. Prior to joining Alibaba, he worked at Uber
> where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and
> significantly improved Uber's cluster efficiency.
>
>
>
>
> --
> "So you have to trust that the dots will somehow connect in your future."
>
>
> --
> "So you have to trust that the dots will somehow connect in your future."
>
>

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by "Zhang, Xuefu" <xu...@alibaba-inc.com>.
Hi Shuiyi,

Good idea. Actually the PDF was converted from a google doc. Here is its link:
https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
Once we reach an agreement, I can convert it to a FLIP.

Thanks,
Xuefu




------------------------------------------------------------------
Sender:Shuyi Chen <su...@gmail.com>
Sent at:2018 Nov 1 (Thu) 02:47
Recipient:Xuefu <xu...@alibaba-inc.com>
Cc:vino yang <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>; dev <de...@flink.apache.org>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Xuefu, 

Thanks a lot for driving this big effort. I would suggest convert your proposal and design doc into a google doc, and share it on the dev mailing list for the community to review and comment with title like "[DISCUSS] ... Hive integration design ..." . Once approved,  we can document it as a FLIP (Flink Improvement Proposals), and use JIRAs to track the implementations. What do you think?

Shuyi
On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <xu...@alibaba-inc.com> wrote:

Hi all,

I have also shared a design doc on Hive metastore integration that is attached here and also to FLINK-10556[1]. Please kindly review and share your feedback.


Thanks,
Xuefu

[1] https://issues.apache.org/jira/browse/FLINK-10556
------------------------------------------------------------------
Sender:Xuefu <xu...@alibaba-inc.com>
Sent at:2018 Oct 25 (Thu) 01:08
Recipient:Xuefu <xu...@alibaba-inc.com>; Shuyi Chen <su...@gmail.com>
Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>; dev <de...@flink.apache.org>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi all,

To wrap up the discussion, I have attached a PDF describing the proposal, which is also attached to FLINK-10556 [1]. Please feel free to watch that JIRA to track the progress.

Please also let me know if you have additional comments or questions.

Thanks,
Xuefu

[1] https://issues.apache.org/jira/browse/FLINK-10556


------------------------------------------------------------------
Sender:Xuefu <xu...@alibaba-inc.com>
Sent at:2018 Oct 16 (Tue) 03:40
Recipient:Shuyi Chen <su...@gmail.com>
Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>; dev <de...@flink.apache.org>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Shuyi,

Thank you for your input. Yes, I agreed with a phased approach and like to move forward fast. :) We did some work internally on DDL utilizing babel parser in Calcite. While babel makes Calcite's grammar extensible, at first impression it still seems too cumbersome for a project when too much extensions are made. It's even challenging to find where the extension is needed! It would be certainly better if Calcite can magically support Hive QL by just turning on a flag, such as that for MYSQL_5. I can also see that this could mean a lot of work on Calcite. Nevertheless, I will bring up the discussion over there and to see what their community thinks.

Would mind to share more info about the proposal on DDL that you mentioned? We can certainly collaborate on this.

Thanks,
Xuefu

------------------------------------------------------------------
Sender:Shuyi Chen <su...@gmail.com>
Sent at:2018 Oct 14 (Sun) 08:30
Recipient:Xuefu <xu...@alibaba-inc.com>
Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>; dev <de...@flink.apache.org>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Welcome to the community and thanks for the great proposal, Xuefu! I think the proposal can be divided into 2 stages: making Flink to support Hive features, and make Hive to work with Flink. I agreed with Timo that on starting with a smaller scope, so we can make progress faster. As for [6], a proposal for DDL is already in progress, and will come after the unified SQL connector API is done. For supporting Hive syntax, we might need to work with the Calcite community, and a recent effort called babel (https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite might help here.

Thanks
Shuyi
On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <xu...@alibaba-inc.com> wrote:
Hi Fabian/Vno,

Thank you very much for your encouragement inquiry. Sorry that I didn't see Fabian's email until I read Vino's response just now. (Somehow Fabian's went to the spam folder.)

My proposal contains long-term and short-terms goals. Nevertheless, the effort will focus on the following areas, including Fabian's list:

1. Hive metastore connectivity - This covers both read/write access, which means Flink can make full use of Hive's metastore as its catalog (at least for the batch but can extend for streaming as well).
2. Metadata compatibility - Objects (databases, tables, partitions, etc) created by Hive can be understood by Flink and the reverse direction is true also.
3. Data compatibility - Similar to #2, data produced by Hive can be consumed by Flink and vise versa.
4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its own implementation or make Hive's implementation work in Flink. Further, for user created UDFs in Hive, Flink SQL should provide a mechanism allowing user to import them into Flink without any code change required.
5. Data types -  Flink SQL should support all data types that are available in Hive.
6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) with extension to support Hive's syntax and language features, around DDL, DML, and SELECT queries.
7.  SQL CLI - this is currently developing in Flink but more effort is needed.
8. Server - provide a server that's compatible with Hive's HiverServer2 in thrift APIs, such that HiveServer2 users can reuse their existing client (such as beeline) but connect to Flink's thrift server instead.
9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other application to use to connect to its thrift server
10. Support other user's customizations in Hive, such as Hive Serdes, storage handlers, etc.
11. Better task failure tolerance and task scheduling at Flink runtime.

As you can see, achieving all those requires significant effort and across all layers in Flink. However, a short-term goal could  include only core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3, #6).

Please share your further thoughts. If we generally agree that this is the right direction, I could come up with a formal proposal quickly and then we can follow up with broader discussions.

Thanks,
Xuefu



------------------------------------------------------------------
Sender:vino yang <ya...@gmail.com>
Sent at:2018 Oct 11 (Thu) 09:45
Recipient:Fabian Hueske <fh...@gmail.com>
Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Xuefu,

Appreciate this proposal, and like Fabian, it would look better if you can give more details of the plan.

Thanks, vino.
Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
Hi Xuefu,

Welcome to the Flink community and thanks for starting this discussion! Better Hive integration would be really great!
Can you go into details of what you are proposing? I can think of a couple ways to improve Flink in that regard:

* Support for Hive UDFs
* Support for Hive metadata catalog
* Support for HiveQL syntax
* ???

Best, Fabian

Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
Hi all,

 Along with the community's effort, inside Alibaba we have explored Flink's potential as an execution engine not just for stream processing but also for batch processing. We are encouraged by our findings and have initiated our effort to make Flink's SQL capabilities full-fledged. When comparing what's available in Flink to the offerings from competitive data processing engines, we identified a major gap in Flink: a well integration with Hive ecosystem. This is crucial to the success of Flink SQL and batch due to the well-established data ecosystem around Hive. Therefore, we have done some initial work along this direction but there are still a lot of effort needed.

 We have two strategies in mind. The first one is to make Flink SQL full-fledged and well-integrated with Hive ecosystem. This is a similar approach to what Spark SQL adopted. The second strategy is to make Hive itself work with Flink, similar to the proposal in [1]. Each approach bears its pros and cons, but they don’t need to be mutually exclusive with each targeting at different users and use cases. We believe that both will promote a much greater adoption of Flink beyond stream processing.

 We have been focused on the first approach and would like to showcase Flink's batch and SQL capabilities with Flink SQL. However, we have also planned to start strategy #2 as the follow-up effort.

 I'm completely new to Flink(, with a short bio [2] below), though many of my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like to share our thoughts and invite your early feedback. At the same time, I am working on a detailed proposal on Flink SQL's integration with Hive ecosystem, which will be also shared when ready.

 While the ideas are simple, each approach will demand significant effort, more than what we can afford. Thus, the input and contributions from the communities are greatly welcome and appreciated.

 Regards,


 Xuefu

 References:

 [1] https://issues.apache.org/jira/browse/HIVE-10712
 [2] Xuefu Zhang is a long-time open source veteran, worked or working on many projects under Apache Foundation, of which he is also an honored member. About 10 years ago he worked in the Hadoop team at Yahoo where the projects just got started. Later he worked at Cloudera, initiating and leading the development of Hive on Spark project in the communities and across many organizations. Prior to joining Alibaba, he worked at Uber where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and significantly improved Uber's cluster efficiency.




-- 
"So you have to trust that the dots will somehow connect in your future."

-- 
"So you have to trust that the dots will somehow connect in your future." 

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by "Zhang, Xuefu" <xu...@alibaba-inc.com>.
Hi Shuiyi,

Good idea. Actually the PDF was converted from a google doc. Here is its link:
https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
Once we reach an agreement, I can convert it to a FLIP.

Thanks,
Xuefu




------------------------------------------------------------------
Sender:Shuyi Chen <su...@gmail.com>
Sent at:2018 Nov 1 (Thu) 02:47
Recipient:Xuefu <xu...@alibaba-inc.com>
Cc:vino yang <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>; dev <de...@flink.apache.org>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Xuefu, 

Thanks a lot for driving this big effort. I would suggest convert your proposal and design doc into a google doc, and share it on the dev mailing list for the community to review and comment with title like "[DISCUSS] ... Hive integration design ..." . Once approved,  we can document it as a FLIP (Flink Improvement Proposals), and use JIRAs to track the implementations. What do you think?

Shuyi
On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <xu...@alibaba-inc.com> wrote:

Hi all,

I have also shared a design doc on Hive metastore integration that is attached here and also to FLINK-10556[1]. Please kindly review and share your feedback.


Thanks,
Xuefu

[1] https://issues.apache.org/jira/browse/FLINK-10556
------------------------------------------------------------------
Sender:Xuefu <xu...@alibaba-inc.com>
Sent at:2018 Oct 25 (Thu) 01:08
Recipient:Xuefu <xu...@alibaba-inc.com>; Shuyi Chen <su...@gmail.com>
Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>; dev <de...@flink.apache.org>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi all,

To wrap up the discussion, I have attached a PDF describing the proposal, which is also attached to FLINK-10556 [1]. Please feel free to watch that JIRA to track the progress.

Please also let me know if you have additional comments or questions.

Thanks,
Xuefu

[1] https://issues.apache.org/jira/browse/FLINK-10556


------------------------------------------------------------------
Sender:Xuefu <xu...@alibaba-inc.com>
Sent at:2018 Oct 16 (Tue) 03:40
Recipient:Shuyi Chen <su...@gmail.com>
Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>; dev <de...@flink.apache.org>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Shuyi,

Thank you for your input. Yes, I agreed with a phased approach and like to move forward fast. :) We did some work internally on DDL utilizing babel parser in Calcite. While babel makes Calcite's grammar extensible, at first impression it still seems too cumbersome for a project when too much extensions are made. It's even challenging to find where the extension is needed! It would be certainly better if Calcite can magically support Hive QL by just turning on a flag, such as that for MYSQL_5. I can also see that this could mean a lot of work on Calcite. Nevertheless, I will bring up the discussion over there and to see what their community thinks.

Would mind to share more info about the proposal on DDL that you mentioned? We can certainly collaborate on this.

Thanks,
Xuefu

------------------------------------------------------------------
Sender:Shuyi Chen <su...@gmail.com>
Sent at:2018 Oct 14 (Sun) 08:30
Recipient:Xuefu <xu...@alibaba-inc.com>
Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>; dev <de...@flink.apache.org>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Welcome to the community and thanks for the great proposal, Xuefu! I think the proposal can be divided into 2 stages: making Flink to support Hive features, and make Hive to work with Flink. I agreed with Timo that on starting with a smaller scope, so we can make progress faster. As for [6], a proposal for DDL is already in progress, and will come after the unified SQL connector API is done. For supporting Hive syntax, we might need to work with the Calcite community, and a recent effort called babel (https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite might help here.

Thanks
Shuyi
On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <xu...@alibaba-inc.com> wrote:
Hi Fabian/Vno,

Thank you very much for your encouragement inquiry. Sorry that I didn't see Fabian's email until I read Vino's response just now. (Somehow Fabian's went to the spam folder.)

My proposal contains long-term and short-terms goals. Nevertheless, the effort will focus on the following areas, including Fabian's list:

1. Hive metastore connectivity - This covers both read/write access, which means Flink can make full use of Hive's metastore as its catalog (at least for the batch but can extend for streaming as well).
2. Metadata compatibility - Objects (databases, tables, partitions, etc) created by Hive can be understood by Flink and the reverse direction is true also.
3. Data compatibility - Similar to #2, data produced by Hive can be consumed by Flink and vise versa.
4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its own implementation or make Hive's implementation work in Flink. Further, for user created UDFs in Hive, Flink SQL should provide a mechanism allowing user to import them into Flink without any code change required.
5. Data types -  Flink SQL should support all data types that are available in Hive.
6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) with extension to support Hive's syntax and language features, around DDL, DML, and SELECT queries.
7.  SQL CLI - this is currently developing in Flink but more effort is needed.
8. Server - provide a server that's compatible with Hive's HiverServer2 in thrift APIs, such that HiveServer2 users can reuse their existing client (such as beeline) but connect to Flink's thrift server instead.
9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other application to use to connect to its thrift server
10. Support other user's customizations in Hive, such as Hive Serdes, storage handlers, etc.
11. Better task failure tolerance and task scheduling at Flink runtime.

As you can see, achieving all those requires significant effort and across all layers in Flink. However, a short-term goal could  include only core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3, #6).

Please share your further thoughts. If we generally agree that this is the right direction, I could come up with a formal proposal quickly and then we can follow up with broader discussions.

Thanks,
Xuefu



------------------------------------------------------------------
Sender:vino yang <ya...@gmail.com>
Sent at:2018 Oct 11 (Thu) 09:45
Recipient:Fabian Hueske <fh...@gmail.com>
Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Xuefu,

Appreciate this proposal, and like Fabian, it would look better if you can give more details of the plan.

Thanks, vino.
Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
Hi Xuefu,

Welcome to the Flink community and thanks for starting this discussion! Better Hive integration would be really great!
Can you go into details of what you are proposing? I can think of a couple ways to improve Flink in that regard:

* Support for Hive UDFs
* Support for Hive metadata catalog
* Support for HiveQL syntax
* ???

Best, Fabian

Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
Hi all,

 Along with the community's effort, inside Alibaba we have explored Flink's potential as an execution engine not just for stream processing but also for batch processing. We are encouraged by our findings and have initiated our effort to make Flink's SQL capabilities full-fledged. When comparing what's available in Flink to the offerings from competitive data processing engines, we identified a major gap in Flink: a well integration with Hive ecosystem. This is crucial to the success of Flink SQL and batch due to the well-established data ecosystem around Hive. Therefore, we have done some initial work along this direction but there are still a lot of effort needed.

 We have two strategies in mind. The first one is to make Flink SQL full-fledged and well-integrated with Hive ecosystem. This is a similar approach to what Spark SQL adopted. The second strategy is to make Hive itself work with Flink, similar to the proposal in [1]. Each approach bears its pros and cons, but they don’t need to be mutually exclusive with each targeting at different users and use cases. We believe that both will promote a much greater adoption of Flink beyond stream processing.

 We have been focused on the first approach and would like to showcase Flink's batch and SQL capabilities with Flink SQL. However, we have also planned to start strategy #2 as the follow-up effort.

 I'm completely new to Flink(, with a short bio [2] below), though many of my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like to share our thoughts and invite your early feedback. At the same time, I am working on a detailed proposal on Flink SQL's integration with Hive ecosystem, which will be also shared when ready.

 While the ideas are simple, each approach will demand significant effort, more than what we can afford. Thus, the input and contributions from the communities are greatly welcome and appreciated.

 Regards,


 Xuefu

 References:

 [1] https://issues.apache.org/jira/browse/HIVE-10712
 [2] Xuefu Zhang is a long-time open source veteran, worked or working on many projects under Apache Foundation, of which he is also an honored member. About 10 years ago he worked in the Hadoop team at Yahoo where the projects just got started. Later he worked at Cloudera, initiating and leading the development of Hive on Spark project in the communities and across many organizations. Prior to joining Alibaba, he worked at Uber where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and significantly improved Uber's cluster efficiency.




-- 
"So you have to trust that the dots will somehow connect in your future."

-- 
"So you have to trust that the dots will somehow connect in your future." 

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by Shuyi Chen <su...@gmail.com>.
Hi Xuefu,

Thanks a lot for driving this big effort. I would suggest convert your
proposal and design doc into a google doc, and share it on the dev mailing
list for the community to review and comment with title like "[DISCUSS] ...
Hive integration design ..." . Once approved,  we can document it as a FLIP
(Flink Improvement Proposals), and use JIRAs to track the implementations.
What do you think?

Shuyi

On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <xu...@alibaba-inc.com>
wrote:

> Hi all,
>
> I have also shared a design doc on Hive metastore integration that is
> attached here and also to FLINK-10556[1]. Please kindly review and share
> your feedback.
>
>
> Thanks,
> Xuefu
>
> [1] https://issues.apache.org/jira/browse/FLINK-10556
>
> ------------------------------------------------------------------
> Sender:Xuefu <xu...@alibaba-inc.com>
> Sent at:2018 Oct 25 (Thu) 01:08
> Recipient:Xuefu <xu...@alibaba-inc.com>; Shuyi Chen <su...@gmail.com>
> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> Hi all,
>
> To wrap up the discussion, I have attached a PDF describing the proposal,
> which is also attached to FLINK-10556 [1]. Please feel free to watch that
> JIRA to track the progress.
>
> Please also let me know if you have additional comments or questions.
>
> Thanks,
> Xuefu
>
> [1] https://issues.apache.org/jira/browse/FLINK-10556
>
>
> ------------------------------------------------------------------
> Sender:Xuefu <xu...@alibaba-inc.com>
> Sent at:2018 Oct 16 (Tue) 03:40
> Recipient:Shuyi Chen <su...@gmail.com>
> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> Hi Shuyi,
>
> Thank you for your input. Yes, I agreed with a phased approach and like to
> move forward fast. :) We did some work internally on DDL utilizing babel
> parser in Calcite. While babel makes Calcite's grammar extensible, at
> first impression it still seems too cumbersome for a project when too
> much extensions are made. It's even challenging to find where the extension
> is needed! It would be certainly better if Calcite can magically support
> Hive QL by just turning on a flag, such as that for MYSQL_5. I can also
> see that this could mean a lot of work on Calcite. Nevertheless, I will
> bring up the discussion over there and to see what their community thinks.
>
> Would mind to share more info about the proposal on DDL that you
> mentioned? We can certainly collaborate on this.
>
> Thanks,
> Xuefu
>
> ------------------------------------------------------------------
> Sender:Shuyi Chen <su...@gmail.com>
> Sent at:2018 Oct 14 (Sun) 08:30
> Recipient:Xuefu <xu...@alibaba-inc.com>
> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> Welcome to the community and thanks for the great proposal, Xuefu! I think
> the proposal can be divided into 2 stages: making Flink to support Hive
> features, and make Hive to work with Flink. I agreed with Timo that on
> starting with a smaller scope, so we can make progress faster. As for [6],
> a proposal for DDL is already in progress, and will come after the unified
> SQL connector API is done. For supporting Hive syntax, we might need to
> work with the Calcite community, and a recent effort called babel (
> https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite might help
> here.
>
> Thanks
> Shuyi
>
> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <xu...@alibaba-inc.com>
> wrote:
> Hi Fabian/Vno,
>
> Thank you very much for your encouragement inquiry. Sorry that I didn't
> see Fabian's email until I read Vino's response just now. (Somehow Fabian's
> went to the spam folder.)
>
> My proposal contains long-term and short-terms goals. Nevertheless, the
> effort will focus on the following areas, including Fabian's list:
>
> 1. Hive metastore connectivity - This covers both read/write access, which
> means Flink can make full use of Hive's metastore as its catalog (at least
> for the batch but can extend for streaming as well).
> 2. Metadata compatibility - Objects (databases, tables, partitions, etc)
> created by Hive can be understood by Flink and the reverse direction is
> true also.
> 3. Data compatibility - Similar to #2, data produced by Hive can be
> consumed by Flink and vise versa.
> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides
> its own implementation or make Hive's implementation work in Flink.
> Further, for user created UDFs in Hive, Flink SQL should provide a
> mechanism allowing user to import them into Flink without any code change
> required.
> 5. Data types -  Flink SQL should support all data types that are
> available in Hive.
> 6. SQL Language - Flink SQL should support SQL standard (such as SQL2003)
> with extension to support Hive's syntax and language features, around DDL,
> DML, and SELECT queries.
> 7.  SQL CLI - this is currently developing in Flink but more effort is
> needed.
> 8. Server - provide a server that's compatible with Hive's HiverServer2 in
> thrift APIs, such that HiveServer2 users can reuse their existing client
> (such as beeline) but connect to Flink's thrift server instead.
> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for
> other application to use to connect to its thrift server
> 10. Support other user's customizations in Hive, such as Hive Serdes,
> storage handlers, etc.
> 11. Better task failure tolerance and task scheduling at Flink runtime.
>
> As you can see, achieving all those requires significant effort and across
> all layers in Flink. However, a short-term goal could  include only core
> areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3,
> #6).
>
> Please share your further thoughts. If we generally agree that this is the
> right direction, I could come up with a formal proposal quickly and then we
> can follow up with broader discussions.
>
> Thanks,
> Xuefu
>
>
>
> ------------------------------------------------------------------
> Sender:vino yang <ya...@gmail.com>
> Sent at:2018 Oct 11 (Thu) 09:45
> Recipient:Fabian Hueske <fh...@gmail.com>
> Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <
> user@flink.apache.org>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> Hi Xuefu,
>
> Appreciate this proposal, and like Fabian, it would look better if you can
> give more details of the plan.
>
> Thanks, vino.
>
> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
> Hi Xuefu,
>
> Welcome to the Flink community and thanks for starting this discussion!
> Better Hive integration would be really great!
> Can you go into details of what you are proposing? I can think of a couple
> ways to improve Flink in that regard:
>
> * Support for Hive UDFs
> * Support for Hive metadata catalog
> * Support for HiveQL syntax
> * ???
>
> Best, Fabian
>
> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
> xuefu.z@alibaba-inc.com>:
> Hi all,
>
> Along with the community's effort, inside Alibaba we have explored Flink's
> potential as an execution engine not just for stream processing but also
> for batch processing. We are encouraged by our findings and have initiated
> our effort to make Flink's SQL capabilities full-fledged. When comparing
> what's available in Flink to the offerings from competitive data processing
> engines, we identified a major gap in Flink: a well integration with Hive
> ecosystem. This is crucial to the success of Flink SQL and batch due to the
> well-established data ecosystem around Hive. Therefore, we have done some
> initial work along this direction but there are still a lot of effort
> needed.
>
> We have two strategies in mind. The first one is to make Flink SQL
> full-fledged and well-integrated with Hive ecosystem. This is a similar
> approach to what Spark SQL adopted. The second strategy is to make Hive
> itself work with Flink, similar to the proposal in [1]. Each approach bears
> its pros and cons, but they don’t need to be mutually exclusive with each
> targeting at different users and use cases. We believe that both will
> promote a much greater adoption of Flink beyond stream processing.
>
> We have been focused on the first approach and would like to showcase
> Flink's batch and SQL capabilities with Flink SQL. However, we have also
> planned to start strategy #2 as the follow-up effort.
>
> I'm completely new to Flink(, with a short bio [2] below), though many of
> my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd
> like to share our thoughts and invite your early feedback. At the same
> time, I am working on a detailed proposal on Flink SQL's integration with
> Hive ecosystem, which will be also shared when ready.
>
> While the ideas are simple, each approach will demand significant effort,
> more than what we can afford. Thus, the input and contributions from the
> communities are greatly welcome and appreciated.
>
> Regards,
>
>
> Xuefu
>
> References:
>
> [1] https://issues.apache.org/jira/browse/HIVE-10712
> [2] Xuefu Zhang is a long-time open source veteran, worked or working on
> many projects under Apache Foundation, of which he is also an honored
> member. About 10 years ago he worked in the Hadoop team at Yahoo where the
> projects just got started. Later he worked at Cloudera, initiating and
> leading the development of Hive on Spark project in the communities and
> across many organizations. Prior to joining Alibaba, he worked at Uber
> where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and
> significantly improved Uber's cluster efficiency.
>
>
>
>
> --
> "So you have to trust that the dots will somehow connect in your future."
>
>

-- 
"So you have to trust that the dots will somehow connect in your future."

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by Shuyi Chen <su...@gmail.com>.
Hi Xuefu,

Thanks a lot for driving this big effort. I would suggest convert your
proposal and design doc into a google doc, and share it on the dev mailing
list for the community to review and comment with title like "[DISCUSS] ...
Hive integration design ..." . Once approved,  we can document it as a FLIP
(Flink Improvement Proposals), and use JIRAs to track the implementations.
What do you think?

Shuyi

On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu <xu...@alibaba-inc.com>
wrote:

> Hi all,
>
> I have also shared a design doc on Hive metastore integration that is
> attached here and also to FLINK-10556[1]. Please kindly review and share
> your feedback.
>
>
> Thanks,
> Xuefu
>
> [1] https://issues.apache.org/jira/browse/FLINK-10556
>
> ------------------------------------------------------------------
> Sender:Xuefu <xu...@alibaba-inc.com>
> Sent at:2018 Oct 25 (Thu) 01:08
> Recipient:Xuefu <xu...@alibaba-inc.com>; Shuyi Chen <su...@gmail.com>
> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> Hi all,
>
> To wrap up the discussion, I have attached a PDF describing the proposal,
> which is also attached to FLINK-10556 [1]. Please feel free to watch that
> JIRA to track the progress.
>
> Please also let me know if you have additional comments or questions.
>
> Thanks,
> Xuefu
>
> [1] https://issues.apache.org/jira/browse/FLINK-10556
>
>
> ------------------------------------------------------------------
> Sender:Xuefu <xu...@alibaba-inc.com>
> Sent at:2018 Oct 16 (Tue) 03:40
> Recipient:Shuyi Chen <su...@gmail.com>
> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> Hi Shuyi,
>
> Thank you for your input. Yes, I agreed with a phased approach and like to
> move forward fast. :) We did some work internally on DDL utilizing babel
> parser in Calcite. While babel makes Calcite's grammar extensible, at
> first impression it still seems too cumbersome for a project when too
> much extensions are made. It's even challenging to find where the extension
> is needed! It would be certainly better if Calcite can magically support
> Hive QL by just turning on a flag, such as that for MYSQL_5. I can also
> see that this could mean a lot of work on Calcite. Nevertheless, I will
> bring up the discussion over there and to see what their community thinks.
>
> Would mind to share more info about the proposal on DDL that you
> mentioned? We can certainly collaborate on this.
>
> Thanks,
> Xuefu
>
> ------------------------------------------------------------------
> Sender:Shuyi Chen <su...@gmail.com>
> Sent at:2018 Oct 14 (Sun) 08:30
> Recipient:Xuefu <xu...@alibaba-inc.com>
> Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>;
> dev <de...@flink.apache.org>; user <us...@flink.apache.org>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> Welcome to the community and thanks for the great proposal, Xuefu! I think
> the proposal can be divided into 2 stages: making Flink to support Hive
> features, and make Hive to work with Flink. I agreed with Timo that on
> starting with a smaller scope, so we can make progress faster. As for [6],
> a proposal for DDL is already in progress, and will come after the unified
> SQL connector API is done. For supporting Hive syntax, we might need to
> work with the Calcite community, and a recent effort called babel (
> https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite might help
> here.
>
> Thanks
> Shuyi
>
> On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <xu...@alibaba-inc.com>
> wrote:
> Hi Fabian/Vno,
>
> Thank you very much for your encouragement inquiry. Sorry that I didn't
> see Fabian's email until I read Vino's response just now. (Somehow Fabian's
> went to the spam folder.)
>
> My proposal contains long-term and short-terms goals. Nevertheless, the
> effort will focus on the following areas, including Fabian's list:
>
> 1. Hive metastore connectivity - This covers both read/write access, which
> means Flink can make full use of Hive's metastore as its catalog (at least
> for the batch but can extend for streaming as well).
> 2. Metadata compatibility - Objects (databases, tables, partitions, etc)
> created by Hive can be understood by Flink and the reverse direction is
> true also.
> 3. Data compatibility - Similar to #2, data produced by Hive can be
> consumed by Flink and vise versa.
> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides
> its own implementation or make Hive's implementation work in Flink.
> Further, for user created UDFs in Hive, Flink SQL should provide a
> mechanism allowing user to import them into Flink without any code change
> required.
> 5. Data types -  Flink SQL should support all data types that are
> available in Hive.
> 6. SQL Language - Flink SQL should support SQL standard (such as SQL2003)
> with extension to support Hive's syntax and language features, around DDL,
> DML, and SELECT queries.
> 7.  SQL CLI - this is currently developing in Flink but more effort is
> needed.
> 8. Server - provide a server that's compatible with Hive's HiverServer2 in
> thrift APIs, such that HiveServer2 users can reuse their existing client
> (such as beeline) but connect to Flink's thrift server instead.
> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for
> other application to use to connect to its thrift server
> 10. Support other user's customizations in Hive, such as Hive Serdes,
> storage handlers, etc.
> 11. Better task failure tolerance and task scheduling at Flink runtime.
>
> As you can see, achieving all those requires significant effort and across
> all layers in Flink. However, a short-term goal could  include only core
> areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3,
> #6).
>
> Please share your further thoughts. If we generally agree that this is the
> right direction, I could come up with a formal proposal quickly and then we
> can follow up with broader discussions.
>
> Thanks,
> Xuefu
>
>
>
> ------------------------------------------------------------------
> Sender:vino yang <ya...@gmail.com>
> Sent at:2018 Oct 11 (Thu) 09:45
> Recipient:Fabian Hueske <fh...@gmail.com>
> Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <
> user@flink.apache.org>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> Hi Xuefu,
>
> Appreciate this proposal, and like Fabian, it would look better if you can
> give more details of the plan.
>
> Thanks, vino.
>
> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
> Hi Xuefu,
>
> Welcome to the Flink community and thanks for starting this discussion!
> Better Hive integration would be really great!
> Can you go into details of what you are proposing? I can think of a couple
> ways to improve Flink in that regard:
>
> * Support for Hive UDFs
> * Support for Hive metadata catalog
> * Support for HiveQL syntax
> * ???
>
> Best, Fabian
>
> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
> xuefu.z@alibaba-inc.com>:
> Hi all,
>
> Along with the community's effort, inside Alibaba we have explored Flink's
> potential as an execution engine not just for stream processing but also
> for batch processing. We are encouraged by our findings and have initiated
> our effort to make Flink's SQL capabilities full-fledged. When comparing
> what's available in Flink to the offerings from competitive data processing
> engines, we identified a major gap in Flink: a well integration with Hive
> ecosystem. This is crucial to the success of Flink SQL and batch due to the
> well-established data ecosystem around Hive. Therefore, we have done some
> initial work along this direction but there are still a lot of effort
> needed.
>
> We have two strategies in mind. The first one is to make Flink SQL
> full-fledged and well-integrated with Hive ecosystem. This is a similar
> approach to what Spark SQL adopted. The second strategy is to make Hive
> itself work with Flink, similar to the proposal in [1]. Each approach bears
> its pros and cons, but they don’t need to be mutually exclusive with each
> targeting at different users and use cases. We believe that both will
> promote a much greater adoption of Flink beyond stream processing.
>
> We have been focused on the first approach and would like to showcase
> Flink's batch and SQL capabilities with Flink SQL. However, we have also
> planned to start strategy #2 as the follow-up effort.
>
> I'm completely new to Flink(, with a short bio [2] below), though many of
> my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd
> like to share our thoughts and invite your early feedback. At the same
> time, I am working on a detailed proposal on Flink SQL's integration with
> Hive ecosystem, which will be also shared when ready.
>
> While the ideas are simple, each approach will demand significant effort,
> more than what we can afford. Thus, the input and contributions from the
> communities are greatly welcome and appreciated.
>
> Regards,
>
>
> Xuefu
>
> References:
>
> [1] https://issues.apache.org/jira/browse/HIVE-10712
> [2] Xuefu Zhang is a long-time open source veteran, worked or working on
> many projects under Apache Foundation, of which he is also an honored
> member. About 10 years ago he worked in the Hadoop team at Yahoo where the
> projects just got started. Later he worked at Cloudera, initiating and
> leading the development of Hive on Spark project in the communities and
> across many organizations. Prior to joining Alibaba, he worked at Uber
> where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and
> significantly improved Uber's cluster efficiency.
>
>
>
>
> --
> "So you have to trust that the dots will somehow connect in your future."
>
>

-- 
"So you have to trust that the dots will somehow connect in your future."

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by "Zhang, Xuefu" <xu...@alibaba-inc.com>.
Hi all,

I have also shared a design doc on Hive metastore integration that is attached here and also to FLINK-10556[1]. Please kindly review and share your feedback.


Thanks,
Xuefu

[1] https://issues.apache.org/jira/browse/FLINK-10556
------------------------------------------------------------------
Sender:Xuefu <xu...@alibaba-inc.com>
Sent at:2018 Oct 25 (Thu) 01:08
Recipient:Xuefu <xu...@alibaba-inc.com>; Shuyi Chen <su...@gmail.com>
Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>; dev <de...@flink.apache.org>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi all,

To wrap up the discussion, I have attached a PDF describing the proposal, which is also attached to FLINK-10556 [1]. Please feel free to watch that JIRA to track the progress.

Please also let me know if you have additional comments or questions.

Thanks,
Xuefu

[1] https://issues.apache.org/jira/browse/FLINK-10556



------------------------------------------------------------------
Sender:Xuefu <xu...@alibaba-inc.com>
Sent at:2018 Oct 16 (Tue) 03:40
Recipient:Shuyi Chen <su...@gmail.com>
Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>; dev <de...@flink.apache.org>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Shuyi,

Thank you for your input. Yes, I agreed with a phased approach and like to move forward fast. :) We did some work internally on DDL utilizing babel parser in Calcite. While babel makes Calcite's grammar extensible, at first impression it still seems too cumbersome for a project when too much extensions are made. It's even challenging to find where the extension is needed! It would be certainly better if Calcite can magically support Hive QL by just turning on a flag, such as that for MYSQL_5. I can also see that this could mean a lot of work on Calcite. Nevertheless, I will bring up the discussion over there and to see what their community thinks.

Would mind to share more info about the proposal on DDL that you mentioned? We can certainly collaborate on this.

Thanks,
Xuefu

------------------------------------------------------------------
Sender:Shuyi Chen <su...@gmail.com>
Sent at:2018 Oct 14 (Sun) 08:30
Recipient:Xuefu <xu...@alibaba-inc.com>
Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>; dev <de...@flink.apache.org>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Welcome to the community and thanks for the great proposal, Xuefu! I think the proposal can be divided into 2 stages: making Flink to support Hive features, and make Hive to work with Flink. I agreed with Timo that on starting with a smaller scope, so we can make progress faster. As for [6], a proposal for DDL is already in progress, and will come after the unified SQL connector API is done. For supporting Hive syntax, we might need to work with the Calcite community, and a recent effort called babel (https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite might help here.

Thanks
Shuyi
On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <xu...@alibaba-inc.com> wrote:
Hi Fabian/Vno,

Thank you very much for your encouragement inquiry. Sorry that I didn't see Fabian's email until I read Vino's response just now. (Somehow Fabian's went to the spam folder.)

My proposal contains long-term and short-terms goals. Nevertheless, the effort will focus on the following areas, including Fabian's list:

1. Hive metastore connectivity - This covers both read/write access, which means Flink can make full use of Hive's metastore as its catalog (at least for the batch but can extend for streaming as well).
2. Metadata compatibility - Objects (databases, tables, partitions, etc) created by Hive can be understood by Flink and the reverse direction is true also.
3. Data compatibility - Similar to #2, data produced by Hive can be consumed by Flink and vise versa.
4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its own implementation or make Hive's implementation work in Flink. Further, for user created UDFs in Hive, Flink SQL should provide a mechanism allowing user to import them into Flink without any code change required.
5. Data types -  Flink SQL should support all data types that are available in Hive.
6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) with extension to support Hive's syntax and language features, around DDL, DML, and SELECT queries.
7.  SQL CLI - this is currently developing in Flink but more effort is needed.
8. Server - provide a server that's compatible with Hive's HiverServer2 in thrift APIs, such that HiveServer2 users can reuse their existing client (such as beeline) but connect to Flink's thrift server instead.
9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other application to use to connect to its thrift server
10. Support other user's customizations in Hive, such as Hive Serdes, storage handlers, etc.
11. Better task failure tolerance and task scheduling at Flink runtime.

As you can see, achieving all those requires significant effort and across all layers in Flink. However, a short-term goal could  include only core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3, #6).

Please share your further thoughts. If we generally agree that this is the right direction, I could come up with a formal proposal quickly and then we can follow up with broader discussions.

Thanks,
Xuefu



------------------------------------------------------------------
Sender:vino yang <ya...@gmail.com>
Sent at:2018 Oct 11 (Thu) 09:45
Recipient:Fabian Hueske <fh...@gmail.com>
Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Xuefu,

Appreciate this proposal, and like Fabian, it would look better if you can give more details of the plan.

Thanks, vino.
Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
Hi Xuefu,

Welcome to the Flink community and thanks for starting this discussion! Better Hive integration would be really great!
Can you go into details of what you are proposing? I can think of a couple ways to improve Flink in that regard:

* Support for Hive UDFs
* Support for Hive metadata catalog
* Support for HiveQL syntax
* ???

Best, Fabian

Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
Hi all,

 Along with the community's effort, inside Alibaba we have explored Flink's potential as an execution engine not just for stream processing but also for batch processing. We are encouraged by our findings and have initiated our effort to make Flink's SQL capabilities full-fledged. When comparing what's available in Flink to the offerings from competitive data processing engines, we identified a major gap in Flink: a well integration with Hive ecosystem. This is crucial to the success of Flink SQL and batch due to the well-established data ecosystem around Hive. Therefore, we have done some initial work along this direction but there are still a lot of effort needed.

 We have two strategies in mind. The first one is to make Flink SQL full-fledged and well-integrated with Hive ecosystem. This is a similar approach to what Spark SQL adopted. The second strategy is to make Hive itself work with Flink, similar to the proposal in [1]. Each approach bears its pros and cons, but they don’t need to be mutually exclusive with each targeting at different users and use cases. We believe that both will promote a much greater adoption of Flink beyond stream processing.

 We have been focused on the first approach and would like to showcase Flink's batch and SQL capabilities with Flink SQL. However, we have also planned to start strategy #2 as the follow-up effort.

 I'm completely new to Flink(, with a short bio [2] below), though many of my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like to share our thoughts and invite your early feedback. At the same time, I am working on a detailed proposal on Flink SQL's integration with Hive ecosystem, which will be also shared when ready.

 While the ideas are simple, each approach will demand significant effort, more than what we can afford. Thus, the input and contributions from the communities are greatly welcome and appreciated.

 Regards,


 Xuefu

 References:

 [1] https://issues.apache.org/jira/browse/HIVE-10712
 [2] Xuefu Zhang is a long-time open source veteran, worked or working on many projects under Apache Foundation, of which he is also an honored member. About 10 years ago he worked in the Hadoop team at Yahoo where the projects just got started. Later he worked at Cloudera, initiating and leading the development of Hive on Spark project in the communities and across many organizations. Prior to joining Alibaba, he worked at Uber where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and significantly improved Uber's cluster efficiency.




-- 
"So you have to trust that the dots will somehow connect in your future."

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by "Zhang, Xuefu" <xu...@alibaba-inc.com>.
Hi all,

I have also shared a design doc on Hive metastore integration that is attached here and also to FLINK-10556[1]. Please kindly review and share your feedback.


Thanks,
Xuefu

[1] https://issues.apache.org/jira/browse/FLINK-10556
------------------------------------------------------------------
Sender:Xuefu <xu...@alibaba-inc.com>
Sent at:2018 Oct 25 (Thu) 01:08
Recipient:Xuefu <xu...@alibaba-inc.com>; Shuyi Chen <su...@gmail.com>
Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>; dev <de...@flink.apache.org>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi all,

To wrap up the discussion, I have attached a PDF describing the proposal, which is also attached to FLINK-10556 [1]. Please feel free to watch that JIRA to track the progress.

Please also let me know if you have additional comments or questions.

Thanks,
Xuefu

[1] https://issues.apache.org/jira/browse/FLINK-10556



------------------------------------------------------------------
Sender:Xuefu <xu...@alibaba-inc.com>
Sent at:2018 Oct 16 (Tue) 03:40
Recipient:Shuyi Chen <su...@gmail.com>
Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>; dev <de...@flink.apache.org>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Shuyi,

Thank you for your input. Yes, I agreed with a phased approach and like to move forward fast. :) We did some work internally on DDL utilizing babel parser in Calcite. While babel makes Calcite's grammar extensible, at first impression it still seems too cumbersome for a project when too much extensions are made. It's even challenging to find where the extension is needed! It would be certainly better if Calcite can magically support Hive QL by just turning on a flag, such as that for MYSQL_5. I can also see that this could mean a lot of work on Calcite. Nevertheless, I will bring up the discussion over there and to see what their community thinks.

Would mind to share more info about the proposal on DDL that you mentioned? We can certainly collaborate on this.

Thanks,
Xuefu

------------------------------------------------------------------
Sender:Shuyi Chen <su...@gmail.com>
Sent at:2018 Oct 14 (Sun) 08:30
Recipient:Xuefu <xu...@alibaba-inc.com>
Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>; dev <de...@flink.apache.org>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Welcome to the community and thanks for the great proposal, Xuefu! I think the proposal can be divided into 2 stages: making Flink to support Hive features, and make Hive to work with Flink. I agreed with Timo that on starting with a smaller scope, so we can make progress faster. As for [6], a proposal for DDL is already in progress, and will come after the unified SQL connector API is done. For supporting Hive syntax, we might need to work with the Calcite community, and a recent effort called babel (https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite might help here.

Thanks
Shuyi
On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <xu...@alibaba-inc.com> wrote:
Hi Fabian/Vno,

Thank you very much for your encouragement inquiry. Sorry that I didn't see Fabian's email until I read Vino's response just now. (Somehow Fabian's went to the spam folder.)

My proposal contains long-term and short-terms goals. Nevertheless, the effort will focus on the following areas, including Fabian's list:

1. Hive metastore connectivity - This covers both read/write access, which means Flink can make full use of Hive's metastore as its catalog (at least for the batch but can extend for streaming as well).
2. Metadata compatibility - Objects (databases, tables, partitions, etc) created by Hive can be understood by Flink and the reverse direction is true also.
3. Data compatibility - Similar to #2, data produced by Hive can be consumed by Flink and vise versa.
4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its own implementation or make Hive's implementation work in Flink. Further, for user created UDFs in Hive, Flink SQL should provide a mechanism allowing user to import them into Flink without any code change required.
5. Data types -  Flink SQL should support all data types that are available in Hive.
6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) with extension to support Hive's syntax and language features, around DDL, DML, and SELECT queries.
7.  SQL CLI - this is currently developing in Flink but more effort is needed.
8. Server - provide a server that's compatible with Hive's HiverServer2 in thrift APIs, such that HiveServer2 users can reuse their existing client (such as beeline) but connect to Flink's thrift server instead.
9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other application to use to connect to its thrift server
10. Support other user's customizations in Hive, such as Hive Serdes, storage handlers, etc.
11. Better task failure tolerance and task scheduling at Flink runtime.

As you can see, achieving all those requires significant effort and across all layers in Flink. However, a short-term goal could  include only core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3, #6).

Please share your further thoughts. If we generally agree that this is the right direction, I could come up with a formal proposal quickly and then we can follow up with broader discussions.

Thanks,
Xuefu



------------------------------------------------------------------
Sender:vino yang <ya...@gmail.com>
Sent at:2018 Oct 11 (Thu) 09:45
Recipient:Fabian Hueske <fh...@gmail.com>
Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Xuefu,

Appreciate this proposal, and like Fabian, it would look better if you can give more details of the plan.

Thanks, vino.
Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
Hi Xuefu,

Welcome to the Flink community and thanks for starting this discussion! Better Hive integration would be really great!
Can you go into details of what you are proposing? I can think of a couple ways to improve Flink in that regard:

* Support for Hive UDFs
* Support for Hive metadata catalog
* Support for HiveQL syntax
* ???

Best, Fabian

Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
Hi all,

 Along with the community's effort, inside Alibaba we have explored Flink's potential as an execution engine not just for stream processing but also for batch processing. We are encouraged by our findings and have initiated our effort to make Flink's SQL capabilities full-fledged. When comparing what's available in Flink to the offerings from competitive data processing engines, we identified a major gap in Flink: a well integration with Hive ecosystem. This is crucial to the success of Flink SQL and batch due to the well-established data ecosystem around Hive. Therefore, we have done some initial work along this direction but there are still a lot of effort needed.

 We have two strategies in mind. The first one is to make Flink SQL full-fledged and well-integrated with Hive ecosystem. This is a similar approach to what Spark SQL adopted. The second strategy is to make Hive itself work with Flink, similar to the proposal in [1]. Each approach bears its pros and cons, but they don’t need to be mutually exclusive with each targeting at different users and use cases. We believe that both will promote a much greater adoption of Flink beyond stream processing.

 We have been focused on the first approach and would like to showcase Flink's batch and SQL capabilities with Flink SQL. However, we have also planned to start strategy #2 as the follow-up effort.

 I'm completely new to Flink(, with a short bio [2] below), though many of my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like to share our thoughts and invite your early feedback. At the same time, I am working on a detailed proposal on Flink SQL's integration with Hive ecosystem, which will be also shared when ready.

 While the ideas are simple, each approach will demand significant effort, more than what we can afford. Thus, the input and contributions from the communities are greatly welcome and appreciated.

 Regards,


 Xuefu

 References:

 [1] https://issues.apache.org/jira/browse/HIVE-10712
 [2] Xuefu Zhang is a long-time open source veteran, worked or working on many projects under Apache Foundation, of which he is also an honored member. About 10 years ago he worked in the Hadoop team at Yahoo where the projects just got started. Later he worked at Cloudera, initiating and leading the development of Hive on Spark project in the communities and across many organizations. Prior to joining Alibaba, he worked at Uber where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and significantly improved Uber's cluster efficiency.




-- 
"So you have to trust that the dots will somehow connect in your future."

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by "Zhang, Xuefu" <xu...@alibaba-inc.com>.
Hi all,

To wrap up the discussion, I have attached a PDF describing the proposal, which is also attached to FLINK-10556 [1]. Please feel free to watch that JIRA to track the progress.

Please also let me know if you have additional comments or questions.

Thanks,
Xuefu

[1] https://issues.apache.org/jira/browse/FLINK-10556



------------------------------------------------------------------
Sender:Xuefu <xu...@alibaba-inc.com>
Sent at:2018 Oct 16 (Tue) 03:40
Recipient:Shuyi Chen <su...@gmail.com>
Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>; dev <de...@flink.apache.org>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Shuyi,

Thank you for your input. Yes, I agreed with a phased approach and like to move forward fast. :) We did some work internally on DDL utilizing babel parser in Calcite. While babel makes Calcite's grammar extensible, at first impression it still seems too cumbersome for a project when too much extensions are made. It's even challenging to find where the extension is needed! It would be certainly better if Calcite can magically support Hive QL by just turning on a flag, such as that for MYSQL_5. I can also see that this could mean a lot of work on Calcite. Nevertheless, I will bring up the discussion over there and to see what their community thinks.

Would mind to share more info about the proposal on DDL that you mentioned? We can certainly collaborate on this.

Thanks,
Xuefu


------------------------------------------------------------------
Sender:Shuyi Chen <su...@gmail.com>
Sent at:2018 Oct 14 (Sun) 08:30
Recipient:Xuefu <xu...@alibaba-inc.com>
Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>; dev <de...@flink.apache.org>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Welcome to the community and thanks for the great proposal, Xuefu! I think the proposal can be divided into 2 stages: making Flink to support Hive features, and make Hive to work with Flink. I agreed with Timo that on starting with a smaller scope, so we can make progress faster. As for [6], a proposal for DDL is already in progress, and will come after the unified SQL connector API is done. For supporting Hive syntax, we might need to work with the Calcite community, and a recent effort called babel (https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite might help here.

Thanks
Shuyi
On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <xu...@alibaba-inc.com> wrote:
Hi Fabian/Vno,

Thank you very much for your encouragement inquiry. Sorry that I didn't see Fabian's email until I read Vino's response just now. (Somehow Fabian's went to the spam folder.)

My proposal contains long-term and short-terms goals. Nevertheless, the effort will focus on the following areas, including Fabian's list:

1. Hive metastore connectivity - This covers both read/write access, which means Flink can make full use of Hive's metastore as its catalog (at least for the batch but can extend for streaming as well).
2. Metadata compatibility - Objects (databases, tables, partitions, etc) created by Hive can be understood by Flink and the reverse direction is true also.
3. Data compatibility - Similar to #2, data produced by Hive can be consumed by Flink and vise versa.
4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its own implementation or make Hive's implementation work in Flink. Further, for user created UDFs in Hive, Flink SQL should provide a mechanism allowing user to import them into Flink without any code change required.
5. Data types -  Flink SQL should support all data types that are available in Hive.
6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) with extension to support Hive's syntax and language features, around DDL, DML, and SELECT queries.
7.  SQL CLI - this is currently developing in Flink but more effort is needed.
8. Server - provide a server that's compatible with Hive's HiverServer2 in thrift APIs, such that HiveServer2 users can reuse their existing client (such as beeline) but connect to Flink's thrift server instead.
9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other application to use to connect to its thrift server
10. Support other user's customizations in Hive, such as Hive Serdes, storage handlers, etc.
11. Better task failure tolerance and task scheduling at Flink runtime.

As you can see, achieving all those requires significant effort and across all layers in Flink. However, a short-term goal could  include only core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3, #6).

Please share your further thoughts. If we generally agree that this is the right direction, I could come up with a formal proposal quickly and then we can follow up with broader discussions.

Thanks,
Xuefu



------------------------------------------------------------------
Sender:vino yang <ya...@gmail.com>
Sent at:2018 Oct 11 (Thu) 09:45
Recipient:Fabian Hueske <fh...@gmail.com>
Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Xuefu,

Appreciate this proposal, and like Fabian, it would look better if you can give more details of the plan.

Thanks, vino.
Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
Hi Xuefu,

Welcome to the Flink community and thanks for starting this discussion! Better Hive integration would be really great!
Can you go into details of what you are proposing? I can think of a couple ways to improve Flink in that regard:

* Support for Hive UDFs
* Support for Hive metadata catalog
* Support for HiveQL syntax
* ???

Best, Fabian

Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
Hi all,

 Along with the community's effort, inside Alibaba we have explored Flink's potential as an execution engine not just for stream processing but also for batch processing. We are encouraged by our findings and have initiated our effort to make Flink's SQL capabilities full-fledged. When comparing what's available in Flink to the offerings from competitive data processing engines, we identified a major gap in Flink: a well integration with Hive ecosystem. This is crucial to the success of Flink SQL and batch due to the well-established data ecosystem around Hive. Therefore, we have done some initial work along this direction but there are still a lot of effort needed.

 We have two strategies in mind. The first one is to make Flink SQL full-fledged and well-integrated with Hive ecosystem. This is a similar approach to what Spark SQL adopted. The second strategy is to make Hive itself work with Flink, similar to the proposal in [1]. Each approach bears its pros and cons, but they don’t need to be mutually exclusive with each targeting at different users and use cases. We believe that both will promote a much greater adoption of Flink beyond stream processing.

 We have been focused on the first approach and would like to showcase Flink's batch and SQL capabilities with Flink SQL. However, we have also planned to start strategy #2 as the follow-up effort.

 I'm completely new to Flink(, with a short bio [2] below), though many of my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like to share our thoughts and invite your early feedback. At the same time, I am working on a detailed proposal on Flink SQL's integration with Hive ecosystem, which will be also shared when ready.

 While the ideas are simple, each approach will demand significant effort, more than what we can afford. Thus, the input and contributions from the communities are greatly welcome and appreciated.

 Regards,


 Xuefu

 References:

 [1] https://issues.apache.org/jira/browse/HIVE-10712
 [2] Xuefu Zhang is a long-time open source veteran, worked or working on many projects under Apache Foundation, of which he is also an honored member. About 10 years ago he worked in the Hadoop team at Yahoo where the projects just got started. Later he worked at Cloudera, initiating and leading the development of Hive on Spark project in the communities and across many organizations. Prior to joining Alibaba, he worked at Uber where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and significantly improved Uber's cluster efficiency.




-- 
"So you have to trust that the dots will somehow connect in your future."

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by "Zhang, Xuefu" <xu...@alibaba-inc.com>.
Hi all,

To wrap up the discussion, I have attached a PDF describing the proposal, which is also attached to FLINK-10556 [1]. Please feel free to watch that JIRA to track the progress.

Please also let me know if you have additional comments or questions.

Thanks,
Xuefu

[1] https://issues.apache.org/jira/browse/FLINK-10556



------------------------------------------------------------------
Sender:Xuefu <xu...@alibaba-inc.com>
Sent at:2018 Oct 16 (Tue) 03:40
Recipient:Shuyi Chen <su...@gmail.com>
Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>; dev <de...@flink.apache.org>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Shuyi,

Thank you for your input. Yes, I agreed with a phased approach and like to move forward fast. :) We did some work internally on DDL utilizing babel parser in Calcite. While babel makes Calcite's grammar extensible, at first impression it still seems too cumbersome for a project when too much extensions are made. It's even challenging to find where the extension is needed! It would be certainly better if Calcite can magically support Hive QL by just turning on a flag, such as that for MYSQL_5. I can also see that this could mean a lot of work on Calcite. Nevertheless, I will bring up the discussion over there and to see what their community thinks.

Would mind to share more info about the proposal on DDL that you mentioned? We can certainly collaborate on this.

Thanks,
Xuefu


------------------------------------------------------------------
Sender:Shuyi Chen <su...@gmail.com>
Sent at:2018 Oct 14 (Sun) 08:30
Recipient:Xuefu <xu...@alibaba-inc.com>
Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>; dev <de...@flink.apache.org>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Welcome to the community and thanks for the great proposal, Xuefu! I think the proposal can be divided into 2 stages: making Flink to support Hive features, and make Hive to work with Flink. I agreed with Timo that on starting with a smaller scope, so we can make progress faster. As for [6], a proposal for DDL is already in progress, and will come after the unified SQL connector API is done. For supporting Hive syntax, we might need to work with the Calcite community, and a recent effort called babel (https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite might help here.

Thanks
Shuyi
On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <xu...@alibaba-inc.com> wrote:
Hi Fabian/Vno,

Thank you very much for your encouragement inquiry. Sorry that I didn't see Fabian's email until I read Vino's response just now. (Somehow Fabian's went to the spam folder.)

My proposal contains long-term and short-terms goals. Nevertheless, the effort will focus on the following areas, including Fabian's list:

1. Hive metastore connectivity - This covers both read/write access, which means Flink can make full use of Hive's metastore as its catalog (at least for the batch but can extend for streaming as well).
2. Metadata compatibility - Objects (databases, tables, partitions, etc) created by Hive can be understood by Flink and the reverse direction is true also.
3. Data compatibility - Similar to #2, data produced by Hive can be consumed by Flink and vise versa.
4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its own implementation or make Hive's implementation work in Flink. Further, for user created UDFs in Hive, Flink SQL should provide a mechanism allowing user to import them into Flink without any code change required.
5. Data types -  Flink SQL should support all data types that are available in Hive.
6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) with extension to support Hive's syntax and language features, around DDL, DML, and SELECT queries.
7.  SQL CLI - this is currently developing in Flink but more effort is needed.
8. Server - provide a server that's compatible with Hive's HiverServer2 in thrift APIs, such that HiveServer2 users can reuse their existing client (such as beeline) but connect to Flink's thrift server instead.
9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other application to use to connect to its thrift server
10. Support other user's customizations in Hive, such as Hive Serdes, storage handlers, etc.
11. Better task failure tolerance and task scheduling at Flink runtime.

As you can see, achieving all those requires significant effort and across all layers in Flink. However, a short-term goal could  include only core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3, #6).

Please share your further thoughts. If we generally agree that this is the right direction, I could come up with a formal proposal quickly and then we can follow up with broader discussions.

Thanks,
Xuefu



------------------------------------------------------------------
Sender:vino yang <ya...@gmail.com>
Sent at:2018 Oct 11 (Thu) 09:45
Recipient:Fabian Hueske <fh...@gmail.com>
Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Xuefu,

Appreciate this proposal, and like Fabian, it would look better if you can give more details of the plan.

Thanks, vino.
Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
Hi Xuefu,

Welcome to the Flink community and thanks for starting this discussion! Better Hive integration would be really great!
Can you go into details of what you are proposing? I can think of a couple ways to improve Flink in that regard:

* Support for Hive UDFs
* Support for Hive metadata catalog
* Support for HiveQL syntax
* ???

Best, Fabian

Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
Hi all,

 Along with the community's effort, inside Alibaba we have explored Flink's potential as an execution engine not just for stream processing but also for batch processing. We are encouraged by our findings and have initiated our effort to make Flink's SQL capabilities full-fledged. When comparing what's available in Flink to the offerings from competitive data processing engines, we identified a major gap in Flink: a well integration with Hive ecosystem. This is crucial to the success of Flink SQL and batch due to the well-established data ecosystem around Hive. Therefore, we have done some initial work along this direction but there are still a lot of effort needed.

 We have two strategies in mind. The first one is to make Flink SQL full-fledged and well-integrated with Hive ecosystem. This is a similar approach to what Spark SQL adopted. The second strategy is to make Hive itself work with Flink, similar to the proposal in [1]. Each approach bears its pros and cons, but they don’t need to be mutually exclusive with each targeting at different users and use cases. We believe that both will promote a much greater adoption of Flink beyond stream processing.

 We have been focused on the first approach and would like to showcase Flink's batch and SQL capabilities with Flink SQL. However, we have also planned to start strategy #2 as the follow-up effort.

 I'm completely new to Flink(, with a short bio [2] below), though many of my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like to share our thoughts and invite your early feedback. At the same time, I am working on a detailed proposal on Flink SQL's integration with Hive ecosystem, which will be also shared when ready.

 While the ideas are simple, each approach will demand significant effort, more than what we can afford. Thus, the input and contributions from the communities are greatly welcome and appreciated.

 Regards,


 Xuefu

 References:

 [1] https://issues.apache.org/jira/browse/HIVE-10712
 [2] Xuefu Zhang is a long-time open source veteran, worked or working on many projects under Apache Foundation, of which he is also an honored member. About 10 years ago he worked in the Hadoop team at Yahoo where the projects just got started. Later he worked at Cloudera, initiating and leading the development of Hive on Spark project in the communities and across many organizations. Prior to joining Alibaba, he worked at Uber where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and significantly improved Uber's cluster efficiency.




-- 
"So you have to trust that the dots will somehow connect in your future."

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by "Zhang, Xuefu" <xu...@alibaba-inc.com>.
Hi Shuyi,

Thank you for your input. Yes, I agreed with a phased approach and like to move forward fast. :) We did some work internally on DDL utilizing babel parser in Calcite. While babel makes Calcite's grammar extensible, at first impression it still seems too cumbersome for a project when too much extensions are made. It's even challenging to find where the extension is needed! It would be certainly better if Calcite can magically support Hive QL by just turning on a flag, such as that for MYSQL_5. I can also see that this could mean a lot of work on Calcite. Nevertheless, I will bring up the discussion over there and to see what their community thinks.

Would mind to share more info about the proposal on DDL that you mentioned? We can certainly collaborate on this.

Thanks,
Xuefu


------------------------------------------------------------------
Sender:Shuyi Chen <su...@gmail.com>
Sent at:2018 Oct 14 (Sun) 08:30
Recipient:Xuefu <xu...@alibaba-inc.com>
Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>; dev <de...@flink.apache.org>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Welcome to the community and thanks for the great proposal, Xuefu! I think the proposal can be divided into 2 stages: making Flink to support Hive features, and make Hive to work with Flink. I agreed with Timo that on starting with a smaller scope, so we can make progress faster. As for [6], a proposal for DDL is already in progress, and will come after the unified SQL connector API is done. For supporting Hive syntax, we might need to work with the Calcite community, and a recent effort called babel (https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite might help here.

Thanks
Shuyi
On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <xu...@alibaba-inc.com> wrote:

Hi Fabian/Vno,

Thank you very much for your encouragement inquiry. Sorry that I didn't see Fabian's email until I read Vino's response just now. (Somehow Fabian's went to the spam folder.)

My proposal contains long-term and short-terms goals. Nevertheless, the effort will focus on the following areas, including Fabian's list:

1. Hive metastore connectivity - This covers both read/write access, which means Flink can make full use of Hive's metastore as its catalog (at least for the batch but can extend for streaming as well).
2. Metadata compatibility - Objects (databases, tables, partitions, etc) created by Hive can be understood by Flink and the reverse direction is true also.
3. Data compatibility - Similar to #2, data produced by Hive can be consumed by Flink and vise versa.
4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its own implementation or make Hive's implementation work in Flink. Further, for user created UDFs in Hive, Flink SQL should provide a mechanism allowing user to import them into Flink without any code change required.
5. Data types -  Flink SQL should support all data types that are available in Hive.
6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) with extension to support Hive's syntax and language features, around DDL, DML, and SELECT queries.
7.  SQL CLI - this is currently developing in Flink but more effort is needed.
8. Server - provide a server that's compatible with Hive's HiverServer2 in thrift APIs, such that HiveServer2 users can reuse their existing client (such as beeline) but connect to Flink's thrift server instead.
9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other application to use to connect to its thrift server
10. Support other user's customizations in Hive, such as Hive Serdes, storage handlers, etc.
11. Better task failure tolerance and task scheduling at Flink runtime.

As you can see, achieving all those requires significant effort and across all layers in Flink. However, a short-term goal could  include only core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3, #6).

Please share your further thoughts. If we generally agree that this is the right direction, I could come up with a formal proposal quickly and then we can follow up with broader discussions.

Thanks,
Xuefu



------------------------------------------------------------------
Sender:vino yang <ya...@gmail.com>
Sent at:2018 Oct 11 (Thu) 09:45
Recipient:Fabian Hueske <fh...@gmail.com>
Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Xuefu,

Appreciate this proposal, and like Fabian, it would look better if you can give more details of the plan.

Thanks, vino.
Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
Hi Xuefu,

Welcome to the Flink community and thanks for starting this discussion! Better Hive integration would be really great!
Can you go into details of what you are proposing? I can think of a couple ways to improve Flink in that regard:

* Support for Hive UDFs
* Support for Hive metadata catalog
* Support for HiveQL syntax
* ???

Best, Fabian

Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
Hi all,

 Along with the community's effort, inside Alibaba we have explored Flink's potential as an execution engine not just for stream processing but also for batch processing. We are encouraged by our findings and have initiated our effort to make Flink's SQL capabilities full-fledged. When comparing what's available in Flink to the offerings from competitive data processing engines, we identified a major gap in Flink: a well integration with Hive ecosystem. This is crucial to the success of Flink SQL and batch due to the well-established data ecosystem around Hive. Therefore, we have done some initial work along this direction but there are still a lot of effort needed.

 We have two strategies in mind. The first one is to make Flink SQL full-fledged and well-integrated with Hive ecosystem. This is a similar approach to what Spark SQL adopted. The second strategy is to make Hive itself work with Flink, similar to the proposal in [1]. Each approach bears its pros and cons, but they don’t need to be mutually exclusive with each targeting at different users and use cases. We believe that both will promote a much greater adoption of Flink beyond stream processing.

 We have been focused on the first approach and would like to showcase Flink's batch and SQL capabilities with Flink SQL. However, we have also planned to start strategy #2 as the follow-up effort.

 I'm completely new to Flink(, with a short bio [2] below), though many of my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like to share our thoughts and invite your early feedback. At the same time, I am working on a detailed proposal on Flink SQL's integration with Hive ecosystem, which will be also shared when ready.

 While the ideas are simple, each approach will demand significant effort, more than what we can afford. Thus, the input and contributions from the communities are greatly welcome and appreciated.

 Regards,


 Xuefu

 References:

 [1] https://issues.apache.org/jira/browse/HIVE-10712
 [2] Xuefu Zhang is a long-time open source veteran, worked or working on many projects under Apache Foundation, of which he is also an honored member. About 10 years ago he worked in the Hadoop team at Yahoo where the projects just got started. Later he worked at Cloudera, initiating and leading the development of Hive on Spark project in the communities and across many organizations. Prior to joining Alibaba, he worked at Uber where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and significantly improved Uber's cluster efficiency.




-- 
"So you have to trust that the dots will somehow connect in your future."

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by "Zhang, Xuefu" <xu...@alibaba-inc.com>.
Hi Shuyi,

Thank you for your input. Yes, I agreed with a phased approach and like to move forward fast. :) We did some work internally on DDL utilizing babel parser in Calcite. While babel makes Calcite's grammar extensible, at first impression it still seems too cumbersome for a project when too much extensions are made. It's even challenging to find where the extension is needed! It would be certainly better if Calcite can magically support Hive QL by just turning on a flag, such as that for MYSQL_5. I can also see that this could mean a lot of work on Calcite. Nevertheless, I will bring up the discussion over there and to see what their community thinks.

Would mind to share more info about the proposal on DDL that you mentioned? We can certainly collaborate on this.

Thanks,
Xuefu


------------------------------------------------------------------
Sender:Shuyi Chen <su...@gmail.com>
Sent at:2018 Oct 14 (Sun) 08:30
Recipient:Xuefu <xu...@alibaba-inc.com>
Cc:yanghua1127 <ya...@gmail.com>; Fabian Hueske <fh...@gmail.com>; dev <de...@flink.apache.org>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Welcome to the community and thanks for the great proposal, Xuefu! I think the proposal can be divided into 2 stages: making Flink to support Hive features, and make Hive to work with Flink. I agreed with Timo that on starting with a smaller scope, so we can make progress faster. As for [6], a proposal for DDL is already in progress, and will come after the unified SQL connector API is done. For supporting Hive syntax, we might need to work with the Calcite community, and a recent effort called babel (https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite might help here.

Thanks
Shuyi
On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <xu...@alibaba-inc.com> wrote:

Hi Fabian/Vno,

Thank you very much for your encouragement inquiry. Sorry that I didn't see Fabian's email until I read Vino's response just now. (Somehow Fabian's went to the spam folder.)

My proposal contains long-term and short-terms goals. Nevertheless, the effort will focus on the following areas, including Fabian's list:

1. Hive metastore connectivity - This covers both read/write access, which means Flink can make full use of Hive's metastore as its catalog (at least for the batch but can extend for streaming as well).
2. Metadata compatibility - Objects (databases, tables, partitions, etc) created by Hive can be understood by Flink and the reverse direction is true also.
3. Data compatibility - Similar to #2, data produced by Hive can be consumed by Flink and vise versa.
4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its own implementation or make Hive's implementation work in Flink. Further, for user created UDFs in Hive, Flink SQL should provide a mechanism allowing user to import them into Flink without any code change required.
5. Data types -  Flink SQL should support all data types that are available in Hive.
6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) with extension to support Hive's syntax and language features, around DDL, DML, and SELECT queries.
7.  SQL CLI - this is currently developing in Flink but more effort is needed.
8. Server - provide a server that's compatible with Hive's HiverServer2 in thrift APIs, such that HiveServer2 users can reuse their existing client (such as beeline) but connect to Flink's thrift server instead.
9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other application to use to connect to its thrift server
10. Support other user's customizations in Hive, such as Hive Serdes, storage handlers, etc.
11. Better task failure tolerance and task scheduling at Flink runtime.

As you can see, achieving all those requires significant effort and across all layers in Flink. However, a short-term goal could  include only core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3, #6).

Please share your further thoughts. If we generally agree that this is the right direction, I could come up with a formal proposal quickly and then we can follow up with broader discussions.

Thanks,
Xuefu



------------------------------------------------------------------
Sender:vino yang <ya...@gmail.com>
Sent at:2018 Oct 11 (Thu) 09:45
Recipient:Fabian Hueske <fh...@gmail.com>
Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Xuefu,

Appreciate this proposal, and like Fabian, it would look better if you can give more details of the plan.

Thanks, vino.
Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
Hi Xuefu,

Welcome to the Flink community and thanks for starting this discussion! Better Hive integration would be really great!
Can you go into details of what you are proposing? I can think of a couple ways to improve Flink in that regard:

* Support for Hive UDFs
* Support for Hive metadata catalog
* Support for HiveQL syntax
* ???

Best, Fabian

Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
Hi all,

 Along with the community's effort, inside Alibaba we have explored Flink's potential as an execution engine not just for stream processing but also for batch processing. We are encouraged by our findings and have initiated our effort to make Flink's SQL capabilities full-fledged. When comparing what's available in Flink to the offerings from competitive data processing engines, we identified a major gap in Flink: a well integration with Hive ecosystem. This is crucial to the success of Flink SQL and batch due to the well-established data ecosystem around Hive. Therefore, we have done some initial work along this direction but there are still a lot of effort needed.

 We have two strategies in mind. The first one is to make Flink SQL full-fledged and well-integrated with Hive ecosystem. This is a similar approach to what Spark SQL adopted. The second strategy is to make Hive itself work with Flink, similar to the proposal in [1]. Each approach bears its pros and cons, but they don’t need to be mutually exclusive with each targeting at different users and use cases. We believe that both will promote a much greater adoption of Flink beyond stream processing.

 We have been focused on the first approach and would like to showcase Flink's batch and SQL capabilities with Flink SQL. However, we have also planned to start strategy #2 as the follow-up effort.

 I'm completely new to Flink(, with a short bio [2] below), though many of my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like to share our thoughts and invite your early feedback. At the same time, I am working on a detailed proposal on Flink SQL's integration with Hive ecosystem, which will be also shared when ready.

 While the ideas are simple, each approach will demand significant effort, more than what we can afford. Thus, the input and contributions from the communities are greatly welcome and appreciated.

 Regards,


 Xuefu

 References:

 [1] https://issues.apache.org/jira/browse/HIVE-10712
 [2] Xuefu Zhang is a long-time open source veteran, worked or working on many projects under Apache Foundation, of which he is also an honored member. About 10 years ago he worked in the Hadoop team at Yahoo where the projects just got started. Later he worked at Cloudera, initiating and leading the development of Hive on Spark project in the communities and across many organizations. Prior to joining Alibaba, he worked at Uber where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and significantly improved Uber's cluster efficiency.




-- 
"So you have to trust that the dots will somehow connect in your future."

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by Shuyi Chen <su...@gmail.com>.
Welcome to the community and thanks for the great proposal, Xuefu! I think
the proposal can be divided into 2 stages: making Flink to support Hive
features, and make Hive to work with Flink. I agreed with Timo that on
starting with a smaller scope, so we can make progress faster. As for [6],
a proposal for DDL is already in progress, and will come after the unified
SQL connector API is done. For supporting Hive syntax, we might need to
work with the Calcite community, and a recent effort called babel (
https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite might help
here.

Thanks
Shuyi

On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <xu...@alibaba-inc.com>
wrote:

> Hi Fabian/Vno,
>
> Thank you very much for your encouragement inquiry. Sorry that I didn't
> see Fabian's email until I read Vino's response just now. (Somehow Fabian's
> went to the spam folder.)
>
> My proposal contains long-term and short-terms goals. Nevertheless, the
> effort will focus on the following areas, including Fabian's list:
>
> 1. Hive metastore connectivity - This covers both read/write access, which
> means Flink can make full use of Hive's metastore as its catalog (at least
> for the batch but can extend for streaming as well).
> 2. Metadata compatibility - Objects (databases, tables, partitions, etc)
> created by Hive can be understood by Flink and the reverse direction is
> true also.
> 3. Data compatibility - Similar to #2, data produced by Hive can be
> consumed by Flink and vise versa.
> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides
> its own implementation or make Hive's implementation work in Flink.
> Further, for user created UDFs in Hive, Flink SQL should provide a
> mechanism allowing user to import them into Flink without any code change
> required.
> 5. Data types -  Flink SQL should support all data types that are
> available in Hive.
> 6. SQL Language - Flink SQL should support SQL standard (such as SQL2003)
> with extension to support Hive's syntax and language features, around DDL,
> DML, and SELECT queries.
> 7.  SQL CLI - this is currently developing in Flink but more effort is
> needed.
> 8. Server - provide a server that's compatible with Hive's HiverServer2 in
> thrift APIs, such that HiveServer2 users can reuse their existing client
> (such as beeline) but connect to Flink's thrift server instead.
> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for
> other application to use to connect to its thrift server
> 10. Support other user's customizations in Hive, such as Hive Serdes,
> storage handlers, etc.
> 11. Better task failure tolerance and task scheduling at Flink runtime.
>
> As you can see, achieving all those requires significant effort and across
> all layers in Flink. However, a short-term goal could  include only core
> areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3,
> #6).
>
> Please share your further thoughts. If we generally agree that this is the
> right direction, I could come up with a formal proposal quickly and then we
> can follow up with broader discussions.
>
> Thanks,
> Xuefu
>
>
>
> ------------------------------------------------------------------
> Sender:vino yang <ya...@gmail.com>
> Sent at:2018 Oct 11 (Thu) 09:45
> Recipient:Fabian Hueske <fh...@gmail.com>
> Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <
> user@flink.apache.org>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> Hi Xuefu,
>
> Appreciate this proposal, and like Fabian, it would look better if you can
> give more details of the plan.
>
> Thanks, vino.
>
> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
> Hi Xuefu,
>
> Welcome to the Flink community and thanks for starting this discussion!
> Better Hive integration would be really great!
> Can you go into details of what you are proposing? I can think of a couple
> ways to improve Flink in that regard:
>
> * Support for Hive UDFs
> * Support for Hive metadata catalog
> * Support for HiveQL syntax
> * ???
>
> Best, Fabian
>
> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
> xuefu.z@alibaba-inc.com>:
> Hi all,
>
> Along with the community's effort, inside Alibaba we have explored Flink's
> potential as an execution engine not just for stream processing but also
> for batch processing. We are encouraged by our findings and have initiated
> our effort to make Flink's SQL capabilities full-fledged. When comparing
> what's available in Flink to the offerings from competitive data processing
> engines, we identified a major gap in Flink: a well integration with Hive
> ecosystem. This is crucial to the success of Flink SQL and batch due to the
> well-established data ecosystem around Hive. Therefore, we have done some
> initial work along this direction but there are still a lot of effort
> needed.
>
> We have two strategies in mind. The first one is to make Flink SQL
> full-fledged and well-integrated with Hive ecosystem. This is a similar
> approach to what Spark SQL adopted. The second strategy is to make Hive
> itself work with Flink, similar to the proposal in [1]. Each approach bears
> its pros and cons, but they don’t need to be mutually exclusive with each
> targeting at different users and use cases. We believe that both will
> promote a much greater adoption of Flink beyond stream processing.
>
> We have been focused on the first approach and would like to showcase
> Flink's batch and SQL capabilities with Flink SQL. However, we have also
> planned to start strategy #2 as the follow-up effort.
>
> I'm completely new to Flink(, with a short bio [2] below), though many of
> my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd
> like to share our thoughts and invite your early feedback. At the same
> time, I am working on a detailed proposal on Flink SQL's integration with
> Hive ecosystem, which will be also shared when ready.
>
> While the ideas are simple, each approach will demand significant effort,
> more than what we can afford. Thus, the input and contributions from the
> communities are greatly welcome and appreciated.
>
> Regards,
>
>
> Xuefu
>
> References:
>
> [1] https://issues.apache.org/jira/browse/HIVE-10712
> [2] Xuefu Zhang is a long-time open source veteran, worked or working on
> many projects under Apache Foundation, of which he is also an honored
> member. About 10 years ago he worked in the Hadoop team at Yahoo where the
> projects just got started. Later he worked at Cloudera, initiating and
> leading the development of Hive on Spark project in the communities and
> across many organizations. Prior to joining Alibaba, he worked at Uber
> where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and
> significantly improved Uber's cluster efficiency.
>
>
>

-- 
"So you have to trust that the dots will somehow connect in your future."

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by Shuyi Chen <su...@gmail.com>.
Welcome to the community and thanks for the great proposal, Xuefu! I think
the proposal can be divided into 2 stages: making Flink to support Hive
features, and make Hive to work with Flink. I agreed with Timo that on
starting with a smaller scope, so we can make progress faster. As for [6],
a proposal for DDL is already in progress, and will come after the unified
SQL connector API is done. For supporting Hive syntax, we might need to
work with the Calcite community, and a recent effort called babel (
https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite might help
here.

Thanks
Shuyi

On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu <xu...@alibaba-inc.com>
wrote:

> Hi Fabian/Vno,
>
> Thank you very much for your encouragement inquiry. Sorry that I didn't
> see Fabian's email until I read Vino's response just now. (Somehow Fabian's
> went to the spam folder.)
>
> My proposal contains long-term and short-terms goals. Nevertheless, the
> effort will focus on the following areas, including Fabian's list:
>
> 1. Hive metastore connectivity - This covers both read/write access, which
> means Flink can make full use of Hive's metastore as its catalog (at least
> for the batch but can extend for streaming as well).
> 2. Metadata compatibility - Objects (databases, tables, partitions, etc)
> created by Hive can be understood by Flink and the reverse direction is
> true also.
> 3. Data compatibility - Similar to #2, data produced by Hive can be
> consumed by Flink and vise versa.
> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides
> its own implementation or make Hive's implementation work in Flink.
> Further, for user created UDFs in Hive, Flink SQL should provide a
> mechanism allowing user to import them into Flink without any code change
> required.
> 5. Data types -  Flink SQL should support all data types that are
> available in Hive.
> 6. SQL Language - Flink SQL should support SQL standard (such as SQL2003)
> with extension to support Hive's syntax and language features, around DDL,
> DML, and SELECT queries.
> 7.  SQL CLI - this is currently developing in Flink but more effort is
> needed.
> 8. Server - provide a server that's compatible with Hive's HiverServer2 in
> thrift APIs, such that HiveServer2 users can reuse their existing client
> (such as beeline) but connect to Flink's thrift server instead.
> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for
> other application to use to connect to its thrift server
> 10. Support other user's customizations in Hive, such as Hive Serdes,
> storage handlers, etc.
> 11. Better task failure tolerance and task scheduling at Flink runtime.
>
> As you can see, achieving all those requires significant effort and across
> all layers in Flink. However, a short-term goal could  include only core
> areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3,
> #6).
>
> Please share your further thoughts. If we generally agree that this is the
> right direction, I could come up with a formal proposal quickly and then we
> can follow up with broader discussions.
>
> Thanks,
> Xuefu
>
>
>
> ------------------------------------------------------------------
> Sender:vino yang <ya...@gmail.com>
> Sent at:2018 Oct 11 (Thu) 09:45
> Recipient:Fabian Hueske <fh...@gmail.com>
> Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <
> user@flink.apache.org>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>
> Hi Xuefu,
>
> Appreciate this proposal, and like Fabian, it would look better if you can
> give more details of the plan.
>
> Thanks, vino.
>
> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
> Hi Xuefu,
>
> Welcome to the Flink community and thanks for starting this discussion!
> Better Hive integration would be really great!
> Can you go into details of what you are proposing? I can think of a couple
> ways to improve Flink in that regard:
>
> * Support for Hive UDFs
> * Support for Hive metadata catalog
> * Support for HiveQL syntax
> * ???
>
> Best, Fabian
>
> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
> xuefu.z@alibaba-inc.com>:
> Hi all,
>
> Along with the community's effort, inside Alibaba we have explored Flink's
> potential as an execution engine not just for stream processing but also
> for batch processing. We are encouraged by our findings and have initiated
> our effort to make Flink's SQL capabilities full-fledged. When comparing
> what's available in Flink to the offerings from competitive data processing
> engines, we identified a major gap in Flink: a well integration with Hive
> ecosystem. This is crucial to the success of Flink SQL and batch due to the
> well-established data ecosystem around Hive. Therefore, we have done some
> initial work along this direction but there are still a lot of effort
> needed.
>
> We have two strategies in mind. The first one is to make Flink SQL
> full-fledged and well-integrated with Hive ecosystem. This is a similar
> approach to what Spark SQL adopted. The second strategy is to make Hive
> itself work with Flink, similar to the proposal in [1]. Each approach bears
> its pros and cons, but they don’t need to be mutually exclusive with each
> targeting at different users and use cases. We believe that both will
> promote a much greater adoption of Flink beyond stream processing.
>
> We have been focused on the first approach and would like to showcase
> Flink's batch and SQL capabilities with Flink SQL. However, we have also
> planned to start strategy #2 as the follow-up effort.
>
> I'm completely new to Flink(, with a short bio [2] below), though many of
> my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd
> like to share our thoughts and invite your early feedback. At the same
> time, I am working on a detailed proposal on Flink SQL's integration with
> Hive ecosystem, which will be also shared when ready.
>
> While the ideas are simple, each approach will demand significant effort,
> more than what we can afford. Thus, the input and contributions from the
> communities are greatly welcome and appreciated.
>
> Regards,
>
>
> Xuefu
>
> References:
>
> [1] https://issues.apache.org/jira/browse/HIVE-10712
> [2] Xuefu Zhang is a long-time open source veteran, worked or working on
> many projects under Apache Foundation, of which he is also an honored
> member. About 10 years ago he worked in the Hadoop team at Yahoo where the
> projects just got started. Later he worked at Cloudera, initiating and
> leading the development of Hive on Spark project in the communities and
> across many organizations. Prior to joining Alibaba, he worked at Uber
> where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and
> significantly improved Uber's cluster efficiency.
>
>
>

-- 
"So you have to trust that the dots will somehow connect in your future."

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by Jörn Franke <jo...@gmail.com>.
Would it maybe make sense to provide Flink as an engine on Hive („flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely coupled than integrating hive in all possible flink core modules and thus introducing a very tight dependency to Hive in the core.
1,2,3 could be achieved via a connector based on the Flink Table API.
Just as a proposal to start this Endeavour as independent projects (hive engine, connector) to avoid too tight coupling with Flink. Maybe in a more distant future if the Hive integration is heavily demanded one could then integrate it more tightly if needed. 

What is meant by 11?
> Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
> 
> Hi Fabian/Vno,
> 
> Thank you very much for your encouragement inquiry. Sorry that I didn't see Fabian's email until I read Vino's response just now. (Somehow Fabian's went to the spam folder.)
> 
> My proposal contains long-term and short-terms goals. Nevertheless, the effort will focus on the following areas, including Fabian's list:
> 
> 1. Hive metastore connectivity - This covers both read/write access, which means Flink can make full use of Hive's metastore as its catalog (at least for the batch but can extend for streaming as well).
> 2. Metadata compatibility - Objects (databases, tables, partitions, etc) created by Hive can be understood by Flink and the reverse direction is true also.
> 3. Data compatibility - Similar to #2, data produced by Hive can be consumed by Flink and vise versa.
> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its own implementation or make Hive's implementation work in Flink. Further, for user created UDFs in Hive, Flink SQL should provide a mechanism allowing user to import them into Flink without any code change required.
> 5. Data types -  Flink SQL should support all data types that are available in Hive.
> 6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) with extension to support Hive's syntax and language features, around DDL, DML, and SELECT queries.
> 7.  SQL CLI - this is currently developing in Flink but more effort is needed.
> 8. Server - provide a server that's compatible with Hive's HiverServer2 in thrift APIs, such that HiveServer2 users can reuse their existing client (such as beeline) but connect to Flink's thrift server instead.
> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other application to use to connect to its thrift server
> 10. Support other user's customizations in Hive, such as Hive Serdes, storage handlers, etc.
> 11. Better task failure tolerance and task scheduling at Flink runtime.
> 
> As you can see, achieving all those requires significant effort and across all layers in Flink. However, a short-term goal could  include only core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3, #6).
> 
> Please share your further thoughts. If we generally agree that this is the right direction, I could come up with a formal proposal quickly and then we can follow up with broader discussions.
> 
> Thanks,
> Xuefu
> 
> 
> 
> ------------------------------------------------------------------
> Sender:vino yang <ya...@gmail.com>
> Sent at:2018 Oct 11 (Thu) 09:45
> Recipient:Fabian Hueske <fh...@gmail.com>
> Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <us...@flink.apache.org>
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> 
> Hi Xuefu,
> 
> Appreciate this proposal, and like Fabian, it would look better if you can give more details of the plan.
> 
> Thanks, vino.
> 
> Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:
> Hi Xuefu,
> 
> Welcome to the Flink community and thanks for starting this discussion! Better Hive integration would be really great!
> Can you go into details of what you are proposing? I can think of a couple ways to improve Flink in that regard:
> 
> * Support for Hive UDFs
> * Support for Hive metadata catalog
> * Support for HiveQL syntax
> * ???
> 
> Best, Fabian
> 
> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
> Hi all,
> 
> Along with the community's effort, inside Alibaba we have explored Flink's potential as an execution engine not just for stream processing but also for batch processing. We are encouraged by our findings and have initiated our effort to make Flink's SQL capabilities full-fledged. When comparing what's available in Flink to the offerings from competitive data processing engines, we identified a major gap in Flink: a well integration with Hive ecosystem. This is crucial to the success of Flink SQL and batch due to the well-established data ecosystem around Hive. Therefore, we have done some initial work along this direction but there are still a lot of effort needed.
> 
> We have two strategies in mind. The first one is to make Flink SQL full-fledged and well-integrated with Hive ecosystem. This is a similar approach to what Spark SQL adopted. The second strategy is to make Hive itself work with Flink, similar to the proposal in [1]. Each approach bears its pros and cons, but they don’t need to be mutually exclusive with each targeting at different users and use cases. We believe that both will promote a much greater adoption of Flink beyond stream processing.
> 
> We have been focused on the first approach and would like to showcase Flink's batch and SQL capabilities with Flink SQL. However, we have also planned to start strategy #2 as the follow-up effort.
> 
> I'm completely new to Flink(, with a short bio [2] below), though many of my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like to share our thoughts and invite your early feedback. At the same time, I am working on a detailed proposal on Flink SQL's integration with Hive ecosystem, which will be also shared when ready.
> 
> While the ideas are simple, each approach will demand significant effort, more than what we can afford. Thus, the input and contributions from the communities are greatly welcome and appreciated.
> 
> Regards,
> 
> 
> Xuefu
> 
> References:
> 
> [1] https://issues.apache.org/jira/browse/HIVE-10712
> [2] Xuefu Zhang is a long-time open source veteran, worked or working on many projects under Apache Foundation, of which he is also an honored member. About 10 years ago he worked in the Hadoop team at Yahoo where the projects just got started. Later he worked at Cloudera, initiating and leading the development of Hive on Spark project in the communities and across many organizations. Prior to joining Alibaba, he worked at Uber where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and significantly improved Uber's cluster efficiency.
> 
> 

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by "Zhang, Xuefu" <xu...@alibaba-inc.com>.
Hi Fabian/Vno,

Thank you very much for your encouragement inquiry. Sorry that I didn't see Fabian's email until I read Vino's response just now. (Somehow Fabian's went to the spam folder.)

My proposal contains long-term and short-terms goals. Nevertheless, the effort will focus on the following areas, including Fabian's list:

1. Hive metastore connectivity - This covers both read/write access, which means Flink can make full use of Hive's metastore as its catalog (at least for the batch but can extend for streaming as well).
2. Metadata compatibility - Objects (databases, tables, partitions, etc) created by Hive can be understood by Flink and the reverse direction is true also.
3. Data compatibility - Similar to #2, data produced by Hive can be consumed by Flink and vise versa.
4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its own implementation or make Hive's implementation work in Flink. Further, for user created UDFs in Hive, Flink SQL should provide a mechanism allowing user to import them into Flink without any code change required.
5. Data types -  Flink SQL should support all data types that are available in Hive.
6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) with extension to support Hive's syntax and language features, around DDL, DML, and SELECT queries.
7.  SQL CLI - this is currently developing in Flink but more effort is needed.
8. Server - provide a server that's compatible with Hive's HiverServer2 in thrift APIs, such that HiveServer2 users can reuse their existing client (such as beeline) but connect to Flink's thrift server instead.
9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other application to use to connect to its thrift server
10. Support other user's customizations in Hive, such as Hive Serdes, storage handlers, etc.
11. Better task failure tolerance and task scheduling at Flink runtime.

As you can see, achieving all those requires significant effort and across all layers in Flink. However, a short-term goal could  include only core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3, #6).

Please share your further thoughts. If we generally agree that this is the right direction, I could come up with a formal proposal quickly and then we can follow up with broader discussions.

Thanks,
Xuefu




------------------------------------------------------------------
Sender:vino yang <ya...@gmail.com>
Sent at:2018 Oct 11 (Thu) 09:45
Recipient:Fabian Hueske <fh...@gmail.com>
Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Xuefu,

Appreciate this proposal, and like Fabian, it would look better if you can give more details of the plan.

Thanks, vino.
Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:

Hi Xuefu,

Welcome to the Flink community and thanks for starting this discussion! Better Hive integration would be really great!
Can you go into details of what you are proposing? I can think of a couple ways to improve Flink in that regard:

* Support for Hive UDFs
* Support for Hive metadata catalog
* Support for HiveQL syntax
* ???

Best, Fabian

Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
Hi all,

 Along with the community's effort, inside Alibaba we have explored Flink's potential as an execution engine not just for stream processing but also for batch processing. We are encouraged by our findings and have initiated our effort to make Flink's SQL capabilities full-fledged. When comparing what's available in Flink to the offerings from competitive data processing engines, we identified a major gap in Flink: a well integration with Hive ecosystem. This is crucial to the success of Flink SQL and batch due to the well-established data ecosystem around Hive. Therefore, we have done some initial work along this direction but there are still a lot of effort needed.

 We have two strategies in mind. The first one is to make Flink SQL full-fledged and well-integrated with Hive ecosystem. This is a similar approach to what Spark SQL adopted. The second strategy is to make Hive itself work with Flink, similar to the proposal in [1]. Each approach bears its pros and cons, but they don’t need to be mutually exclusive with each targeting at different users and use cases. We believe that both will promote a much greater adoption of Flink beyond stream processing.

 We have been focused on the first approach and would like to showcase Flink's batch and SQL capabilities with Flink SQL. However, we have also planned to start strategy #2 as the follow-up effort.

 I'm completely new to Flink(, with a short bio [2] below), though many of my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like to share our thoughts and invite your early feedback. At the same time, I am working on a detailed proposal on Flink SQL's integration with Hive ecosystem, which will be also shared when ready.

 While the ideas are simple, each approach will demand significant effort, more than what we can afford. Thus, the input and contributions from the communities are greatly welcome and appreciated.

 Regards,


 Xuefu

 References:

 [1] https://issues.apache.org/jira/browse/HIVE-10712
 [2] Xuefu Zhang is a long-time open source veteran, worked or working on many projects under Apache Foundation, of which he is also an honored member. About 10 years ago he worked in the Hadoop team at Yahoo where the projects just got started. Later he worked at Cloudera, initiating and leading the development of Hive on Spark project in the communities and across many organizations. Prior to joining Alibaba, he worked at Uber where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and significantly improved Uber's cluster efficiency.



Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by "Zhang, Xuefu" <xu...@alibaba-inc.com>.
Hi Fabian/Vno,

Thank you very much for your encouragement inquiry. Sorry that I didn't see Fabian's email until I read Vino's response just now. (Somehow Fabian's went to the spam folder.)

My proposal contains long-term and short-terms goals. Nevertheless, the effort will focus on the following areas, including Fabian's list:

1. Hive metastore connectivity - This covers both read/write access, which means Flink can make full use of Hive's metastore as its catalog (at least for the batch but can extend for streaming as well).
2. Metadata compatibility - Objects (databases, tables, partitions, etc) created by Hive can be understood by Flink and the reverse direction is true also.
3. Data compatibility - Similar to #2, data produced by Hive can be consumed by Flink and vise versa.
4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its own implementation or make Hive's implementation work in Flink. Further, for user created UDFs in Hive, Flink SQL should provide a mechanism allowing user to import them into Flink without any code change required.
5. Data types -  Flink SQL should support all data types that are available in Hive.
6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) with extension to support Hive's syntax and language features, around DDL, DML, and SELECT queries.
7.  SQL CLI - this is currently developing in Flink but more effort is needed.
8. Server - provide a server that's compatible with Hive's HiverServer2 in thrift APIs, such that HiveServer2 users can reuse their existing client (such as beeline) but connect to Flink's thrift server instead.
9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other application to use to connect to its thrift server
10. Support other user's customizations in Hive, such as Hive Serdes, storage handlers, etc.
11. Better task failure tolerance and task scheduling at Flink runtime.

As you can see, achieving all those requires significant effort and across all layers in Flink. However, a short-term goal could  include only core areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3, #6).

Please share your further thoughts. If we generally agree that this is the right direction, I could come up with a formal proposal quickly and then we can follow up with broader discussions.

Thanks,
Xuefu




------------------------------------------------------------------
Sender:vino yang <ya...@gmail.com>
Sent at:2018 Oct 11 (Thu) 09:45
Recipient:Fabian Hueske <fh...@gmail.com>
Cc:dev <de...@flink.apache.org>; Xuefu <xu...@alibaba-inc.com>; user <us...@flink.apache.org>
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Xuefu,

Appreciate this proposal, and like Fabian, it would look better if you can give more details of the plan.

Thanks, vino.
Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:

Hi Xuefu,

Welcome to the Flink community and thanks for starting this discussion! Better Hive integration would be really great!
Can you go into details of what you are proposing? I can think of a couple ways to improve Flink in that regard:

* Support for Hive UDFs
* Support for Hive metadata catalog
* Support for HiveQL syntax
* ???

Best, Fabian

Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <xu...@alibaba-inc.com>:
Hi all,

 Along with the community's effort, inside Alibaba we have explored Flink's potential as an execution engine not just for stream processing but also for batch processing. We are encouraged by our findings and have initiated our effort to make Flink's SQL capabilities full-fledged. When comparing what's available in Flink to the offerings from competitive data processing engines, we identified a major gap in Flink: a well integration with Hive ecosystem. This is crucial to the success of Flink SQL and batch due to the well-established data ecosystem around Hive. Therefore, we have done some initial work along this direction but there are still a lot of effort needed.

 We have two strategies in mind. The first one is to make Flink SQL full-fledged and well-integrated with Hive ecosystem. This is a similar approach to what Spark SQL adopted. The second strategy is to make Hive itself work with Flink, similar to the proposal in [1]. Each approach bears its pros and cons, but they don’t need to be mutually exclusive with each targeting at different users and use cases. We believe that both will promote a much greater adoption of Flink beyond stream processing.

 We have been focused on the first approach and would like to showcase Flink's batch and SQL capabilities with Flink SQL. However, we have also planned to start strategy #2 as the follow-up effort.

 I'm completely new to Flink(, with a short bio [2] below), though many of my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like to share our thoughts and invite your early feedback. At the same time, I am working on a detailed proposal on Flink SQL's integration with Hive ecosystem, which will be also shared when ready.

 While the ideas are simple, each approach will demand significant effort, more than what we can afford. Thus, the input and contributions from the communities are greatly welcome and appreciated.

 Regards,


 Xuefu

 References:

 [1] https://issues.apache.org/jira/browse/HIVE-10712
 [2] Xuefu Zhang is a long-time open source veteran, worked or working on many projects under Apache Foundation, of which he is also an honored member. About 10 years ago he worked in the Hadoop team at Yahoo where the projects just got started. Later he worked at Cloudera, initiating and leading the development of Hive on Spark project in the communities and across many organizations. Prior to joining Alibaba, he worked at Uber where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and significantly improved Uber's cluster efficiency.



Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by vino yang <ya...@gmail.com>.
Hi Xuefu,

Appreciate this proposal, and like Fabian, it would look better if you can
give more details of the plan.

Thanks, vino.

Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:

> Hi Xuefu,
>
> Welcome to the Flink community and thanks for starting this discussion!
> Better Hive integration would be really great!
> Can you go into details of what you are proposing? I can think of a couple
> ways to improve Flink in that regard:
>
> * Support for Hive UDFs
> * Support for Hive metadata catalog
> * Support for HiveQL syntax
> * ???
>
> Best, Fabian
>
> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
> xuefu.z@alibaba-inc.com>:
>
>> Hi all,
>>
>> Along with the community's effort, inside Alibaba we have explored
>> Flink's potential as an execution engine not just for stream processing but
>> also for batch processing. We are encouraged by our findings and have
>> initiated our effort to make Flink's SQL capabilities full-fledged. When
>> comparing what's available in Flink to the offerings from competitive data
>> processing engines, we identified a major gap in Flink: a well integration
>> with Hive ecosystem. This is crucial to the success of Flink SQL and batch
>> due to the well-established data ecosystem around Hive. Therefore, we have
>> done some initial work along this direction but there are still a lot of
>> effort needed.
>>
>> We have two strategies in mind. The first one is to make Flink SQL
>> full-fledged and well-integrated with Hive ecosystem. This is a similar
>> approach to what Spark SQL adopted. The second strategy is to make Hive
>> itself work with Flink, similar to the proposal in [1]. Each approach bears
>> its pros and cons, but they don’t need to be mutually exclusive with each
>> targeting at different users and use cases. We believe that both will
>> promote a much greater adoption of Flink beyond stream processing.
>>
>> We have been focused on the first approach and would like to showcase
>> Flink's batch and SQL capabilities with Flink SQL. However, we have also
>> planned to start strategy #2 as the follow-up effort.
>>
>> I'm completely new to Flink(, with a short bio [2] below), though many of
>> my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd
>> like to share our thoughts and invite your early feedback. At the same
>> time, I am working on a detailed proposal on Flink SQL's integration with
>> Hive ecosystem, which will be also shared when ready.
>>
>> While the ideas are simple, each approach will demand significant effort,
>> more than what we can afford. Thus, the input and contributions from the
>> communities are greatly welcome and appreciated.
>>
>> Regards,
>>
>>
>> Xuefu
>>
>> References:
>>
>> [1] https://issues.apache.org/jira/browse/HIVE-10712
>> [2] Xuefu Zhang is a long-time open source veteran, worked or working on
>> many projects under Apache Foundation, of which he is also an honored
>> member. About 10 years ago he worked in the Hadoop team at Yahoo where the
>> projects just got started. Later he worked at Cloudera, initiating and
>> leading the development of Hive on Spark project in the communities and
>> across many organizations. Prior to joining Alibaba, he worked at Uber
>> where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and
>> significantly improved Uber's cluster efficiency.
>>
>>
>>

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by vino yang <ya...@gmail.com>.
Hi Xuefu,

Appreciate this proposal, and like Fabian, it would look better if you can
give more details of the plan.

Thanks, vino.

Fabian Hueske <fh...@gmail.com> 于2018年10月10日周三 下午5:27写道:

> Hi Xuefu,
>
> Welcome to the Flink community and thanks for starting this discussion!
> Better Hive integration would be really great!
> Can you go into details of what you are proposing? I can think of a couple
> ways to improve Flink in that regard:
>
> * Support for Hive UDFs
> * Support for Hive metadata catalog
> * Support for HiveQL syntax
> * ???
>
> Best, Fabian
>
> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
> xuefu.z@alibaba-inc.com>:
>
>> Hi all,
>>
>> Along with the community's effort, inside Alibaba we have explored
>> Flink's potential as an execution engine not just for stream processing but
>> also for batch processing. We are encouraged by our findings and have
>> initiated our effort to make Flink's SQL capabilities full-fledged. When
>> comparing what's available in Flink to the offerings from competitive data
>> processing engines, we identified a major gap in Flink: a well integration
>> with Hive ecosystem. This is crucial to the success of Flink SQL and batch
>> due to the well-established data ecosystem around Hive. Therefore, we have
>> done some initial work along this direction but there are still a lot of
>> effort needed.
>>
>> We have two strategies in mind. The first one is to make Flink SQL
>> full-fledged and well-integrated with Hive ecosystem. This is a similar
>> approach to what Spark SQL adopted. The second strategy is to make Hive
>> itself work with Flink, similar to the proposal in [1]. Each approach bears
>> its pros and cons, but they don’t need to be mutually exclusive with each
>> targeting at different users and use cases. We believe that both will
>> promote a much greater adoption of Flink beyond stream processing.
>>
>> We have been focused on the first approach and would like to showcase
>> Flink's batch and SQL capabilities with Flink SQL. However, we have also
>> planned to start strategy #2 as the follow-up effort.
>>
>> I'm completely new to Flink(, with a short bio [2] below), though many of
>> my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd
>> like to share our thoughts and invite your early feedback. At the same
>> time, I am working on a detailed proposal on Flink SQL's integration with
>> Hive ecosystem, which will be also shared when ready.
>>
>> While the ideas are simple, each approach will demand significant effort,
>> more than what we can afford. Thus, the input and contributions from the
>> communities are greatly welcome and appreciated.
>>
>> Regards,
>>
>>
>> Xuefu
>>
>> References:
>>
>> [1] https://issues.apache.org/jira/browse/HIVE-10712
>> [2] Xuefu Zhang is a long-time open source veteran, worked or working on
>> many projects under Apache Foundation, of which he is also an honored
>> member. About 10 years ago he worked in the Hadoop team at Yahoo where the
>> projects just got started. Later he worked at Cloudera, initiating and
>> leading the development of Hive on Spark project in the communities and
>> across many organizations. Prior to joining Alibaba, he worked at Uber
>> where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and
>> significantly improved Uber's cluster efficiency.
>>
>>
>>

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by Fabian Hueske <fh...@gmail.com>.
Hi Xuefu,

Welcome to the Flink community and thanks for starting this discussion!
Better Hive integration would be really great!
Can you go into details of what you are proposing? I can think of a couple
ways to improve Flink in that regard:

* Support for Hive UDFs
* Support for Hive metadata catalog
* Support for HiveQL syntax
* ???

Best, Fabian

Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
xuefu.z@alibaba-inc.com>:

> Hi all,
>
> Along with the community's effort, inside Alibaba we have explored Flink's
> potential as an execution engine not just for stream processing but also
> for batch processing. We are encouraged by our findings and have initiated
> our effort to make Flink's SQL capabilities full-fledged. When comparing
> what's available in Flink to the offerings from competitive data processing
> engines, we identified a major gap in Flink: a well integration with Hive
> ecosystem. This is crucial to the success of Flink SQL and batch due to the
> well-established data ecosystem around Hive. Therefore, we have done some
> initial work along this direction but there are still a lot of effort
> needed.
>
> We have two strategies in mind. The first one is to make Flink SQL
> full-fledged and well-integrated with Hive ecosystem. This is a similar
> approach to what Spark SQL adopted. The second strategy is to make Hive
> itself work with Flink, similar to the proposal in [1]. Each approach bears
> its pros and cons, but they don’t need to be mutually exclusive with each
> targeting at different users and use cases. We believe that both will
> promote a much greater adoption of Flink beyond stream processing.
>
> We have been focused on the first approach and would like to showcase
> Flink's batch and SQL capabilities with Flink SQL. However, we have also
> planned to start strategy #2 as the follow-up effort.
>
> I'm completely new to Flink(, with a short bio [2] below), though many of
> my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd
> like to share our thoughts and invite your early feedback. At the same
> time, I am working on a detailed proposal on Flink SQL's integration with
> Hive ecosystem, which will be also shared when ready.
>
> While the ideas are simple, each approach will demand significant effort,
> more than what we can afford. Thus, the input and contributions from the
> communities are greatly welcome and appreciated.
>
> Regards,
>
>
> Xuefu
>
> References:
>
> [1] https://issues.apache.org/jira/browse/HIVE-10712
> [2] Xuefu Zhang is a long-time open source veteran, worked or working on
> many projects under Apache Foundation, of which he is also an honored
> member. About 10 years ago he worked in the Hadoop team at Yahoo where the
> projects just got started. Later he worked at Cloudera, initiating and
> leading the development of Hive on Spark project in the communities and
> across many organizations. Prior to joining Alibaba, he worked at Uber
> where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and
> significantly improved Uber's cluster efficiency.
>
>
>

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Posted by Fabian Hueske <fh...@gmail.com>.
Hi Xuefu,

Welcome to the Flink community and thanks for starting this discussion!
Better Hive integration would be really great!
Can you go into details of what you are proposing? I can think of a couple
ways to improve Flink in that regard:

* Support for Hive UDFs
* Support for Hive metadata catalog
* Support for HiveQL syntax
* ???

Best, Fabian

Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu <
xuefu.z@alibaba-inc.com>:

> Hi all,
>
> Along with the community's effort, inside Alibaba we have explored Flink's
> potential as an execution engine not just for stream processing but also
> for batch processing. We are encouraged by our findings and have initiated
> our effort to make Flink's SQL capabilities full-fledged. When comparing
> what's available in Flink to the offerings from competitive data processing
> engines, we identified a major gap in Flink: a well integration with Hive
> ecosystem. This is crucial to the success of Flink SQL and batch due to the
> well-established data ecosystem around Hive. Therefore, we have done some
> initial work along this direction but there are still a lot of effort
> needed.
>
> We have two strategies in mind. The first one is to make Flink SQL
> full-fledged and well-integrated with Hive ecosystem. This is a similar
> approach to what Spark SQL adopted. The second strategy is to make Hive
> itself work with Flink, similar to the proposal in [1]. Each approach bears
> its pros and cons, but they don’t need to be mutually exclusive with each
> targeting at different users and use cases. We believe that both will
> promote a much greater adoption of Flink beyond stream processing.
>
> We have been focused on the first approach and would like to showcase
> Flink's batch and SQL capabilities with Flink SQL. However, we have also
> planned to start strategy #2 as the follow-up effort.
>
> I'm completely new to Flink(, with a short bio [2] below), though many of
> my colleagues here at Alibaba are long-time contributors. Nevertheless, I'd
> like to share our thoughts and invite your early feedback. At the same
> time, I am working on a detailed proposal on Flink SQL's integration with
> Hive ecosystem, which will be also shared when ready.
>
> While the ideas are simple, each approach will demand significant effort,
> more than what we can afford. Thus, the input and contributions from the
> communities are greatly welcome and appreciated.
>
> Regards,
>
>
> Xuefu
>
> References:
>
> [1] https://issues.apache.org/jira/browse/HIVE-10712
> [2] Xuefu Zhang is a long-time open source veteran, worked or working on
> many projects under Apache Foundation, of which he is also an honored
> member. About 10 years ago he worked in the Hadoop team at Yahoo where the
> projects just got started. Later he worked at Cloudera, initiating and
> leading the development of Hive on Spark project in the communities and
> across many organizations. Prior to joining Alibaba, he worked at Uber
> where he promoted Hive on Spark to all Uber's SQL on Hadoop workload and
> significantly improved Uber's cluster efficiency.
>
>
>