You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by jincheng sun <su...@gmail.com> on 2018/11/01 09:07:07 UTC

[DISCUSS] Enhancing the functionality and productivity of Table API

Hi all,

With the continuous efforts from the community, the Flink system has been
continuously improved, which has attracted more and more users. Flink SQL
is a canonical, widely used relational query language. However, there are
still some scenarios where Flink SQL failed to meet user needs in terms of
functionality and ease of use, such as:


   -

   In terms of functionality

Iteration, user-defined window, user-defined join, user-defined
GroupReduce, etc. Users cannot express them with SQL;

   -

   In terms of ease of use
   -

      Map - e.g. “dataStream.map(mapFun)”. Although “table.select(udf1(),
      udf2(), udf3()....)” can be used to accomplish the same function., with a
      map() function returning 100 columns, one has to define or call 100 UDFs
      when using SQL, which is quite involved.
      -

      FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”. Similarly, it can be
      implemented with “table.join(udtf).select()”. However, it is obvious that
      datastream is easier to use than SQL.


Due to the above two reasons, some users have to use the DataStream API or
the DataSet API. But when they do that, they lose the unification of batch
and streaming. They will also lose the sophisticated optimizations such as
codegen, aggregate join transpose  and multi-stage agg from Flink SQL.

We believe that enhancing the functionality and productivity is vital for
the successful adoption of Table API. To this end,  Table API still
requires more efforts from every contributor in the community. We see great
opportunity in improving our user’s experience from this work. Any feedback
is welcome.

Regards,

Jincheng

Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Posted by jincheng sun <su...@gmail.com>.
Hi, Jiangjie,
Thanks a lot for your feedback. And also thanks for our offline discussion!
Yes, your right! The Row-based APIs which you mentioned are very friendly
to flink user!
In order to follow the concept of the traditional database, perhaps we
named the corresponding function RowValued/TabeValued function will be more
appropriate, then from the perspective of return value in TableAPI we have
three type functions:

   - ColumnValuedFunction - ScalarFunction & AggregateFunction, and the
   result is a column.
   - RowValuedFunction - MapFunction which I'll proposal is
   RowValuedFunciton, and result is a single row.
   - TableValuedFunction - FlatMapFunction which I'll proposal is
   TableValuedFunciton, and result is a table.

The detail will be described in following FLIP/design doc.
About the input type I think we can support both column parameters and row
parameters. but I think the meaning you want to express should be
consistent with me, we are on the same page, right?

And thanks you like the proposal, I hope that we can work together to
advance the work!

Best,
Jincheng

Becket Qin <be...@gmail.com> 于2018年11月2日周五 上午1:25写道:

> Thanks for the proposal, Jincheng.
>
> This makes a lot of sense. As a programming interface, Table API is
> especially attractive because it supports both batch and stream. However,
> the relational-only API often forces users to shoehorn their logic into a
> bunch of user defined functions. Introducing some more flexible API (e.g.
> row-based APIs) to process records would really help here.
>
> Besides the processing API, another useful improvement would be allowing
> batch tables and stream tables to run in the same job, which is actually a
> quite common scenario.
>
> As you said, there are a lot of work could be done here. I am looking
> forward to the upcoming FLIPs.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Fri, Nov 2, 2018 at 12:10 AM jincheng sun <su...@gmail.com>
> wrote:
>
> > Hi, Timo,
> > I am very grateful for your feedback, and I am very excited when I hear
> > that you also consider adding a process function to the TableAPI.
> >
> > I agree that add support for the Process Function on the Table API, which
> > is actually part of my proposal Enhancing the functionality of Table API.
> > In fact, supporting the ProcessFunction means supporting the user-defined
> > Operator. As you said, A ProcessFunction can implement any logic,
> including
> > the user-defined window, which leaves the user with enough freedom and
> > control. At the same time, Co-PrecessFunction needs to be supported, so
> we
> > can implement the logic of User-Defined JOIN through Co-PrecessFunciton.
> Of
> > course, Co-PrecessFunciton also needs to introduce the concept of
> Connect,
> > and will introduce a new ConnectedTable type on TableAPI.  And I also
> think
> > TableAPI also for more event-driven applications.
> >
> > About processFunction In addition to the timer function, it should be
> > completely equivalent to flatmapFunction, so maybe we can support map and
> > flatmap in Table, support processFunction in GroupedTable, because for
> the
> > reason of State, the Timer of ProcessFunction can only Apply to
> > KeyedStream.
> >
> > You are right, ANSI-SQL is difficult to express complex operator logic
> such
> > as ProcessFunction, so once we decide to make these enhancements on the
> > TableAPI, it means that the Flink SQL only includes ANSI-SQL operations,
> > and the TableAPI' operations is SQL super set. This means that the Flink
> > High-level API includes the A Query language SQL and A powerfu program
> > language Table API. In this way, SQL using for those simple ETL user
> > groups, the TableAPI is for a user group that needs to be customized for
> > complex logic, and these users can enjoy The benefit of the query
> > optimizer. Maybe we need more refinement and hard work to support these
> > functions, but maybe this is a good direction of effort.
> >
> > Thanks,
> > Jincheng
> >
> > Timo Walther <tw...@apache.org> 于2018年11月1日周四 下午10:08写道:
> >
> > > Hi Jincheng,
> > >
> > > I was also thinking about introducing a process function for the Table
> > > API several times. This would allow to define more complex logic
> (custom
> > > windows, timers, etc.) embedded into a relational API with schema
> > > awareness and optimization around the black box. Of course this would
> > > mean that we diverge with Table API from SQL API, however, it would
> open
> > > the Table API also for more event-driven applications.
> > >
> > > Maybe it would be possible to define timers and firing logic using
> Table
> > > API expressions and UDFs. Within planning this would be treated as a
> > > special Calc node.
> > >
> > > Just some ideas that might be interesting for new use cases.
> > >
> > > Regards,
> > > Timo
> > >
> > >
> > > Am 01.11.18 um 13:12 schrieb Aljoscha Krettek:
> > > > Hi Jincheng,
> > > >
> > > > these points sound very good! Are there any concrete proposals for
> > > changes? For example a FLIP/design document?
> > > >
> > > > See here for FLIPs:
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> > > >
> > > > Best,
> > > > Aljoscha
> > > >
> > > >> On 1. Nov 2018, at 12:51, jincheng sun <su...@gmail.com>
> > > wrote:
> > > >>
> > > >> *--------I am sorry for the formatting of the email content. I
> > reformat
> > > >> the **content** as follows-----------*
> > > >>
> > > >> *Hi ALL,*
> > > >>
> > > >> With the continuous efforts from the community, the Flink system has
> > > been
> > > >> continuously improved, which has attracted more and more users.
> Flink
> > > SQL
> > > >> is a canonical, widely used relational query language. However,
> there
> > > are
> > > >> still some scenarios where Flink SQL failed to meet user needs in
> > terms
> > > of
> > > >> functionality and ease of use, such as:
> > > >>
> > > >> *1. In terms of functionality*
> > > >>     Iteration, user-defined window, user-defined join, user-defined
> > > >> GroupReduce, etc. Users cannot express them with SQL;
> > > >>
> > > >> *2. In terms of ease of use*
> > > >>
> > > >>    - Map - e.g. “dataStream.map(mapFun)”. Although
> > “table.select(udf1(),
> > > >>    udf2(), udf3()....)” can be used to accomplish the same
> function.,
> > > with a
> > > >>    map() function returning 100 columns, one has to define or call
> 100
> > > UDFs
> > > >>    when using SQL, which is quite involved.
> > > >>    - FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”. Similarly, it
> > can
> > > be
> > > >>    implemented with “table.join(udtf).select()”. However, it is
> > obvious
> > > that
> > > >>    dataStream is easier to use than SQL.
> > > >>
> > > >> Due to the above two reasons, some users have to use the DataStream
> > API
> > > or
> > > >> the DataSet API. But when they do that, they lose the unification of
> > > batch
> > > >> and streaming. They will also lose the sophisticated optimizations
> > such
> > > as
> > > >> codegen, aggregate join transpose and multi-stage agg from Flink
> SQL.
> > > >>
> > > >> We believe that enhancing the functionality and productivity is
> vital
> > > for
> > > >> the successful adoption of Table API. To this end,  Table API still
> > > >> requires more efforts from every contributor in the community. We
> see
> > > great
> > > >> opportunity in improving our user’s experience from this work. Any
> > > feedback
> > > >> is welcome.
> > > >>
> > > >> Regards,
> > > >>
> > > >> Jincheng
> > > >>
> > > >> jincheng sun <su...@gmail.com> 于2018年11月1日周四 下午5:07写道:
> > > >>
> > > >>> Hi all,
> > > >>>
> > > >>> With the continuous efforts from the community, the Flink system
> has
> > > been
> > > >>> continuously improved, which has attracted more and more users.
> Flink
> > > SQL
> > > >>> is a canonical, widely used relational query language. However,
> there
> > > are
> > > >>> still some scenarios where Flink SQL failed to meet user needs in
> > > terms of
> > > >>> functionality and ease of use, such as:
> > > >>>
> > > >>>
> > > >>>    -
> > > >>>
> > > >>>    In terms of functionality
> > > >>>
> > > >>> Iteration, user-defined window, user-defined join, user-defined
> > > >>> GroupReduce, etc. Users cannot express them with SQL;
> > > >>>
> > > >>>    -
> > > >>>
> > > >>>    In terms of ease of use
> > > >>>    -
> > > >>>
> > > >>>       Map - e.g. “dataStream.map(mapFun)”. Although
> > > “table.select(udf1(),
> > > >>>       udf2(), udf3()....)” can be used to accomplish the same
> > > function., with a
> > > >>>       map() function returning 100 columns, one has to define or
> call
> > > 100 UDFs
> > > >>>       when using SQL, which is quite involved.
> > > >>>       -
> > > >>>
> > > >>>       FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”. Similarly,
> it
> > > can
> > > >>>       be implemented with “table.join(udtf).select()”. However, it
> is
> > > obvious
> > > >>>       that datastream is easier to use than SQL.
> > > >>>
> > > >>>
> > > >>> Due to the above two reasons, some users have to use the DataStream
> > > API or
> > > >>> the DataSet API. But when they do that, they lose the unification
> of
> > > batch
> > > >>> and streaming. They will also lose the sophisticated optimizations
> > > such as
> > > >>> codegen, aggregate join transpose  and multi-stage agg from Flink
> > SQL.
> > > >>>
> > > >>> We believe that enhancing the functionality and productivity is
> vital
> > > for
> > > >>> the successful adoption of Table API. To this end,  Table API still
> > > >>> requires more efforts from every contributor in the community. We
> see
> > > great
> > > >>> opportunity in improving our user’s experience from this work. Any
> > > feedback
> > > >>> is welcome.
> > > >>>
> > > >>> Regards,
> > > >>>
> > > >>> Jincheng
> > > >>>
> > > >>>
> > >
> > >
> >
>

Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Posted by Becket Qin <be...@gmail.com>.
Thanks for the proposal, Jincheng.

This makes a lot of sense. As a programming interface, Table API is
especially attractive because it supports both batch and stream. However,
the relational-only API often forces users to shoehorn their logic into a
bunch of user defined functions. Introducing some more flexible API (e.g.
row-based APIs) to process records would really help here.

Besides the processing API, another useful improvement would be allowing
batch tables and stream tables to run in the same job, which is actually a
quite common scenario.

As you said, there are a lot of work could be done here. I am looking
forward to the upcoming FLIPs.

Thanks,

Jiangjie (Becket) Qin

On Fri, Nov 2, 2018 at 12:10 AM jincheng sun <su...@gmail.com>
wrote:

> Hi, Timo,
> I am very grateful for your feedback, and I am very excited when I hear
> that you also consider adding a process function to the TableAPI.
>
> I agree that add support for the Process Function on the Table API, which
> is actually part of my proposal Enhancing the functionality of Table API.
> In fact, supporting the ProcessFunction means supporting the user-defined
> Operator. As you said, A ProcessFunction can implement any logic, including
> the user-defined window, which leaves the user with enough freedom and
> control. At the same time, Co-PrecessFunction needs to be supported, so we
> can implement the logic of User-Defined JOIN through Co-PrecessFunciton. Of
> course, Co-PrecessFunciton also needs to introduce the concept of Connect,
> and will introduce a new ConnectedTable type on TableAPI.  And I also think
> TableAPI also for more event-driven applications.
>
> About processFunction In addition to the timer function, it should be
> completely equivalent to flatmapFunction, so maybe we can support map and
> flatmap in Table, support processFunction in GroupedTable, because for the
> reason of State, the Timer of ProcessFunction can only Apply to
> KeyedStream.
>
> You are right, ANSI-SQL is difficult to express complex operator logic such
> as ProcessFunction, so once we decide to make these enhancements on the
> TableAPI, it means that the Flink SQL only includes ANSI-SQL operations,
> and the TableAPI' operations is SQL super set. This means that the Flink
> High-level API includes the A Query language SQL and A powerfu program
> language Table API. In this way, SQL using for those simple ETL user
> groups, the TableAPI is for a user group that needs to be customized for
> complex logic, and these users can enjoy The benefit of the query
> optimizer. Maybe we need more refinement and hard work to support these
> functions, but maybe this is a good direction of effort.
>
> Thanks,
> Jincheng
>
> Timo Walther <tw...@apache.org> 于2018年11月1日周四 下午10:08写道:
>
> > Hi Jincheng,
> >
> > I was also thinking about introducing a process function for the Table
> > API several times. This would allow to define more complex logic (custom
> > windows, timers, etc.) embedded into a relational API with schema
> > awareness and optimization around the black box. Of course this would
> > mean that we diverge with Table API from SQL API, however, it would open
> > the Table API also for more event-driven applications.
> >
> > Maybe it would be possible to define timers and firing logic using Table
> > API expressions and UDFs. Within planning this would be treated as a
> > special Calc node.
> >
> > Just some ideas that might be interesting for new use cases.
> >
> > Regards,
> > Timo
> >
> >
> > Am 01.11.18 um 13:12 schrieb Aljoscha Krettek:
> > > Hi Jincheng,
> > >
> > > these points sound very good! Are there any concrete proposals for
> > changes? For example a FLIP/design document?
> > >
> > > See here for FLIPs:
> >
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> > >
> > > Best,
> > > Aljoscha
> > >
> > >> On 1. Nov 2018, at 12:51, jincheng sun <su...@gmail.com>
> > wrote:
> > >>
> > >> *--------I am sorry for the formatting of the email content. I
> reformat
> > >> the **content** as follows-----------*
> > >>
> > >> *Hi ALL,*
> > >>
> > >> With the continuous efforts from the community, the Flink system has
> > been
> > >> continuously improved, which has attracted more and more users. Flink
> > SQL
> > >> is a canonical, widely used relational query language. However, there
> > are
> > >> still some scenarios where Flink SQL failed to meet user needs in
> terms
> > of
> > >> functionality and ease of use, such as:
> > >>
> > >> *1. In terms of functionality*
> > >>     Iteration, user-defined window, user-defined join, user-defined
> > >> GroupReduce, etc. Users cannot express them with SQL;
> > >>
> > >> *2. In terms of ease of use*
> > >>
> > >>    - Map - e.g. “dataStream.map(mapFun)”. Although
> “table.select(udf1(),
> > >>    udf2(), udf3()....)” can be used to accomplish the same function.,
> > with a
> > >>    map() function returning 100 columns, one has to define or call 100
> > UDFs
> > >>    when using SQL, which is quite involved.
> > >>    - FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”. Similarly, it
> can
> > be
> > >>    implemented with “table.join(udtf).select()”. However, it is
> obvious
> > that
> > >>    dataStream is easier to use than SQL.
> > >>
> > >> Due to the above two reasons, some users have to use the DataStream
> API
> > or
> > >> the DataSet API. But when they do that, they lose the unification of
> > batch
> > >> and streaming. They will also lose the sophisticated optimizations
> such
> > as
> > >> codegen, aggregate join transpose and multi-stage agg from Flink SQL.
> > >>
> > >> We believe that enhancing the functionality and productivity is vital
> > for
> > >> the successful adoption of Table API. To this end,  Table API still
> > >> requires more efforts from every contributor in the community. We see
> > great
> > >> opportunity in improving our user’s experience from this work. Any
> > feedback
> > >> is welcome.
> > >>
> > >> Regards,
> > >>
> > >> Jincheng
> > >>
> > >> jincheng sun <su...@gmail.com> 于2018年11月1日周四 下午5:07写道:
> > >>
> > >>> Hi all,
> > >>>
> > >>> With the continuous efforts from the community, the Flink system has
> > been
> > >>> continuously improved, which has attracted more and more users. Flink
> > SQL
> > >>> is a canonical, widely used relational query language. However, there
> > are
> > >>> still some scenarios where Flink SQL failed to meet user needs in
> > terms of
> > >>> functionality and ease of use, such as:
> > >>>
> > >>>
> > >>>    -
> > >>>
> > >>>    In terms of functionality
> > >>>
> > >>> Iteration, user-defined window, user-defined join, user-defined
> > >>> GroupReduce, etc. Users cannot express them with SQL;
> > >>>
> > >>>    -
> > >>>
> > >>>    In terms of ease of use
> > >>>    -
> > >>>
> > >>>       Map - e.g. “dataStream.map(mapFun)”. Although
> > “table.select(udf1(),
> > >>>       udf2(), udf3()....)” can be used to accomplish the same
> > function., with a
> > >>>       map() function returning 100 columns, one has to define or call
> > 100 UDFs
> > >>>       when using SQL, which is quite involved.
> > >>>       -
> > >>>
> > >>>       FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”. Similarly, it
> > can
> > >>>       be implemented with “table.join(udtf).select()”. However, it is
> > obvious
> > >>>       that datastream is easier to use than SQL.
> > >>>
> > >>>
> > >>> Due to the above two reasons, some users have to use the DataStream
> > API or
> > >>> the DataSet API. But when they do that, they lose the unification of
> > batch
> > >>> and streaming. They will also lose the sophisticated optimizations
> > such as
> > >>> codegen, aggregate join transpose  and multi-stage agg from Flink
> SQL.
> > >>>
> > >>> We believe that enhancing the functionality and productivity is vital
> > for
> > >>> the successful adoption of Table API. To this end,  Table API still
> > >>> requires more efforts from every contributor in the community. We see
> > great
> > >>> opportunity in improving our user’s experience from this work. Any
> > feedback
> > >>> is welcome.
> > >>>
> > >>> Regards,
> > >>>
> > >>> Jincheng
> > >>>
> > >>>
> >
> >
>

Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Posted by jincheng sun <su...@gmail.com>.
Hi, Timo,
I am very grateful for your feedback, and I am very excited when I hear
that you also consider adding a process function to the TableAPI.

I agree that add support for the Process Function on the Table API, which
is actually part of my proposal Enhancing the functionality of Table API.
In fact, supporting the ProcessFunction means supporting the user-defined
Operator. As you said, A ProcessFunction can implement any logic, including
the user-defined window, which leaves the user with enough freedom and
control. At the same time, Co-PrecessFunction needs to be supported, so we
can implement the logic of User-Defined JOIN through Co-PrecessFunciton. Of
course, Co-PrecessFunciton also needs to introduce the concept of Connect,
and will introduce a new ConnectedTable type on TableAPI.  And I also think
TableAPI also for more event-driven applications.

About processFunction In addition to the timer function, it should be
completely equivalent to flatmapFunction, so maybe we can support map and
flatmap in Table, support processFunction in GroupedTable, because for the
reason of State, the Timer of ProcessFunction can only Apply to KeyedStream.

You are right, ANSI-SQL is difficult to express complex operator logic such
as ProcessFunction, so once we decide to make these enhancements on the
TableAPI, it means that the Flink SQL only includes ANSI-SQL operations,
and the TableAPI' operations is SQL super set. This means that the Flink
High-level API includes the A Query language SQL and A powerfu program
language Table API. In this way, SQL using for those simple ETL user
groups, the TableAPI is for a user group that needs to be customized for
complex logic, and these users can enjoy The benefit of the query
optimizer. Maybe we need more refinement and hard work to support these
functions, but maybe this is a good direction of effort.

Thanks,
Jincheng

Timo Walther <tw...@apache.org> 于2018年11月1日周四 下午10:08写道:

> Hi Jincheng,
>
> I was also thinking about introducing a process function for the Table
> API several times. This would allow to define more complex logic (custom
> windows, timers, etc.) embedded into a relational API with schema
> awareness and optimization around the black box. Of course this would
> mean that we diverge with Table API from SQL API, however, it would open
> the Table API also for more event-driven applications.
>
> Maybe it would be possible to define timers and firing logic using Table
> API expressions and UDFs. Within planning this would be treated as a
> special Calc node.
>
> Just some ideas that might be interesting for new use cases.
>
> Regards,
> Timo
>
>
> Am 01.11.18 um 13:12 schrieb Aljoscha Krettek:
> > Hi Jincheng,
> >
> > these points sound very good! Are there any concrete proposals for
> changes? For example a FLIP/design document?
> >
> > See here for FLIPs:
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> >
> > Best,
> > Aljoscha
> >
> >> On 1. Nov 2018, at 12:51, jincheng sun <su...@gmail.com>
> wrote:
> >>
> >> *--------I am sorry for the formatting of the email content. I reformat
> >> the **content** as follows-----------*
> >>
> >> *Hi ALL,*
> >>
> >> With the continuous efforts from the community, the Flink system has
> been
> >> continuously improved, which has attracted more and more users. Flink
> SQL
> >> is a canonical, widely used relational query language. However, there
> are
> >> still some scenarios where Flink SQL failed to meet user needs in terms
> of
> >> functionality and ease of use, such as:
> >>
> >> *1. In terms of functionality*
> >>     Iteration, user-defined window, user-defined join, user-defined
> >> GroupReduce, etc. Users cannot express them with SQL;
> >>
> >> *2. In terms of ease of use*
> >>
> >>    - Map - e.g. “dataStream.map(mapFun)”. Although “table.select(udf1(),
> >>    udf2(), udf3()....)” can be used to accomplish the same function.,
> with a
> >>    map() function returning 100 columns, one has to define or call 100
> UDFs
> >>    when using SQL, which is quite involved.
> >>    - FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”. Similarly, it can
> be
> >>    implemented with “table.join(udtf).select()”. However, it is obvious
> that
> >>    dataStream is easier to use than SQL.
> >>
> >> Due to the above two reasons, some users have to use the DataStream API
> or
> >> the DataSet API. But when they do that, they lose the unification of
> batch
> >> and streaming. They will also lose the sophisticated optimizations such
> as
> >> codegen, aggregate join transpose and multi-stage agg from Flink SQL.
> >>
> >> We believe that enhancing the functionality and productivity is vital
> for
> >> the successful adoption of Table API. To this end,  Table API still
> >> requires more efforts from every contributor in the community. We see
> great
> >> opportunity in improving our user’s experience from this work. Any
> feedback
> >> is welcome.
> >>
> >> Regards,
> >>
> >> Jincheng
> >>
> >> jincheng sun <su...@gmail.com> 于2018年11月1日周四 下午5:07写道:
> >>
> >>> Hi all,
> >>>
> >>> With the continuous efforts from the community, the Flink system has
> been
> >>> continuously improved, which has attracted more and more users. Flink
> SQL
> >>> is a canonical, widely used relational query language. However, there
> are
> >>> still some scenarios where Flink SQL failed to meet user needs in
> terms of
> >>> functionality and ease of use, such as:
> >>>
> >>>
> >>>    -
> >>>
> >>>    In terms of functionality
> >>>
> >>> Iteration, user-defined window, user-defined join, user-defined
> >>> GroupReduce, etc. Users cannot express them with SQL;
> >>>
> >>>    -
> >>>
> >>>    In terms of ease of use
> >>>    -
> >>>
> >>>       Map - e.g. “dataStream.map(mapFun)”. Although
> “table.select(udf1(),
> >>>       udf2(), udf3()....)” can be used to accomplish the same
> function., with a
> >>>       map() function returning 100 columns, one has to define or call
> 100 UDFs
> >>>       when using SQL, which is quite involved.
> >>>       -
> >>>
> >>>       FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”. Similarly, it
> can
> >>>       be implemented with “table.join(udtf).select()”. However, it is
> obvious
> >>>       that datastream is easier to use than SQL.
> >>>
> >>>
> >>> Due to the above two reasons, some users have to use the DataStream
> API or
> >>> the DataSet API. But when they do that, they lose the unification of
> batch
> >>> and streaming. They will also lose the sophisticated optimizations
> such as
> >>> codegen, aggregate join transpose  and multi-stage agg from Flink SQL.
> >>>
> >>> We believe that enhancing the functionality and productivity is vital
> for
> >>> the successful adoption of Table API. To this end,  Table API still
> >>> requires more efforts from every contributor in the community. We see
> great
> >>> opportunity in improving our user’s experience from this work. Any
> feedback
> >>> is welcome.
> >>>
> >>> Regards,
> >>>
> >>> Jincheng
> >>>
> >>>
>
>

Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Posted by Timo Walther <tw...@apache.org>.
Hi Jincheng,

I was also thinking about introducing a process function for the Table 
API several times. This would allow to define more complex logic (custom 
windows, timers, etc.) embedded into a relational API with schema 
awareness and optimization around the black box. Of course this would 
mean that we diverge with Table API from SQL API, however, it would open 
the Table API also for more event-driven applications.

Maybe it would be possible to define timers and firing logic using Table 
API expressions and UDFs. Within planning this would be treated as a 
special Calc node.

Just some ideas that might be interesting for new use cases.

Regards,
Timo


Am 01.11.18 um 13:12 schrieb Aljoscha Krettek:
> Hi Jincheng,
>
> these points sound very good! Are there any concrete proposals for changes? For example a FLIP/design document?
>
> See here for FLIPs: https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
>
> Best,
> Aljoscha
>
>> On 1. Nov 2018, at 12:51, jincheng sun <su...@gmail.com> wrote:
>>
>> *--------I am sorry for the formatting of the email content. I reformat
>> the **content** as follows-----------*
>>
>> *Hi ALL,*
>>
>> With the continuous efforts from the community, the Flink system has been
>> continuously improved, which has attracted more and more users. Flink SQL
>> is a canonical, widely used relational query language. However, there are
>> still some scenarios where Flink SQL failed to meet user needs in terms of
>> functionality and ease of use, such as:
>>
>> *1. In terms of functionality*
>>     Iteration, user-defined window, user-defined join, user-defined
>> GroupReduce, etc. Users cannot express them with SQL;
>>
>> *2. In terms of ease of use*
>>
>>    - Map - e.g. “dataStream.map(mapFun)”. Although “table.select(udf1(),
>>    udf2(), udf3()....)” can be used to accomplish the same function., with a
>>    map() function returning 100 columns, one has to define or call 100 UDFs
>>    when using SQL, which is quite involved.
>>    - FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”. Similarly, it can be
>>    implemented with “table.join(udtf).select()”. However, it is obvious that
>>    dataStream is easier to use than SQL.
>>
>> Due to the above two reasons, some users have to use the DataStream API or
>> the DataSet API. But when they do that, they lose the unification of batch
>> and streaming. They will also lose the sophisticated optimizations such as
>> codegen, aggregate join transpose and multi-stage agg from Flink SQL.
>>
>> We believe that enhancing the functionality and productivity is vital for
>> the successful adoption of Table API. To this end,  Table API still
>> requires more efforts from every contributor in the community. We see great
>> opportunity in improving our user’s experience from this work. Any feedback
>> is welcome.
>>
>> Regards,
>>
>> Jincheng
>>
>> jincheng sun <su...@gmail.com> 于2018年11月1日周四 下午5:07写道:
>>
>>> Hi all,
>>>
>>> With the continuous efforts from the community, the Flink system has been
>>> continuously improved, which has attracted more and more users. Flink SQL
>>> is a canonical, widely used relational query language. However, there are
>>> still some scenarios where Flink SQL failed to meet user needs in terms of
>>> functionality and ease of use, such as:
>>>
>>>
>>>    -
>>>
>>>    In terms of functionality
>>>
>>> Iteration, user-defined window, user-defined join, user-defined
>>> GroupReduce, etc. Users cannot express them with SQL;
>>>
>>>    -
>>>
>>>    In terms of ease of use
>>>    -
>>>
>>>       Map - e.g. “dataStream.map(mapFun)”. Although “table.select(udf1(),
>>>       udf2(), udf3()....)” can be used to accomplish the same function., with a
>>>       map() function returning 100 columns, one has to define or call 100 UDFs
>>>       when using SQL, which is quite involved.
>>>       -
>>>
>>>       FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”. Similarly, it can
>>>       be implemented with “table.join(udtf).select()”. However, it is obvious
>>>       that datastream is easier to use than SQL.
>>>
>>>
>>> Due to the above two reasons, some users have to use the DataStream API or
>>> the DataSet API. But when they do that, they lose the unification of batch
>>> and streaming. They will also lose the sophisticated optimizations such as
>>> codegen, aggregate join transpose  and multi-stage agg from Flink SQL.
>>>
>>> We believe that enhancing the functionality and productivity is vital for
>>> the successful adoption of Table API. To this end,  Table API still
>>> requires more efforts from every contributor in the community. We see great
>>> opportunity in improving our user’s experience from this work. Any feedback
>>> is welcome.
>>>
>>> Regards,
>>>
>>> Jincheng
>>>
>>>


Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Posted by Xiaowei Jiang <xi...@gmail.com>.
Hi Fabian,

I totally agree with you that we should incrementally improve TableAPI. We
don't suggest that we do anything drastic such as replacing DataSet API
yet. We should see how much we can achieve by extending TableAPI cleanly.
By then, we should see if there are any natural boundaries on how far we
can get. Hopefully it should be much easier for the community to reach
consensus at that point.

As you pointed out, the APIs in DataSet have mixed flavors. Some of them
are logical while others are very physical (e.g., MapPartition). We should
really think what the use cases for such physical API are. Is it possible
that their existence points to limitations in our current optimizations? If
we do better in our optimizations, can we use more logical APIs to achieve
the task? We did some investigations for the "hard" APIs in DataSet before.
We will post some of the lessons we learned and hope to get feedback from
you guys.

Iteration is a very complex and interesting topic. I think that we can live
with optimizations stopping at the boundary of iterations (at least
initially).

Regards,
Xiaowei

On Tue, Nov 6, 2018 at 6:21 PM Fabian Hueske <fh...@gmail.com> wrote:

> Thanks for the replies Xiaowei and others!
>
> You are right, I did not consider the batch optimization that would be
> missing if the DataSet API would be ported to extend the DataStream API.
> By extending the scope of the Table API, we can gain a holistic logical &
> physical optimization which would be great!
> Is your plan to move all DataSet API functionality into the Table API?
> If so, do you envision any batch-related API in DataStream at all or should
> this be done by converting a batch table to DataStream? I'm asking because
> if there would be batch features in DataStream, we would need some
> optimization there as well.
>
> I think the proposed separation of Table API (stateless APIs) and
> DataStream (APIs that expose state handling) is a good idea.
> On a side note, the DataSet API discouraged state handing in user function,
> so porting this Table API would be quite "natural".
>
> As I said before, I like that we can incrementally extend the Table API.
> Map and FlatMap functions do not seem too difficult.
> Reduce, GroupReduce, Combine, GroupCombine, MapPartition might be more
> tricky, esp. if we want to support retractions.
> Iterations should be a challenge. I assume that Calcite does not support
> iterations, so we probably need to split query / program and optimize parts
> separately (IIRC, this is also how Flink's own optimizer handles this).
> To what extend are you planning to support explicit physical operations
> like partitioning, sorting or optimizer hints?
>
> I haven't had a look in the design document that you shared. Probably, I
> find answers to some of my questions there ;-)
>
> Regarding the question of SQL or Table API, I agree that extending the
> scope of the Table API does not limit the scope for SQL.
> By adding more operations to the Table API we can expand it to use case
> that are not well-served by SQL.
> As others have said, we'll of course continue to extend and improve Flink's
> SQL support (within the bounds of the standard).
>
> Best, Fabian
>
> Am Di., 6. Nov. 2018 um 10:09 Uhr schrieb jincheng sun <
> sunjincheng121@gmail.com>:
>
> > Hi Jark,
> > Glad to see your feedback!
> > That's Correct, The proposal is aiming to extend the functionality for
> > Table API! I like add "drop" to fit the use case you mentioned. Not only
> > that, if a 100-columns Table. and our UDF needs these 100 columns, we
> don't
> > want to define the eval as eval(column0...column99), we prefer to define
> > eval as eval(Row)。Using it like this: table.select(udf (*)). All we also
> > need to consider if we put the columns package as a row. In a scenario
> like
> > this, we have Classification it as cloumn operation, and  list the
> changes
> > to the column operation after the map/flatMap/agg/flatAgg phase is
> > completed. And Currently,  Xiaowei has started a threading outlining
> which
> > talk about what we are proposing. Please see the detail in the mail
> thread:
> > Please see the detail in the mail thread:
> >
> >
> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
> > <
> >
> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
> > >
> >  .
> >
> > At this stage the Table API Enhancement Outline as follows:
> >
> >
> https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing
> >
> > Please let we know if you have further thoughts or feedback!
> >
> > Thanks,
> > Jincheng
> >
> >
> > Jark Wu <im...@gmail.com> 于2018年11月6日周二 下午3:35写道:
> >
> > > Hi jingcheng,
> > >
> > > Thanks for your proposal. I think it is a helpful enhancement for
> > TableAPI
> > > which is a solid step forward for TableAPI.
> > > It doesn't weaken SQL or DataStream, because the conversion between
> > > DataStream and Table still works.
> > > People with advanced cases (e.g. complex and fine-grained state
> control)
> > > can go with DataStream,
> > > but most general cases can stay in TableAPI. This works is aiming to
> > extend
> > > the functionality for TableAPI,
> > > to extend the usage scenario, to help TableAPI becomes a more widely
> used
> > > API.
> > >
> > > For example, someone want to drop one column from a 100-columns Table.
> > > Currently, we have to convert
> > > Table to DataStream and use MapFunction to do that, or select the
> > remaining
> > > 99 columns using Table.select API.
> > > But if we support Table.drop() method for TableAPI, it will be a very
> > > convenient method and let users stay in Table.
> > >
> > > Looking forward to the more detailed design and further discussion.
> > >
> > > Regards,
> > > Jark
> > >
> > > jincheng sun <su...@gmail.com> 于2018年11月6日周二 下午1:05写道:
> > >
> > > > Hi Rong Rong,
> > > >
> > > > Sorry for the late reply, And thanks for your feedback!  We will
> > continue
> > > > to add more convenience features to the TableAPI, such as map,
> flatmap,
> > > > agg, flatagg, iteration etc. And I am very happy that you are
> > interested
> > > on
> > > > this proposal. Due to this is a long-term continuous work, we will
> push
> > > it
> > > > in stages.  Currently  Xiaowei has started a threading outlining
> which
> > > talk
> > > > about what we are proposing. Please see the detail in the mail
> thread:
> > > >
> > > >
> > >
> >
> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
> > > > <
> > > >
> > >
> >
> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
> > > > >
> > > >  .
> > > >
> > > > The Table API Enhancement Outline as follows:
> > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing
> > > >
> > > > Please let we know if you have further thoughts or feedback!
> > > >
> > > > Thanks,
> > > > Jincheng
> > > >
> > > > Fabian Hueske <fh...@gmail.com> 于2018年11月5日周一 下午7:03写道:
> > > >
> > > > > Hi Jincheng,
> > > > >
> > > > > Thanks for this interesting proposal.
> > > > > I like that we can push this effort forward in a very fine-grained
> > > > manner,
> > > > > i.e., incrementally adding more APIs to the Table API.
> > > > >
> > > > > However, I also have a few questions / concerns.
> > > > > Today, the Table API is tightly integrated with the DataSet and
> > > > DataStream
> > > > > APIs. It is very easy to convert a Table into a DataSet or
> DataStream
> > > and
> > > > > vice versa. This mean it is already easy to combine custom logic an
> > > > > relational operations. What I like is that several aspects are
> > clearly
> > > > > separated like retraction and timestamp handling (see below) + all
> > > > > libraries on DataStream/DataSet can be easily combined with
> > relational
> > > > > operations.
> > > > > I can see that adding more functionality to the Table API would
> > remove
> > > > the
> > > > > distinction between DataSet and DataStream. However, wouldn't we
> get
> > a
> > > > > similar benefit by extending the DataStream API for proper support
> > for
> > > > > bounded streams (as is the long-term goal of Flink)?
> > > > > I'm also a bit skeptical about the optimization opportunities we
> > would
> > > > > gain. Map/FlatMap UDFs are black boxes that cannot be easily
> removed
> > > > > without additional information (I did some research on this a few
> > years
> > > > ago
> > > > > [1]).
> > > > >
> > > > > Moreover, I think there are a few tricky details that need to be
> > > resolved
> > > > > to enable a good integration.
> > > > >
> > > > > 1) How to deal with retraction messages? The DataStream API does
> not
> > > > have a
> > > > > notion of retractions. How would a MapFunction or FlatMapFunction
> > > handle
> > > > > retraction? Do they need to be aware of the change flag? Custom
> > > windowing
> > > > > and aggregation logic would certainly need to have that
> information.
> > > > > 2) How to deal with timestamps? The DataStream API does not give
> > access
> > > > to
> > > > > timestamps. In the Table API / SQL these are exposed as regular
> > > > attributes.
> > > > > How can we ensure that timestamp attributes remain valid (i.e.
> > aligned
> > > > with
> > > > > watermarks) if the output is produced by arbitrary code?
> > > > > There might be more issues of this kind.
> > > > >
> > > > > My main question would be how much would we gain with this proposal
> > > over
> > > > a
> > > > > tight integration of Table API and DataStream API, assuming that
> > batch
> > > > > functionality is moved to DataStream?
> > > > >
> > > > > Best, Fabian
> > > > >
> > > > > [1]
> http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf
> > > > >
> > > > >
> > > > > Am Mo., 5. Nov. 2018 um 02:49 Uhr schrieb Rong Rong <
> > > walterddr@gmail.com
> > > > >:
> > > > >
> > > > > > Hi Jincheng,
> > > > > >
> > > > > > Thank you for the proposal! I think being able to define a
> process
> > /
> > > > > > co-process function in table API definitely opens up a whole new
> > > level
> > > > of
> > > > > > applications using a unified API.
> > > > > >
> > > > > > In addition, as Tzu-Li and Hequn have mentioned, the benefit of
> > > > > > optimization layer of Table API will already bring in additional
> > > > benefit
> > > > > > over directly programming on top of DataStream/DataSet API. I am
> > very
> > > > > > interested an looking forward to seeing the support for more
> > complex
> > > > use
> > > > > > cases, especially iterations. It will enable table API to define
> > much
> > > > > > broader, event-driven use cases such as real-time ML
> > > > prediction/training.
> > > > > >
> > > > > > As Timo mentioned, This will make Table API diverge from the SQL
> > API.
> > > > But
> > > > > > as from my experience Table API was always giving me the
> impression
> > > to
> > > > > be a
> > > > > > more sophisticated, syntactic-aware way to express relational
> > > > operations.
> > > > > > Looking forward to further discussion and collaborations on the
> > FLIP
> > > > doc.
> > > > > >
> > > > > > --
> > > > > > Rong
> > > > > >
> > > > > > On Sun, Nov 4, 2018 at 5:22 PM jincheng sun <
> > > sunjincheng121@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi tison,
> > > > > > >
> > > > > > > Thanks a lot for your feedback!
> > > > > > > I am very happy to see that community contributors agree to
> > > enhanced
> > > > > the
> > > > > > > TableAPI. This work is a long-term continuous work, we will
> push
> > it
> > > > in
> > > > > > > stages, we will soon complete  the enhanced list of the first
> > > phase,
> > > > we
> > > > > > can
> > > > > > > go deep discussion  in google doc. thanks again for joining on
> > the
> > > > very
> > > > > > > important discussion of the Flink Table API.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Jincheng
> > > > > > >
> > > > > > > Tzu-Li Chen <wa...@gmail.com> 于2018年11月2日周五 下午1:49写道:
> > > > > > >
> > > > > > > > Hi jingchengm
> > > > > > > >
> > > > > > > > Thanks a lot for your proposal! I find it is a good start
> point
> > > for
> > > > > > > > internal optimization works and help Flink to be more
> > > > > > > > user-friendly.
> > > > > > > >
> > > > > > > > AFAIK, DataStream is the most popular API currently that
> Flink
> > > > > > > > users should describe their logic with detailed logic.
> > > > > > > > From a more internal view the conversion from DataStream to
> > > > > > > > JobGraph is quite mechanically and hard to be optimized. So
> > when
> > > > > > > > users program with DataStream, they have to learn more
> > internals
> > > > > > > > and spend a lot of time to tune for performance.
> > > > > > > > With your proposal, we provide enhanced functionality of
> Table
> > > API,
> > > > > > > > so that users can describe their job easily on Table aspect.
> > This
> > > > > gives
> > > > > > > > an opportunity to Flink developers to introduce an optimize
> > phase
> > > > > > > > while transforming user program(described by Table API) to
> > > internal
> > > > > > > > representation.
> > > > > > > >
> > > > > > > > Given a user who want to start using Flink with simple ETL,
> > > > > pipelining
> > > > > > > > or analytics, he would find it is most naturally described by
> > > > > SQL/Table
> > > > > > > > API. Further, as mentioned by @hequn,
> > > > > > > >
> > > > > > > > SQL is a widely used language. It follows standards, is a
> > > > > > > > > descriptive language, and is easy to use
> > > > > > > >
> > > > > > > >
> > > > > > > > thus we could expect with the enhancement of SQL/Table API,
> > Flink
> > > > > > > > becomes more friendly to users.
> > > > > > > >
> > > > > > > > Looking forward to the design doc/FLIP!
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > tison.
> > > > > > > >
> > > > > > > >
> > > > > > > > jincheng sun <su...@gmail.com> 于2018年11月2日周五
> > 上午11:46写道:
> > > > > > > >
> > > > > > > > > Hi Hequn,
> > > > > > > > > Thanks for your feedback! And also thanks for our offline
> > > > > discussion!
> > > > > > > > > You are right, unification of batch and streaming is very
> > > > important
> > > > > > for
> > > > > > > > > flink API.
> > > > > > > > > We will provide more detailed design later, Please let me
> > know
> > > if
> > > > > you
> > > > > > > > have
> > > > > > > > > further thoughts or feedback.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Jincheng
> > > > > > > > >
> > > > > > > > > Hequn Cheng <ch...@gmail.com> 于2018年11月2日周五
> 上午10:02写道:
> > > > > > > > >
> > > > > > > > > > Hi Jincheng,
> > > > > > > > > >
> > > > > > > > > > Thanks a lot for your proposal. It is very encouraging!
> > > > > > > > > >
> > > > > > > > > > As we all know, SQL is a widely used language. It follows
> > > > > > standards,
> > > > > > > > is a
> > > > > > > > > > descriptive language, and is easy to use. A powerful
> > feature
> > > of
> > > > > SQL
> > > > > > > is
> > > > > > > > > that
> > > > > > > > > > it supports optimization. Users only need to care about
> the
> > > > logic
> > > > > > of
> > > > > > > > the
> > > > > > > > > > program. The underlying optimizer will help users
> optimize
> > > the
> > > > > > > > > performance
> > > > > > > > > > of the program. However, in terms of functionality and
> ease
> > > of
> > > > > use,
> > > > > > > in
> > > > > > > > > some
> > > > > > > > > > scenarios sql will be limited, as described in Jincheng's
> > > > > proposal.
> > > > > > > > > >
> > > > > > > > > > Correspondingly, the DataStream/DataSet api can provide
> > > > powerful
> > > > > > > > > > functionalities. Users can write
> > > > > ProcessFunction/CoProcessFunction
> > > > > > > and
> > > > > > > > > get
> > > > > > > > > > the timer. Compared with SQL, it provides more
> > > functionalities
> > > > > and
> > > > > > > > > > flexibilities. However, it does not support optimization
> > like
> > > > > SQL.
> > > > > > > > > > Meanwhile, DataStream/DataSet api has not been unified
> > which
> > > > > means,
> > > > > > > for
> > > > > > > > > the
> > > > > > > > > > same logic, users need to write a job for each stream and
> > > > batch.
> > > > > > > > > >
> > > > > > > > > > With TableApi, I think we can combine the advantages of
> > both.
> > > > > Users
> > > > > > > can
> > > > > > > > > > easily write relational operations and enjoy
> optimization.
> > At
> > > > the
> > > > > > > same
> > > > > > > > > > time, it supports more functionality and ease of use.
> > Looking
> > > > > > forward
> > > > > > > > to
> > > > > > > > > > the detailed design/FLIP.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Hequn
> > > > > > > > > >
> > > > > > > > > > On Fri, Nov 2, 2018 at 9:48 AM Shaoxuan Wang <
> > > > > wshaoxuan@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Aljoscha,
> > > > > > > > > > > Glad that you like the proposal. We have completed the
> > > > > prototype
> > > > > > of
> > > > > > > > > most
> > > > > > > > > > > new proposed functionalities. Once collect the feedback
> > > from
> > > > > > > > community,
> > > > > > > > > > we
> > > > > > > > > > > will come up with a concrete FLIP/design doc.
> > > > > > > > > > >
> > > > > > > > > > > Regards,
> > > > > > > > > > > Shaoxuan
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Nov 1, 2018 at 8:12 PM Aljoscha Krettek <
> > > > > > > aljoscha@apache.org
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Jincheng,
> > > > > > > > > > > >
> > > > > > > > > > > > these points sound very good! Are there any concrete
> > > > > proposals
> > > > > > > for
> > > > > > > > > > > > changes? For example a FLIP/design document?
> > > > > > > > > > > >
> > > > > > > > > > > > See here for FLIPs:
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> > > > > > > > > > > >
> > > > > > > > > > > > Best,
> > > > > > > > > > > > Aljoscha
> > > > > > > > > > > >
> > > > > > > > > > > > > On 1. Nov 2018, at 12:51, jincheng sun <
> > > > > > > sunjincheng121@gmail.com
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > *--------I am sorry for the formatting of the email
> > > > > content.
> > > > > > I
> > > > > > > > > > reformat
> > > > > > > > > > > > > the **content** as follows-----------*
> > > > > > > > > > > > >
> > > > > > > > > > > > > *Hi ALL,*
> > > > > > > > > > > > >
> > > > > > > > > > > > > With the continuous efforts from the community, the
> > > Flink
> > > > > > > system
> > > > > > > > > has
> > > > > > > > > > > been
> > > > > > > > > > > > > continuously improved, which has attracted more and
> > > more
> > > > > > users.
> > > > > > > > > Flink
> > > > > > > > > > > SQL
> > > > > > > > > > > > > is a canonical, widely used relational query
> > language.
> > > > > > However,
> > > > > > > > > there
> > > > > > > > > > > are
> > > > > > > > > > > > > still some scenarios where Flink SQL failed to meet
> > > user
> > > > > > needs
> > > > > > > in
> > > > > > > > > > terms
> > > > > > > > > > > > of
> > > > > > > > > > > > > functionality and ease of use, such as:
> > > > > > > > > > > > >
> > > > > > > > > > > > > *1. In terms of functionality*
> > > > > > > > > > > > >    Iteration, user-defined window, user-defined
> join,
> > > > > > > > user-defined
> > > > > > > > > > > > > GroupReduce, etc. Users cannot express them with
> SQL;
> > > > > > > > > > > > >
> > > > > > > > > > > > > *2. In terms of ease of use*
> > > > > > > > > > > > >
> > > > > > > > > > > > >   - Map - e.g. “dataStream.map(mapFun)”. Although
> > > > > > > > > > “table.select(udf1(),
> > > > > > > > > > > > >   udf2(), udf3()....)” can be used to accomplish
> the
> > > same
> > > > > > > > > function.,
> > > > > > > > > > > > with a
> > > > > > > > > > > > >   map() function returning 100 columns, one has to
> > > define
> > > > > or
> > > > > > > call
> > > > > > > > > 100
> > > > > > > > > > > > UDFs
> > > > > > > > > > > > >   when using SQL, which is quite involved.
> > > > > > > > > > > > >   - FlatMap -  e.g.
> “dataStrem.flatmap(flatMapFun)”.
> > > > > > Similarly,
> > > > > > > > it
> > > > > > > > > > can
> > > > > > > > > > > be
> > > > > > > > > > > > >   implemented with “table.join(udtf).select()”.
> > > However,
> > > > it
> > > > > > is
> > > > > > > > > > obvious
> > > > > > > > > > > > that
> > > > > > > > > > > > >   dataStream is easier to use than SQL.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Due to the above two reasons, some users have to
> use
> > > the
> > > > > > > > DataStream
> > > > > > > > > > API
> > > > > > > > > > > > or
> > > > > > > > > > > > > the DataSet API. But when they do that, they lose
> the
> > > > > > > unification
> > > > > > > > > of
> > > > > > > > > > > > batch
> > > > > > > > > > > > > and streaming. They will also lose the
> sophisticated
> > > > > > > > optimizations
> > > > > > > > > > such
> > > > > > > > > > > > as
> > > > > > > > > > > > > codegen, aggregate join transpose and multi-stage
> agg
> > > > from
> > > > > > > Flink
> > > > > > > > > SQL.
> > > > > > > > > > > > >
> > > > > > > > > > > > > We believe that enhancing the functionality and
> > > > > productivity
> > > > > > is
> > > > > > > > > vital
> > > > > > > > > > > for
> > > > > > > > > > > > > the successful adoption of Table API. To this end,
> > > Table
> > > > > API
> > > > > > > > still
> > > > > > > > > > > > > requires more efforts from every contributor in the
> > > > > > community.
> > > > > > > We
> > > > > > > > > see
> > > > > > > > > > > > great
> > > > > > > > > > > > > opportunity in improving our user’s experience from
> > > this
> > > > > > work.
> > > > > > > > Any
> > > > > > > > > > > > feedback
> > > > > > > > > > > > > is welcome.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Regards,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Jincheng
> > > > > > > > > > > > >
> > > > > > > > > > > > > jincheng sun <su...@gmail.com>
> > 于2018年11月1日周四
> > > > > > > 下午5:07写道:
> > > > > > > > > > > > >
> > > > > > > > > > > > >> Hi all,
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> With the continuous efforts from the community,
> the
> > > > Flink
> > > > > > > system
> > > > > > > > > has
> > > > > > > > > > > > been
> > > > > > > > > > > > >> continuously improved, which has attracted more
> and
> > > more
> > > > > > > users.
> > > > > > > > > > Flink
> > > > > > > > > > > > SQL
> > > > > > > > > > > > >> is a canonical, widely used relational query
> > language.
> > > > > > > However,
> > > > > > > > > > there
> > > > > > > > > > > > are
> > > > > > > > > > > > >> still some scenarios where Flink SQL failed to
> meet
> > > user
> > > > > > needs
> > > > > > > > in
> > > > > > > > > > > terms
> > > > > > > > > > > > of
> > > > > > > > > > > > >> functionality and ease of use, such as:
> > > > > > > > > > > > >>
> > > > > > > > > > > > >>
> > > > > > > > > > > > >>   -
> > > > > > > > > > > > >>
> > > > > > > > > > > > >>   In terms of functionality
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> Iteration, user-defined window, user-defined join,
> > > > > > > user-defined
> > > > > > > > > > > > >> GroupReduce, etc. Users cannot express them with
> > SQL;
> > > > > > > > > > > > >>
> > > > > > > > > > > > >>   -
> > > > > > > > > > > > >>
> > > > > > > > > > > > >>   In terms of ease of use
> > > > > > > > > > > > >>   -
> > > > > > > > > > > > >>
> > > > > > > > > > > > >>      Map - e.g. “dataStream.map(mapFun)”. Although
> > > > > > > > > > > “table.select(udf1(),
> > > > > > > > > > > > >>      udf2(), udf3()....)” can be used to
> accomplish
> > > the
> > > > > same
> > > > > > > > > > > function.,
> > > > > > > > > > > > with a
> > > > > > > > > > > > >>      map() function returning 100 columns, one has
> > to
> > > > > define
> > > > > > > or
> > > > > > > > > call
> > > > > > > > > > > > 100 UDFs
> > > > > > > > > > > > >>      when using SQL, which is quite involved.
> > > > > > > > > > > > >>      -
> > > > > > > > > > > > >>
> > > > > > > > > > > > >>      FlatMap -  e.g.
> > “dataStrem.flatmap(flatMapFun)”.
> > > > > > > Similarly,
> > > > > > > > > it
> > > > > > > > > > > can
> > > > > > > > > > > > >>      be implemented with
> > “table.join(udtf).select()”.
> > > > > > However,
> > > > > > > > it
> > > > > > > > > is
> > > > > > > > > > > > obvious
> > > > > > > > > > > > >>      that datastream is easier to use than SQL.
> > > > > > > > > > > > >>
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> Due to the above two reasons, some users have to
> use
> > > the
> > > > > > > > > DataStream
> > > > > > > > > > > API
> > > > > > > > > > > > or
> > > > > > > > > > > > >> the DataSet API. But when they do that, they lose
> > the
> > > > > > > > unification
> > > > > > > > > of
> > > > > > > > > > > > batch
> > > > > > > > > > > > >> and streaming. They will also lose the
> sophisticated
> > > > > > > > optimizations
> > > > > > > > > > > such
> > > > > > > > > > > > as
> > > > > > > > > > > > >> codegen, aggregate join transpose  and multi-stage
> > agg
> > > > > from
> > > > > > > > Flink
> > > > > > > > > > SQL.
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> We believe that enhancing the functionality and
> > > > > productivity
> > > > > > > is
> > > > > > > > > > vital
> > > > > > > > > > > > for
> > > > > > > > > > > > >> the successful adoption of Table API. To this end,
> > > > Table
> > > > > > API
> > > > > > > > > still
> > > > > > > > > > > > >> requires more efforts from every contributor in
> the
> > > > > > community.
> > > > > > > > We
> > > > > > > > > > see
> > > > > > > > > > > > great
> > > > > > > > > > > > >> opportunity in improving our user’s experience
> from
> > > this
> > > > > > work.
> > > > > > > > Any
> > > > > > > > > > > > feedback
> > > > > > > > > > > > >> is welcome.
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> Regards,
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> Jincheng
> > > > > > > > > > > > >>
> > > > > > > > > > > > >>
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Posted by SHI Xiaogang <sh...@gmail.com>.
Hi all,

Thank you for your replies and comments.

I have similar consideration like Piotrek.  My opinion is that two APIs are
enough for Flink, a declarative one (SQL) and one imperative one
(DataStream). From my perspective, most of users prefer SQL at most time
and turn to Data Stream when the logic is complex. TableAPI, which falls
somewhere between SQL and DataStream and thus obtains some benefits from
both sides, is never the choice currently because it needs some programming
efforts in Ad-hoc senarios and it is short of expressiveness for complex
logic.

I agree that more optimization opportunities can be gained from Table API
than DataStream. But I think the benefits are marginal. The ostacle is the
side effects in imperative languages. Symbolic programming can do some
help, but as it is still declarative, it is not easy to deal with the side
effects. SCOPE does some good work in the optimization opportunities gained
from the program analysis in UDFs [1, 2], which may help understand the
problem.

I also agree that the enhancement of Table API does not mean the weaken of
SQL. But given the LIMITED resources in the community, I think we should
focus our work on SQL instead of TableAPI at present.  Many users are
looking forward to Flink SQL which can provide good performance for both
streaming and batch processing. I think the demand is more urgent and
should be put in the first place.

[1]  Spotting Code Optimizations in Data-Parallel Pipelines through
PeriSCOPE. In OSDI 2012.
[2]  Optimizing Data Shuffling in Data-Parallel Computation by
Understanding User-Defined Functions. In NSDI 2012.

Regards,
Xiaogang

Piotr Nowojski <pi...@data-artisans.com> 于2018年11月6日周二 下午9:24写道:

> Hi,
>
> What is our intended division/border between Table API and DataSet or
> DataStream? If we want Table API to drift away from SQL that would be a
> valid question.
>
> > Another distinguishing feature of DataStream API is that users get direct
> > access to state/statebackend which we intensionally avoided in Table API
>
> Do we really want to make Table API an equivalent of DataSet/DataStream
> API but without a state? Drawing boundary in such way would make it more
> difficult for users to pick the right tool if for many use cases they could
> use both. What if then at some point of time they came to conclusion that
> they need a small state access for something? If that’s our intended end
> goal separation between Table API and DataStream API, it would be very very
> weird having two very similar APIs, that have tons of small differences,
> but are basically equivalent modulo state accesses.
>
> Maybe instead of duplicating efforts and work between our different APIs
> we should more focus on either interoperability or unifying them?
>
> For example if we would like to differentiate those APIs because of the
> presence/lack of optimiser, maybe the APIs should be the same, but there
> should be a way tell whether the UDF/operator is deterministic, has side
> effects, etc. And if such operator is found in the plan, the nodes below
> and above could still be subject to regular optimisation rules.
>
> Piotrek
>
> > On 6 Nov 2018, at 14:04, Fabian Hueske <fh...@gmail.com> wrote:
> >
> > Hi,
> >
> > An analysis of orthogonal functions would be great!
> > There is certainly some overlap in the functions provided by the DataSet
> > API.
> >
> > In the past, I found that having low-level functions helped a lot to
> > efficiently implement complex logic.
> > Without partitionByHash, sortPartition, sort, mapPartition, combine, etc
> it
> > would not be possible to (efficiently) implement certain operators for
> the
> > Table API, SQL or the Cascading-On-Flink port that I did a while back.
> > I could imaging that these APIs would be useful to implement DSLs on top
> of
> > the Table API, such as Gelly.
> >
> > Anyway, I completely agree that these physical operators should not be
> the
> > first step.
> > If we find that these methods are not needed, even better!
> >
> > Let's try to keep this thread focused on the general proposal of
> extending
> > the scope of the Table API and keep the discussion of concrete proposal
> > that Xiaowei shared in the other thread (and the design doc).
> > That will help to keep all related comments in one place ;-)
> >
> > Best, Fabian
> >
> >
> > Am Di., 6. Nov. 2018 um 13:01 Uhr schrieb jincheng sun <
> > sunjincheng121@gmail.com>:
> >
> >> Hi Fabian,
> >> Thank you for your deep thoughts in this regard, I think most of
> questions
> >> you had mentioned are very worthy of in-depth discussion! I want share
> >> thoughts about following questions:
> >>
> >> 1. Do we need move all DataSet API functionality into the Table API?
> >> I think most of dataset functionality should be add into the TableAPI,
> such
> >> as map, flatmap, groupReduce etc., Because these are very easy to use
> for
> >> the user.
> >>
> >> 2. Do we support explicit physical operations like partitioning,
> sorting or
> >> optimizer hints?
> >> I think we do not want add the physical operations, e.g.:
> >> sortPartition,partitionCustom etc. From the points of my view, those
> >> physical operations are used to optimization, which can be solved by
> >> hints(I think we should add hints feature to both tableAPI and SQL).
> >>
> >> 3. Do we want to support retractions in iteration?
> >> I think support iteration is  a very complicated function。I am not sure,
> >> but i think the implementation of the iteration may be implemented
> >> according to the current batch mode, and the retraction is temporarily
> not
> >> supported, assuming that the trained data will not be updated in the
> >> current iteration. The updated data will be used in the next
> iteration.  So
> >> I think we should in-depth discussion  in a new threading.
> >>
> >> BTW, I find that you have had leave the very useful comments in the
> google
> >> doc:
> >>
> >>
> https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit#
> >>
> >> Thanks again for both your mail feedback and doc comments !
> >>
> >> Best,
> >> Jincheng
> >>
> >>
> >>
> >> Fabian Hueske <fh...@gmail.com> 于2018年11月6日周二 下午6:21写道:
> >>
> >>> Thanks for the replies Xiaowei and others!
> >>>
> >>> You are right, I did not consider the batch optimization that would be
> >>> missing if the DataSet API would be ported to extend the DataStream
> API.
> >>> By extending the scope of the Table API, we can gain a holistic
> logical &
> >>> physical optimization which would be great!
> >>> Is your plan to move all DataSet API functionality into the Table API?
> >>> If so, do you envision any batch-related API in DataStream at all or
> >> should
> >>> this be done by converting a batch table to DataStream? I'm asking
> >> because
> >>> if there would be batch features in DataStream, we would need some
> >>> optimization there as well.
> >>>
> >>> I think the proposed separation of Table API (stateless APIs) and
> >>> DataStream (APIs that expose state handling) is a good idea.
> >>> On a side note, the DataSet API discouraged state handing in user
> >> function,
> >>> so porting this Table API would be quite "natural".
> >>>
> >>> As I said before, I like that we can incrementally extend the Table
> API.
> >>> Map and FlatMap functions do not seem too difficult.
> >>> Reduce, GroupReduce, Combine, GroupCombine, MapPartition might be more
> >>> tricky, esp. if we want to support retractions.
> >>> Iterations should be a challenge. I assume that Calcite does not
> support
> >>> iterations, so we probably need to split query / program and optimize
> >> parts
> >>> separately (IIRC, this is also how Flink's own optimizer handles this).
> >>> To what extend are you planning to support explicit physical operations
> >>> like partitioning, sorting or optimizer hints?
> >>>
> >>> I haven't had a look in the design document that you shared. Probably,
> I
> >>> find answers to some of my questions there ;-)
> >>>
> >>> Regarding the question of SQL or Table API, I agree that extending the
> >>> scope of the Table API does not limit the scope for SQL.
> >>> By adding more operations to the Table API we can expand it to use case
> >>> that are not well-served by SQL.
> >>> As others have said, we'll of course continue to extend and improve
> >> Flink's
> >>> SQL support (within the bounds of the standard).
> >>>
> >>> Best, Fabian
> >>>
> >>> Am Di., 6. Nov. 2018 um 10:09 Uhr schrieb jincheng sun <
> >>> sunjincheng121@gmail.com>:
> >>>
> >>>> Hi Jark,
> >>>> Glad to see your feedback!
> >>>> That's Correct, The proposal is aiming to extend the functionality for
> >>>> Table API! I like add "drop" to fit the use case you mentioned. Not
> >> only
> >>>> that, if a 100-columns Table. and our UDF needs these 100 columns, we
> >>> don't
> >>>> want to define the eval as eval(column0...column99), we prefer to
> >> define
> >>>> eval as eval(Row)。Using it like this: table.select(udf (*)). All we
> >> also
> >>>> need to consider if we put the columns package as a row. In a scenario
> >>> like
> >>>> this, we have Classification it as cloumn operation, and  list the
> >>> changes
> >>>> to the column operation after the map/flatMap/agg/flatAgg phase is
> >>>> completed. And Currently,  Xiaowei has started a threading outlining
> >>> which
> >>>> talk about what we are proposing. Please see the detail in the mail
> >>> thread:
> >>>> Please see the detail in the mail thread:
> >>>>
> >>>>
> >>>
> >>
> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
> >>>> <
> >>>>
> >>>
> >>
> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
> >>>>>
> >>>> .
> >>>>
> >>>> At this stage the Table API Enhancement Outline as follows:
> >>>>
> >>>>
> >>>
> >>
> https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing
> >>>>
> >>>> Please let we know if you have further thoughts or feedback!
> >>>>
> >>>> Thanks,
> >>>> Jincheng
> >>>>
> >>>>
> >>>> Jark Wu <im...@gmail.com> 于2018年11月6日周二 下午3:35写道:
> >>>>
> >>>>> Hi jingcheng,
> >>>>>
> >>>>> Thanks for your proposal. I think it is a helpful enhancement for
> >>>> TableAPI
> >>>>> which is a solid step forward for TableAPI.
> >>>>> It doesn't weaken SQL or DataStream, because the conversion between
> >>>>> DataStream and Table still works.
> >>>>> People with advanced cases (e.g. complex and fine-grained state
> >>> control)
> >>>>> can go with DataStream,
> >>>>> but most general cases can stay in TableAPI. This works is aiming to
> >>>> extend
> >>>>> the functionality for TableAPI,
> >>>>> to extend the usage scenario, to help TableAPI becomes a more widely
> >>> used
> >>>>> API.
> >>>>>
> >>>>> For example, someone want to drop one column from a 100-columns
> >> Table.
> >>>>> Currently, we have to convert
> >>>>> Table to DataStream and use MapFunction to do that, or select the
> >>>> remaining
> >>>>> 99 columns using Table.select API.
> >>>>> But if we support Table.drop() method for TableAPI, it will be a very
> >>>>> convenient method and let users stay in Table.
> >>>>>
> >>>>> Looking forward to the more detailed design and further discussion.
> >>>>>
> >>>>> Regards,
> >>>>> Jark
> >>>>>
> >>>>> jincheng sun <su...@gmail.com> 于2018年11月6日周二 下午1:05写道:
> >>>>>
> >>>>>> Hi Rong Rong,
> >>>>>>
> >>>>>> Sorry for the late reply, And thanks for your feedback!  We will
> >>>> continue
> >>>>>> to add more convenience features to the TableAPI, such as map,
> >>> flatmap,
> >>>>>> agg, flatagg, iteration etc. And I am very happy that you are
> >>>> interested
> >>>>> on
> >>>>>> this proposal. Due to this is a long-term continuous work, we will
> >>> push
> >>>>> it
> >>>>>> in stages.  Currently  Xiaowei has started a threading outlining
> >>> which
> >>>>> talk
> >>>>>> about what we are proposing. Please see the detail in the mail
> >>> thread:
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
> >>>>>> <
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
> >>>>>>>
> >>>>>> .
> >>>>>>
> >>>>>> The Table API Enhancement Outline as follows:
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing
> >>>>>>
> >>>>>> Please let we know if you have further thoughts or feedback!
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Jincheng
> >>>>>>
> >>>>>> Fabian Hueske <fh...@gmail.com> 于2018年11月5日周一 下午7:03写道:
> >>>>>>
> >>>>>>> Hi Jincheng,
> >>>>>>>
> >>>>>>> Thanks for this interesting proposal.
> >>>>>>> I like that we can push this effort forward in a very
> >> fine-grained
> >>>>>> manner,
> >>>>>>> i.e., incrementally adding more APIs to the Table API.
> >>>>>>>
> >>>>>>> However, I also have a few questions / concerns.
> >>>>>>> Today, the Table API is tightly integrated with the DataSet and
> >>>>>> DataStream
> >>>>>>> APIs. It is very easy to convert a Table into a DataSet or
> >>> DataStream
> >>>>> and
> >>>>>>> vice versa. This mean it is already easy to combine custom logic
> >> an
> >>>>>>> relational operations. What I like is that several aspects are
> >>>> clearly
> >>>>>>> separated like retraction and timestamp handling (see below) +
> >> all
> >>>>>>> libraries on DataStream/DataSet can be easily combined with
> >>>> relational
> >>>>>>> operations.
> >>>>>>> I can see that adding more functionality to the Table API would
> >>>> remove
> >>>>>> the
> >>>>>>> distinction between DataSet and DataStream. However, wouldn't we
> >>> get
> >>>> a
> >>>>>>> similar benefit by extending the DataStream API for proper
> >> support
> >>>> for
> >>>>>>> bounded streams (as is the long-term goal of Flink)?
> >>>>>>> I'm also a bit skeptical about the optimization opportunities we
> >>>> would
> >>>>>>> gain. Map/FlatMap UDFs are black boxes that cannot be easily
> >>> removed
> >>>>>>> without additional information (I did some research on this a few
> >>>> years
> >>>>>> ago
> >>>>>>> [1]).
> >>>>>>>
> >>>>>>> Moreover, I think there are a few tricky details that need to be
> >>>>> resolved
> >>>>>>> to enable a good integration.
> >>>>>>>
> >>>>>>> 1) How to deal with retraction messages? The DataStream API does
> >>> not
> >>>>>> have a
> >>>>>>> notion of retractions. How would a MapFunction or FlatMapFunction
> >>>>> handle
> >>>>>>> retraction? Do they need to be aware of the change flag? Custom
> >>>>> windowing
> >>>>>>> and aggregation logic would certainly need to have that
> >>> information.
> >>>>>>> 2) How to deal with timestamps? The DataStream API does not give
> >>>> access
> >>>>>> to
> >>>>>>> timestamps. In the Table API / SQL these are exposed as regular
> >>>>>> attributes.
> >>>>>>> How can we ensure that timestamp attributes remain valid (i.e.
> >>>> aligned
> >>>>>> with
> >>>>>>> watermarks) if the output is produced by arbitrary code?
> >>>>>>> There might be more issues of this kind.
> >>>>>>>
> >>>>>>> My main question would be how much would we gain with this
> >> proposal
> >>>>> over
> >>>>>> a
> >>>>>>> tight integration of Table API and DataStream API, assuming that
> >>>> batch
> >>>>>>> functionality is moved to DataStream?
> >>>>>>>
> >>>>>>> Best, Fabian
> >>>>>>>
> >>>>>>> [1]
> >>> http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf
> >>>>>>>
> >>>>>>>
> >>>>>>> Am Mo., 5. Nov. 2018 um 02:49 Uhr schrieb Rong Rong <
> >>>>> walterddr@gmail.com
> >>>>>>> :
> >>>>>>>
> >>>>>>>> Hi Jincheng,
> >>>>>>>>
> >>>>>>>> Thank you for the proposal! I think being able to define a
> >>> process
> >>>> /
> >>>>>>>> co-process function in table API definitely opens up a whole
> >> new
> >>>>> level
> >>>>>> of
> >>>>>>>> applications using a unified API.
> >>>>>>>>
> >>>>>>>> In addition, as Tzu-Li and Hequn have mentioned, the benefit of
> >>>>>>>> optimization layer of Table API will already bring in
> >> additional
> >>>>>> benefit
> >>>>>>>> over directly programming on top of DataStream/DataSet API. I
> >> am
> >>>> very
> >>>>>>>> interested an looking forward to seeing the support for more
> >>>> complex
> >>>>>> use
> >>>>>>>> cases, especially iterations. It will enable table API to
> >> define
> >>>> much
> >>>>>>>> broader, event-driven use cases such as real-time ML
> >>>>>> prediction/training.
> >>>>>>>>
> >>>>>>>> As Timo mentioned, This will make Table API diverge from the
> >> SQL
> >>>> API.
> >>>>>> But
> >>>>>>>> as from my experience Table API was always giving me the
> >>> impression
> >>>>> to
> >>>>>>> be a
> >>>>>>>> more sophisticated, syntactic-aware way to express relational
> >>>>>> operations.
> >>>>>>>> Looking forward to further discussion and collaborations on the
> >>>> FLIP
> >>>>>> doc.
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Rong
> >>>>>>>>
> >>>>>>>> On Sun, Nov 4, 2018 at 5:22 PM jincheng sun <
> >>>>> sunjincheng121@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi tison,
> >>>>>>>>>
> >>>>>>>>> Thanks a lot for your feedback!
> >>>>>>>>> I am very happy to see that community contributors agree to
> >>>>> enhanced
> >>>>>>> the
> >>>>>>>>> TableAPI. This work is a long-term continuous work, we will
> >>> push
> >>>> it
> >>>>>> in
> >>>>>>>>> stages, we will soon complete  the enhanced list of the first
> >>>>> phase,
> >>>>>> we
> >>>>>>>> can
> >>>>>>>>> go deep discussion  in google doc. thanks again for joining
> >> on
> >>>> the
> >>>>>> very
> >>>>>>>>> important discussion of the Flink Table API.
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Jincheng
> >>>>>>>>>
> >>>>>>>>> Tzu-Li Chen <wa...@gmail.com> 于2018年11月2日周五 下午1:49写道:
> >>>>>>>>>
> >>>>>>>>>> Hi jingchengm
> >>>>>>>>>>
> >>>>>>>>>> Thanks a lot for your proposal! I find it is a good start
> >>> point
> >>>>> for
> >>>>>>>>>> internal optimization works and help Flink to be more
> >>>>>>>>>> user-friendly.
> >>>>>>>>>>
> >>>>>>>>>> AFAIK, DataStream is the most popular API currently that
> >>> Flink
> >>>>>>>>>> users should describe their logic with detailed logic.
> >>>>>>>>>> From a more internal view the conversion from DataStream to
> >>>>>>>>>> JobGraph is quite mechanically and hard to be optimized. So
> >>>> when
> >>>>>>>>>> users program with DataStream, they have to learn more
> >>>> internals
> >>>>>>>>>> and spend a lot of time to tune for performance.
> >>>>>>>>>> With your proposal, we provide enhanced functionality of
> >>> Table
> >>>>> API,
> >>>>>>>>>> so that users can describe their job easily on Table
> >> aspect.
> >>>> This
> >>>>>>> gives
> >>>>>>>>>> an opportunity to Flink developers to introduce an optimize
> >>>> phase
> >>>>>>>>>> while transforming user program(described by Table API) to
> >>>>> internal
> >>>>>>>>>> representation.
> >>>>>>>>>>
> >>>>>>>>>> Given a user who want to start using Flink with simple ETL,
> >>>>>>> pipelining
> >>>>>>>>>> or analytics, he would find it is most naturally described
> >> by
> >>>>>>> SQL/Table
> >>>>>>>>>> API. Further, as mentioned by @hequn,
> >>>>>>>>>>
> >>>>>>>>>> SQL is a widely used language. It follows standards, is a
> >>>>>>>>>>> descriptive language, and is easy to use
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> thus we could expect with the enhancement of SQL/Table API,
> >>>> Flink
> >>>>>>>>>> becomes more friendly to users.
> >>>>>>>>>>
> >>>>>>>>>> Looking forward to the design doc/FLIP!
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> tison.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> jincheng sun <su...@gmail.com> 于2018年11月2日周五
> >>>> 上午11:46写道:
> >>>>>>>>>>
> >>>>>>>>>>> Hi Hequn,
> >>>>>>>>>>> Thanks for your feedback! And also thanks for our offline
> >>>>>>> discussion!
> >>>>>>>>>>> You are right, unification of batch and streaming is very
> >>>>>> important
> >>>>>>>> for
> >>>>>>>>>>> flink API.
> >>>>>>>>>>> We will provide more detailed design later, Please let me
> >>>> know
> >>>>> if
> >>>>>>> you
> >>>>>>>>>> have
> >>>>>>>>>>> further thoughts or feedback.
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> Jincheng
> >>>>>>>>>>>
> >>>>>>>>>>> Hequn Cheng <ch...@gmail.com> 于2018年11月2日周五
> >>> 上午10:02写道:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi Jincheng,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks a lot for your proposal. It is very encouraging!
> >>>>>>>>>>>>
> >>>>>>>>>>>> As we all know, SQL is a widely used language. It
> >> follows
> >>>>>>>> standards,
> >>>>>>>>>> is a
> >>>>>>>>>>>> descriptive language, and is easy to use. A powerful
> >>>> feature
> >>>>> of
> >>>>>>> SQL
> >>>>>>>>> is
> >>>>>>>>>>> that
> >>>>>>>>>>>> it supports optimization. Users only need to care about
> >>> the
> >>>>>> logic
> >>>>>>>> of
> >>>>>>>>>> the
> >>>>>>>>>>>> program. The underlying optimizer will help users
> >>> optimize
> >>>>> the
> >>>>>>>>>>> performance
> >>>>>>>>>>>> of the program. However, in terms of functionality and
> >>> ease
> >>>>> of
> >>>>>>> use,
> >>>>>>>>> in
> >>>>>>>>>>> some
> >>>>>>>>>>>> scenarios sql will be limited, as described in
> >> Jincheng's
> >>>>>>> proposal.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Correspondingly, the DataStream/DataSet api can provide
> >>>>>> powerful
> >>>>>>>>>>>> functionalities. Users can write
> >>>>>>> ProcessFunction/CoProcessFunction
> >>>>>>>>> and
> >>>>>>>>>>> get
> >>>>>>>>>>>> the timer. Compared with SQL, it provides more
> >>>>> functionalities
> >>>>>>> and
> >>>>>>>>>>>> flexibilities. However, it does not support
> >> optimization
> >>>> like
> >>>>>>> SQL.
> >>>>>>>>>>>> Meanwhile, DataStream/DataSet api has not been unified
> >>>> which
> >>>>>>> means,
> >>>>>>>>> for
> >>>>>>>>>>> the
> >>>>>>>>>>>> same logic, users need to write a job for each stream
> >> and
> >>>>>> batch.
> >>>>>>>>>>>>
> >>>>>>>>>>>> With TableApi, I think we can combine the advantages of
> >>>> both.
> >>>>>>> Users
> >>>>>>>>> can
> >>>>>>>>>>>> easily write relational operations and enjoy
> >>> optimization.
> >>>> At
> >>>>>> the
> >>>>>>>>> same
> >>>>>>>>>>>> time, it supports more functionality and ease of use.
> >>>> Looking
> >>>>>>>> forward
> >>>>>>>>>> to
> >>>>>>>>>>>> the detailed design/FLIP.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Hequn
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Fri, Nov 2, 2018 at 9:48 AM Shaoxuan Wang <
> >>>>>>> wshaoxuan@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Aljoscha,
> >>>>>>>>>>>>> Glad that you like the proposal. We have completed
> >> the
> >>>>>>> prototype
> >>>>>>>> of
> >>>>>>>>>>> most
> >>>>>>>>>>>>> new proposed functionalities. Once collect the
> >> feedback
> >>>>> from
> >>>>>>>>>> community,
> >>>>>>>>>>>> we
> >>>>>>>>>>>>> will come up with a concrete FLIP/design doc.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>> Shaoxuan
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Thu, Nov 1, 2018 at 8:12 PM Aljoscha Krettek <
> >>>>>>>>> aljoscha@apache.org
> >>>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Jincheng,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> these points sound very good! Are there any
> >> concrete
> >>>>>>> proposals
> >>>>>>>>> for
> >>>>>>>>>>>>>> changes? For example a FLIP/design document?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> See here for FLIPs:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>> Aljoscha
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On 1. Nov 2018, at 12:51, jincheng sun <
> >>>>>>>>> sunjincheng121@gmail.com
> >>>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> *--------I am sorry for the formatting of the
> >> email
> >>>>>>> content.
> >>>>>>>> I
> >>>>>>>>>>>> reformat
> >>>>>>>>>>>>>>> the **content** as follows-----------*
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> *Hi ALL,*
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> With the continuous efforts from the community,
> >> the
> >>>>> Flink
> >>>>>>>>> system
> >>>>>>>>>>> has
> >>>>>>>>>>>>> been
> >>>>>>>>>>>>>>> continuously improved, which has attracted more
> >> and
> >>>>> more
> >>>>>>>> users.
> >>>>>>>>>>> Flink
> >>>>>>>>>>>>> SQL
> >>>>>>>>>>>>>>> is a canonical, widely used relational query
> >>>> language.
> >>>>>>>> However,
> >>>>>>>>>>> there
> >>>>>>>>>>>>> are
> >>>>>>>>>>>>>>> still some scenarios where Flink SQL failed to
> >> meet
> >>>>> user
> >>>>>>>> needs
> >>>>>>>>> in
> >>>>>>>>>>>> terms
> >>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>> functionality and ease of use, such as:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> *1. In terms of functionality*
> >>>>>>>>>>>>>>>   Iteration, user-defined window, user-defined
> >>> join,
> >>>>>>>>>> user-defined
> >>>>>>>>>>>>>>> GroupReduce, etc. Users cannot express them with
> >>> SQL;
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> *2. In terms of ease of use*
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>  - Map - e.g. “dataStream.map(mapFun)”. Although
> >>>>>>>>>>>> “table.select(udf1(),
> >>>>>>>>>>>>>>>  udf2(), udf3()....)” can be used to accomplish
> >>> the
> >>>>> same
> >>>>>>>>>>> function.,
> >>>>>>>>>>>>>> with a
> >>>>>>>>>>>>>>>  map() function returning 100 columns, one has
> >> to
> >>>>> define
> >>>>>>> or
> >>>>>>>>> call
> >>>>>>>>>>> 100
> >>>>>>>>>>>>>> UDFs
> >>>>>>>>>>>>>>>  when using SQL, which is quite involved.
> >>>>>>>>>>>>>>>  - FlatMap -  e.g.
> >>> “dataStrem.flatmap(flatMapFun)”.
> >>>>>>>> Similarly,
> >>>>>>>>>> it
> >>>>>>>>>>>> can
> >>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>  implemented with “table.join(udtf).select()”.
> >>>>> However,
> >>>>>> it
> >>>>>>>> is
> >>>>>>>>>>>> obvious
> >>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>  dataStream is easier to use than SQL.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Due to the above two reasons, some users have to
> >>> use
> >>>>> the
> >>>>>>>>>> DataStream
> >>>>>>>>>>>> API
> >>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>> the DataSet API. But when they do that, they lose
> >>> the
> >>>>>>>>> unification
> >>>>>>>>>>> of
> >>>>>>>>>>>>>> batch
> >>>>>>>>>>>>>>> and streaming. They will also lose the
> >>> sophisticated
> >>>>>>>>>> optimizations
> >>>>>>>>>>>> such
> >>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>> codegen, aggregate join transpose and multi-stage
> >>> agg
> >>>>>> from
> >>>>>>>>> Flink
> >>>>>>>>>>> SQL.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> We believe that enhancing the functionality and
> >>>>>>> productivity
> >>>>>>>> is
> >>>>>>>>>>> vital
> >>>>>>>>>>>>> for
> >>>>>>>>>>>>>>> the successful adoption of Table API. To this
> >> end,
> >>>>> Table
> >>>>>>> API
> >>>>>>>>>> still
> >>>>>>>>>>>>>>> requires more efforts from every contributor in
> >> the
> >>>>>>>> community.
> >>>>>>>>> We
> >>>>>>>>>>> see
> >>>>>>>>>>>>>> great
> >>>>>>>>>>>>>>> opportunity in improving our user’s experience
> >> from
> >>>>> this
> >>>>>>>> work.
> >>>>>>>>>> Any
> >>>>>>>>>>>>>> feedback
> >>>>>>>>>>>>>>> is welcome.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Jincheng
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> jincheng sun <su...@gmail.com>
> >>>> 于2018年11月1日周四
> >>>>>>>>> 下午5:07写道:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> With the continuous efforts from the community,
> >>> the
> >>>>>> Flink
> >>>>>>>>> system
> >>>>>>>>>>> has
> >>>>>>>>>>>>>> been
> >>>>>>>>>>>>>>>> continuously improved, which has attracted more
> >>> and
> >>>>> more
> >>>>>>>>> users.
> >>>>>>>>>>>> Flink
> >>>>>>>>>>>>>> SQL
> >>>>>>>>>>>>>>>> is a canonical, widely used relational query
> >>>> language.
> >>>>>>>>> However,
> >>>>>>>>>>>> there
> >>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>> still some scenarios where Flink SQL failed to
> >>> meet
> >>>>> user
> >>>>>>>> needs
> >>>>>>>>>> in
> >>>>>>>>>>>>> terms
> >>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>> functionality and ease of use, such as:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>  -
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>  In terms of functionality
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Iteration, user-defined window, user-defined
> >> join,
> >>>>>>>>> user-defined
> >>>>>>>>>>>>>>>> GroupReduce, etc. Users cannot express them with
> >>>> SQL;
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>  -
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>  In terms of ease of use
> >>>>>>>>>>>>>>>>  -
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>     Map - e.g. “dataStream.map(mapFun)”.
> >> Although
> >>>>>>>>>>>>> “table.select(udf1(),
> >>>>>>>>>>>>>>>>     udf2(), udf3()....)” can be used to
> >>> accomplish
> >>>>> the
> >>>>>>> same
> >>>>>>>>>>>>> function.,
> >>>>>>>>>>>>>> with a
> >>>>>>>>>>>>>>>>     map() function returning 100 columns, one
> >> has
> >>>> to
> >>>>>>> define
> >>>>>>>>> or
> >>>>>>>>>>> call
> >>>>>>>>>>>>>> 100 UDFs
> >>>>>>>>>>>>>>>>     when using SQL, which is quite involved.
> >>>>>>>>>>>>>>>>     -
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>     FlatMap -  e.g.
> >>>> “dataStrem.flatmap(flatMapFun)”.
> >>>>>>>>> Similarly,
> >>>>>>>>>>> it
> >>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>     be implemented with
> >>>> “table.join(udtf).select()”.
> >>>>>>>> However,
> >>>>>>>>>> it
> >>>>>>>>>>> is
> >>>>>>>>>>>>>> obvious
> >>>>>>>>>>>>>>>>     that datastream is easier to use than SQL.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Due to the above two reasons, some users have to
> >>> use
> >>>>> the
> >>>>>>>>>>> DataStream
> >>>>>>>>>>>>> API
> >>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>> the DataSet API. But when they do that, they
> >> lose
> >>>> the
> >>>>>>>>>> unification
> >>>>>>>>>>> of
> >>>>>>>>>>>>>> batch
> >>>>>>>>>>>>>>>> and streaming. They will also lose the
> >>> sophisticated
> >>>>>>>>>> optimizations
> >>>>>>>>>>>>> such
> >>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>> codegen, aggregate join transpose  and
> >> multi-stage
> >>>> agg
> >>>>>>> from
> >>>>>>>>>> Flink
> >>>>>>>>>>>> SQL.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> We believe that enhancing the functionality and
> >>>>>>> productivity
> >>>>>>>>> is
> >>>>>>>>>>>> vital
> >>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>> the successful adoption of Table API. To this
> >> end,
> >>>>>> Table
> >>>>>>>> API
> >>>>>>>>>>> still
> >>>>>>>>>>>>>>>> requires more efforts from every contributor in
> >>> the
> >>>>>>>> community.
> >>>>>>>>>> We
> >>>>>>>>>>>> see
> >>>>>>>>>>>>>> great
> >>>>>>>>>>>>>>>> opportunity in improving our user’s experience
> >>> from
> >>>>> this
> >>>>>>>> work.
> >>>>>>>>>> Any
> >>>>>>>>>>>>>> feedback
> >>>>>>>>>>>>>>>> is welcome.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Jincheng
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>

Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Posted by jincheng sun <su...@gmail.com>.
Hi Fabian,

Yes, Timers is not only the difference between Table and DataStream, but
also the difference between DataStream and DataSet. We need to unify the
batch and Stream in Table, so the difference about timers needs to be
considered in depth. :)

Thanks, Jincheng


Fabian Hueske <fh...@gmail.com> 于2018年11月15日周四 下午8:58写道:

> Thanks Jincheng,
>
> That makes sense to me.
> Another differentiation of Table API and DataStream API would be the access
> to the timer service.
> The DataStream API can register and act on timers while the Table API would
> not have this feature.
>
> Best, Fabian
>
> Am Mi., 14. Nov. 2018 um 02:02 Uhr schrieb jincheng sun <
> sunjincheng121@gmail.com>:
>
> > Hi Piotrek, Fabian:
> >
> > I am very glad to see your reply. Thank you very much Piotrek for asking
> > very good questions. I will share my opinion:
> >
> >
> >    - The Enhancing TableAPI that I proposed is proposed for user
> >    friendliness. After enhancement, it will maintain the characteristics
> of
> >    TableAPI&SQL, such as: declarative, optimization etc.
> >
> >
> >    - For the difference between DataStreamAPI and TableAPI, I think there
> >    are two points:
> >    -  a). State management, DataStreamAPI users can use stateAPI to
> perform
> >       state operations directly. TableAPI/SQL can only use such as:
> > DataView
> >       indirectly.
> >       -  b). Physical operation, DataStreamAPI can call
> >       rebalance()/rescale()/shuffle()etc. physical operation, TableAPI
> > will use
> >       the optimizer to judge the underlying strategy. If necessary, we
> will
> >       follow the database's hint mechanism and add a hint to the
> tableAPI.
> >       Affects physical operations, but I don't recommend adding
> operations
> > such
> >       as rebalance()/rescale()/shuffle()/sortPartition()etc. on the
> > TableAPI.
> >    - About the management of time attributes we can continue to discuss
> in
> >    TableAPI Enhancement Outline:
> >
> >
> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
> >    .
> >
> >
> >    - Regarding the proposed Enhanced TableAPI, my core goal is to solve
> the
> >    usability problem. At this stage, the following four APIs will be
> > proposed:
> >    - Table.Map
> >       - Table.FlatMap
> >       - GroupedTable.agg
> >       - GroupedTable.flatAgg
> >
> >                 Adding the above API is for  the terms of ease of use,
> > taking map as an example: Map - e.g. “dataStream.map(mapFun)”. Although
> > “table.select(udf1(), udf2(), udf3()....)” can be used to accomplish the
> > same function., with a map() function returning 100 columns, one has to
> > define or call 100 UDFs when using SQL, which is quite involved.
> >
> > I totally agree that we have to discuss in depth the changes in the API
> and
> > let our community APIs continue to develop in the right direction.
> Thanks
> > again for the reply, and looking forward to your feedback!:)
> >
> > Best,
> > Jincheng
> >
> > Fabian Hueske <fh...@gmail.com> 于2018年11月13日周二 下午9:31写道:
> >
> > > Yes, that is my understanding as well.
> > >
> > > Manual time management would be another difference.
> > > Something still to be discussed would be whether (or to what extent) it
> > > would be possible to define the physical execution plan with hints or
> > > methods like partitionByHash and sortPartition.
> > >
> > > Best, Fabian
> > >
> > > Am Di., 13. Nov. 2018, 13:57 hat Piotr Nowojski <
> piotr@data-artisans.com
> > >
> > > geschrieben:
> > >
> > > > Hi,
> > > >
> > > > > This thread is meant to enhancing the functionalities of TableAPI.
> I
> > > > don't
> > > > > think that anyone is suggesting either reducing the effort in SQL
> or
> > > > > DataStream. So let's focus on how we can enhance TableAPI.
> > > >
> > > > I wasn’t thinking about that. As I said before, I was rising a
> > question,
> > > > what Table API should look like in the future if we want to diverge
> it
> > > more
> > > > and more from SQL. It looks to me, that the more or less consensus is
> > > that
> > > > it should be expanded and evolve parallel to the DataStream API, but
> in
> > > > order to better suite different needs with following differences:
> > > > - declarative
> > > > - predefined schema/types
> > > > - no custom state operations (?)
> > > > - optimisations allowed by the above points
> > > >
> > > > Piotrek
> > > >
> > > > > On 7 Nov 2018, at 16:01, Xiaowei Jiang <xi...@gmail.com> wrote:
> > > > >
> > > > > Hi Piotr:
> > > > >
> > > > > I want to clarify one thing first: I think that we will keep the
> > > > > interoperability between TableAPI and DataStream in any case. So
> user
> > > can
> > > > > switch between the two whenever needed. Given that, it would still
> be
> > > > very
> > > > > helpful that users can use one API to achieve most of what they do.
> > > > > Currently, TableAPI/SQL is good enough for most data analytics kind
> > of
> > > > > scenarios, but there are some limitations that when removed will
> help
> > > we
> > > > go
> > > > > even further in this direction. An initial list of these is
> provided
> > in
> > > > > another thread. These are naturally extensions to TableAPI which we
> > > need
> > > > to
> > > > > do just for the sake of making TableAPI more usable.
> > > > >
> > > > > TableAPI and SQL share the same underlying implementation, so
> > > enhancement
> > > > > in one will end up helping the other. I don't see them as
> > competitive.
> > > > > TableAPI is easier to extend because that we have a bit more
> freedom
> > in
> > > > > adding new functionalities. In reality, TableAPI can be mixed with
> > SQL
> > > as
> > > > > well.
> > > > >
> > > > > On the implementation side, I agree that Table API/SQL and
> DataStream
> > > > > should try to share as much as possible. But that question is
> > > orthogonal
> > > > to
> > > > > the API discussion.
> > > > >
> > > > > This thread is meant to enhancing the functionalities of TableAPI.
> I
> > > > don't
> > > > > think that anyone is suggesting either reducing the effort in SQL
> or
> > > > > DataStream. So let's focus on how we can enhance TableAPI.
> > > > >
> > > > > Regards,
> > > > > Xiaowei
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Posted by Fabian Hueske <fh...@gmail.com>.
Thanks Jincheng,

That makes sense to me.
Another differentiation of Table API and DataStream API would be the access
to the timer service.
The DataStream API can register and act on timers while the Table API would
not have this feature.

Best, Fabian

Am Mi., 14. Nov. 2018 um 02:02 Uhr schrieb jincheng sun <
sunjincheng121@gmail.com>:

> Hi Piotrek, Fabian:
>
> I am very glad to see your reply. Thank you very much Piotrek for asking
> very good questions. I will share my opinion:
>
>
>    - The Enhancing TableAPI that I proposed is proposed for user
>    friendliness. After enhancement, it will maintain the characteristics of
>    TableAPI&SQL, such as: declarative, optimization etc.
>
>
>    - For the difference between DataStreamAPI and TableAPI, I think there
>    are two points:
>    -  a). State management, DataStreamAPI users can use stateAPI to perform
>       state operations directly. TableAPI/SQL can only use such as:
> DataView
>       indirectly.
>       -  b). Physical operation, DataStreamAPI can call
>       rebalance()/rescale()/shuffle()etc. physical operation, TableAPI
> will use
>       the optimizer to judge the underlying strategy. If necessary, we will
>       follow the database's hint mechanism and add a hint to the tableAPI.
>       Affects physical operations, but I don't recommend adding operations
> such
>       as rebalance()/rescale()/shuffle()/sortPartition()etc. on the
> TableAPI.
>    - About the management of time attributes we can continue to discuss in
>    TableAPI Enhancement Outline:
>
> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
>    .
>
>
>    - Regarding the proposed Enhanced TableAPI, my core goal is to solve the
>    usability problem. At this stage, the following four APIs will be
> proposed:
>    - Table.Map
>       - Table.FlatMap
>       - GroupedTable.agg
>       - GroupedTable.flatAgg
>
>                 Adding the above API is for  the terms of ease of use,
> taking map as an example: Map - e.g. “dataStream.map(mapFun)”. Although
> “table.select(udf1(), udf2(), udf3()....)” can be used to accomplish the
> same function., with a map() function returning 100 columns, one has to
> define or call 100 UDFs when using SQL, which is quite involved.
>
> I totally agree that we have to discuss in depth the changes in the API and
> let our community APIs continue to develop in the right direction.  Thanks
> again for the reply, and looking forward to your feedback!:)
>
> Best,
> Jincheng
>
> Fabian Hueske <fh...@gmail.com> 于2018年11月13日周二 下午9:31写道:
>
> > Yes, that is my understanding as well.
> >
> > Manual time management would be another difference.
> > Something still to be discussed would be whether (or to what extent) it
> > would be possible to define the physical execution plan with hints or
> > methods like partitionByHash and sortPartition.
> >
> > Best, Fabian
> >
> > Am Di., 13. Nov. 2018, 13:57 hat Piotr Nowojski <piotr@data-artisans.com
> >
> > geschrieben:
> >
> > > Hi,
> > >
> > > > This thread is meant to enhancing the functionalities of TableAPI. I
> > > don't
> > > > think that anyone is suggesting either reducing the effort in SQL or
> > > > DataStream. So let's focus on how we can enhance TableAPI.
> > >
> > > I wasn’t thinking about that. As I said before, I was rising a
> question,
> > > what Table API should look like in the future if we want to diverge it
> > more
> > > and more from SQL. It looks to me, that the more or less consensus is
> > that
> > > it should be expanded and evolve parallel to the DataStream API, but in
> > > order to better suite different needs with following differences:
> > > - declarative
> > > - predefined schema/types
> > > - no custom state operations (?)
> > > - optimisations allowed by the above points
> > >
> > > Piotrek
> > >
> > > > On 7 Nov 2018, at 16:01, Xiaowei Jiang <xi...@gmail.com> wrote:
> > > >
> > > > Hi Piotr:
> > > >
> > > > I want to clarify one thing first: I think that we will keep the
> > > > interoperability between TableAPI and DataStream in any case. So user
> > can
> > > > switch between the two whenever needed. Given that, it would still be
> > > very
> > > > helpful that users can use one API to achieve most of what they do.
> > > > Currently, TableAPI/SQL is good enough for most data analytics kind
> of
> > > > scenarios, but there are some limitations that when removed will help
> > we
> > > go
> > > > even further in this direction. An initial list of these is provided
> in
> > > > another thread. These are naturally extensions to TableAPI which we
> > need
> > > to
> > > > do just for the sake of making TableAPI more usable.
> > > >
> > > > TableAPI and SQL share the same underlying implementation, so
> > enhancement
> > > > in one will end up helping the other. I don't see them as
> competitive.
> > > > TableAPI is easier to extend because that we have a bit more freedom
> in
> > > > adding new functionalities. In reality, TableAPI can be mixed with
> SQL
> > as
> > > > well.
> > > >
> > > > On the implementation side, I agree that Table API/SQL and DataStream
> > > > should try to share as much as possible. But that question is
> > orthogonal
> > > to
> > > > the API discussion.
> > > >
> > > > This thread is meant to enhancing the functionalities of TableAPI. I
> > > don't
> > > > think that anyone is suggesting either reducing the effort in SQL or
> > > > DataStream. So let's focus on how we can enhance TableAPI.
> > > >
> > > > Regards,
> > > > Xiaowei
> > >
> > >
> >
>

Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Posted by jincheng sun <su...@gmail.com>.
Hi Piotrek, Fabian:

I am very glad to see your reply. Thank you very much Piotrek for asking
very good questions. I will share my opinion:


   - The Enhancing TableAPI that I proposed is proposed for user
   friendliness. After enhancement, it will maintain the characteristics of
   TableAPI&SQL, such as: declarative, optimization etc.


   - For the difference between DataStreamAPI and TableAPI, I think there
   are two points:
   -  a). State management, DataStreamAPI users can use stateAPI to perform
      state operations directly. TableAPI/SQL can only use such as: DataView
      indirectly.
      -  b). Physical operation, DataStreamAPI can call
      rebalance()/rescale()/shuffle()etc. physical operation, TableAPI will use
      the optimizer to judge the underlying strategy. If necessary, we will
      follow the database's hint mechanism and add a hint to the tableAPI.
      Affects physical operations, but I don't recommend adding operations such
      as rebalance()/rescale()/shuffle()/sortPartition()etc. on the TableAPI.
   - About the management of time attributes we can continue to discuss in
   TableAPI Enhancement Outline:
   https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
   .


   - Regarding the proposed Enhanced TableAPI, my core goal is to solve the
   usability problem. At this stage, the following four APIs will be proposed:
   - Table.Map
      - Table.FlatMap
      - GroupedTable.agg
      - GroupedTable.flatAgg

                Adding the above API is for  the terms of ease of use,
taking map as an example: Map - e.g. “dataStream.map(mapFun)”. Although
“table.select(udf1(), udf2(), udf3()....)” can be used to accomplish the
same function., with a map() function returning 100 columns, one has to
define or call 100 UDFs when using SQL, which is quite involved.

I totally agree that we have to discuss in depth the changes in the API and
let our community APIs continue to develop in the right direction.  Thanks
again for the reply, and looking forward to your feedback!:)

Best,
Jincheng

Fabian Hueske <fh...@gmail.com> 于2018年11月13日周二 下午9:31写道:

> Yes, that is my understanding as well.
>
> Manual time management would be another difference.
> Something still to be discussed would be whether (or to what extent) it
> would be possible to define the physical execution plan with hints or
> methods like partitionByHash and sortPartition.
>
> Best, Fabian
>
> Am Di., 13. Nov. 2018, 13:57 hat Piotr Nowojski <pi...@data-artisans.com>
> geschrieben:
>
> > Hi,
> >
> > > This thread is meant to enhancing the functionalities of TableAPI. I
> > don't
> > > think that anyone is suggesting either reducing the effort in SQL or
> > > DataStream. So let's focus on how we can enhance TableAPI.
> >
> > I wasn’t thinking about that. As I said before, I was rising a question,
> > what Table API should look like in the future if we want to diverge it
> more
> > and more from SQL. It looks to me, that the more or less consensus is
> that
> > it should be expanded and evolve parallel to the DataStream API, but in
> > order to better suite different needs with following differences:
> > - declarative
> > - predefined schema/types
> > - no custom state operations (?)
> > - optimisations allowed by the above points
> >
> > Piotrek
> >
> > > On 7 Nov 2018, at 16:01, Xiaowei Jiang <xi...@gmail.com> wrote:
> > >
> > > Hi Piotr:
> > >
> > > I want to clarify one thing first: I think that we will keep the
> > > interoperability between TableAPI and DataStream in any case. So user
> can
> > > switch between the two whenever needed. Given that, it would still be
> > very
> > > helpful that users can use one API to achieve most of what they do.
> > > Currently, TableAPI/SQL is good enough for most data analytics kind of
> > > scenarios, but there are some limitations that when removed will help
> we
> > go
> > > even further in this direction. An initial list of these is provided in
> > > another thread. These are naturally extensions to TableAPI which we
> need
> > to
> > > do just for the sake of making TableAPI more usable.
> > >
> > > TableAPI and SQL share the same underlying implementation, so
> enhancement
> > > in one will end up helping the other. I don't see them as competitive.
> > > TableAPI is easier to extend because that we have a bit more freedom in
> > > adding new functionalities. In reality, TableAPI can be mixed with SQL
> as
> > > well.
> > >
> > > On the implementation side, I agree that Table API/SQL and DataStream
> > > should try to share as much as possible. But that question is
> orthogonal
> > to
> > > the API discussion.
> > >
> > > This thread is meant to enhancing the functionalities of TableAPI. I
> > don't
> > > think that anyone is suggesting either reducing the effort in SQL or
> > > DataStream. So let's focus on how we can enhance TableAPI.
> > >
> > > Regards,
> > > Xiaowei
> >
> >
>

Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Posted by Fabian Hueske <fh...@gmail.com>.
Yes, that is my understanding as well.

Manual time management would be another difference.
Something still to be discussed would be whether (or to what extent) it
would be possible to define the physical execution plan with hints or
methods like partitionByHash and sortPartition.

Best, Fabian

Am Di., 13. Nov. 2018, 13:57 hat Piotr Nowojski <pi...@data-artisans.com>
geschrieben:

> Hi,
>
> > This thread is meant to enhancing the functionalities of TableAPI. I
> don't
> > think that anyone is suggesting either reducing the effort in SQL or
> > DataStream. So let's focus on how we can enhance TableAPI.
>
> I wasn’t thinking about that. As I said before, I was rising a question,
> what Table API should look like in the future if we want to diverge it more
> and more from SQL. It looks to me, that the more or less consensus is that
> it should be expanded and evolve parallel to the DataStream API, but in
> order to better suite different needs with following differences:
> - declarative
> - predefined schema/types
> - no custom state operations (?)
> - optimisations allowed by the above points
>
> Piotrek
>
> > On 7 Nov 2018, at 16:01, Xiaowei Jiang <xi...@gmail.com> wrote:
> >
> > Hi Piotr:
> >
> > I want to clarify one thing first: I think that we will keep the
> > interoperability between TableAPI and DataStream in any case. So user can
> > switch between the two whenever needed. Given that, it would still be
> very
> > helpful that users can use one API to achieve most of what they do.
> > Currently, TableAPI/SQL is good enough for most data analytics kind of
> > scenarios, but there are some limitations that when removed will help we
> go
> > even further in this direction. An initial list of these is provided in
> > another thread. These are naturally extensions to TableAPI which we need
> to
> > do just for the sake of making TableAPI more usable.
> >
> > TableAPI and SQL share the same underlying implementation, so enhancement
> > in one will end up helping the other. I don't see them as competitive.
> > TableAPI is easier to extend because that we have a bit more freedom in
> > adding new functionalities. In reality, TableAPI can be mixed with SQL as
> > well.
> >
> > On the implementation side, I agree that Table API/SQL and DataStream
> > should try to share as much as possible. But that question is orthogonal
> to
> > the API discussion.
> >
> > This thread is meant to enhancing the functionalities of TableAPI. I
> don't
> > think that anyone is suggesting either reducing the effort in SQL or
> > DataStream. So let's focus on how we can enhance TableAPI.
> >
> > Regards,
> > Xiaowei
>
>

Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Posted by Piotr Nowojski <pi...@data-artisans.com>.
Hi,

> This thread is meant to enhancing the functionalities of TableAPI. I don't
> think that anyone is suggesting either reducing the effort in SQL or
> DataStream. So let's focus on how we can enhance TableAPI.

I wasn’t thinking about that. As I said before, I was rising a question, what Table API should look like in the future if we want to diverge it more and more from SQL. It looks to me, that the more or less consensus is that it should be expanded and evolve parallel to the DataStream API, but in order to better suite different needs with following differences:
- declarative
- predefined schema/types
- no custom state operations (?)
- optimisations allowed by the above points

Piotrek

> On 7 Nov 2018, at 16:01, Xiaowei Jiang <xi...@gmail.com> wrote:
> 
> Hi Piotr:
> 
> I want to clarify one thing first: I think that we will keep the
> interoperability between TableAPI and DataStream in any case. So user can
> switch between the two whenever needed. Given that, it would still be very
> helpful that users can use one API to achieve most of what they do.
> Currently, TableAPI/SQL is good enough for most data analytics kind of
> scenarios, but there are some limitations that when removed will help we go
> even further in this direction. An initial list of these is provided in
> another thread. These are naturally extensions to TableAPI which we need to
> do just for the sake of making TableAPI more usable.
> 
> TableAPI and SQL share the same underlying implementation, so enhancement
> in one will end up helping the other. I don't see them as competitive.
> TableAPI is easier to extend because that we have a bit more freedom in
> adding new functionalities. In reality, TableAPI can be mixed with SQL as
> well.
> 
> On the implementation side, I agree that Table API/SQL and DataStream
> should try to share as much as possible. But that question is orthogonal to
> the API discussion.
> 
> This thread is meant to enhancing the functionalities of TableAPI. I don't
> think that anyone is suggesting either reducing the effort in SQL or
> DataStream. So let's focus on how we can enhance TableAPI.
> 
> Regards,
> Xiaowei


Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Posted by Xiaowei Jiang <xi...@gmail.com>.
Hi Piotr:

I want to clarify one thing first: I think that we will keep the
interoperability between TableAPI and DataStream in any case. So user can
switch between the two whenever needed. Given that, it would still be very
helpful that users can use one API to achieve most of what they do.
Currently, TableAPI/SQL is good enough for most data analytics kind of
scenarios, but there are some limitations that when removed will help we go
even further in this direction. An initial list of these is provided in
another thread. These are naturally extensions to TableAPI which we need to
do just for the sake of making TableAPI more usable.

TableAPI and SQL share the same underlying implementation, so enhancement
in one will end up helping the other. I don't see them as competitive.
TableAPI is easier to extend because that we have a bit more freedom in
adding new functionalities. In reality, TableAPI can be mixed with SQL as
well.

On the implementation side, I agree that Table API/SQL and DataStream
should try to share as much as possible. But that question is orthogonal to
the API discussion.

This thread is meant to enhancing the functionalities of TableAPI. I don't
think that anyone is suggesting either reducing the effort in SQL or
DataStream. So let's focus on how we can enhance TableAPI.

Regards,
Xiaowei

Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Posted by Shaoxuan Wang <ws...@gmail.com>.
Hi all,

Thanks for the feedback. I enjoyed the discussions, especially the ones
between Fabian and Xiaowei. I think it well revealed the motivations and
design pros/cons behind this proposal. Enhancing tableAPI will not affect
and limit the improvements on Flink SQL (as well as DataStream). Actually
Alibaba is the biggest fun of Flink SQL and our contributions to Flink SQL
will not cease or reduce because of this.

As an supplement to the motivations, I would like to share some of our
experience and lessons learned in Alibaba. In the past 1-2 years we
upgraded most of our production jobs (including data analysis as well as AI
pipeline) on top of the SQL API. Besides SQL, we also provide the tableAPI
to the users and received lots of interests and new requests. We extended
tableAPI with some new functionalities and it indeed helped a lot (in terms
of easy-of-use as well as performance) in many cases. This motivates us to
contribute it back to the community. Xiaowei opened another email thread
and listed a few things that we are proposing to add on tableAPI as the
first step. Please take a look. Let us move the discussions on design
proposal to that ML.

Thanks,
Shaoxuan


On Tue, Nov 6, 2018 at 9:24 PM Piotr Nowojski <pi...@data-artisans.com>
wrote:

> Hi,
>
> What is our intended division/border between Table API and DataSet or
> DataStream? If we want Table API to drift away from SQL that would be a
> valid question.
>
> > Another distinguishing feature of DataStream API is that users get direct
> > access to state/statebackend which we intensionally avoided in Table API
>
> Do we really want to make Table API an equivalent of DataSet/DataStream
> API but without a state? Drawing boundary in such way would make it more
> difficult for users to pick the right tool if for many use cases they could
> use both. What if then at some point of time they came to conclusion that
> they need a small state access for something? If that’s our intended end
> goal separation between Table API and DataStream API, it would be very very
> weird having two very similar APIs, that have tons of small differences,
> but are basically equivalent modulo state accesses.
>
> Maybe instead of duplicating efforts and work between our different APIs
> we should more focus on either interoperability or unifying them?
>
> For example if we would like to differentiate those APIs because of the
> presence/lack of optimiser, maybe the APIs should be the same, but there
> should be a way tell whether the UDF/operator is deterministic, has side
> effects, etc. And if such operator is found in the plan, the nodes below
> and above could still be subject to regular optimisation rules.
>
> Piotrek
>
> > On 6 Nov 2018, at 14:04, Fabian Hueske <fh...@gmail.com> wrote:
> >
> > Hi,
> >
> > An analysis of orthogonal functions would be great!
> > There is certainly some overlap in the functions provided by the DataSet
> > API.
> >
> > In the past, I found that having low-level functions helped a lot to
> > efficiently implement complex logic.
> > Without partitionByHash, sortPartition, sort, mapPartition, combine, etc
> it
> > would not be possible to (efficiently) implement certain operators for
> the
> > Table API, SQL or the Cascading-On-Flink port that I did a while back.
> > I could imaging that these APIs would be useful to implement DSLs on top
> of
> > the Table API, such as Gelly.
> >
> > Anyway, I completely agree that these physical operators should not be
> the
> > first step.
> > If we find that these methods are not needed, even better!
> >
> > Let's try to keep this thread focused on the general proposal of
> extending
> > the scope of the Table API and keep the discussion of concrete proposal
> > that Xiaowei shared in the other thread (and the design doc).
> > That will help to keep all related comments in one place ;-)
> >
> > Best, Fabian
> >
> >
> > Am Di., 6. Nov. 2018 um 13:01 Uhr schrieb jincheng sun <
> > sunjincheng121@gmail.com>:
> >
> >> Hi Fabian,
> >> Thank you for your deep thoughts in this regard, I think most of
> questions
> >> you had mentioned are very worthy of in-depth discussion! I want share
> >> thoughts about following questions:
> >>
> >> 1. Do we need move all DataSet API functionality into the Table API?
> >> I think most of dataset functionality should be add into the TableAPI,
> such
> >> as map, flatmap, groupReduce etc., Because these are very easy to use
> for
> >> the user.
> >>
> >> 2. Do we support explicit physical operations like partitioning,
> sorting or
> >> optimizer hints?
> >> I think we do not want add the physical operations, e.g.:
> >> sortPartition,partitionCustom etc. From the points of my view, those
> >> physical operations are used to optimization, which can be solved by
> >> hints(I think we should add hints feature to both tableAPI and SQL).
> >>
> >> 3. Do we want to support retractions in iteration?
> >> I think support iteration is  a very complicated function。I am not sure,
> >> but i think the implementation of the iteration may be implemented
> >> according to the current batch mode, and the retraction is temporarily
> not
> >> supported, assuming that the trained data will not be updated in the
> >> current iteration. The updated data will be used in the next
> iteration.  So
> >> I think we should in-depth discussion  in a new threading.
> >>
> >> BTW, I find that you have had leave the very useful comments in the
> google
> >> doc:
> >>
> >>
> https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit#
> >>
> >> Thanks again for both your mail feedback and doc comments !
> >>
> >> Best,
> >> Jincheng
> >>
> >>
> >>
> >> Fabian Hueske <fh...@gmail.com> 于2018年11月6日周二 下午6:21写道:
> >>
> >>> Thanks for the replies Xiaowei and others!
> >>>
> >>> You are right, I did not consider the batch optimization that would be
> >>> missing if the DataSet API would be ported to extend the DataStream
> API.
> >>> By extending the scope of the Table API, we can gain a holistic
> logical &
> >>> physical optimization which would be great!
> >>> Is your plan to move all DataSet API functionality into the Table API?
> >>> If so, do you envision any batch-related API in DataStream at all or
> >> should
> >>> this be done by converting a batch table to DataStream? I'm asking
> >> because
> >>> if there would be batch features in DataStream, we would need some
> >>> optimization there as well.
> >>>
> >>> I think the proposed separation of Table API (stateless APIs) and
> >>> DataStream (APIs that expose state handling) is a good idea.
> >>> On a side note, the DataSet API discouraged state handing in user
> >> function,
> >>> so porting this Table API would be quite "natural".
> >>>
> >>> As I said before, I like that we can incrementally extend the Table
> API.
> >>> Map and FlatMap functions do not seem too difficult.
> >>> Reduce, GroupReduce, Combine, GroupCombine, MapPartition might be more
> >>> tricky, esp. if we want to support retractions.
> >>> Iterations should be a challenge. I assume that Calcite does not
> support
> >>> iterations, so we probably need to split query / program and optimize
> >> parts
> >>> separately (IIRC, this is also how Flink's own optimizer handles this).
> >>> To what extend are you planning to support explicit physical operations
> >>> like partitioning, sorting or optimizer hints?
> >>>
> >>> I haven't had a look in the design document that you shared. Probably,
> I
> >>> find answers to some of my questions there ;-)
> >>>
> >>> Regarding the question of SQL or Table API, I agree that extending the
> >>> scope of the Table API does not limit the scope for SQL.
> >>> By adding more operations to the Table API we can expand it to use case
> >>> that are not well-served by SQL.
> >>> As others have said, we'll of course continue to extend and improve
> >> Flink's
> >>> SQL support (within the bounds of the standard).
> >>>
> >>> Best, Fabian
> >>>
> >>> Am Di., 6. Nov. 2018 um 10:09 Uhr schrieb jincheng sun <
> >>> sunjincheng121@gmail.com>:
> >>>
> >>>> Hi Jark,
> >>>> Glad to see your feedback!
> >>>> That's Correct, The proposal is aiming to extend the functionality for
> >>>> Table API! I like add "drop" to fit the use case you mentioned. Not
> >> only
> >>>> that, if a 100-columns Table. and our UDF needs these 100 columns, we
> >>> don't
> >>>> want to define the eval as eval(column0...column99), we prefer to
> >> define
> >>>> eval as eval(Row)。Using it like this: table.select(udf (*)). All we
> >> also
> >>>> need to consider if we put the columns package as a row. In a scenario
> >>> like
> >>>> this, we have Classification it as cloumn operation, and  list the
> >>> changes
> >>>> to the column operation after the map/flatMap/agg/flatAgg phase is
> >>>> completed. And Currently,  Xiaowei has started a threading outlining
> >>> which
> >>>> talk about what we are proposing. Please see the detail in the mail
> >>> thread:
> >>>> Please see the detail in the mail thread:
> >>>>
> >>>>
> >>>
> >>
> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
> >>>> <
> >>>>
> >>>
> >>
> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
> >>>>>
> >>>> .
> >>>>
> >>>> At this stage the Table API Enhancement Outline as follows:
> >>>>
> >>>>
> >>>
> >>
> https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing
> >>>>
> >>>> Please let we know if you have further thoughts or feedback!
> >>>>
> >>>> Thanks,
> >>>> Jincheng
> >>>>
> >>>>
> >>>> Jark Wu <im...@gmail.com> 于2018年11月6日周二 下午3:35写道:
> >>>>
> >>>>> Hi jingcheng,
> >>>>>
> >>>>> Thanks for your proposal. I think it is a helpful enhancement for
> >>>> TableAPI
> >>>>> which is a solid step forward for TableAPI.
> >>>>> It doesn't weaken SQL or DataStream, because the conversion between
> >>>>> DataStream and Table still works.
> >>>>> People with advanced cases (e.g. complex and fine-grained state
> >>> control)
> >>>>> can go with DataStream,
> >>>>> but most general cases can stay in TableAPI. This works is aiming to
> >>>> extend
> >>>>> the functionality for TableAPI,
> >>>>> to extend the usage scenario, to help TableAPI becomes a more widely
> >>> used
> >>>>> API.
> >>>>>
> >>>>> For example, someone want to drop one column from a 100-columns
> >> Table.
> >>>>> Currently, we have to convert
> >>>>> Table to DataStream and use MapFunction to do that, or select the
> >>>> remaining
> >>>>> 99 columns using Table.select API.
> >>>>> But if we support Table.drop() method for TableAPI, it will be a very
> >>>>> convenient method and let users stay in Table.
> >>>>>
> >>>>> Looking forward to the more detailed design and further discussion.
> >>>>>
> >>>>> Regards,
> >>>>> Jark
> >>>>>
> >>>>> jincheng sun <su...@gmail.com> 于2018年11月6日周二 下午1:05写道:
> >>>>>
> >>>>>> Hi Rong Rong,
> >>>>>>
> >>>>>> Sorry for the late reply, And thanks for your feedback!  We will
> >>>> continue
> >>>>>> to add more convenience features to the TableAPI, such as map,
> >>> flatmap,
> >>>>>> agg, flatagg, iteration etc. And I am very happy that you are
> >>>> interested
> >>>>> on
> >>>>>> this proposal. Due to this is a long-term continuous work, we will
> >>> push
> >>>>> it
> >>>>>> in stages.  Currently  Xiaowei has started a threading outlining
> >>> which
> >>>>> talk
> >>>>>> about what we are proposing. Please see the detail in the mail
> >>> thread:
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
> >>>>>> <
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
> >>>>>>>
> >>>>>> .
> >>>>>>
> >>>>>> The Table API Enhancement Outline as follows:
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing
> >>>>>>
> >>>>>> Please let we know if you have further thoughts or feedback!
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Jincheng
> >>>>>>
> >>>>>> Fabian Hueske <fh...@gmail.com> 于2018年11月5日周一 下午7:03写道:
> >>>>>>
> >>>>>>> Hi Jincheng,
> >>>>>>>
> >>>>>>> Thanks for this interesting proposal.
> >>>>>>> I like that we can push this effort forward in a very
> >> fine-grained
> >>>>>> manner,
> >>>>>>> i.e., incrementally adding more APIs to the Table API.
> >>>>>>>
> >>>>>>> However, I also have a few questions / concerns.
> >>>>>>> Today, the Table API is tightly integrated with the DataSet and
> >>>>>> DataStream
> >>>>>>> APIs. It is very easy to convert a Table into a DataSet or
> >>> DataStream
> >>>>> and
> >>>>>>> vice versa. This mean it is already easy to combine custom logic
> >> an
> >>>>>>> relational operations. What I like is that several aspects are
> >>>> clearly
> >>>>>>> separated like retraction and timestamp handling (see below) +
> >> all
> >>>>>>> libraries on DataStream/DataSet can be easily combined with
> >>>> relational
> >>>>>>> operations.
> >>>>>>> I can see that adding more functionality to the Table API would
> >>>> remove
> >>>>>> the
> >>>>>>> distinction between DataSet and DataStream. However, wouldn't we
> >>> get
> >>>> a
> >>>>>>> similar benefit by extending the DataStream API for proper
> >> support
> >>>> for
> >>>>>>> bounded streams (as is the long-term goal of Flink)?
> >>>>>>> I'm also a bit skeptical about the optimization opportunities we
> >>>> would
> >>>>>>> gain. Map/FlatMap UDFs are black boxes that cannot be easily
> >>> removed
> >>>>>>> without additional information (I did some research on this a few
> >>>> years
> >>>>>> ago
> >>>>>>> [1]).
> >>>>>>>
> >>>>>>> Moreover, I think there are a few tricky details that need to be
> >>>>> resolved
> >>>>>>> to enable a good integration.
> >>>>>>>
> >>>>>>> 1) How to deal with retraction messages? The DataStream API does
> >>> not
> >>>>>> have a
> >>>>>>> notion of retractions. How would a MapFunction or FlatMapFunction
> >>>>> handle
> >>>>>>> retraction? Do they need to be aware of the change flag? Custom
> >>>>> windowing
> >>>>>>> and aggregation logic would certainly need to have that
> >>> information.
> >>>>>>> 2) How to deal with timestamps? The DataStream API does not give
> >>>> access
> >>>>>> to
> >>>>>>> timestamps. In the Table API / SQL these are exposed as regular
> >>>>>> attributes.
> >>>>>>> How can we ensure that timestamp attributes remain valid (i.e.
> >>>> aligned
> >>>>>> with
> >>>>>>> watermarks) if the output is produced by arbitrary code?
> >>>>>>> There might be more issues of this kind.
> >>>>>>>
> >>>>>>> My main question would be how much would we gain with this
> >> proposal
> >>>>> over
> >>>>>> a
> >>>>>>> tight integration of Table API and DataStream API, assuming that
> >>>> batch
> >>>>>>> functionality is moved to DataStream?
> >>>>>>>
> >>>>>>> Best, Fabian
> >>>>>>>
> >>>>>>> [1]
> >>> http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf
> >>>>>>>
> >>>>>>>
> >>>>>>> Am Mo., 5. Nov. 2018 um 02:49 Uhr schrieb Rong Rong <
> >>>>> walterddr@gmail.com
> >>>>>>> :
> >>>>>>>
> >>>>>>>> Hi Jincheng,
> >>>>>>>>
> >>>>>>>> Thank you for the proposal! I think being able to define a
> >>> process
> >>>> /
> >>>>>>>> co-process function in table API definitely opens up a whole
> >> new
> >>>>> level
> >>>>>> of
> >>>>>>>> applications using a unified API.
> >>>>>>>>
> >>>>>>>> In addition, as Tzu-Li and Hequn have mentioned, the benefit of
> >>>>>>>> optimization layer of Table API will already bring in
> >> additional
> >>>>>> benefit
> >>>>>>>> over directly programming on top of DataStream/DataSet API. I
> >> am
> >>>> very
> >>>>>>>> interested an looking forward to seeing the support for more
> >>>> complex
> >>>>>> use
> >>>>>>>> cases, especially iterations. It will enable table API to
> >> define
> >>>> much
> >>>>>>>> broader, event-driven use cases such as real-time ML
> >>>>>> prediction/training.
> >>>>>>>>
> >>>>>>>> As Timo mentioned, This will make Table API diverge from the
> >> SQL
> >>>> API.
> >>>>>> But
> >>>>>>>> as from my experience Table API was always giving me the
> >>> impression
> >>>>> to
> >>>>>>> be a
> >>>>>>>> more sophisticated, syntactic-aware way to express relational
> >>>>>> operations.
> >>>>>>>> Looking forward to further discussion and collaborations on the
> >>>> FLIP
> >>>>>> doc.
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Rong
> >>>>>>>>
> >>>>>>>> On Sun, Nov 4, 2018 at 5:22 PM jincheng sun <
> >>>>> sunjincheng121@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi tison,
> >>>>>>>>>
> >>>>>>>>> Thanks a lot for your feedback!
> >>>>>>>>> I am very happy to see that community contributors agree to
> >>>>> enhanced
> >>>>>>> the
> >>>>>>>>> TableAPI. This work is a long-term continuous work, we will
> >>> push
> >>>> it
> >>>>>> in
> >>>>>>>>> stages, we will soon complete  the enhanced list of the first
> >>>>> phase,
> >>>>>> we
> >>>>>>>> can
> >>>>>>>>> go deep discussion  in google doc. thanks again for joining
> >> on
> >>>> the
> >>>>>> very
> >>>>>>>>> important discussion of the Flink Table API.
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Jincheng
> >>>>>>>>>
> >>>>>>>>> Tzu-Li Chen <wa...@gmail.com> 于2018年11月2日周五 下午1:49写道:
> >>>>>>>>>
> >>>>>>>>>> Hi jingchengm
> >>>>>>>>>>
> >>>>>>>>>> Thanks a lot for your proposal! I find it is a good start
> >>> point
> >>>>> for
> >>>>>>>>>> internal optimization works and help Flink to be more
> >>>>>>>>>> user-friendly.
> >>>>>>>>>>
> >>>>>>>>>> AFAIK, DataStream is the most popular API currently that
> >>> Flink
> >>>>>>>>>> users should describe their logic with detailed logic.
> >>>>>>>>>> From a more internal view the conversion from DataStream to
> >>>>>>>>>> JobGraph is quite mechanically and hard to be optimized. So
> >>>> when
> >>>>>>>>>> users program with DataStream, they have to learn more
> >>>> internals
> >>>>>>>>>> and spend a lot of time to tune for performance.
> >>>>>>>>>> With your proposal, we provide enhanced functionality of
> >>> Table
> >>>>> API,
> >>>>>>>>>> so that users can describe their job easily on Table
> >> aspect.
> >>>> This
> >>>>>>> gives
> >>>>>>>>>> an opportunity to Flink developers to introduce an optimize
> >>>> phase
> >>>>>>>>>> while transforming user program(described by Table API) to
> >>>>> internal
> >>>>>>>>>> representation.
> >>>>>>>>>>
> >>>>>>>>>> Given a user who want to start using Flink with simple ETL,
> >>>>>>> pipelining
> >>>>>>>>>> or analytics, he would find it is most naturally described
> >> by
> >>>>>>> SQL/Table
> >>>>>>>>>> API. Further, as mentioned by @hequn,
> >>>>>>>>>>
> >>>>>>>>>> SQL is a widely used language. It follows standards, is a
> >>>>>>>>>>> descriptive language, and is easy to use
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> thus we could expect with the enhancement of SQL/Table API,
> >>>> Flink
> >>>>>>>>>> becomes more friendly to users.
> >>>>>>>>>>
> >>>>>>>>>> Looking forward to the design doc/FLIP!
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> tison.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> jincheng sun <su...@gmail.com> 于2018年11月2日周五
> >>>> 上午11:46写道:
> >>>>>>>>>>
> >>>>>>>>>>> Hi Hequn,
> >>>>>>>>>>> Thanks for your feedback! And also thanks for our offline
> >>>>>>> discussion!
> >>>>>>>>>>> You are right, unification of batch and streaming is very
> >>>>>> important
> >>>>>>>> for
> >>>>>>>>>>> flink API.
> >>>>>>>>>>> We will provide more detailed design later, Please let me
> >>>> know
> >>>>> if
> >>>>>>> you
> >>>>>>>>>> have
> >>>>>>>>>>> further thoughts or feedback.
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> Jincheng
> >>>>>>>>>>>
> >>>>>>>>>>> Hequn Cheng <ch...@gmail.com> 于2018年11月2日周五
> >>> 上午10:02写道:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi Jincheng,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks a lot for your proposal. It is very encouraging!
> >>>>>>>>>>>>
> >>>>>>>>>>>> As we all know, SQL is a widely used language. It
> >> follows
> >>>>>>>> standards,
> >>>>>>>>>> is a
> >>>>>>>>>>>> descriptive language, and is easy to use. A powerful
> >>>> feature
> >>>>> of
> >>>>>>> SQL
> >>>>>>>>> is
> >>>>>>>>>>> that
> >>>>>>>>>>>> it supports optimization. Users only need to care about
> >>> the
> >>>>>> logic
> >>>>>>>> of
> >>>>>>>>>> the
> >>>>>>>>>>>> program. The underlying optimizer will help users
> >>> optimize
> >>>>> the
> >>>>>>>>>>> performance
> >>>>>>>>>>>> of the program. However, in terms of functionality and
> >>> ease
> >>>>> of
> >>>>>>> use,
> >>>>>>>>> in
> >>>>>>>>>>> some
> >>>>>>>>>>>> scenarios sql will be limited, as described in
> >> Jincheng's
> >>>>>>> proposal.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Correspondingly, the DataStream/DataSet api can provide
> >>>>>> powerful
> >>>>>>>>>>>> functionalities. Users can write
> >>>>>>> ProcessFunction/CoProcessFunction
> >>>>>>>>> and
> >>>>>>>>>>> get
> >>>>>>>>>>>> the timer. Compared with SQL, it provides more
> >>>>> functionalities
> >>>>>>> and
> >>>>>>>>>>>> flexibilities. However, it does not support
> >> optimization
> >>>> like
> >>>>>>> SQL.
> >>>>>>>>>>>> Meanwhile, DataStream/DataSet api has not been unified
> >>>> which
> >>>>>>> means,
> >>>>>>>>> for
> >>>>>>>>>>> the
> >>>>>>>>>>>> same logic, users need to write a job for each stream
> >> and
> >>>>>> batch.
> >>>>>>>>>>>>
> >>>>>>>>>>>> With TableApi, I think we can combine the advantages of
> >>>> both.
> >>>>>>> Users
> >>>>>>>>> can
> >>>>>>>>>>>> easily write relational operations and enjoy
> >>> optimization.
> >>>> At
> >>>>>> the
> >>>>>>>>> same
> >>>>>>>>>>>> time, it supports more functionality and ease of use.
> >>>> Looking
> >>>>>>>> forward
> >>>>>>>>>> to
> >>>>>>>>>>>> the detailed design/FLIP.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Hequn
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Fri, Nov 2, 2018 at 9:48 AM Shaoxuan Wang <
> >>>>>>> wshaoxuan@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Aljoscha,
> >>>>>>>>>>>>> Glad that you like the proposal. We have completed
> >> the
> >>>>>>> prototype
> >>>>>>>> of
> >>>>>>>>>>> most
> >>>>>>>>>>>>> new proposed functionalities. Once collect the
> >> feedback
> >>>>> from
> >>>>>>>>>> community,
> >>>>>>>>>>>> we
> >>>>>>>>>>>>> will come up with a concrete FLIP/design doc.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>> Shaoxuan
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Thu, Nov 1, 2018 at 8:12 PM Aljoscha Krettek <
> >>>>>>>>> aljoscha@apache.org
> >>>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Jincheng,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> these points sound very good! Are there any
> >> concrete
> >>>>>>> proposals
> >>>>>>>>> for
> >>>>>>>>>>>>>> changes? For example a FLIP/design document?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> See here for FLIPs:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>> Aljoscha
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On 1. Nov 2018, at 12:51, jincheng sun <
> >>>>>>>>> sunjincheng121@gmail.com
> >>>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> *--------I am sorry for the formatting of the
> >> email
> >>>>>>> content.
> >>>>>>>> I
> >>>>>>>>>>>> reformat
> >>>>>>>>>>>>>>> the **content** as follows-----------*
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> *Hi ALL,*
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> With the continuous efforts from the community,
> >> the
> >>>>> Flink
> >>>>>>>>> system
> >>>>>>>>>>> has
> >>>>>>>>>>>>> been
> >>>>>>>>>>>>>>> continuously improved, which has attracted more
> >> and
> >>>>> more
> >>>>>>>> users.
> >>>>>>>>>>> Flink
> >>>>>>>>>>>>> SQL
> >>>>>>>>>>>>>>> is a canonical, widely used relational query
> >>>> language.
> >>>>>>>> However,
> >>>>>>>>>>> there
> >>>>>>>>>>>>> are
> >>>>>>>>>>>>>>> still some scenarios where Flink SQL failed to
> >> meet
> >>>>> user
> >>>>>>>> needs
> >>>>>>>>> in
> >>>>>>>>>>>> terms
> >>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>> functionality and ease of use, such as:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> *1. In terms of functionality*
> >>>>>>>>>>>>>>>   Iteration, user-defined window, user-defined
> >>> join,
> >>>>>>>>>> user-defined
> >>>>>>>>>>>>>>> GroupReduce, etc. Users cannot express them with
> >>> SQL;
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> *2. In terms of ease of use*
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>  - Map - e.g. “dataStream.map(mapFun)”. Although
> >>>>>>>>>>>> “table.select(udf1(),
> >>>>>>>>>>>>>>>  udf2(), udf3()....)” can be used to accomplish
> >>> the
> >>>>> same
> >>>>>>>>>>> function.,
> >>>>>>>>>>>>>> with a
> >>>>>>>>>>>>>>>  map() function returning 100 columns, one has
> >> to
> >>>>> define
> >>>>>>> or
> >>>>>>>>> call
> >>>>>>>>>>> 100
> >>>>>>>>>>>>>> UDFs
> >>>>>>>>>>>>>>>  when using SQL, which is quite involved.
> >>>>>>>>>>>>>>>  - FlatMap -  e.g.
> >>> “dataStrem.flatmap(flatMapFun)”.
> >>>>>>>> Similarly,
> >>>>>>>>>> it
> >>>>>>>>>>>> can
> >>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>  implemented with “table.join(udtf).select()”.
> >>>>> However,
> >>>>>> it
> >>>>>>>> is
> >>>>>>>>>>>> obvious
> >>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>  dataStream is easier to use than SQL.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Due to the above two reasons, some users have to
> >>> use
> >>>>> the
> >>>>>>>>>> DataStream
> >>>>>>>>>>>> API
> >>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>> the DataSet API. But when they do that, they lose
> >>> the
> >>>>>>>>> unification
> >>>>>>>>>>> of
> >>>>>>>>>>>>>> batch
> >>>>>>>>>>>>>>> and streaming. They will also lose the
> >>> sophisticated
> >>>>>>>>>> optimizations
> >>>>>>>>>>>> such
> >>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>> codegen, aggregate join transpose and multi-stage
> >>> agg
> >>>>>> from
> >>>>>>>>> Flink
> >>>>>>>>>>> SQL.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> We believe that enhancing the functionality and
> >>>>>>> productivity
> >>>>>>>> is
> >>>>>>>>>>> vital
> >>>>>>>>>>>>> for
> >>>>>>>>>>>>>>> the successful adoption of Table API. To this
> >> end,
> >>>>> Table
> >>>>>>> API
> >>>>>>>>>> still
> >>>>>>>>>>>>>>> requires more efforts from every contributor in
> >> the
> >>>>>>>> community.
> >>>>>>>>> We
> >>>>>>>>>>> see
> >>>>>>>>>>>>>> great
> >>>>>>>>>>>>>>> opportunity in improving our user’s experience
> >> from
> >>>>> this
> >>>>>>>> work.
> >>>>>>>>>> Any
> >>>>>>>>>>>>>> feedback
> >>>>>>>>>>>>>>> is welcome.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Jincheng
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> jincheng sun <su...@gmail.com>
> >>>> 于2018年11月1日周四
> >>>>>>>>> 下午5:07写道:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> With the continuous efforts from the community,
> >>> the
> >>>>>> Flink
> >>>>>>>>> system
> >>>>>>>>>>> has
> >>>>>>>>>>>>>> been
> >>>>>>>>>>>>>>>> continuously improved, which has attracted more
> >>> and
> >>>>> more
> >>>>>>>>> users.
> >>>>>>>>>>>> Flink
> >>>>>>>>>>>>>> SQL
> >>>>>>>>>>>>>>>> is a canonical, widely used relational query
> >>>> language.
> >>>>>>>>> However,
> >>>>>>>>>>>> there
> >>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>> still some scenarios where Flink SQL failed to
> >>> meet
> >>>>> user
> >>>>>>>> needs
> >>>>>>>>>> in
> >>>>>>>>>>>>> terms
> >>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>> functionality and ease of use, such as:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>  -
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>  In terms of functionality
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Iteration, user-defined window, user-defined
> >> join,
> >>>>>>>>> user-defined
> >>>>>>>>>>>>>>>> GroupReduce, etc. Users cannot express them with
> >>>> SQL;
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>  -
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>  In terms of ease of use
> >>>>>>>>>>>>>>>>  -
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>     Map - e.g. “dataStream.map(mapFun)”.
> >> Although
> >>>>>>>>>>>>> “table.select(udf1(),
> >>>>>>>>>>>>>>>>     udf2(), udf3()....)” can be used to
> >>> accomplish
> >>>>> the
> >>>>>>> same
> >>>>>>>>>>>>> function.,
> >>>>>>>>>>>>>> with a
> >>>>>>>>>>>>>>>>     map() function returning 100 columns, one
> >> has
> >>>> to
> >>>>>>> define
> >>>>>>>>> or
> >>>>>>>>>>> call
> >>>>>>>>>>>>>> 100 UDFs
> >>>>>>>>>>>>>>>>     when using SQL, which is quite involved.
> >>>>>>>>>>>>>>>>     -
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>     FlatMap -  e.g.
> >>>> “dataStrem.flatmap(flatMapFun)”.
> >>>>>>>>> Similarly,
> >>>>>>>>>>> it
> >>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>     be implemented with
> >>>> “table.join(udtf).select()”.
> >>>>>>>> However,
> >>>>>>>>>> it
> >>>>>>>>>>> is
> >>>>>>>>>>>>>> obvious
> >>>>>>>>>>>>>>>>     that datastream is easier to use than SQL.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Due to the above two reasons, some users have to
> >>> use
> >>>>> the
> >>>>>>>>>>> DataStream
> >>>>>>>>>>>>> API
> >>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>> the DataSet API. But when they do that, they
> >> lose
> >>>> the
> >>>>>>>>>> unification
> >>>>>>>>>>> of
> >>>>>>>>>>>>>> batch
> >>>>>>>>>>>>>>>> and streaming. They will also lose the
> >>> sophisticated
> >>>>>>>>>> optimizations
> >>>>>>>>>>>>> such
> >>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>> codegen, aggregate join transpose  and
> >> multi-stage
> >>>> agg
> >>>>>>> from
> >>>>>>>>>> Flink
> >>>>>>>>>>>> SQL.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> We believe that enhancing the functionality and
> >>>>>>> productivity
> >>>>>>>>> is
> >>>>>>>>>>>> vital
> >>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>> the successful adoption of Table API. To this
> >> end,
> >>>>>> Table
> >>>>>>>> API
> >>>>>>>>>>> still
> >>>>>>>>>>>>>>>> requires more efforts from every contributor in
> >>> the
> >>>>>>>> community.
> >>>>>>>>>> We
> >>>>>>>>>>>> see
> >>>>>>>>>>>>>> great
> >>>>>>>>>>>>>>>> opportunity in improving our user’s experience
> >>> from
> >>>>> this
> >>>>>>>> work.
> >>>>>>>>>> Any
> >>>>>>>>>>>>>> feedback
> >>>>>>>>>>>>>>>> is welcome.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Jincheng
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>

Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Posted by Piotr Nowojski <pi...@data-artisans.com>.
Hi,

What is our intended division/border between Table API and DataSet or DataStream? If we want Table API to drift away from SQL that would be a valid question.

> Another distinguishing feature of DataStream API is that users get direct
> access to state/statebackend which we intensionally avoided in Table API

Do we really want to make Table API an equivalent of DataSet/DataStream API but without a state? Drawing boundary in such way would make it more difficult for users to pick the right tool if for many use cases they could use both. What if then at some point of time they came to conclusion that they need a small state access for something? If that’s our intended end goal separation between Table API and DataStream API, it would be very very weird having two very similar APIs, that have tons of small differences, but are basically equivalent modulo state accesses.

Maybe instead of duplicating efforts and work between our different APIs we should more focus on either interoperability or unifying them?

For example if we would like to differentiate those APIs because of the presence/lack of optimiser, maybe the APIs should be the same, but there should be a way tell whether the UDF/operator is deterministic, has side effects, etc. And if such operator is found in the plan, the nodes below and above could still be subject to regular optimisation rules.

Piotrek

> On 6 Nov 2018, at 14:04, Fabian Hueske <fh...@gmail.com> wrote:
> 
> Hi,
> 
> An analysis of orthogonal functions would be great!
> There is certainly some overlap in the functions provided by the DataSet
> API.
> 
> In the past, I found that having low-level functions helped a lot to
> efficiently implement complex logic.
> Without partitionByHash, sortPartition, sort, mapPartition, combine, etc it
> would not be possible to (efficiently) implement certain operators for the
> Table API, SQL or the Cascading-On-Flink port that I did a while back.
> I could imaging that these APIs would be useful to implement DSLs on top of
> the Table API, such as Gelly.
> 
> Anyway, I completely agree that these physical operators should not be the
> first step.
> If we find that these methods are not needed, even better!
> 
> Let's try to keep this thread focused on the general proposal of extending
> the scope of the Table API and keep the discussion of concrete proposal
> that Xiaowei shared in the other thread (and the design doc).
> That will help to keep all related comments in one place ;-)
> 
> Best, Fabian
> 
> 
> Am Di., 6. Nov. 2018 um 13:01 Uhr schrieb jincheng sun <
> sunjincheng121@gmail.com>:
> 
>> Hi Fabian,
>> Thank you for your deep thoughts in this regard, I think most of questions
>> you had mentioned are very worthy of in-depth discussion! I want share
>> thoughts about following questions:
>> 
>> 1. Do we need move all DataSet API functionality into the Table API?
>> I think most of dataset functionality should be add into the TableAPI, such
>> as map, flatmap, groupReduce etc., Because these are very easy to use for
>> the user.
>> 
>> 2. Do we support explicit physical operations like partitioning, sorting or
>> optimizer hints?
>> I think we do not want add the physical operations, e.g.:
>> sortPartition,partitionCustom etc. From the points of my view, those
>> physical operations are used to optimization, which can be solved by
>> hints(I think we should add hints feature to both tableAPI and SQL).
>> 
>> 3. Do we want to support retractions in iteration?
>> I think support iteration is  a very complicated function。I am not sure,
>> but i think the implementation of the iteration may be implemented
>> according to the current batch mode, and the retraction is temporarily not
>> supported, assuming that the trained data will not be updated in the
>> current iteration. The updated data will be used in the next iteration.  So
>> I think we should in-depth discussion  in a new threading.
>> 
>> BTW, I find that you have had leave the very useful comments in the google
>> doc:
>> 
>> https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit#
>> 
>> Thanks again for both your mail feedback and doc comments !
>> 
>> Best,
>> Jincheng
>> 
>> 
>> 
>> Fabian Hueske <fh...@gmail.com> 于2018年11月6日周二 下午6:21写道:
>> 
>>> Thanks for the replies Xiaowei and others!
>>> 
>>> You are right, I did not consider the batch optimization that would be
>>> missing if the DataSet API would be ported to extend the DataStream API.
>>> By extending the scope of the Table API, we can gain a holistic logical &
>>> physical optimization which would be great!
>>> Is your plan to move all DataSet API functionality into the Table API?
>>> If so, do you envision any batch-related API in DataStream at all or
>> should
>>> this be done by converting a batch table to DataStream? I'm asking
>> because
>>> if there would be batch features in DataStream, we would need some
>>> optimization there as well.
>>> 
>>> I think the proposed separation of Table API (stateless APIs) and
>>> DataStream (APIs that expose state handling) is a good idea.
>>> On a side note, the DataSet API discouraged state handing in user
>> function,
>>> so porting this Table API would be quite "natural".
>>> 
>>> As I said before, I like that we can incrementally extend the Table API.
>>> Map and FlatMap functions do not seem too difficult.
>>> Reduce, GroupReduce, Combine, GroupCombine, MapPartition might be more
>>> tricky, esp. if we want to support retractions.
>>> Iterations should be a challenge. I assume that Calcite does not support
>>> iterations, so we probably need to split query / program and optimize
>> parts
>>> separately (IIRC, this is also how Flink's own optimizer handles this).
>>> To what extend are you planning to support explicit physical operations
>>> like partitioning, sorting or optimizer hints?
>>> 
>>> I haven't had a look in the design document that you shared. Probably, I
>>> find answers to some of my questions there ;-)
>>> 
>>> Regarding the question of SQL or Table API, I agree that extending the
>>> scope of the Table API does not limit the scope for SQL.
>>> By adding more operations to the Table API we can expand it to use case
>>> that are not well-served by SQL.
>>> As others have said, we'll of course continue to extend and improve
>> Flink's
>>> SQL support (within the bounds of the standard).
>>> 
>>> Best, Fabian
>>> 
>>> Am Di., 6. Nov. 2018 um 10:09 Uhr schrieb jincheng sun <
>>> sunjincheng121@gmail.com>:
>>> 
>>>> Hi Jark,
>>>> Glad to see your feedback!
>>>> That's Correct, The proposal is aiming to extend the functionality for
>>>> Table API! I like add "drop" to fit the use case you mentioned. Not
>> only
>>>> that, if a 100-columns Table. and our UDF needs these 100 columns, we
>>> don't
>>>> want to define the eval as eval(column0...column99), we prefer to
>> define
>>>> eval as eval(Row)。Using it like this: table.select(udf (*)). All we
>> also
>>>> need to consider if we put the columns package as a row. In a scenario
>>> like
>>>> this, we have Classification it as cloumn operation, and  list the
>>> changes
>>>> to the column operation after the map/flatMap/agg/flatAgg phase is
>>>> completed. And Currently,  Xiaowei has started a threading outlining
>>> which
>>>> talk about what we are proposing. Please see the detail in the mail
>>> thread:
>>>> Please see the detail in the mail thread:
>>>> 
>>>> 
>>> 
>> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
>>>> <
>>>> 
>>> 
>> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
>>>>> 
>>>> .
>>>> 
>>>> At this stage the Table API Enhancement Outline as follows:
>>>> 
>>>> 
>>> 
>> https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing
>>>> 
>>>> Please let we know if you have further thoughts or feedback!
>>>> 
>>>> Thanks,
>>>> Jincheng
>>>> 
>>>> 
>>>> Jark Wu <im...@gmail.com> 于2018年11月6日周二 下午3:35写道:
>>>> 
>>>>> Hi jingcheng,
>>>>> 
>>>>> Thanks for your proposal. I think it is a helpful enhancement for
>>>> TableAPI
>>>>> which is a solid step forward for TableAPI.
>>>>> It doesn't weaken SQL or DataStream, because the conversion between
>>>>> DataStream and Table still works.
>>>>> People with advanced cases (e.g. complex and fine-grained state
>>> control)
>>>>> can go with DataStream,
>>>>> but most general cases can stay in TableAPI. This works is aiming to
>>>> extend
>>>>> the functionality for TableAPI,
>>>>> to extend the usage scenario, to help TableAPI becomes a more widely
>>> used
>>>>> API.
>>>>> 
>>>>> For example, someone want to drop one column from a 100-columns
>> Table.
>>>>> Currently, we have to convert
>>>>> Table to DataStream and use MapFunction to do that, or select the
>>>> remaining
>>>>> 99 columns using Table.select API.
>>>>> But if we support Table.drop() method for TableAPI, it will be a very
>>>>> convenient method and let users stay in Table.
>>>>> 
>>>>> Looking forward to the more detailed design and further discussion.
>>>>> 
>>>>> Regards,
>>>>> Jark
>>>>> 
>>>>> jincheng sun <su...@gmail.com> 于2018年11月6日周二 下午1:05写道:
>>>>> 
>>>>>> Hi Rong Rong,
>>>>>> 
>>>>>> Sorry for the late reply, And thanks for your feedback!  We will
>>>> continue
>>>>>> to add more convenience features to the TableAPI, such as map,
>>> flatmap,
>>>>>> agg, flatagg, iteration etc. And I am very happy that you are
>>>> interested
>>>>> on
>>>>>> this proposal. Due to this is a long-term continuous work, we will
>>> push
>>>>> it
>>>>>> in stages.  Currently  Xiaowei has started a threading outlining
>>> which
>>>>> talk
>>>>>> about what we are proposing. Please see the detail in the mail
>>> thread:
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
>>>>>> <
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
>>>>>>> 
>>>>>> .
>>>>>> 
>>>>>> The Table API Enhancement Outline as follows:
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing
>>>>>> 
>>>>>> Please let we know if you have further thoughts or feedback!
>>>>>> 
>>>>>> Thanks,
>>>>>> Jincheng
>>>>>> 
>>>>>> Fabian Hueske <fh...@gmail.com> 于2018年11月5日周一 下午7:03写道:
>>>>>> 
>>>>>>> Hi Jincheng,
>>>>>>> 
>>>>>>> Thanks for this interesting proposal.
>>>>>>> I like that we can push this effort forward in a very
>> fine-grained
>>>>>> manner,
>>>>>>> i.e., incrementally adding more APIs to the Table API.
>>>>>>> 
>>>>>>> However, I also have a few questions / concerns.
>>>>>>> Today, the Table API is tightly integrated with the DataSet and
>>>>>> DataStream
>>>>>>> APIs. It is very easy to convert a Table into a DataSet or
>>> DataStream
>>>>> and
>>>>>>> vice versa. This mean it is already easy to combine custom logic
>> an
>>>>>>> relational operations. What I like is that several aspects are
>>>> clearly
>>>>>>> separated like retraction and timestamp handling (see below) +
>> all
>>>>>>> libraries on DataStream/DataSet can be easily combined with
>>>> relational
>>>>>>> operations.
>>>>>>> I can see that adding more functionality to the Table API would
>>>> remove
>>>>>> the
>>>>>>> distinction between DataSet and DataStream. However, wouldn't we
>>> get
>>>> a
>>>>>>> similar benefit by extending the DataStream API for proper
>> support
>>>> for
>>>>>>> bounded streams (as is the long-term goal of Flink)?
>>>>>>> I'm also a bit skeptical about the optimization opportunities we
>>>> would
>>>>>>> gain. Map/FlatMap UDFs are black boxes that cannot be easily
>>> removed
>>>>>>> without additional information (I did some research on this a few
>>>> years
>>>>>> ago
>>>>>>> [1]).
>>>>>>> 
>>>>>>> Moreover, I think there are a few tricky details that need to be
>>>>> resolved
>>>>>>> to enable a good integration.
>>>>>>> 
>>>>>>> 1) How to deal with retraction messages? The DataStream API does
>>> not
>>>>>> have a
>>>>>>> notion of retractions. How would a MapFunction or FlatMapFunction
>>>>> handle
>>>>>>> retraction? Do they need to be aware of the change flag? Custom
>>>>> windowing
>>>>>>> and aggregation logic would certainly need to have that
>>> information.
>>>>>>> 2) How to deal with timestamps? The DataStream API does not give
>>>> access
>>>>>> to
>>>>>>> timestamps. In the Table API / SQL these are exposed as regular
>>>>>> attributes.
>>>>>>> How can we ensure that timestamp attributes remain valid (i.e.
>>>> aligned
>>>>>> with
>>>>>>> watermarks) if the output is produced by arbitrary code?
>>>>>>> There might be more issues of this kind.
>>>>>>> 
>>>>>>> My main question would be how much would we gain with this
>> proposal
>>>>> over
>>>>>> a
>>>>>>> tight integration of Table API and DataStream API, assuming that
>>>> batch
>>>>>>> functionality is moved to DataStream?
>>>>>>> 
>>>>>>> Best, Fabian
>>>>>>> 
>>>>>>> [1]
>>> http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf
>>>>>>> 
>>>>>>> 
>>>>>>> Am Mo., 5. Nov. 2018 um 02:49 Uhr schrieb Rong Rong <
>>>>> walterddr@gmail.com
>>>>>>> :
>>>>>>> 
>>>>>>>> Hi Jincheng,
>>>>>>>> 
>>>>>>>> Thank you for the proposal! I think being able to define a
>>> process
>>>> /
>>>>>>>> co-process function in table API definitely opens up a whole
>> new
>>>>> level
>>>>>> of
>>>>>>>> applications using a unified API.
>>>>>>>> 
>>>>>>>> In addition, as Tzu-Li and Hequn have mentioned, the benefit of
>>>>>>>> optimization layer of Table API will already bring in
>> additional
>>>>>> benefit
>>>>>>>> over directly programming on top of DataStream/DataSet API. I
>> am
>>>> very
>>>>>>>> interested an looking forward to seeing the support for more
>>>> complex
>>>>>> use
>>>>>>>> cases, especially iterations. It will enable table API to
>> define
>>>> much
>>>>>>>> broader, event-driven use cases such as real-time ML
>>>>>> prediction/training.
>>>>>>>> 
>>>>>>>> As Timo mentioned, This will make Table API diverge from the
>> SQL
>>>> API.
>>>>>> But
>>>>>>>> as from my experience Table API was always giving me the
>>> impression
>>>>> to
>>>>>>> be a
>>>>>>>> more sophisticated, syntactic-aware way to express relational
>>>>>> operations.
>>>>>>>> Looking forward to further discussion and collaborations on the
>>>> FLIP
>>>>>> doc.
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Rong
>>>>>>>> 
>>>>>>>> On Sun, Nov 4, 2018 at 5:22 PM jincheng sun <
>>>>> sunjincheng121@gmail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi tison,
>>>>>>>>> 
>>>>>>>>> Thanks a lot for your feedback!
>>>>>>>>> I am very happy to see that community contributors agree to
>>>>> enhanced
>>>>>>> the
>>>>>>>>> TableAPI. This work is a long-term continuous work, we will
>>> push
>>>> it
>>>>>> in
>>>>>>>>> stages, we will soon complete  the enhanced list of the first
>>>>> phase,
>>>>>> we
>>>>>>>> can
>>>>>>>>> go deep discussion  in google doc. thanks again for joining
>> on
>>>> the
>>>>>> very
>>>>>>>>> important discussion of the Flink Table API.
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Jincheng
>>>>>>>>> 
>>>>>>>>> Tzu-Li Chen <wa...@gmail.com> 于2018年11月2日周五 下午1:49写道:
>>>>>>>>> 
>>>>>>>>>> Hi jingchengm
>>>>>>>>>> 
>>>>>>>>>> Thanks a lot for your proposal! I find it is a good start
>>> point
>>>>> for
>>>>>>>>>> internal optimization works and help Flink to be more
>>>>>>>>>> user-friendly.
>>>>>>>>>> 
>>>>>>>>>> AFAIK, DataStream is the most popular API currently that
>>> Flink
>>>>>>>>>> users should describe their logic with detailed logic.
>>>>>>>>>> From a more internal view the conversion from DataStream to
>>>>>>>>>> JobGraph is quite mechanically and hard to be optimized. So
>>>> when
>>>>>>>>>> users program with DataStream, they have to learn more
>>>> internals
>>>>>>>>>> and spend a lot of time to tune for performance.
>>>>>>>>>> With your proposal, we provide enhanced functionality of
>>> Table
>>>>> API,
>>>>>>>>>> so that users can describe their job easily on Table
>> aspect.
>>>> This
>>>>>>> gives
>>>>>>>>>> an opportunity to Flink developers to introduce an optimize
>>>> phase
>>>>>>>>>> while transforming user program(described by Table API) to
>>>>> internal
>>>>>>>>>> representation.
>>>>>>>>>> 
>>>>>>>>>> Given a user who want to start using Flink with simple ETL,
>>>>>>> pipelining
>>>>>>>>>> or analytics, he would find it is most naturally described
>> by
>>>>>>> SQL/Table
>>>>>>>>>> API. Further, as mentioned by @hequn,
>>>>>>>>>> 
>>>>>>>>>> SQL is a widely used language. It follows standards, is a
>>>>>>>>>>> descriptive language, and is easy to use
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> thus we could expect with the enhancement of SQL/Table API,
>>>> Flink
>>>>>>>>>> becomes more friendly to users.
>>>>>>>>>> 
>>>>>>>>>> Looking forward to the design doc/FLIP!
>>>>>>>>>> 
>>>>>>>>>> Best,
>>>>>>>>>> tison.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> jincheng sun <su...@gmail.com> 于2018年11月2日周五
>>>> 上午11:46写道:
>>>>>>>>>> 
>>>>>>>>>>> Hi Hequn,
>>>>>>>>>>> Thanks for your feedback! And also thanks for our offline
>>>>>>> discussion!
>>>>>>>>>>> You are right, unification of batch and streaming is very
>>>>>> important
>>>>>>>> for
>>>>>>>>>>> flink API.
>>>>>>>>>>> We will provide more detailed design later, Please let me
>>>> know
>>>>> if
>>>>>>> you
>>>>>>>>>> have
>>>>>>>>>>> further thoughts or feedback.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Jincheng
>>>>>>>>>>> 
>>>>>>>>>>> Hequn Cheng <ch...@gmail.com> 于2018年11月2日周五
>>> 上午10:02写道:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi Jincheng,
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks a lot for your proposal. It is very encouraging!
>>>>>>>>>>>> 
>>>>>>>>>>>> As we all know, SQL is a widely used language. It
>> follows
>>>>>>>> standards,
>>>>>>>>>> is a
>>>>>>>>>>>> descriptive language, and is easy to use. A powerful
>>>> feature
>>>>> of
>>>>>>> SQL
>>>>>>>>> is
>>>>>>>>>>> that
>>>>>>>>>>>> it supports optimization. Users only need to care about
>>> the
>>>>>> logic
>>>>>>>> of
>>>>>>>>>> the
>>>>>>>>>>>> program. The underlying optimizer will help users
>>> optimize
>>>>> the
>>>>>>>>>>> performance
>>>>>>>>>>>> of the program. However, in terms of functionality and
>>> ease
>>>>> of
>>>>>>> use,
>>>>>>>>> in
>>>>>>>>>>> some
>>>>>>>>>>>> scenarios sql will be limited, as described in
>> Jincheng's
>>>>>>> proposal.
>>>>>>>>>>>> 
>>>>>>>>>>>> Correspondingly, the DataStream/DataSet api can provide
>>>>>> powerful
>>>>>>>>>>>> functionalities. Users can write
>>>>>>> ProcessFunction/CoProcessFunction
>>>>>>>>> and
>>>>>>>>>>> get
>>>>>>>>>>>> the timer. Compared with SQL, it provides more
>>>>> functionalities
>>>>>>> and
>>>>>>>>>>>> flexibilities. However, it does not support
>> optimization
>>>> like
>>>>>>> SQL.
>>>>>>>>>>>> Meanwhile, DataStream/DataSet api has not been unified
>>>> which
>>>>>>> means,
>>>>>>>>> for
>>>>>>>>>>> the
>>>>>>>>>>>> same logic, users need to write a job for each stream
>> and
>>>>>> batch.
>>>>>>>>>>>> 
>>>>>>>>>>>> With TableApi, I think we can combine the advantages of
>>>> both.
>>>>>>> Users
>>>>>>>>> can
>>>>>>>>>>>> easily write relational operations and enjoy
>>> optimization.
>>>> At
>>>>>> the
>>>>>>>>> same
>>>>>>>>>>>> time, it supports more functionality and ease of use.
>>>> Looking
>>>>>>>> forward
>>>>>>>>>> to
>>>>>>>>>>>> the detailed design/FLIP.
>>>>>>>>>>>> 
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Hequn
>>>>>>>>>>>> 
>>>>>>>>>>>> On Fri, Nov 2, 2018 at 9:48 AM Shaoxuan Wang <
>>>>>>> wshaoxuan@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Aljoscha,
>>>>>>>>>>>>> Glad that you like the proposal. We have completed
>> the
>>>>>>> prototype
>>>>>>>> of
>>>>>>>>>>> most
>>>>>>>>>>>>> new proposed functionalities. Once collect the
>> feedback
>>>>> from
>>>>>>>>>> community,
>>>>>>>>>>>> we
>>>>>>>>>>>>> will come up with a concrete FLIP/design doc.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Shaoxuan
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Thu, Nov 1, 2018 at 8:12 PM Aljoscha Krettek <
>>>>>>>>> aljoscha@apache.org
>>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Jincheng,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> these points sound very good! Are there any
>> concrete
>>>>>>> proposals
>>>>>>>>> for
>>>>>>>>>>>>>> changes? For example a FLIP/design document?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> See here for FLIPs:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Aljoscha
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On 1. Nov 2018, at 12:51, jincheng sun <
>>>>>>>>> sunjincheng121@gmail.com
>>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> *--------I am sorry for the formatting of the
>> email
>>>>>>> content.
>>>>>>>> I
>>>>>>>>>>>> reformat
>>>>>>>>>>>>>>> the **content** as follows-----------*
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> *Hi ALL,*
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> With the continuous efforts from the community,
>> the
>>>>> Flink
>>>>>>>>> system
>>>>>>>>>>> has
>>>>>>>>>>>>> been
>>>>>>>>>>>>>>> continuously improved, which has attracted more
>> and
>>>>> more
>>>>>>>> users.
>>>>>>>>>>> Flink
>>>>>>>>>>>>> SQL
>>>>>>>>>>>>>>> is a canonical, widely used relational query
>>>> language.
>>>>>>>> However,
>>>>>>>>>>> there
>>>>>>>>>>>>> are
>>>>>>>>>>>>>>> still some scenarios where Flink SQL failed to
>> meet
>>>>> user
>>>>>>>> needs
>>>>>>>>> in
>>>>>>>>>>>> terms
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>> functionality and ease of use, such as:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> *1. In terms of functionality*
>>>>>>>>>>>>>>>   Iteration, user-defined window, user-defined
>>> join,
>>>>>>>>>> user-defined
>>>>>>>>>>>>>>> GroupReduce, etc. Users cannot express them with
>>> SQL;
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> *2. In terms of ease of use*
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>  - Map - e.g. “dataStream.map(mapFun)”. Although
>>>>>>>>>>>> “table.select(udf1(),
>>>>>>>>>>>>>>>  udf2(), udf3()....)” can be used to accomplish
>>> the
>>>>> same
>>>>>>>>>>> function.,
>>>>>>>>>>>>>> with a
>>>>>>>>>>>>>>>  map() function returning 100 columns, one has
>> to
>>>>> define
>>>>>>> or
>>>>>>>>> call
>>>>>>>>>>> 100
>>>>>>>>>>>>>> UDFs
>>>>>>>>>>>>>>>  when using SQL, which is quite involved.
>>>>>>>>>>>>>>>  - FlatMap -  e.g.
>>> “dataStrem.flatmap(flatMapFun)”.
>>>>>>>> Similarly,
>>>>>>>>>> it
>>>>>>>>>>>> can
>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>  implemented with “table.join(udtf).select()”.
>>>>> However,
>>>>>> it
>>>>>>>> is
>>>>>>>>>>>> obvious
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>  dataStream is easier to use than SQL.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Due to the above two reasons, some users have to
>>> use
>>>>> the
>>>>>>>>>> DataStream
>>>>>>>>>>>> API
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>> the DataSet API. But when they do that, they lose
>>> the
>>>>>>>>> unification
>>>>>>>>>>> of
>>>>>>>>>>>>>> batch
>>>>>>>>>>>>>>> and streaming. They will also lose the
>>> sophisticated
>>>>>>>>>> optimizations
>>>>>>>>>>>> such
>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>> codegen, aggregate join transpose and multi-stage
>>> agg
>>>>>> from
>>>>>>>>> Flink
>>>>>>>>>>> SQL.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> We believe that enhancing the functionality and
>>>>>>> productivity
>>>>>>>> is
>>>>>>>>>>> vital
>>>>>>>>>>>>> for
>>>>>>>>>>>>>>> the successful adoption of Table API. To this
>> end,
>>>>> Table
>>>>>>> API
>>>>>>>>>> still
>>>>>>>>>>>>>>> requires more efforts from every contributor in
>> the
>>>>>>>> community.
>>>>>>>>> We
>>>>>>>>>>> see
>>>>>>>>>>>>>> great
>>>>>>>>>>>>>>> opportunity in improving our user’s experience
>> from
>>>>> this
>>>>>>>> work.
>>>>>>>>>> Any
>>>>>>>>>>>>>> feedback
>>>>>>>>>>>>>>> is welcome.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Jincheng
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> jincheng sun <su...@gmail.com>
>>>> 于2018年11月1日周四
>>>>>>>>> 下午5:07写道:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> With the continuous efforts from the community,
>>> the
>>>>>> Flink
>>>>>>>>> system
>>>>>>>>>>> has
>>>>>>>>>>>>>> been
>>>>>>>>>>>>>>>> continuously improved, which has attracted more
>>> and
>>>>> more
>>>>>>>>> users.
>>>>>>>>>>>> Flink
>>>>>>>>>>>>>> SQL
>>>>>>>>>>>>>>>> is a canonical, widely used relational query
>>>> language.
>>>>>>>>> However,
>>>>>>>>>>>> there
>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>> still some scenarios where Flink SQL failed to
>>> meet
>>>>> user
>>>>>>>> needs
>>>>>>>>>> in
>>>>>>>>>>>>> terms
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>> functionality and ease of use, such as:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>  -
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>  In terms of functionality
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Iteration, user-defined window, user-defined
>> join,
>>>>>>>>> user-defined
>>>>>>>>>>>>>>>> GroupReduce, etc. Users cannot express them with
>>>> SQL;
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>  -
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>  In terms of ease of use
>>>>>>>>>>>>>>>>  -
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>     Map - e.g. “dataStream.map(mapFun)”.
>> Although
>>>>>>>>>>>>> “table.select(udf1(),
>>>>>>>>>>>>>>>>     udf2(), udf3()....)” can be used to
>>> accomplish
>>>>> the
>>>>>>> same
>>>>>>>>>>>>> function.,
>>>>>>>>>>>>>> with a
>>>>>>>>>>>>>>>>     map() function returning 100 columns, one
>> has
>>>> to
>>>>>>> define
>>>>>>>>> or
>>>>>>>>>>> call
>>>>>>>>>>>>>> 100 UDFs
>>>>>>>>>>>>>>>>     when using SQL, which is quite involved.
>>>>>>>>>>>>>>>>     -
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>     FlatMap -  e.g.
>>>> “dataStrem.flatmap(flatMapFun)”.
>>>>>>>>> Similarly,
>>>>>>>>>>> it
>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>     be implemented with
>>>> “table.join(udtf).select()”.
>>>>>>>> However,
>>>>>>>>>> it
>>>>>>>>>>> is
>>>>>>>>>>>>>> obvious
>>>>>>>>>>>>>>>>     that datastream is easier to use than SQL.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Due to the above two reasons, some users have to
>>> use
>>>>> the
>>>>>>>>>>> DataStream
>>>>>>>>>>>>> API
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>> the DataSet API. But when they do that, they
>> lose
>>>> the
>>>>>>>>>> unification
>>>>>>>>>>> of
>>>>>>>>>>>>>> batch
>>>>>>>>>>>>>>>> and streaming. They will also lose the
>>> sophisticated
>>>>>>>>>> optimizations
>>>>>>>>>>>>> such
>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>> codegen, aggregate join transpose  and
>> multi-stage
>>>> agg
>>>>>>> from
>>>>>>>>>> Flink
>>>>>>>>>>>> SQL.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> We believe that enhancing the functionality and
>>>>>>> productivity
>>>>>>>>> is
>>>>>>>>>>>> vital
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>> the successful adoption of Table API. To this
>> end,
>>>>>> Table
>>>>>>>> API
>>>>>>>>>>> still
>>>>>>>>>>>>>>>> requires more efforts from every contributor in
>>> the
>>>>>>>> community.
>>>>>>>>>> We
>>>>>>>>>>>> see
>>>>>>>>>>>>>> great
>>>>>>>>>>>>>>>> opportunity in improving our user’s experience
>>> from
>>>>> this
>>>>>>>> work.
>>>>>>>>>> Any
>>>>>>>>>>>>>> feedback
>>>>>>>>>>>>>>>> is welcome.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Jincheng
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 


Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Posted by Fabian Hueske <fh...@gmail.com>.
Hi,

An analysis of orthogonal functions would be great!
There is certainly some overlap in the functions provided by the DataSet
API.

In the past, I found that having low-level functions helped a lot to
efficiently implement complex logic.
Without partitionByHash, sortPartition, sort, mapPartition, combine, etc it
would not be possible to (efficiently) implement certain operators for the
Table API, SQL or the Cascading-On-Flink port that I did a while back.
I could imaging that these APIs would be useful to implement DSLs on top of
the Table API, such as Gelly.

Anyway, I completely agree that these physical operators should not be the
first step.
If we find that these methods are not needed, even better!

Let's try to keep this thread focused on the general proposal of extending
the scope of the Table API and keep the discussion of concrete proposal
that Xiaowei shared in the other thread (and the design doc).
That will help to keep all related comments in one place ;-)

Best, Fabian


Am Di., 6. Nov. 2018 um 13:01 Uhr schrieb jincheng sun <
sunjincheng121@gmail.com>:

> Hi Fabian,
> Thank you for your deep thoughts in this regard, I think most of questions
> you had mentioned are very worthy of in-depth discussion! I want share
> thoughts about following questions:
>
> 1. Do we need move all DataSet API functionality into the Table API?
> I think most of dataset functionality should be add into the TableAPI, such
> as map, flatmap, groupReduce etc., Because these are very easy to use for
> the user.
>
> 2. Do we support explicit physical operations like partitioning, sorting or
> optimizer hints?
> I think we do not want add the physical operations, e.g.:
> sortPartition,partitionCustom etc. From the points of my view, those
> physical operations are used to optimization, which can be solved by
> hints(I think we should add hints feature to both tableAPI and SQL).
>
> 3. Do we want to support retractions in iteration?
> I think support iteration is  a very complicated function。I am not sure,
> but i think the implementation of the iteration may be implemented
> according to the current batch mode, and the retraction is temporarily not
> supported, assuming that the trained data will not be updated in the
> current iteration. The updated data will be used in the next iteration.  So
> I think we should in-depth discussion  in a new threading.
>
> BTW, I find that you have had leave the very useful comments in the google
> doc:
>
> https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit#
>
> Thanks again for both your mail feedback and doc comments !
>
> Best,
> Jincheng
>
>
>
> Fabian Hueske <fh...@gmail.com> 于2018年11月6日周二 下午6:21写道:
>
> > Thanks for the replies Xiaowei and others!
> >
> > You are right, I did not consider the batch optimization that would be
> > missing if the DataSet API would be ported to extend the DataStream API.
> > By extending the scope of the Table API, we can gain a holistic logical &
> > physical optimization which would be great!
> > Is your plan to move all DataSet API functionality into the Table API?
> > If so, do you envision any batch-related API in DataStream at all or
> should
> > this be done by converting a batch table to DataStream? I'm asking
> because
> > if there would be batch features in DataStream, we would need some
> > optimization there as well.
> >
> > I think the proposed separation of Table API (stateless APIs) and
> > DataStream (APIs that expose state handling) is a good idea.
> > On a side note, the DataSet API discouraged state handing in user
> function,
> > so porting this Table API would be quite "natural".
> >
> > As I said before, I like that we can incrementally extend the Table API.
> > Map and FlatMap functions do not seem too difficult.
> > Reduce, GroupReduce, Combine, GroupCombine, MapPartition might be more
> > tricky, esp. if we want to support retractions.
> > Iterations should be a challenge. I assume that Calcite does not support
> > iterations, so we probably need to split query / program and optimize
> parts
> > separately (IIRC, this is also how Flink's own optimizer handles this).
> > To what extend are you planning to support explicit physical operations
> > like partitioning, sorting or optimizer hints?
> >
> > I haven't had a look in the design document that you shared. Probably, I
> > find answers to some of my questions there ;-)
> >
> > Regarding the question of SQL or Table API, I agree that extending the
> > scope of the Table API does not limit the scope for SQL.
> > By adding more operations to the Table API we can expand it to use case
> > that are not well-served by SQL.
> > As others have said, we'll of course continue to extend and improve
> Flink's
> > SQL support (within the bounds of the standard).
> >
> > Best, Fabian
> >
> > Am Di., 6. Nov. 2018 um 10:09 Uhr schrieb jincheng sun <
> > sunjincheng121@gmail.com>:
> >
> > > Hi Jark,
> > > Glad to see your feedback!
> > > That's Correct, The proposal is aiming to extend the functionality for
> > > Table API! I like add "drop" to fit the use case you mentioned. Not
> only
> > > that, if a 100-columns Table. and our UDF needs these 100 columns, we
> > don't
> > > want to define the eval as eval(column0...column99), we prefer to
> define
> > > eval as eval(Row)。Using it like this: table.select(udf (*)). All we
> also
> > > need to consider if we put the columns package as a row. In a scenario
> > like
> > > this, we have Classification it as cloumn operation, and  list the
> > changes
> > > to the column operation after the map/flatMap/agg/flatAgg phase is
> > > completed. And Currently,  Xiaowei has started a threading outlining
> > which
> > > talk about what we are proposing. Please see the detail in the mail
> > thread:
> > > Please see the detail in the mail thread:
> > >
> > >
> >
> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
> > > <
> > >
> >
> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
> > > >
> > >  .
> > >
> > > At this stage the Table API Enhancement Outline as follows:
> > >
> > >
> >
> https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing
> > >
> > > Please let we know if you have further thoughts or feedback!
> > >
> > > Thanks,
> > > Jincheng
> > >
> > >
> > > Jark Wu <im...@gmail.com> 于2018年11月6日周二 下午3:35写道:
> > >
> > > > Hi jingcheng,
> > > >
> > > > Thanks for your proposal. I think it is a helpful enhancement for
> > > TableAPI
> > > > which is a solid step forward for TableAPI.
> > > > It doesn't weaken SQL or DataStream, because the conversion between
> > > > DataStream and Table still works.
> > > > People with advanced cases (e.g. complex and fine-grained state
> > control)
> > > > can go with DataStream,
> > > > but most general cases can stay in TableAPI. This works is aiming to
> > > extend
> > > > the functionality for TableAPI,
> > > > to extend the usage scenario, to help TableAPI becomes a more widely
> > used
> > > > API.
> > > >
> > > > For example, someone want to drop one column from a 100-columns
> Table.
> > > > Currently, we have to convert
> > > > Table to DataStream and use MapFunction to do that, or select the
> > > remaining
> > > > 99 columns using Table.select API.
> > > > But if we support Table.drop() method for TableAPI, it will be a very
> > > > convenient method and let users stay in Table.
> > > >
> > > > Looking forward to the more detailed design and further discussion.
> > > >
> > > > Regards,
> > > > Jark
> > > >
> > > > jincheng sun <su...@gmail.com> 于2018年11月6日周二 下午1:05写道:
> > > >
> > > > > Hi Rong Rong,
> > > > >
> > > > > Sorry for the late reply, And thanks for your feedback!  We will
> > > continue
> > > > > to add more convenience features to the TableAPI, such as map,
> > flatmap,
> > > > > agg, flatagg, iteration etc. And I am very happy that you are
> > > interested
> > > > on
> > > > > this proposal. Due to this is a long-term continuous work, we will
> > push
> > > > it
> > > > > in stages.  Currently  Xiaowei has started a threading outlining
> > which
> > > > talk
> > > > > about what we are proposing. Please see the detail in the mail
> > thread:
> > > > >
> > > > >
> > > >
> > >
> >
> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
> > > > > <
> > > > >
> > > >
> > >
> >
> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
> > > > > >
> > > > >  .
> > > > >
> > > > > The Table API Enhancement Outline as follows:
> > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing
> > > > >
> > > > > Please let we know if you have further thoughts or feedback!
> > > > >
> > > > > Thanks,
> > > > > Jincheng
> > > > >
> > > > > Fabian Hueske <fh...@gmail.com> 于2018年11月5日周一 下午7:03写道:
> > > > >
> > > > > > Hi Jincheng,
> > > > > >
> > > > > > Thanks for this interesting proposal.
> > > > > > I like that we can push this effort forward in a very
> fine-grained
> > > > > manner,
> > > > > > i.e., incrementally adding more APIs to the Table API.
> > > > > >
> > > > > > However, I also have a few questions / concerns.
> > > > > > Today, the Table API is tightly integrated with the DataSet and
> > > > > DataStream
> > > > > > APIs. It is very easy to convert a Table into a DataSet or
> > DataStream
> > > > and
> > > > > > vice versa. This mean it is already easy to combine custom logic
> an
> > > > > > relational operations. What I like is that several aspects are
> > > clearly
> > > > > > separated like retraction and timestamp handling (see below) +
> all
> > > > > > libraries on DataStream/DataSet can be easily combined with
> > > relational
> > > > > > operations.
> > > > > > I can see that adding more functionality to the Table API would
> > > remove
> > > > > the
> > > > > > distinction between DataSet and DataStream. However, wouldn't we
> > get
> > > a
> > > > > > similar benefit by extending the DataStream API for proper
> support
> > > for
> > > > > > bounded streams (as is the long-term goal of Flink)?
> > > > > > I'm also a bit skeptical about the optimization opportunities we
> > > would
> > > > > > gain. Map/FlatMap UDFs are black boxes that cannot be easily
> > removed
> > > > > > without additional information (I did some research on this a few
> > > years
> > > > > ago
> > > > > > [1]).
> > > > > >
> > > > > > Moreover, I think there are a few tricky details that need to be
> > > > resolved
> > > > > > to enable a good integration.
> > > > > >
> > > > > > 1) How to deal with retraction messages? The DataStream API does
> > not
> > > > > have a
> > > > > > notion of retractions. How would a MapFunction or FlatMapFunction
> > > > handle
> > > > > > retraction? Do they need to be aware of the change flag? Custom
> > > > windowing
> > > > > > and aggregation logic would certainly need to have that
> > information.
> > > > > > 2) How to deal with timestamps? The DataStream API does not give
> > > access
> > > > > to
> > > > > > timestamps. In the Table API / SQL these are exposed as regular
> > > > > attributes.
> > > > > > How can we ensure that timestamp attributes remain valid (i.e.
> > > aligned
> > > > > with
> > > > > > watermarks) if the output is produced by arbitrary code?
> > > > > > There might be more issues of this kind.
> > > > > >
> > > > > > My main question would be how much would we gain with this
> proposal
> > > > over
> > > > > a
> > > > > > tight integration of Table API and DataStream API, assuming that
> > > batch
> > > > > > functionality is moved to DataStream?
> > > > > >
> > > > > > Best, Fabian
> > > > > >
> > > > > > [1]
> > http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf
> > > > > >
> > > > > >
> > > > > > Am Mo., 5. Nov. 2018 um 02:49 Uhr schrieb Rong Rong <
> > > > walterddr@gmail.com
> > > > > >:
> > > > > >
> > > > > > > Hi Jincheng,
> > > > > > >
> > > > > > > Thank you for the proposal! I think being able to define a
> > process
> > > /
> > > > > > > co-process function in table API definitely opens up a whole
> new
> > > > level
> > > > > of
> > > > > > > applications using a unified API.
> > > > > > >
> > > > > > > In addition, as Tzu-Li and Hequn have mentioned, the benefit of
> > > > > > > optimization layer of Table API will already bring in
> additional
> > > > > benefit
> > > > > > > over directly programming on top of DataStream/DataSet API. I
> am
> > > very
> > > > > > > interested an looking forward to seeing the support for more
> > > complex
> > > > > use
> > > > > > > cases, especially iterations. It will enable table API to
> define
> > > much
> > > > > > > broader, event-driven use cases such as real-time ML
> > > > > prediction/training.
> > > > > > >
> > > > > > > As Timo mentioned, This will make Table API diverge from the
> SQL
> > > API.
> > > > > But
> > > > > > > as from my experience Table API was always giving me the
> > impression
> > > > to
> > > > > > be a
> > > > > > > more sophisticated, syntactic-aware way to express relational
> > > > > operations.
> > > > > > > Looking forward to further discussion and collaborations on the
> > > FLIP
> > > > > doc.
> > > > > > >
> > > > > > > --
> > > > > > > Rong
> > > > > > >
> > > > > > > On Sun, Nov 4, 2018 at 5:22 PM jincheng sun <
> > > > sunjincheng121@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi tison,
> > > > > > > >
> > > > > > > > Thanks a lot for your feedback!
> > > > > > > > I am very happy to see that community contributors agree to
> > > > enhanced
> > > > > > the
> > > > > > > > TableAPI. This work is a long-term continuous work, we will
> > push
> > > it
> > > > > in
> > > > > > > > stages, we will soon complete  the enhanced list of the first
> > > > phase,
> > > > > we
> > > > > > > can
> > > > > > > > go deep discussion  in google doc. thanks again for joining
> on
> > > the
> > > > > very
> > > > > > > > important discussion of the Flink Table API.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Jincheng
> > > > > > > >
> > > > > > > > Tzu-Li Chen <wa...@gmail.com> 于2018年11月2日周五 下午1:49写道:
> > > > > > > >
> > > > > > > > > Hi jingchengm
> > > > > > > > >
> > > > > > > > > Thanks a lot for your proposal! I find it is a good start
> > point
> > > > for
> > > > > > > > > internal optimization works and help Flink to be more
> > > > > > > > > user-friendly.
> > > > > > > > >
> > > > > > > > > AFAIK, DataStream is the most popular API currently that
> > Flink
> > > > > > > > > users should describe their logic with detailed logic.
> > > > > > > > > From a more internal view the conversion from DataStream to
> > > > > > > > > JobGraph is quite mechanically and hard to be optimized. So
> > > when
> > > > > > > > > users program with DataStream, they have to learn more
> > > internals
> > > > > > > > > and spend a lot of time to tune for performance.
> > > > > > > > > With your proposal, we provide enhanced functionality of
> > Table
> > > > API,
> > > > > > > > > so that users can describe their job easily on Table
> aspect.
> > > This
> > > > > > gives
> > > > > > > > > an opportunity to Flink developers to introduce an optimize
> > > phase
> > > > > > > > > while transforming user program(described by Table API) to
> > > > internal
> > > > > > > > > representation.
> > > > > > > > >
> > > > > > > > > Given a user who want to start using Flink with simple ETL,
> > > > > > pipelining
> > > > > > > > > or analytics, he would find it is most naturally described
> by
> > > > > > SQL/Table
> > > > > > > > > API. Further, as mentioned by @hequn,
> > > > > > > > >
> > > > > > > > > SQL is a widely used language. It follows standards, is a
> > > > > > > > > > descriptive language, and is easy to use
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > thus we could expect with the enhancement of SQL/Table API,
> > > Flink
> > > > > > > > > becomes more friendly to users.
> > > > > > > > >
> > > > > > > > > Looking forward to the design doc/FLIP!
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > tison.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > jincheng sun <su...@gmail.com> 于2018年11月2日周五
> > > 上午11:46写道:
> > > > > > > > >
> > > > > > > > > > Hi Hequn,
> > > > > > > > > > Thanks for your feedback! And also thanks for our offline
> > > > > > discussion!
> > > > > > > > > > You are right, unification of batch and streaming is very
> > > > > important
> > > > > > > for
> > > > > > > > > > flink API.
> > > > > > > > > > We will provide more detailed design later, Please let me
> > > know
> > > > if
> > > > > > you
> > > > > > > > > have
> > > > > > > > > > further thoughts or feedback.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Jincheng
> > > > > > > > > >
> > > > > > > > > > Hequn Cheng <ch...@gmail.com> 于2018年11月2日周五
> > 上午10:02写道:
> > > > > > > > > >
> > > > > > > > > > > Hi Jincheng,
> > > > > > > > > > >
> > > > > > > > > > > Thanks a lot for your proposal. It is very encouraging!
> > > > > > > > > > >
> > > > > > > > > > > As we all know, SQL is a widely used language. It
> follows
> > > > > > > standards,
> > > > > > > > > is a
> > > > > > > > > > > descriptive language, and is easy to use. A powerful
> > > feature
> > > > of
> > > > > > SQL
> > > > > > > > is
> > > > > > > > > > that
> > > > > > > > > > > it supports optimization. Users only need to care about
> > the
> > > > > logic
> > > > > > > of
> > > > > > > > > the
> > > > > > > > > > > program. The underlying optimizer will help users
> > optimize
> > > > the
> > > > > > > > > > performance
> > > > > > > > > > > of the program. However, in terms of functionality and
> > ease
> > > > of
> > > > > > use,
> > > > > > > > in
> > > > > > > > > > some
> > > > > > > > > > > scenarios sql will be limited, as described in
> Jincheng's
> > > > > > proposal.
> > > > > > > > > > >
> > > > > > > > > > > Correspondingly, the DataStream/DataSet api can provide
> > > > > powerful
> > > > > > > > > > > functionalities. Users can write
> > > > > > ProcessFunction/CoProcessFunction
> > > > > > > > and
> > > > > > > > > > get
> > > > > > > > > > > the timer. Compared with SQL, it provides more
> > > > functionalities
> > > > > > and
> > > > > > > > > > > flexibilities. However, it does not support
> optimization
> > > like
> > > > > > SQL.
> > > > > > > > > > > Meanwhile, DataStream/DataSet api has not been unified
> > > which
> > > > > > means,
> > > > > > > > for
> > > > > > > > > > the
> > > > > > > > > > > same logic, users need to write a job for each stream
> and
> > > > > batch.
> > > > > > > > > > >
> > > > > > > > > > > With TableApi, I think we can combine the advantages of
> > > both.
> > > > > > Users
> > > > > > > > can
> > > > > > > > > > > easily write relational operations and enjoy
> > optimization.
> > > At
> > > > > the
> > > > > > > > same
> > > > > > > > > > > time, it supports more functionality and ease of use.
> > > Looking
> > > > > > > forward
> > > > > > > > > to
> > > > > > > > > > > the detailed design/FLIP.
> > > > > > > > > > >
> > > > > > > > > > > Best,
> > > > > > > > > > > Hequn
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Nov 2, 2018 at 9:48 AM Shaoxuan Wang <
> > > > > > wshaoxuan@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Aljoscha,
> > > > > > > > > > > > Glad that you like the proposal. We have completed
> the
> > > > > > prototype
> > > > > > > of
> > > > > > > > > > most
> > > > > > > > > > > > new proposed functionalities. Once collect the
> feedback
> > > > from
> > > > > > > > > community,
> > > > > > > > > > > we
> > > > > > > > > > > > will come up with a concrete FLIP/design doc.
> > > > > > > > > > > >
> > > > > > > > > > > > Regards,
> > > > > > > > > > > > Shaoxuan
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Nov 1, 2018 at 8:12 PM Aljoscha Krettek <
> > > > > > > > aljoscha@apache.org
> > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi Jincheng,
> > > > > > > > > > > > >
> > > > > > > > > > > > > these points sound very good! Are there any
> concrete
> > > > > > proposals
> > > > > > > > for
> > > > > > > > > > > > > changes? For example a FLIP/design document?
> > > > > > > > > > > > >
> > > > > > > > > > > > > See here for FLIPs:
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> > > > > > > > > > > > >
> > > > > > > > > > > > > Best,
> > > > > > > > > > > > > Aljoscha
> > > > > > > > > > > > >
> > > > > > > > > > > > > > On 1. Nov 2018, at 12:51, jincheng sun <
> > > > > > > > sunjincheng121@gmail.com
> > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > *--------I am sorry for the formatting of the
> email
> > > > > > content.
> > > > > > > I
> > > > > > > > > > > reformat
> > > > > > > > > > > > > > the **content** as follows-----------*
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > *Hi ALL,*
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > With the continuous efforts from the community,
> the
> > > > Flink
> > > > > > > > system
> > > > > > > > > > has
> > > > > > > > > > > > been
> > > > > > > > > > > > > > continuously improved, which has attracted more
> and
> > > > more
> > > > > > > users.
> > > > > > > > > > Flink
> > > > > > > > > > > > SQL
> > > > > > > > > > > > > > is a canonical, widely used relational query
> > > language.
> > > > > > > However,
> > > > > > > > > > there
> > > > > > > > > > > > are
> > > > > > > > > > > > > > still some scenarios where Flink SQL failed to
> meet
> > > > user
> > > > > > > needs
> > > > > > > > in
> > > > > > > > > > > terms
> > > > > > > > > > > > > of
> > > > > > > > > > > > > > functionality and ease of use, such as:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > *1. In terms of functionality*
> > > > > > > > > > > > > >    Iteration, user-defined window, user-defined
> > join,
> > > > > > > > > user-defined
> > > > > > > > > > > > > > GroupReduce, etc. Users cannot express them with
> > SQL;
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > *2. In terms of ease of use*
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >   - Map - e.g. “dataStream.map(mapFun)”. Although
> > > > > > > > > > > “table.select(udf1(),
> > > > > > > > > > > > > >   udf2(), udf3()....)” can be used to accomplish
> > the
> > > > same
> > > > > > > > > > function.,
> > > > > > > > > > > > > with a
> > > > > > > > > > > > > >   map() function returning 100 columns, one has
> to
> > > > define
> > > > > > or
> > > > > > > > call
> > > > > > > > > > 100
> > > > > > > > > > > > > UDFs
> > > > > > > > > > > > > >   when using SQL, which is quite involved.
> > > > > > > > > > > > > >   - FlatMap -  e.g.
> > “dataStrem.flatmap(flatMapFun)”.
> > > > > > > Similarly,
> > > > > > > > > it
> > > > > > > > > > > can
> > > > > > > > > > > > be
> > > > > > > > > > > > > >   implemented with “table.join(udtf).select()”.
> > > > However,
> > > > > it
> > > > > > > is
> > > > > > > > > > > obvious
> > > > > > > > > > > > > that
> > > > > > > > > > > > > >   dataStream is easier to use than SQL.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Due to the above two reasons, some users have to
> > use
> > > > the
> > > > > > > > > DataStream
> > > > > > > > > > > API
> > > > > > > > > > > > > or
> > > > > > > > > > > > > > the DataSet API. But when they do that, they lose
> > the
> > > > > > > > unification
> > > > > > > > > > of
> > > > > > > > > > > > > batch
> > > > > > > > > > > > > > and streaming. They will also lose the
> > sophisticated
> > > > > > > > > optimizations
> > > > > > > > > > > such
> > > > > > > > > > > > > as
> > > > > > > > > > > > > > codegen, aggregate join transpose and multi-stage
> > agg
> > > > > from
> > > > > > > > Flink
> > > > > > > > > > SQL.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > We believe that enhancing the functionality and
> > > > > > productivity
> > > > > > > is
> > > > > > > > > > vital
> > > > > > > > > > > > for
> > > > > > > > > > > > > > the successful adoption of Table API. To this
> end,
> > > > Table
> > > > > > API
> > > > > > > > > still
> > > > > > > > > > > > > > requires more efforts from every contributor in
> the
> > > > > > > community.
> > > > > > > > We
> > > > > > > > > > see
> > > > > > > > > > > > > great
> > > > > > > > > > > > > > opportunity in improving our user’s experience
> from
> > > > this
> > > > > > > work.
> > > > > > > > > Any
> > > > > > > > > > > > > feedback
> > > > > > > > > > > > > > is welcome.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Regards,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Jincheng
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > jincheng sun <su...@gmail.com>
> > > 于2018年11月1日周四
> > > > > > > > 下午5:07写道:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >> Hi all,
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> With the continuous efforts from the community,
> > the
> > > > > Flink
> > > > > > > > system
> > > > > > > > > > has
> > > > > > > > > > > > > been
> > > > > > > > > > > > > >> continuously improved, which has attracted more
> > and
> > > > more
> > > > > > > > users.
> > > > > > > > > > > Flink
> > > > > > > > > > > > > SQL
> > > > > > > > > > > > > >> is a canonical, widely used relational query
> > > language.
> > > > > > > > However,
> > > > > > > > > > > there
> > > > > > > > > > > > > are
> > > > > > > > > > > > > >> still some scenarios where Flink SQL failed to
> > meet
> > > > user
> > > > > > > needs
> > > > > > > > > in
> > > > > > > > > > > > terms
> > > > > > > > > > > > > of
> > > > > > > > > > > > > >> functionality and ease of use, such as:
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >>   -
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >>   In terms of functionality
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> Iteration, user-defined window, user-defined
> join,
> > > > > > > > user-defined
> > > > > > > > > > > > > >> GroupReduce, etc. Users cannot express them with
> > > SQL;
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >>   -
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >>   In terms of ease of use
> > > > > > > > > > > > > >>   -
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >>      Map - e.g. “dataStream.map(mapFun)”.
> Although
> > > > > > > > > > > > “table.select(udf1(),
> > > > > > > > > > > > > >>      udf2(), udf3()....)” can be used to
> > accomplish
> > > > the
> > > > > > same
> > > > > > > > > > > > function.,
> > > > > > > > > > > > > with a
> > > > > > > > > > > > > >>      map() function returning 100 columns, one
> has
> > > to
> > > > > > define
> > > > > > > > or
> > > > > > > > > > call
> > > > > > > > > > > > > 100 UDFs
> > > > > > > > > > > > > >>      when using SQL, which is quite involved.
> > > > > > > > > > > > > >>      -
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >>      FlatMap -  e.g.
> > > “dataStrem.flatmap(flatMapFun)”.
> > > > > > > > Similarly,
> > > > > > > > > > it
> > > > > > > > > > > > can
> > > > > > > > > > > > > >>      be implemented with
> > > “table.join(udtf).select()”.
> > > > > > > However,
> > > > > > > > > it
> > > > > > > > > > is
> > > > > > > > > > > > > obvious
> > > > > > > > > > > > > >>      that datastream is easier to use than SQL.
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> Due to the above two reasons, some users have to
> > use
> > > > the
> > > > > > > > > > DataStream
> > > > > > > > > > > > API
> > > > > > > > > > > > > or
> > > > > > > > > > > > > >> the DataSet API. But when they do that, they
> lose
> > > the
> > > > > > > > > unification
> > > > > > > > > > of
> > > > > > > > > > > > > batch
> > > > > > > > > > > > > >> and streaming. They will also lose the
> > sophisticated
> > > > > > > > > optimizations
> > > > > > > > > > > > such
> > > > > > > > > > > > > as
> > > > > > > > > > > > > >> codegen, aggregate join transpose  and
> multi-stage
> > > agg
> > > > > > from
> > > > > > > > > Flink
> > > > > > > > > > > SQL.
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> We believe that enhancing the functionality and
> > > > > > productivity
> > > > > > > > is
> > > > > > > > > > > vital
> > > > > > > > > > > > > for
> > > > > > > > > > > > > >> the successful adoption of Table API. To this
> end,
> > > > > Table
> > > > > > > API
> > > > > > > > > > still
> > > > > > > > > > > > > >> requires more efforts from every contributor in
> > the
> > > > > > > community.
> > > > > > > > > We
> > > > > > > > > > > see
> > > > > > > > > > > > > great
> > > > > > > > > > > > > >> opportunity in improving our user’s experience
> > from
> > > > this
> > > > > > > work.
> > > > > > > > > Any
> > > > > > > > > > > > > feedback
> > > > > > > > > > > > > >> is welcome.
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> Regards,
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> Jincheng
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >>
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Posted by jincheng sun <su...@gmail.com>.
Hi Fabian,
Thank you for your deep thoughts in this regard, I think most of questions
you had mentioned are very worthy of in-depth discussion! I want share
thoughts about following questions:

1. Do we need move all DataSet API functionality into the Table API?
I think most of dataset functionality should be add into the TableAPI, such
as map, flatmap, groupReduce etc., Because these are very easy to use for
the user.

2. Do we support explicit physical operations like partitioning, sorting or
optimizer hints?
I think we do not want add the physical operations, e.g.:
sortPartition,partitionCustom etc. From the points of my view, those
physical operations are used to optimization, which can be solved by
hints(I think we should add hints feature to both tableAPI and SQL).

3. Do we want to support retractions in iteration?
I think support iteration is  a very complicated function。I am not sure,
but i think the implementation of the iteration may be implemented
according to the current batch mode, and the retraction is temporarily not
supported, assuming that the trained data will not be updated in the
current iteration. The updated data will be used in the next iteration.  So
I think we should in-depth discussion  in a new threading.

BTW, I find that you have had leave the very useful comments in the google
doc:
https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit#

Thanks again for both your mail feedback and doc comments !

Best,
Jincheng



Fabian Hueske <fh...@gmail.com> 于2018年11月6日周二 下午6:21写道:

> Thanks for the replies Xiaowei and others!
>
> You are right, I did not consider the batch optimization that would be
> missing if the DataSet API would be ported to extend the DataStream API.
> By extending the scope of the Table API, we can gain a holistic logical &
> physical optimization which would be great!
> Is your plan to move all DataSet API functionality into the Table API?
> If so, do you envision any batch-related API in DataStream at all or should
> this be done by converting a batch table to DataStream? I'm asking because
> if there would be batch features in DataStream, we would need some
> optimization there as well.
>
> I think the proposed separation of Table API (stateless APIs) and
> DataStream (APIs that expose state handling) is a good idea.
> On a side note, the DataSet API discouraged state handing in user function,
> so porting this Table API would be quite "natural".
>
> As I said before, I like that we can incrementally extend the Table API.
> Map and FlatMap functions do not seem too difficult.
> Reduce, GroupReduce, Combine, GroupCombine, MapPartition might be more
> tricky, esp. if we want to support retractions.
> Iterations should be a challenge. I assume that Calcite does not support
> iterations, so we probably need to split query / program and optimize parts
> separately (IIRC, this is also how Flink's own optimizer handles this).
> To what extend are you planning to support explicit physical operations
> like partitioning, sorting or optimizer hints?
>
> I haven't had a look in the design document that you shared. Probably, I
> find answers to some of my questions there ;-)
>
> Regarding the question of SQL or Table API, I agree that extending the
> scope of the Table API does not limit the scope for SQL.
> By adding more operations to the Table API we can expand it to use case
> that are not well-served by SQL.
> As others have said, we'll of course continue to extend and improve Flink's
> SQL support (within the bounds of the standard).
>
> Best, Fabian
>
> Am Di., 6. Nov. 2018 um 10:09 Uhr schrieb jincheng sun <
> sunjincheng121@gmail.com>:
>
> > Hi Jark,
> > Glad to see your feedback!
> > That's Correct, The proposal is aiming to extend the functionality for
> > Table API! I like add "drop" to fit the use case you mentioned. Not only
> > that, if a 100-columns Table. and our UDF needs these 100 columns, we
> don't
> > want to define the eval as eval(column0...column99), we prefer to define
> > eval as eval(Row)。Using it like this: table.select(udf (*)). All we also
> > need to consider if we put the columns package as a row. In a scenario
> like
> > this, we have Classification it as cloumn operation, and  list the
> changes
> > to the column operation after the map/flatMap/agg/flatAgg phase is
> > completed. And Currently,  Xiaowei has started a threading outlining
> which
> > talk about what we are proposing. Please see the detail in the mail
> thread:
> > Please see the detail in the mail thread:
> >
> >
> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
> > <
> >
> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
> > >
> >  .
> >
> > At this stage the Table API Enhancement Outline as follows:
> >
> >
> https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing
> >
> > Please let we know if you have further thoughts or feedback!
> >
> > Thanks,
> > Jincheng
> >
> >
> > Jark Wu <im...@gmail.com> 于2018年11月6日周二 下午3:35写道:
> >
> > > Hi jingcheng,
> > >
> > > Thanks for your proposal. I think it is a helpful enhancement for
> > TableAPI
> > > which is a solid step forward for TableAPI.
> > > It doesn't weaken SQL or DataStream, because the conversion between
> > > DataStream and Table still works.
> > > People with advanced cases (e.g. complex and fine-grained state
> control)
> > > can go with DataStream,
> > > but most general cases can stay in TableAPI. This works is aiming to
> > extend
> > > the functionality for TableAPI,
> > > to extend the usage scenario, to help TableAPI becomes a more widely
> used
> > > API.
> > >
> > > For example, someone want to drop one column from a 100-columns Table.
> > > Currently, we have to convert
> > > Table to DataStream and use MapFunction to do that, or select the
> > remaining
> > > 99 columns using Table.select API.
> > > But if we support Table.drop() method for TableAPI, it will be a very
> > > convenient method and let users stay in Table.
> > >
> > > Looking forward to the more detailed design and further discussion.
> > >
> > > Regards,
> > > Jark
> > >
> > > jincheng sun <su...@gmail.com> 于2018年11月6日周二 下午1:05写道:
> > >
> > > > Hi Rong Rong,
> > > >
> > > > Sorry for the late reply, And thanks for your feedback!  We will
> > continue
> > > > to add more convenience features to the TableAPI, such as map,
> flatmap,
> > > > agg, flatagg, iteration etc. And I am very happy that you are
> > interested
> > > on
> > > > this proposal. Due to this is a long-term continuous work, we will
> push
> > > it
> > > > in stages.  Currently  Xiaowei has started a threading outlining
> which
> > > talk
> > > > about what we are proposing. Please see the detail in the mail
> thread:
> > > >
> > > >
> > >
> >
> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
> > > > <
> > > >
> > >
> >
> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
> > > > >
> > > >  .
> > > >
> > > > The Table API Enhancement Outline as follows:
> > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing
> > > >
> > > > Please let we know if you have further thoughts or feedback!
> > > >
> > > > Thanks,
> > > > Jincheng
> > > >
> > > > Fabian Hueske <fh...@gmail.com> 于2018年11月5日周一 下午7:03写道:
> > > >
> > > > > Hi Jincheng,
> > > > >
> > > > > Thanks for this interesting proposal.
> > > > > I like that we can push this effort forward in a very fine-grained
> > > > manner,
> > > > > i.e., incrementally adding more APIs to the Table API.
> > > > >
> > > > > However, I also have a few questions / concerns.
> > > > > Today, the Table API is tightly integrated with the DataSet and
> > > > DataStream
> > > > > APIs. It is very easy to convert a Table into a DataSet or
> DataStream
> > > and
> > > > > vice versa. This mean it is already easy to combine custom logic an
> > > > > relational operations. What I like is that several aspects are
> > clearly
> > > > > separated like retraction and timestamp handling (see below) + all
> > > > > libraries on DataStream/DataSet can be easily combined with
> > relational
> > > > > operations.
> > > > > I can see that adding more functionality to the Table API would
> > remove
> > > > the
> > > > > distinction between DataSet and DataStream. However, wouldn't we
> get
> > a
> > > > > similar benefit by extending the DataStream API for proper support
> > for
> > > > > bounded streams (as is the long-term goal of Flink)?
> > > > > I'm also a bit skeptical about the optimization opportunities we
> > would
> > > > > gain. Map/FlatMap UDFs are black boxes that cannot be easily
> removed
> > > > > without additional information (I did some research on this a few
> > years
> > > > ago
> > > > > [1]).
> > > > >
> > > > > Moreover, I think there are a few tricky details that need to be
> > > resolved
> > > > > to enable a good integration.
> > > > >
> > > > > 1) How to deal with retraction messages? The DataStream API does
> not
> > > > have a
> > > > > notion of retractions. How would a MapFunction or FlatMapFunction
> > > handle
> > > > > retraction? Do they need to be aware of the change flag? Custom
> > > windowing
> > > > > and aggregation logic would certainly need to have that
> information.
> > > > > 2) How to deal with timestamps? The DataStream API does not give
> > access
> > > > to
> > > > > timestamps. In the Table API / SQL these are exposed as regular
> > > > attributes.
> > > > > How can we ensure that timestamp attributes remain valid (i.e.
> > aligned
> > > > with
> > > > > watermarks) if the output is produced by arbitrary code?
> > > > > There might be more issues of this kind.
> > > > >
> > > > > My main question would be how much would we gain with this proposal
> > > over
> > > > a
> > > > > tight integration of Table API and DataStream API, assuming that
> > batch
> > > > > functionality is moved to DataStream?
> > > > >
> > > > > Best, Fabian
> > > > >
> > > > > [1]
> http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf
> > > > >
> > > > >
> > > > > Am Mo., 5. Nov. 2018 um 02:49 Uhr schrieb Rong Rong <
> > > walterddr@gmail.com
> > > > >:
> > > > >
> > > > > > Hi Jincheng,
> > > > > >
> > > > > > Thank you for the proposal! I think being able to define a
> process
> > /
> > > > > > co-process function in table API definitely opens up a whole new
> > > level
> > > > of
> > > > > > applications using a unified API.
> > > > > >
> > > > > > In addition, as Tzu-Li and Hequn have mentioned, the benefit of
> > > > > > optimization layer of Table API will already bring in additional
> > > > benefit
> > > > > > over directly programming on top of DataStream/DataSet API. I am
> > very
> > > > > > interested an looking forward to seeing the support for more
> > complex
> > > > use
> > > > > > cases, especially iterations. It will enable table API to define
> > much
> > > > > > broader, event-driven use cases such as real-time ML
> > > > prediction/training.
> > > > > >
> > > > > > As Timo mentioned, This will make Table API diverge from the SQL
> > API.
> > > > But
> > > > > > as from my experience Table API was always giving me the
> impression
> > > to
> > > > > be a
> > > > > > more sophisticated, syntactic-aware way to express relational
> > > > operations.
> > > > > > Looking forward to further discussion and collaborations on the
> > FLIP
> > > > doc.
> > > > > >
> > > > > > --
> > > > > > Rong
> > > > > >
> > > > > > On Sun, Nov 4, 2018 at 5:22 PM jincheng sun <
> > > sunjincheng121@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi tison,
> > > > > > >
> > > > > > > Thanks a lot for your feedback!
> > > > > > > I am very happy to see that community contributors agree to
> > > enhanced
> > > > > the
> > > > > > > TableAPI. This work is a long-term continuous work, we will
> push
> > it
> > > > in
> > > > > > > stages, we will soon complete  the enhanced list of the first
> > > phase,
> > > > we
> > > > > > can
> > > > > > > go deep discussion  in google doc. thanks again for joining on
> > the
> > > > very
> > > > > > > important discussion of the Flink Table API.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Jincheng
> > > > > > >
> > > > > > > Tzu-Li Chen <wa...@gmail.com> 于2018年11月2日周五 下午1:49写道:
> > > > > > >
> > > > > > > > Hi jingchengm
> > > > > > > >
> > > > > > > > Thanks a lot for your proposal! I find it is a good start
> point
> > > for
> > > > > > > > internal optimization works and help Flink to be more
> > > > > > > > user-friendly.
> > > > > > > >
> > > > > > > > AFAIK, DataStream is the most popular API currently that
> Flink
> > > > > > > > users should describe their logic with detailed logic.
> > > > > > > > From a more internal view the conversion from DataStream to
> > > > > > > > JobGraph is quite mechanically and hard to be optimized. So
> > when
> > > > > > > > users program with DataStream, they have to learn more
> > internals
> > > > > > > > and spend a lot of time to tune for performance.
> > > > > > > > With your proposal, we provide enhanced functionality of
> Table
> > > API,
> > > > > > > > so that users can describe their job easily on Table aspect.
> > This
> > > > > gives
> > > > > > > > an opportunity to Flink developers to introduce an optimize
> > phase
> > > > > > > > while transforming user program(described by Table API) to
> > > internal
> > > > > > > > representation.
> > > > > > > >
> > > > > > > > Given a user who want to start using Flink with simple ETL,
> > > > > pipelining
> > > > > > > > or analytics, he would find it is most naturally described by
> > > > > SQL/Table
> > > > > > > > API. Further, as mentioned by @hequn,
> > > > > > > >
> > > > > > > > SQL is a widely used language. It follows standards, is a
> > > > > > > > > descriptive language, and is easy to use
> > > > > > > >
> > > > > > > >
> > > > > > > > thus we could expect with the enhancement of SQL/Table API,
> > Flink
> > > > > > > > becomes more friendly to users.
> > > > > > > >
> > > > > > > > Looking forward to the design doc/FLIP!
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > tison.
> > > > > > > >
> > > > > > > >
> > > > > > > > jincheng sun <su...@gmail.com> 于2018年11月2日周五
> > 上午11:46写道:
> > > > > > > >
> > > > > > > > > Hi Hequn,
> > > > > > > > > Thanks for your feedback! And also thanks for our offline
> > > > > discussion!
> > > > > > > > > You are right, unification of batch and streaming is very
> > > > important
> > > > > > for
> > > > > > > > > flink API.
> > > > > > > > > We will provide more detailed design later, Please let me
> > know
> > > if
> > > > > you
> > > > > > > > have
> > > > > > > > > further thoughts or feedback.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Jincheng
> > > > > > > > >
> > > > > > > > > Hequn Cheng <ch...@gmail.com> 于2018年11月2日周五
> 上午10:02写道:
> > > > > > > > >
> > > > > > > > > > Hi Jincheng,
> > > > > > > > > >
> > > > > > > > > > Thanks a lot for your proposal. It is very encouraging!
> > > > > > > > > >
> > > > > > > > > > As we all know, SQL is a widely used language. It follows
> > > > > > standards,
> > > > > > > > is a
> > > > > > > > > > descriptive language, and is easy to use. A powerful
> > feature
> > > of
> > > > > SQL
> > > > > > > is
> > > > > > > > > that
> > > > > > > > > > it supports optimization. Users only need to care about
> the
> > > > logic
> > > > > > of
> > > > > > > > the
> > > > > > > > > > program. The underlying optimizer will help users
> optimize
> > > the
> > > > > > > > > performance
> > > > > > > > > > of the program. However, in terms of functionality and
> ease
> > > of
> > > > > use,
> > > > > > > in
> > > > > > > > > some
> > > > > > > > > > scenarios sql will be limited, as described in Jincheng's
> > > > > proposal.
> > > > > > > > > >
> > > > > > > > > > Correspondingly, the DataStream/DataSet api can provide
> > > > powerful
> > > > > > > > > > functionalities. Users can write
> > > > > ProcessFunction/CoProcessFunction
> > > > > > > and
> > > > > > > > > get
> > > > > > > > > > the timer. Compared with SQL, it provides more
> > > functionalities
> > > > > and
> > > > > > > > > > flexibilities. However, it does not support optimization
> > like
> > > > > SQL.
> > > > > > > > > > Meanwhile, DataStream/DataSet api has not been unified
> > which
> > > > > means,
> > > > > > > for
> > > > > > > > > the
> > > > > > > > > > same logic, users need to write a job for each stream and
> > > > batch.
> > > > > > > > > >
> > > > > > > > > > With TableApi, I think we can combine the advantages of
> > both.
> > > > > Users
> > > > > > > can
> > > > > > > > > > easily write relational operations and enjoy
> optimization.
> > At
> > > > the
> > > > > > > same
> > > > > > > > > > time, it supports more functionality and ease of use.
> > Looking
> > > > > > forward
> > > > > > > > to
> > > > > > > > > > the detailed design/FLIP.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Hequn
> > > > > > > > > >
> > > > > > > > > > On Fri, Nov 2, 2018 at 9:48 AM Shaoxuan Wang <
> > > > > wshaoxuan@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Aljoscha,
> > > > > > > > > > > Glad that you like the proposal. We have completed the
> > > > > prototype
> > > > > > of
> > > > > > > > > most
> > > > > > > > > > > new proposed functionalities. Once collect the feedback
> > > from
> > > > > > > > community,
> > > > > > > > > > we
> > > > > > > > > > > will come up with a concrete FLIP/design doc.
> > > > > > > > > > >
> > > > > > > > > > > Regards,
> > > > > > > > > > > Shaoxuan
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Nov 1, 2018 at 8:12 PM Aljoscha Krettek <
> > > > > > > aljoscha@apache.org
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Jincheng,
> > > > > > > > > > > >
> > > > > > > > > > > > these points sound very good! Are there any concrete
> > > > > proposals
> > > > > > > for
> > > > > > > > > > > > changes? For example a FLIP/design document?
> > > > > > > > > > > >
> > > > > > > > > > > > See here for FLIPs:
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> > > > > > > > > > > >
> > > > > > > > > > > > Best,
> > > > > > > > > > > > Aljoscha
> > > > > > > > > > > >
> > > > > > > > > > > > > On 1. Nov 2018, at 12:51, jincheng sun <
> > > > > > > sunjincheng121@gmail.com
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > *--------I am sorry for the formatting of the email
> > > > > content.
> > > > > > I
> > > > > > > > > > reformat
> > > > > > > > > > > > > the **content** as follows-----------*
> > > > > > > > > > > > >
> > > > > > > > > > > > > *Hi ALL,*
> > > > > > > > > > > > >
> > > > > > > > > > > > > With the continuous efforts from the community, the
> > > Flink
> > > > > > > system
> > > > > > > > > has
> > > > > > > > > > > been
> > > > > > > > > > > > > continuously improved, which has attracted more and
> > > more
> > > > > > users.
> > > > > > > > > Flink
> > > > > > > > > > > SQL
> > > > > > > > > > > > > is a canonical, widely used relational query
> > language.
> > > > > > However,
> > > > > > > > > there
> > > > > > > > > > > are
> > > > > > > > > > > > > still some scenarios where Flink SQL failed to meet
> > > user
> > > > > > needs
> > > > > > > in
> > > > > > > > > > terms
> > > > > > > > > > > > of
> > > > > > > > > > > > > functionality and ease of use, such as:
> > > > > > > > > > > > >
> > > > > > > > > > > > > *1. In terms of functionality*
> > > > > > > > > > > > >    Iteration, user-defined window, user-defined
> join,
> > > > > > > > user-defined
> > > > > > > > > > > > > GroupReduce, etc. Users cannot express them with
> SQL;
> > > > > > > > > > > > >
> > > > > > > > > > > > > *2. In terms of ease of use*
> > > > > > > > > > > > >
> > > > > > > > > > > > >   - Map - e.g. “dataStream.map(mapFun)”. Although
> > > > > > > > > > “table.select(udf1(),
> > > > > > > > > > > > >   udf2(), udf3()....)” can be used to accomplish
> the
> > > same
> > > > > > > > > function.,
> > > > > > > > > > > > with a
> > > > > > > > > > > > >   map() function returning 100 columns, one has to
> > > define
> > > > > or
> > > > > > > call
> > > > > > > > > 100
> > > > > > > > > > > > UDFs
> > > > > > > > > > > > >   when using SQL, which is quite involved.
> > > > > > > > > > > > >   - FlatMap -  e.g.
> “dataStrem.flatmap(flatMapFun)”.
> > > > > > Similarly,
> > > > > > > > it
> > > > > > > > > > can
> > > > > > > > > > > be
> > > > > > > > > > > > >   implemented with “table.join(udtf).select()”.
> > > However,
> > > > it
> > > > > > is
> > > > > > > > > > obvious
> > > > > > > > > > > > that
> > > > > > > > > > > > >   dataStream is easier to use than SQL.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Due to the above two reasons, some users have to
> use
> > > the
> > > > > > > > DataStream
> > > > > > > > > > API
> > > > > > > > > > > > or
> > > > > > > > > > > > > the DataSet API. But when they do that, they lose
> the
> > > > > > > unification
> > > > > > > > > of
> > > > > > > > > > > > batch
> > > > > > > > > > > > > and streaming. They will also lose the
> sophisticated
> > > > > > > > optimizations
> > > > > > > > > > such
> > > > > > > > > > > > as
> > > > > > > > > > > > > codegen, aggregate join transpose and multi-stage
> agg
> > > > from
> > > > > > > Flink
> > > > > > > > > SQL.
> > > > > > > > > > > > >
> > > > > > > > > > > > > We believe that enhancing the functionality and
> > > > > productivity
> > > > > > is
> > > > > > > > > vital
> > > > > > > > > > > for
> > > > > > > > > > > > > the successful adoption of Table API. To this end,
> > > Table
> > > > > API
> > > > > > > > still
> > > > > > > > > > > > > requires more efforts from every contributor in the
> > > > > > community.
> > > > > > > We
> > > > > > > > > see
> > > > > > > > > > > > great
> > > > > > > > > > > > > opportunity in improving our user’s experience from
> > > this
> > > > > > work.
> > > > > > > > Any
> > > > > > > > > > > > feedback
> > > > > > > > > > > > > is welcome.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Regards,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Jincheng
> > > > > > > > > > > > >
> > > > > > > > > > > > > jincheng sun <su...@gmail.com>
> > 于2018年11月1日周四
> > > > > > > 下午5:07写道:
> > > > > > > > > > > > >
> > > > > > > > > > > > >> Hi all,
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> With the continuous efforts from the community,
> the
> > > > Flink
> > > > > > > system
> > > > > > > > > has
> > > > > > > > > > > > been
> > > > > > > > > > > > >> continuously improved, which has attracted more
> and
> > > more
> > > > > > > users.
> > > > > > > > > > Flink
> > > > > > > > > > > > SQL
> > > > > > > > > > > > >> is a canonical, widely used relational query
> > language.
> > > > > > > However,
> > > > > > > > > > there
> > > > > > > > > > > > are
> > > > > > > > > > > > >> still some scenarios where Flink SQL failed to
> meet
> > > user
> > > > > > needs
> > > > > > > > in
> > > > > > > > > > > terms
> > > > > > > > > > > > of
> > > > > > > > > > > > >> functionality and ease of use, such as:
> > > > > > > > > > > > >>
> > > > > > > > > > > > >>
> > > > > > > > > > > > >>   -
> > > > > > > > > > > > >>
> > > > > > > > > > > > >>   In terms of functionality
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> Iteration, user-defined window, user-defined join,
> > > > > > > user-defined
> > > > > > > > > > > > >> GroupReduce, etc. Users cannot express them with
> > SQL;
> > > > > > > > > > > > >>
> > > > > > > > > > > > >>   -
> > > > > > > > > > > > >>
> > > > > > > > > > > > >>   In terms of ease of use
> > > > > > > > > > > > >>   -
> > > > > > > > > > > > >>
> > > > > > > > > > > > >>      Map - e.g. “dataStream.map(mapFun)”. Although
> > > > > > > > > > > “table.select(udf1(),
> > > > > > > > > > > > >>      udf2(), udf3()....)” can be used to
> accomplish
> > > the
> > > > > same
> > > > > > > > > > > function.,
> > > > > > > > > > > > with a
> > > > > > > > > > > > >>      map() function returning 100 columns, one has
> > to
> > > > > define
> > > > > > > or
> > > > > > > > > call
> > > > > > > > > > > > 100 UDFs
> > > > > > > > > > > > >>      when using SQL, which is quite involved.
> > > > > > > > > > > > >>      -
> > > > > > > > > > > > >>
> > > > > > > > > > > > >>      FlatMap -  e.g.
> > “dataStrem.flatmap(flatMapFun)”.
> > > > > > > Similarly,
> > > > > > > > > it
> > > > > > > > > > > can
> > > > > > > > > > > > >>      be implemented with
> > “table.join(udtf).select()”.
> > > > > > However,
> > > > > > > > it
> > > > > > > > > is
> > > > > > > > > > > > obvious
> > > > > > > > > > > > >>      that datastream is easier to use than SQL.
> > > > > > > > > > > > >>
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> Due to the above two reasons, some users have to
> use
> > > the
> > > > > > > > > DataStream
> > > > > > > > > > > API
> > > > > > > > > > > > or
> > > > > > > > > > > > >> the DataSet API. But when they do that, they lose
> > the
> > > > > > > > unification
> > > > > > > > > of
> > > > > > > > > > > > batch
> > > > > > > > > > > > >> and streaming. They will also lose the
> sophisticated
> > > > > > > > optimizations
> > > > > > > > > > > such
> > > > > > > > > > > > as
> > > > > > > > > > > > >> codegen, aggregate join transpose  and multi-stage
> > agg
> > > > > from
> > > > > > > > Flink
> > > > > > > > > > SQL.
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> We believe that enhancing the functionality and
> > > > > productivity
> > > > > > > is
> > > > > > > > > > vital
> > > > > > > > > > > > for
> > > > > > > > > > > > >> the successful adoption of Table API. To this end,
> > > > Table
> > > > > > API
> > > > > > > > > still
> > > > > > > > > > > > >> requires more efforts from every contributor in
> the
> > > > > > community.
> > > > > > > > We
> > > > > > > > > > see
> > > > > > > > > > > > great
> > > > > > > > > > > > >> opportunity in improving our user’s experience
> from
> > > this
> > > > > > work.
> > > > > > > > Any
> > > > > > > > > > > > feedback
> > > > > > > > > > > > >> is welcome.
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> Regards,
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> Jincheng
> > > > > > > > > > > > >>
> > > > > > > > > > > > >>
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Posted by Fabian Hueske <fh...@gmail.com>.
Thanks for the replies Xiaowei and others!

You are right, I did not consider the batch optimization that would be
missing if the DataSet API would be ported to extend the DataStream API.
By extending the scope of the Table API, we can gain a holistic logical &
physical optimization which would be great!
Is your plan to move all DataSet API functionality into the Table API?
If so, do you envision any batch-related API in DataStream at all or should
this be done by converting a batch table to DataStream? I'm asking because
if there would be batch features in DataStream, we would need some
optimization there as well.

I think the proposed separation of Table API (stateless APIs) and
DataStream (APIs that expose state handling) is a good idea.
On a side note, the DataSet API discouraged state handing in user function,
so porting this Table API would be quite "natural".

As I said before, I like that we can incrementally extend the Table API.
Map and FlatMap functions do not seem too difficult.
Reduce, GroupReduce, Combine, GroupCombine, MapPartition might be more
tricky, esp. if we want to support retractions.
Iterations should be a challenge. I assume that Calcite does not support
iterations, so we probably need to split query / program and optimize parts
separately (IIRC, this is also how Flink's own optimizer handles this).
To what extend are you planning to support explicit physical operations
like partitioning, sorting or optimizer hints?

I haven't had a look in the design document that you shared. Probably, I
find answers to some of my questions there ;-)

Regarding the question of SQL or Table API, I agree that extending the
scope of the Table API does not limit the scope for SQL.
By adding more operations to the Table API we can expand it to use case
that are not well-served by SQL.
As others have said, we'll of course continue to extend and improve Flink's
SQL support (within the bounds of the standard).

Best, Fabian

Am Di., 6. Nov. 2018 um 10:09 Uhr schrieb jincheng sun <
sunjincheng121@gmail.com>:

> Hi Jark,
> Glad to see your feedback!
> That's Correct, The proposal is aiming to extend the functionality for
> Table API! I like add "drop" to fit the use case you mentioned. Not only
> that, if a 100-columns Table. and our UDF needs these 100 columns, we don't
> want to define the eval as eval(column0...column99), we prefer to define
> eval as eval(Row)。Using it like this: table.select(udf (*)). All we also
> need to consider if we put the columns package as a row. In a scenario like
> this, we have Classification it as cloumn operation, and  list the changes
> to the column operation after the map/flatMap/agg/flatAgg phase is
> completed. And Currently,  Xiaowei has started a threading outlining which
> talk about what we are proposing. Please see the detail in the mail thread:
> Please see the detail in the mail thread:
>
> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
> <
> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
> >
>  .
>
> At this stage the Table API Enhancement Outline as follows:
>
> https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing
>
> Please let we know if you have further thoughts or feedback!
>
> Thanks,
> Jincheng
>
>
> Jark Wu <im...@gmail.com> 于2018年11月6日周二 下午3:35写道:
>
> > Hi jingcheng,
> >
> > Thanks for your proposal. I think it is a helpful enhancement for
> TableAPI
> > which is a solid step forward for TableAPI.
> > It doesn't weaken SQL or DataStream, because the conversion between
> > DataStream and Table still works.
> > People with advanced cases (e.g. complex and fine-grained state control)
> > can go with DataStream,
> > but most general cases can stay in TableAPI. This works is aiming to
> extend
> > the functionality for TableAPI,
> > to extend the usage scenario, to help TableAPI becomes a more widely used
> > API.
> >
> > For example, someone want to drop one column from a 100-columns Table.
> > Currently, we have to convert
> > Table to DataStream and use MapFunction to do that, or select the
> remaining
> > 99 columns using Table.select API.
> > But if we support Table.drop() method for TableAPI, it will be a very
> > convenient method and let users stay in Table.
> >
> > Looking forward to the more detailed design and further discussion.
> >
> > Regards,
> > Jark
> >
> > jincheng sun <su...@gmail.com> 于2018年11月6日周二 下午1:05写道:
> >
> > > Hi Rong Rong,
> > >
> > > Sorry for the late reply, And thanks for your feedback!  We will
> continue
> > > to add more convenience features to the TableAPI, such as map, flatmap,
> > > agg, flatagg, iteration etc. And I am very happy that you are
> interested
> > on
> > > this proposal. Due to this is a long-term continuous work, we will push
> > it
> > > in stages.  Currently  Xiaowei has started a threading outlining which
> > talk
> > > about what we are proposing. Please see the detail in the mail thread:
> > >
> > >
> >
> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
> > > <
> > >
> >
> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
> > > >
> > >  .
> > >
> > > The Table API Enhancement Outline as follows:
> > >
> > >
> >
> https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing
> > >
> > > Please let we know if you have further thoughts or feedback!
> > >
> > > Thanks,
> > > Jincheng
> > >
> > > Fabian Hueske <fh...@gmail.com> 于2018年11月5日周一 下午7:03写道:
> > >
> > > > Hi Jincheng,
> > > >
> > > > Thanks for this interesting proposal.
> > > > I like that we can push this effort forward in a very fine-grained
> > > manner,
> > > > i.e., incrementally adding more APIs to the Table API.
> > > >
> > > > However, I also have a few questions / concerns.
> > > > Today, the Table API is tightly integrated with the DataSet and
> > > DataStream
> > > > APIs. It is very easy to convert a Table into a DataSet or DataStream
> > and
> > > > vice versa. This mean it is already easy to combine custom logic an
> > > > relational operations. What I like is that several aspects are
> clearly
> > > > separated like retraction and timestamp handling (see below) + all
> > > > libraries on DataStream/DataSet can be easily combined with
> relational
> > > > operations.
> > > > I can see that adding more functionality to the Table API would
> remove
> > > the
> > > > distinction between DataSet and DataStream. However, wouldn't we get
> a
> > > > similar benefit by extending the DataStream API for proper support
> for
> > > > bounded streams (as is the long-term goal of Flink)?
> > > > I'm also a bit skeptical about the optimization opportunities we
> would
> > > > gain. Map/FlatMap UDFs are black boxes that cannot be easily removed
> > > > without additional information (I did some research on this a few
> years
> > > ago
> > > > [1]).
> > > >
> > > > Moreover, I think there are a few tricky details that need to be
> > resolved
> > > > to enable a good integration.
> > > >
> > > > 1) How to deal with retraction messages? The DataStream API does not
> > > have a
> > > > notion of retractions. How would a MapFunction or FlatMapFunction
> > handle
> > > > retraction? Do they need to be aware of the change flag? Custom
> > windowing
> > > > and aggregation logic would certainly need to have that information.
> > > > 2) How to deal with timestamps? The DataStream API does not give
> access
> > > to
> > > > timestamps. In the Table API / SQL these are exposed as regular
> > > attributes.
> > > > How can we ensure that timestamp attributes remain valid (i.e.
> aligned
> > > with
> > > > watermarks) if the output is produced by arbitrary code?
> > > > There might be more issues of this kind.
> > > >
> > > > My main question would be how much would we gain with this proposal
> > over
> > > a
> > > > tight integration of Table API and DataStream API, assuming that
> batch
> > > > functionality is moved to DataStream?
> > > >
> > > > Best, Fabian
> > > >
> > > > [1] http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf
> > > >
> > > >
> > > > Am Mo., 5. Nov. 2018 um 02:49 Uhr schrieb Rong Rong <
> > walterddr@gmail.com
> > > >:
> > > >
> > > > > Hi Jincheng,
> > > > >
> > > > > Thank you for the proposal! I think being able to define a process
> /
> > > > > co-process function in table API definitely opens up a whole new
> > level
> > > of
> > > > > applications using a unified API.
> > > > >
> > > > > In addition, as Tzu-Li and Hequn have mentioned, the benefit of
> > > > > optimization layer of Table API will already bring in additional
> > > benefit
> > > > > over directly programming on top of DataStream/DataSet API. I am
> very
> > > > > interested an looking forward to seeing the support for more
> complex
> > > use
> > > > > cases, especially iterations. It will enable table API to define
> much
> > > > > broader, event-driven use cases such as real-time ML
> > > prediction/training.
> > > > >
> > > > > As Timo mentioned, This will make Table API diverge from the SQL
> API.
> > > But
> > > > > as from my experience Table API was always giving me the impression
> > to
> > > > be a
> > > > > more sophisticated, syntactic-aware way to express relational
> > > operations.
> > > > > Looking forward to further discussion and collaborations on the
> FLIP
> > > doc.
> > > > >
> > > > > --
> > > > > Rong
> > > > >
> > > > > On Sun, Nov 4, 2018 at 5:22 PM jincheng sun <
> > sunjincheng121@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi tison,
> > > > > >
> > > > > > Thanks a lot for your feedback!
> > > > > > I am very happy to see that community contributors agree to
> > enhanced
> > > > the
> > > > > > TableAPI. This work is a long-term continuous work, we will push
> it
> > > in
> > > > > > stages, we will soon complete  the enhanced list of the first
> > phase,
> > > we
> > > > > can
> > > > > > go deep discussion  in google doc. thanks again for joining on
> the
> > > very
> > > > > > important discussion of the Flink Table API.
> > > > > >
> > > > > > Thanks,
> > > > > > Jincheng
> > > > > >
> > > > > > Tzu-Li Chen <wa...@gmail.com> 于2018年11月2日周五 下午1:49写道:
> > > > > >
> > > > > > > Hi jingchengm
> > > > > > >
> > > > > > > Thanks a lot for your proposal! I find it is a good start point
> > for
> > > > > > > internal optimization works and help Flink to be more
> > > > > > > user-friendly.
> > > > > > >
> > > > > > > AFAIK, DataStream is the most popular API currently that Flink
> > > > > > > users should describe their logic with detailed logic.
> > > > > > > From a more internal view the conversion from DataStream to
> > > > > > > JobGraph is quite mechanically and hard to be optimized. So
> when
> > > > > > > users program with DataStream, they have to learn more
> internals
> > > > > > > and spend a lot of time to tune for performance.
> > > > > > > With your proposal, we provide enhanced functionality of Table
> > API,
> > > > > > > so that users can describe their job easily on Table aspect.
> This
> > > > gives
> > > > > > > an opportunity to Flink developers to introduce an optimize
> phase
> > > > > > > while transforming user program(described by Table API) to
> > internal
> > > > > > > representation.
> > > > > > >
> > > > > > > Given a user who want to start using Flink with simple ETL,
> > > > pipelining
> > > > > > > or analytics, he would find it is most naturally described by
> > > > SQL/Table
> > > > > > > API. Further, as mentioned by @hequn,
> > > > > > >
> > > > > > > SQL is a widely used language. It follows standards, is a
> > > > > > > > descriptive language, and is easy to use
> > > > > > >
> > > > > > >
> > > > > > > thus we could expect with the enhancement of SQL/Table API,
> Flink
> > > > > > > becomes more friendly to users.
> > > > > > >
> > > > > > > Looking forward to the design doc/FLIP!
> > > > > > >
> > > > > > > Best,
> > > > > > > tison.
> > > > > > >
> > > > > > >
> > > > > > > jincheng sun <su...@gmail.com> 于2018年11月2日周五
> 上午11:46写道:
> > > > > > >
> > > > > > > > Hi Hequn,
> > > > > > > > Thanks for your feedback! And also thanks for our offline
> > > > discussion!
> > > > > > > > You are right, unification of batch and streaming is very
> > > important
> > > > > for
> > > > > > > > flink API.
> > > > > > > > We will provide more detailed design later, Please let me
> know
> > if
> > > > you
> > > > > > > have
> > > > > > > > further thoughts or feedback.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Jincheng
> > > > > > > >
> > > > > > > > Hequn Cheng <ch...@gmail.com> 于2018年11月2日周五 上午10:02写道:
> > > > > > > >
> > > > > > > > > Hi Jincheng,
> > > > > > > > >
> > > > > > > > > Thanks a lot for your proposal. It is very encouraging!
> > > > > > > > >
> > > > > > > > > As we all know, SQL is a widely used language. It follows
> > > > > standards,
> > > > > > > is a
> > > > > > > > > descriptive language, and is easy to use. A powerful
> feature
> > of
> > > > SQL
> > > > > > is
> > > > > > > > that
> > > > > > > > > it supports optimization. Users only need to care about the
> > > logic
> > > > > of
> > > > > > > the
> > > > > > > > > program. The underlying optimizer will help users optimize
> > the
> > > > > > > > performance
> > > > > > > > > of the program. However, in terms of functionality and ease
> > of
> > > > use,
> > > > > > in
> > > > > > > > some
> > > > > > > > > scenarios sql will be limited, as described in Jincheng's
> > > > proposal.
> > > > > > > > >
> > > > > > > > > Correspondingly, the DataStream/DataSet api can provide
> > > powerful
> > > > > > > > > functionalities. Users can write
> > > > ProcessFunction/CoProcessFunction
> > > > > > and
> > > > > > > > get
> > > > > > > > > the timer. Compared with SQL, it provides more
> > functionalities
> > > > and
> > > > > > > > > flexibilities. However, it does not support optimization
> like
> > > > SQL.
> > > > > > > > > Meanwhile, DataStream/DataSet api has not been unified
> which
> > > > means,
> > > > > > for
> > > > > > > > the
> > > > > > > > > same logic, users need to write a job for each stream and
> > > batch.
> > > > > > > > >
> > > > > > > > > With TableApi, I think we can combine the advantages of
> both.
> > > > Users
> > > > > > can
> > > > > > > > > easily write relational operations and enjoy optimization.
> At
> > > the
> > > > > > same
> > > > > > > > > time, it supports more functionality and ease of use.
> Looking
> > > > > forward
> > > > > > > to
> > > > > > > > > the detailed design/FLIP.
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Hequn
> > > > > > > > >
> > > > > > > > > On Fri, Nov 2, 2018 at 9:48 AM Shaoxuan Wang <
> > > > wshaoxuan@gmail.com>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Aljoscha,
> > > > > > > > > > Glad that you like the proposal. We have completed the
> > > > prototype
> > > > > of
> > > > > > > > most
> > > > > > > > > > new proposed functionalities. Once collect the feedback
> > from
> > > > > > > community,
> > > > > > > > > we
> > > > > > > > > > will come up with a concrete FLIP/design doc.
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > > Shaoxuan
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Thu, Nov 1, 2018 at 8:12 PM Aljoscha Krettek <
> > > > > > aljoscha@apache.org
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Jincheng,
> > > > > > > > > > >
> > > > > > > > > > > these points sound very good! Are there any concrete
> > > > proposals
> > > > > > for
> > > > > > > > > > > changes? For example a FLIP/design document?
> > > > > > > > > > >
> > > > > > > > > > > See here for FLIPs:
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> > > > > > > > > > >
> > > > > > > > > > > Best,
> > > > > > > > > > > Aljoscha
> > > > > > > > > > >
> > > > > > > > > > > > On 1. Nov 2018, at 12:51, jincheng sun <
> > > > > > sunjincheng121@gmail.com
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > *--------I am sorry for the formatting of the email
> > > > content.
> > > > > I
> > > > > > > > > reformat
> > > > > > > > > > > > the **content** as follows-----------*
> > > > > > > > > > > >
> > > > > > > > > > > > *Hi ALL,*
> > > > > > > > > > > >
> > > > > > > > > > > > With the continuous efforts from the community, the
> > Flink
> > > > > > system
> > > > > > > > has
> > > > > > > > > > been
> > > > > > > > > > > > continuously improved, which has attracted more and
> > more
> > > > > users.
> > > > > > > > Flink
> > > > > > > > > > SQL
> > > > > > > > > > > > is a canonical, widely used relational query
> language.
> > > > > However,
> > > > > > > > there
> > > > > > > > > > are
> > > > > > > > > > > > still some scenarios where Flink SQL failed to meet
> > user
> > > > > needs
> > > > > > in
> > > > > > > > > terms
> > > > > > > > > > > of
> > > > > > > > > > > > functionality and ease of use, such as:
> > > > > > > > > > > >
> > > > > > > > > > > > *1. In terms of functionality*
> > > > > > > > > > > >    Iteration, user-defined window, user-defined join,
> > > > > > > user-defined
> > > > > > > > > > > > GroupReduce, etc. Users cannot express them with SQL;
> > > > > > > > > > > >
> > > > > > > > > > > > *2. In terms of ease of use*
> > > > > > > > > > > >
> > > > > > > > > > > >   - Map - e.g. “dataStream.map(mapFun)”. Although
> > > > > > > > > “table.select(udf1(),
> > > > > > > > > > > >   udf2(), udf3()....)” can be used to accomplish the
> > same
> > > > > > > > function.,
> > > > > > > > > > > with a
> > > > > > > > > > > >   map() function returning 100 columns, one has to
> > define
> > > > or
> > > > > > call
> > > > > > > > 100
> > > > > > > > > > > UDFs
> > > > > > > > > > > >   when using SQL, which is quite involved.
> > > > > > > > > > > >   - FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”.
> > > > > Similarly,
> > > > > > > it
> > > > > > > > > can
> > > > > > > > > > be
> > > > > > > > > > > >   implemented with “table.join(udtf).select()”.
> > However,
> > > it
> > > > > is
> > > > > > > > > obvious
> > > > > > > > > > > that
> > > > > > > > > > > >   dataStream is easier to use than SQL.
> > > > > > > > > > > >
> > > > > > > > > > > > Due to the above two reasons, some users have to use
> > the
> > > > > > > DataStream
> > > > > > > > > API
> > > > > > > > > > > or
> > > > > > > > > > > > the DataSet API. But when they do that, they lose the
> > > > > > unification
> > > > > > > > of
> > > > > > > > > > > batch
> > > > > > > > > > > > and streaming. They will also lose the sophisticated
> > > > > > > optimizations
> > > > > > > > > such
> > > > > > > > > > > as
> > > > > > > > > > > > codegen, aggregate join transpose and multi-stage agg
> > > from
> > > > > > Flink
> > > > > > > > SQL.
> > > > > > > > > > > >
> > > > > > > > > > > > We believe that enhancing the functionality and
> > > > productivity
> > > > > is
> > > > > > > > vital
> > > > > > > > > > for
> > > > > > > > > > > > the successful adoption of Table API. To this end,
> > Table
> > > > API
> > > > > > > still
> > > > > > > > > > > > requires more efforts from every contributor in the
> > > > > community.
> > > > > > We
> > > > > > > > see
> > > > > > > > > > > great
> > > > > > > > > > > > opportunity in improving our user’s experience from
> > this
> > > > > work.
> > > > > > > Any
> > > > > > > > > > > feedback
> > > > > > > > > > > > is welcome.
> > > > > > > > > > > >
> > > > > > > > > > > > Regards,
> > > > > > > > > > > >
> > > > > > > > > > > > Jincheng
> > > > > > > > > > > >
> > > > > > > > > > > > jincheng sun <su...@gmail.com>
> 于2018年11月1日周四
> > > > > > 下午5:07写道:
> > > > > > > > > > > >
> > > > > > > > > > > >> Hi all,
> > > > > > > > > > > >>
> > > > > > > > > > > >> With the continuous efforts from the community, the
> > > Flink
> > > > > > system
> > > > > > > > has
> > > > > > > > > > > been
> > > > > > > > > > > >> continuously improved, which has attracted more and
> > more
> > > > > > users.
> > > > > > > > > Flink
> > > > > > > > > > > SQL
> > > > > > > > > > > >> is a canonical, widely used relational query
> language.
> > > > > > However,
> > > > > > > > > there
> > > > > > > > > > > are
> > > > > > > > > > > >> still some scenarios where Flink SQL failed to meet
> > user
> > > > > needs
> > > > > > > in
> > > > > > > > > > terms
> > > > > > > > > > > of
> > > > > > > > > > > >> functionality and ease of use, such as:
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > > >>   -
> > > > > > > > > > > >>
> > > > > > > > > > > >>   In terms of functionality
> > > > > > > > > > > >>
> > > > > > > > > > > >> Iteration, user-defined window, user-defined join,
> > > > > > user-defined
> > > > > > > > > > > >> GroupReduce, etc. Users cannot express them with
> SQL;
> > > > > > > > > > > >>
> > > > > > > > > > > >>   -
> > > > > > > > > > > >>
> > > > > > > > > > > >>   In terms of ease of use
> > > > > > > > > > > >>   -
> > > > > > > > > > > >>
> > > > > > > > > > > >>      Map - e.g. “dataStream.map(mapFun)”. Although
> > > > > > > > > > “table.select(udf1(),
> > > > > > > > > > > >>      udf2(), udf3()....)” can be used to accomplish
> > the
> > > > same
> > > > > > > > > > function.,
> > > > > > > > > > > with a
> > > > > > > > > > > >>      map() function returning 100 columns, one has
> to
> > > > define
> > > > > > or
> > > > > > > > call
> > > > > > > > > > > 100 UDFs
> > > > > > > > > > > >>      when using SQL, which is quite involved.
> > > > > > > > > > > >>      -
> > > > > > > > > > > >>
> > > > > > > > > > > >>      FlatMap -  e.g.
> “dataStrem.flatmap(flatMapFun)”.
> > > > > > Similarly,
> > > > > > > > it
> > > > > > > > > > can
> > > > > > > > > > > >>      be implemented with
> “table.join(udtf).select()”.
> > > > > However,
> > > > > > > it
> > > > > > > > is
> > > > > > > > > > > obvious
> > > > > > > > > > > >>      that datastream is easier to use than SQL.
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > > >> Due to the above two reasons, some users have to use
> > the
> > > > > > > > DataStream
> > > > > > > > > > API
> > > > > > > > > > > or
> > > > > > > > > > > >> the DataSet API. But when they do that, they lose
> the
> > > > > > > unification
> > > > > > > > of
> > > > > > > > > > > batch
> > > > > > > > > > > >> and streaming. They will also lose the sophisticated
> > > > > > > optimizations
> > > > > > > > > > such
> > > > > > > > > > > as
> > > > > > > > > > > >> codegen, aggregate join transpose  and multi-stage
> agg
> > > > from
> > > > > > > Flink
> > > > > > > > > SQL.
> > > > > > > > > > > >>
> > > > > > > > > > > >> We believe that enhancing the functionality and
> > > > productivity
> > > > > > is
> > > > > > > > > vital
> > > > > > > > > > > for
> > > > > > > > > > > >> the successful adoption of Table API. To this end,
> > > Table
> > > > > API
> > > > > > > > still
> > > > > > > > > > > >> requires more efforts from every contributor in the
> > > > > community.
> > > > > > > We
> > > > > > > > > see
> > > > > > > > > > > great
> > > > > > > > > > > >> opportunity in improving our user’s experience from
> > this
> > > > > work.
> > > > > > > Any
> > > > > > > > > > > feedback
> > > > > > > > > > > >> is welcome.
> > > > > > > > > > > >>
> > > > > > > > > > > >> Regards,
> > > > > > > > > > > >>
> > > > > > > > > > > >> Jincheng
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Posted by jincheng sun <su...@gmail.com>.
Hi Jark,
Glad to see your feedback!
That's Correct, The proposal is aiming to extend the functionality for
Table API! I like add "drop" to fit the use case you mentioned. Not only
that, if a 100-columns Table. and our UDF needs these 100 columns, we don't
want to define the eval as eval(column0...column99), we prefer to define
eval as eval(Row)。Using it like this: table.select(udf (*)). All we also
need to consider if we put the columns package as a row. In a scenario like
this, we have Classification it as cloumn operation, and  list the changes
to the column operation after the map/flatMap/agg/flatAgg phase is
completed. And Currently,  Xiaowei has started a threading outlining which
talk about what we are proposing. Please see the detail in the mail thread:
Please see the detail in the mail thread:
 https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
<https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB>
 .

At this stage the Table API Enhancement Outline as follows:
https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing

Please let we know if you have further thoughts or feedback!

Thanks,
Jincheng


Jark Wu <im...@gmail.com> 于2018年11月6日周二 下午3:35写道:

> Hi jingcheng,
>
> Thanks for your proposal. I think it is a helpful enhancement for TableAPI
> which is a solid step forward for TableAPI.
> It doesn't weaken SQL or DataStream, because the conversion between
> DataStream and Table still works.
> People with advanced cases (e.g. complex and fine-grained state control)
> can go with DataStream,
> but most general cases can stay in TableAPI. This works is aiming to extend
> the functionality for TableAPI,
> to extend the usage scenario, to help TableAPI becomes a more widely used
> API.
>
> For example, someone want to drop one column from a 100-columns Table.
> Currently, we have to convert
> Table to DataStream and use MapFunction to do that, or select the remaining
> 99 columns using Table.select API.
> But if we support Table.drop() method for TableAPI, it will be a very
> convenient method and let users stay in Table.
>
> Looking forward to the more detailed design and further discussion.
>
> Regards,
> Jark
>
> jincheng sun <su...@gmail.com> 于2018年11月6日周二 下午1:05写道:
>
> > Hi Rong Rong,
> >
> > Sorry for the late reply, And thanks for your feedback!  We will continue
> > to add more convenience features to the TableAPI, such as map, flatmap,
> > agg, flatagg, iteration etc. And I am very happy that you are interested
> on
> > this proposal. Due to this is a long-term continuous work, we will push
> it
> > in stages.  Currently  Xiaowei has started a threading outlining which
> talk
> > about what we are proposing. Please see the detail in the mail thread:
> >
> >
> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
> > <
> >
> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
> > >
> >  .
> >
> > The Table API Enhancement Outline as follows:
> >
> >
> https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing
> >
> > Please let we know if you have further thoughts or feedback!
> >
> > Thanks,
> > Jincheng
> >
> > Fabian Hueske <fh...@gmail.com> 于2018年11月5日周一 下午7:03写道:
> >
> > > Hi Jincheng,
> > >
> > > Thanks for this interesting proposal.
> > > I like that we can push this effort forward in a very fine-grained
> > manner,
> > > i.e., incrementally adding more APIs to the Table API.
> > >
> > > However, I also have a few questions / concerns.
> > > Today, the Table API is tightly integrated with the DataSet and
> > DataStream
> > > APIs. It is very easy to convert a Table into a DataSet or DataStream
> and
> > > vice versa. This mean it is already easy to combine custom logic an
> > > relational operations. What I like is that several aspects are clearly
> > > separated like retraction and timestamp handling (see below) + all
> > > libraries on DataStream/DataSet can be easily combined with relational
> > > operations.
> > > I can see that adding more functionality to the Table API would remove
> > the
> > > distinction between DataSet and DataStream. However, wouldn't we get a
> > > similar benefit by extending the DataStream API for proper support for
> > > bounded streams (as is the long-term goal of Flink)?
> > > I'm also a bit skeptical about the optimization opportunities we would
> > > gain. Map/FlatMap UDFs are black boxes that cannot be easily removed
> > > without additional information (I did some research on this a few years
> > ago
> > > [1]).
> > >
> > > Moreover, I think there are a few tricky details that need to be
> resolved
> > > to enable a good integration.
> > >
> > > 1) How to deal with retraction messages? The DataStream API does not
> > have a
> > > notion of retractions. How would a MapFunction or FlatMapFunction
> handle
> > > retraction? Do they need to be aware of the change flag? Custom
> windowing
> > > and aggregation logic would certainly need to have that information.
> > > 2) How to deal with timestamps? The DataStream API does not give access
> > to
> > > timestamps. In the Table API / SQL these are exposed as regular
> > attributes.
> > > How can we ensure that timestamp attributes remain valid (i.e. aligned
> > with
> > > watermarks) if the output is produced by arbitrary code?
> > > There might be more issues of this kind.
> > >
> > > My main question would be how much would we gain with this proposal
> over
> > a
> > > tight integration of Table API and DataStream API, assuming that batch
> > > functionality is moved to DataStream?
> > >
> > > Best, Fabian
> > >
> > > [1] http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf
> > >
> > >
> > > Am Mo., 5. Nov. 2018 um 02:49 Uhr schrieb Rong Rong <
> walterddr@gmail.com
> > >:
> > >
> > > > Hi Jincheng,
> > > >
> > > > Thank you for the proposal! I think being able to define a process /
> > > > co-process function in table API definitely opens up a whole new
> level
> > of
> > > > applications using a unified API.
> > > >
> > > > In addition, as Tzu-Li and Hequn have mentioned, the benefit of
> > > > optimization layer of Table API will already bring in additional
> > benefit
> > > > over directly programming on top of DataStream/DataSet API. I am very
> > > > interested an looking forward to seeing the support for more complex
> > use
> > > > cases, especially iterations. It will enable table API to define much
> > > > broader, event-driven use cases such as real-time ML
> > prediction/training.
> > > >
> > > > As Timo mentioned, This will make Table API diverge from the SQL API.
> > But
> > > > as from my experience Table API was always giving me the impression
> to
> > > be a
> > > > more sophisticated, syntactic-aware way to express relational
> > operations.
> > > > Looking forward to further discussion and collaborations on the FLIP
> > doc.
> > > >
> > > > --
> > > > Rong
> > > >
> > > > On Sun, Nov 4, 2018 at 5:22 PM jincheng sun <
> sunjincheng121@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi tison,
> > > > >
> > > > > Thanks a lot for your feedback!
> > > > > I am very happy to see that community contributors agree to
> enhanced
> > > the
> > > > > TableAPI. This work is a long-term continuous work, we will push it
> > in
> > > > > stages, we will soon complete  the enhanced list of the first
> phase,
> > we
> > > > can
> > > > > go deep discussion  in google doc. thanks again for joining on the
> > very
> > > > > important discussion of the Flink Table API.
> > > > >
> > > > > Thanks,
> > > > > Jincheng
> > > > >
> > > > > Tzu-Li Chen <wa...@gmail.com> 于2018年11月2日周五 下午1:49写道:
> > > > >
> > > > > > Hi jingchengm
> > > > > >
> > > > > > Thanks a lot for your proposal! I find it is a good start point
> for
> > > > > > internal optimization works and help Flink to be more
> > > > > > user-friendly.
> > > > > >
> > > > > > AFAIK, DataStream is the most popular API currently that Flink
> > > > > > users should describe their logic with detailed logic.
> > > > > > From a more internal view the conversion from DataStream to
> > > > > > JobGraph is quite mechanically and hard to be optimized. So when
> > > > > > users program with DataStream, they have to learn more internals
> > > > > > and spend a lot of time to tune for performance.
> > > > > > With your proposal, we provide enhanced functionality of Table
> API,
> > > > > > so that users can describe their job easily on Table aspect. This
> > > gives
> > > > > > an opportunity to Flink developers to introduce an optimize phase
> > > > > > while transforming user program(described by Table API) to
> internal
> > > > > > representation.
> > > > > >
> > > > > > Given a user who want to start using Flink with simple ETL,
> > > pipelining
> > > > > > or analytics, he would find it is most naturally described by
> > > SQL/Table
> > > > > > API. Further, as mentioned by @hequn,
> > > > > >
> > > > > > SQL is a widely used language. It follows standards, is a
> > > > > > > descriptive language, and is easy to use
> > > > > >
> > > > > >
> > > > > > thus we could expect with the enhancement of SQL/Table API, Flink
> > > > > > becomes more friendly to users.
> > > > > >
> > > > > > Looking forward to the design doc/FLIP!
> > > > > >
> > > > > > Best,
> > > > > > tison.
> > > > > >
> > > > > >
> > > > > > jincheng sun <su...@gmail.com> 于2018年11月2日周五 上午11:46写道:
> > > > > >
> > > > > > > Hi Hequn,
> > > > > > > Thanks for your feedback! And also thanks for our offline
> > > discussion!
> > > > > > > You are right, unification of batch and streaming is very
> > important
> > > > for
> > > > > > > flink API.
> > > > > > > We will provide more detailed design later, Please let me know
> if
> > > you
> > > > > > have
> > > > > > > further thoughts or feedback.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Jincheng
> > > > > > >
> > > > > > > Hequn Cheng <ch...@gmail.com> 于2018年11月2日周五 上午10:02写道:
> > > > > > >
> > > > > > > > Hi Jincheng,
> > > > > > > >
> > > > > > > > Thanks a lot for your proposal. It is very encouraging!
> > > > > > > >
> > > > > > > > As we all know, SQL is a widely used language. It follows
> > > > standards,
> > > > > > is a
> > > > > > > > descriptive language, and is easy to use. A powerful feature
> of
> > > SQL
> > > > > is
> > > > > > > that
> > > > > > > > it supports optimization. Users only need to care about the
> > logic
> > > > of
> > > > > > the
> > > > > > > > program. The underlying optimizer will help users optimize
> the
> > > > > > > performance
> > > > > > > > of the program. However, in terms of functionality and ease
> of
> > > use,
> > > > > in
> > > > > > > some
> > > > > > > > scenarios sql will be limited, as described in Jincheng's
> > > proposal.
> > > > > > > >
> > > > > > > > Correspondingly, the DataStream/DataSet api can provide
> > powerful
> > > > > > > > functionalities. Users can write
> > > ProcessFunction/CoProcessFunction
> > > > > and
> > > > > > > get
> > > > > > > > the timer. Compared with SQL, it provides more
> functionalities
> > > and
> > > > > > > > flexibilities. However, it does not support optimization like
> > > SQL.
> > > > > > > > Meanwhile, DataStream/DataSet api has not been unified which
> > > means,
> > > > > for
> > > > > > > the
> > > > > > > > same logic, users need to write a job for each stream and
> > batch.
> > > > > > > >
> > > > > > > > With TableApi, I think we can combine the advantages of both.
> > > Users
> > > > > can
> > > > > > > > easily write relational operations and enjoy optimization. At
> > the
> > > > > same
> > > > > > > > time, it supports more functionality and ease of use. Looking
> > > > forward
> > > > > > to
> > > > > > > > the detailed design/FLIP.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Hequn
> > > > > > > >
> > > > > > > > On Fri, Nov 2, 2018 at 9:48 AM Shaoxuan Wang <
> > > wshaoxuan@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Aljoscha,
> > > > > > > > > Glad that you like the proposal. We have completed the
> > > prototype
> > > > of
> > > > > > > most
> > > > > > > > > new proposed functionalities. Once collect the feedback
> from
> > > > > > community,
> > > > > > > > we
> > > > > > > > > will come up with a concrete FLIP/design doc.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Shaoxuan
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Thu, Nov 1, 2018 at 8:12 PM Aljoscha Krettek <
> > > > > aljoscha@apache.org
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Jincheng,
> > > > > > > > > >
> > > > > > > > > > these points sound very good! Are there any concrete
> > > proposals
> > > > > for
> > > > > > > > > > changes? For example a FLIP/design document?
> > > > > > > > > >
> > > > > > > > > > See here for FLIPs:
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Aljoscha
> > > > > > > > > >
> > > > > > > > > > > On 1. Nov 2018, at 12:51, jincheng sun <
> > > > > sunjincheng121@gmail.com
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > *--------I am sorry for the formatting of the email
> > > content.
> > > > I
> > > > > > > > reformat
> > > > > > > > > > > the **content** as follows-----------*
> > > > > > > > > > >
> > > > > > > > > > > *Hi ALL,*
> > > > > > > > > > >
> > > > > > > > > > > With the continuous efforts from the community, the
> Flink
> > > > > system
> > > > > > > has
> > > > > > > > > been
> > > > > > > > > > > continuously improved, which has attracted more and
> more
> > > > users.
> > > > > > > Flink
> > > > > > > > > SQL
> > > > > > > > > > > is a canonical, widely used relational query language.
> > > > However,
> > > > > > > there
> > > > > > > > > are
> > > > > > > > > > > still some scenarios where Flink SQL failed to meet
> user
> > > > needs
> > > > > in
> > > > > > > > terms
> > > > > > > > > > of
> > > > > > > > > > > functionality and ease of use, such as:
> > > > > > > > > > >
> > > > > > > > > > > *1. In terms of functionality*
> > > > > > > > > > >    Iteration, user-defined window, user-defined join,
> > > > > > user-defined
> > > > > > > > > > > GroupReduce, etc. Users cannot express them with SQL;
> > > > > > > > > > >
> > > > > > > > > > > *2. In terms of ease of use*
> > > > > > > > > > >
> > > > > > > > > > >   - Map - e.g. “dataStream.map(mapFun)”. Although
> > > > > > > > “table.select(udf1(),
> > > > > > > > > > >   udf2(), udf3()....)” can be used to accomplish the
> same
> > > > > > > function.,
> > > > > > > > > > with a
> > > > > > > > > > >   map() function returning 100 columns, one has to
> define
> > > or
> > > > > call
> > > > > > > 100
> > > > > > > > > > UDFs
> > > > > > > > > > >   when using SQL, which is quite involved.
> > > > > > > > > > >   - FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”.
> > > > Similarly,
> > > > > > it
> > > > > > > > can
> > > > > > > > > be
> > > > > > > > > > >   implemented with “table.join(udtf).select()”.
> However,
> > it
> > > > is
> > > > > > > > obvious
> > > > > > > > > > that
> > > > > > > > > > >   dataStream is easier to use than SQL.
> > > > > > > > > > >
> > > > > > > > > > > Due to the above two reasons, some users have to use
> the
> > > > > > DataStream
> > > > > > > > API
> > > > > > > > > > or
> > > > > > > > > > > the DataSet API. But when they do that, they lose the
> > > > > unification
> > > > > > > of
> > > > > > > > > > batch
> > > > > > > > > > > and streaming. They will also lose the sophisticated
> > > > > > optimizations
> > > > > > > > such
> > > > > > > > > > as
> > > > > > > > > > > codegen, aggregate join transpose and multi-stage agg
> > from
> > > > > Flink
> > > > > > > SQL.
> > > > > > > > > > >
> > > > > > > > > > > We believe that enhancing the functionality and
> > > productivity
> > > > is
> > > > > > > vital
> > > > > > > > > for
> > > > > > > > > > > the successful adoption of Table API. To this end,
> Table
> > > API
> > > > > > still
> > > > > > > > > > > requires more efforts from every contributor in the
> > > > community.
> > > > > We
> > > > > > > see
> > > > > > > > > > great
> > > > > > > > > > > opportunity in improving our user’s experience from
> this
> > > > work.
> > > > > > Any
> > > > > > > > > > feedback
> > > > > > > > > > > is welcome.
> > > > > > > > > > >
> > > > > > > > > > > Regards,
> > > > > > > > > > >
> > > > > > > > > > > Jincheng
> > > > > > > > > > >
> > > > > > > > > > > jincheng sun <su...@gmail.com> 于2018年11月1日周四
> > > > > 下午5:07写道:
> > > > > > > > > > >
> > > > > > > > > > >> Hi all,
> > > > > > > > > > >>
> > > > > > > > > > >> With the continuous efforts from the community, the
> > Flink
> > > > > system
> > > > > > > has
> > > > > > > > > > been
> > > > > > > > > > >> continuously improved, which has attracted more and
> more
> > > > > users.
> > > > > > > > Flink
> > > > > > > > > > SQL
> > > > > > > > > > >> is a canonical, widely used relational query language.
> > > > > However,
> > > > > > > > there
> > > > > > > > > > are
> > > > > > > > > > >> still some scenarios where Flink SQL failed to meet
> user
> > > > needs
> > > > > > in
> > > > > > > > > terms
> > > > > > > > > > of
> > > > > > > > > > >> functionality and ease of use, such as:
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >>   -
> > > > > > > > > > >>
> > > > > > > > > > >>   In terms of functionality
> > > > > > > > > > >>
> > > > > > > > > > >> Iteration, user-defined window, user-defined join,
> > > > > user-defined
> > > > > > > > > > >> GroupReduce, etc. Users cannot express them with SQL;
> > > > > > > > > > >>
> > > > > > > > > > >>   -
> > > > > > > > > > >>
> > > > > > > > > > >>   In terms of ease of use
> > > > > > > > > > >>   -
> > > > > > > > > > >>
> > > > > > > > > > >>      Map - e.g. “dataStream.map(mapFun)”. Although
> > > > > > > > > “table.select(udf1(),
> > > > > > > > > > >>      udf2(), udf3()....)” can be used to accomplish
> the
> > > same
> > > > > > > > > function.,
> > > > > > > > > > with a
> > > > > > > > > > >>      map() function returning 100 columns, one has to
> > > define
> > > > > or
> > > > > > > call
> > > > > > > > > > 100 UDFs
> > > > > > > > > > >>      when using SQL, which is quite involved.
> > > > > > > > > > >>      -
> > > > > > > > > > >>
> > > > > > > > > > >>      FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”.
> > > > > Similarly,
> > > > > > > it
> > > > > > > > > can
> > > > > > > > > > >>      be implemented with “table.join(udtf).select()”.
> > > > However,
> > > > > > it
> > > > > > > is
> > > > > > > > > > obvious
> > > > > > > > > > >>      that datastream is easier to use than SQL.
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >> Due to the above two reasons, some users have to use
> the
> > > > > > > DataStream
> > > > > > > > > API
> > > > > > > > > > or
> > > > > > > > > > >> the DataSet API. But when they do that, they lose the
> > > > > > unification
> > > > > > > of
> > > > > > > > > > batch
> > > > > > > > > > >> and streaming. They will also lose the sophisticated
> > > > > > optimizations
> > > > > > > > > such
> > > > > > > > > > as
> > > > > > > > > > >> codegen, aggregate join transpose  and multi-stage agg
> > > from
> > > > > > Flink
> > > > > > > > SQL.
> > > > > > > > > > >>
> > > > > > > > > > >> We believe that enhancing the functionality and
> > > productivity
> > > > > is
> > > > > > > > vital
> > > > > > > > > > for
> > > > > > > > > > >> the successful adoption of Table API. To this end,
> > Table
> > > > API
> > > > > > > still
> > > > > > > > > > >> requires more efforts from every contributor in the
> > > > community.
> > > > > > We
> > > > > > > > see
> > > > > > > > > > great
> > > > > > > > > > >> opportunity in improving our user’s experience from
> this
> > > > work.
> > > > > > Any
> > > > > > > > > > feedback
> > > > > > > > > > >> is welcome.
> > > > > > > > > > >>
> > > > > > > > > > >> Regards,
> > > > > > > > > > >>
> > > > > > > > > > >> Jincheng
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Posted by Jark Wu <im...@gmail.com>.
Hi jingcheng,

Thanks for your proposal. I think it is a helpful enhancement for TableAPI
which is a solid step forward for TableAPI.
It doesn't weaken SQL or DataStream, because the conversion between
DataStream and Table still works.
People with advanced cases (e.g. complex and fine-grained state control)
can go with DataStream,
but most general cases can stay in TableAPI. This works is aiming to extend
the functionality for TableAPI,
to extend the usage scenario, to help TableAPI becomes a more widely used
API.

For example, someone want to drop one column from a 100-columns Table.
Currently, we have to convert
Table to DataStream and use MapFunction to do that, or select the remaining
99 columns using Table.select API.
But if we support Table.drop() method for TableAPI, it will be a very
convenient method and let users stay in Table.

Looking forward to the more detailed design and further discussion.

Regards,
Jark

jincheng sun <su...@gmail.com> 于2018年11月6日周二 下午1:05写道:

> Hi Rong Rong,
>
> Sorry for the late reply, And thanks for your feedback!  We will continue
> to add more convenience features to the TableAPI, such as map, flatmap,
> agg, flatagg, iteration etc. And I am very happy that you are interested on
> this proposal. Due to this is a long-term continuous work, we will push it
> in stages.  Currently  Xiaowei has started a threading outlining which talk
> about what we are proposing. Please see the detail in the mail thread:
>
> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
> <
> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
> >
>  .
>
> The Table API Enhancement Outline as follows:
>
> https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing
>
> Please let we know if you have further thoughts or feedback!
>
> Thanks,
> Jincheng
>
> Fabian Hueske <fh...@gmail.com> 于2018年11月5日周一 下午7:03写道:
>
> > Hi Jincheng,
> >
> > Thanks for this interesting proposal.
> > I like that we can push this effort forward in a very fine-grained
> manner,
> > i.e., incrementally adding more APIs to the Table API.
> >
> > However, I also have a few questions / concerns.
> > Today, the Table API is tightly integrated with the DataSet and
> DataStream
> > APIs. It is very easy to convert a Table into a DataSet or DataStream and
> > vice versa. This mean it is already easy to combine custom logic an
> > relational operations. What I like is that several aspects are clearly
> > separated like retraction and timestamp handling (see below) + all
> > libraries on DataStream/DataSet can be easily combined with relational
> > operations.
> > I can see that adding more functionality to the Table API would remove
> the
> > distinction between DataSet and DataStream. However, wouldn't we get a
> > similar benefit by extending the DataStream API for proper support for
> > bounded streams (as is the long-term goal of Flink)?
> > I'm also a bit skeptical about the optimization opportunities we would
> > gain. Map/FlatMap UDFs are black boxes that cannot be easily removed
> > without additional information (I did some research on this a few years
> ago
> > [1]).
> >
> > Moreover, I think there are a few tricky details that need to be resolved
> > to enable a good integration.
> >
> > 1) How to deal with retraction messages? The DataStream API does not
> have a
> > notion of retractions. How would a MapFunction or FlatMapFunction handle
> > retraction? Do they need to be aware of the change flag? Custom windowing
> > and aggregation logic would certainly need to have that information.
> > 2) How to deal with timestamps? The DataStream API does not give access
> to
> > timestamps. In the Table API / SQL these are exposed as regular
> attributes.
> > How can we ensure that timestamp attributes remain valid (i.e. aligned
> with
> > watermarks) if the output is produced by arbitrary code?
> > There might be more issues of this kind.
> >
> > My main question would be how much would we gain with this proposal over
> a
> > tight integration of Table API and DataStream API, assuming that batch
> > functionality is moved to DataStream?
> >
> > Best, Fabian
> >
> > [1] http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf
> >
> >
> > Am Mo., 5. Nov. 2018 um 02:49 Uhr schrieb Rong Rong <walterddr@gmail.com
> >:
> >
> > > Hi Jincheng,
> > >
> > > Thank you for the proposal! I think being able to define a process /
> > > co-process function in table API definitely opens up a whole new level
> of
> > > applications using a unified API.
> > >
> > > In addition, as Tzu-Li and Hequn have mentioned, the benefit of
> > > optimization layer of Table API will already bring in additional
> benefit
> > > over directly programming on top of DataStream/DataSet API. I am very
> > > interested an looking forward to seeing the support for more complex
> use
> > > cases, especially iterations. It will enable table API to define much
> > > broader, event-driven use cases such as real-time ML
> prediction/training.
> > >
> > > As Timo mentioned, This will make Table API diverge from the SQL API.
> But
> > > as from my experience Table API was always giving me the impression to
> > be a
> > > more sophisticated, syntactic-aware way to express relational
> operations.
> > > Looking forward to further discussion and collaborations on the FLIP
> doc.
> > >
> > > --
> > > Rong
> > >
> > > On Sun, Nov 4, 2018 at 5:22 PM jincheng sun <su...@gmail.com>
> > > wrote:
> > >
> > > > Hi tison,
> > > >
> > > > Thanks a lot for your feedback!
> > > > I am very happy to see that community contributors agree to enhanced
> > the
> > > > TableAPI. This work is a long-term continuous work, we will push it
> in
> > > > stages, we will soon complete  the enhanced list of the first phase,
> we
> > > can
> > > > go deep discussion  in google doc. thanks again for joining on the
> very
> > > > important discussion of the Flink Table API.
> > > >
> > > > Thanks,
> > > > Jincheng
> > > >
> > > > Tzu-Li Chen <wa...@gmail.com> 于2018年11月2日周五 下午1:49写道:
> > > >
> > > > > Hi jingchengm
> > > > >
> > > > > Thanks a lot for your proposal! I find it is a good start point for
> > > > > internal optimization works and help Flink to be more
> > > > > user-friendly.
> > > > >
> > > > > AFAIK, DataStream is the most popular API currently that Flink
> > > > > users should describe their logic with detailed logic.
> > > > > From a more internal view the conversion from DataStream to
> > > > > JobGraph is quite mechanically and hard to be optimized. So when
> > > > > users program with DataStream, they have to learn more internals
> > > > > and spend a lot of time to tune for performance.
> > > > > With your proposal, we provide enhanced functionality of Table API,
> > > > > so that users can describe their job easily on Table aspect. This
> > gives
> > > > > an opportunity to Flink developers to introduce an optimize phase
> > > > > while transforming user program(described by Table API) to internal
> > > > > representation.
> > > > >
> > > > > Given a user who want to start using Flink with simple ETL,
> > pipelining
> > > > > or analytics, he would find it is most naturally described by
> > SQL/Table
> > > > > API. Further, as mentioned by @hequn,
> > > > >
> > > > > SQL is a widely used language. It follows standards, is a
> > > > > > descriptive language, and is easy to use
> > > > >
> > > > >
> > > > > thus we could expect with the enhancement of SQL/Table API, Flink
> > > > > becomes more friendly to users.
> > > > >
> > > > > Looking forward to the design doc/FLIP!
> > > > >
> > > > > Best,
> > > > > tison.
> > > > >
> > > > >
> > > > > jincheng sun <su...@gmail.com> 于2018年11月2日周五 上午11:46写道:
> > > > >
> > > > > > Hi Hequn,
> > > > > > Thanks for your feedback! And also thanks for our offline
> > discussion!
> > > > > > You are right, unification of batch and streaming is very
> important
> > > for
> > > > > > flink API.
> > > > > > We will provide more detailed design later, Please let me know if
> > you
> > > > > have
> > > > > > further thoughts or feedback.
> > > > > >
> > > > > > Thanks,
> > > > > > Jincheng
> > > > > >
> > > > > > Hequn Cheng <ch...@gmail.com> 于2018年11月2日周五 上午10:02写道:
> > > > > >
> > > > > > > Hi Jincheng,
> > > > > > >
> > > > > > > Thanks a lot for your proposal. It is very encouraging!
> > > > > > >
> > > > > > > As we all know, SQL is a widely used language. It follows
> > > standards,
> > > > > is a
> > > > > > > descriptive language, and is easy to use. A powerful feature of
> > SQL
> > > > is
> > > > > > that
> > > > > > > it supports optimization. Users only need to care about the
> logic
> > > of
> > > > > the
> > > > > > > program. The underlying optimizer will help users optimize the
> > > > > > performance
> > > > > > > of the program. However, in terms of functionality and ease of
> > use,
> > > > in
> > > > > > some
> > > > > > > scenarios sql will be limited, as described in Jincheng's
> > proposal.
> > > > > > >
> > > > > > > Correspondingly, the DataStream/DataSet api can provide
> powerful
> > > > > > > functionalities. Users can write
> > ProcessFunction/CoProcessFunction
> > > > and
> > > > > > get
> > > > > > > the timer. Compared with SQL, it provides more functionalities
> > and
> > > > > > > flexibilities. However, it does not support optimization like
> > SQL.
> > > > > > > Meanwhile, DataStream/DataSet api has not been unified which
> > means,
> > > > for
> > > > > > the
> > > > > > > same logic, users need to write a job for each stream and
> batch.
> > > > > > >
> > > > > > > With TableApi, I think we can combine the advantages of both.
> > Users
> > > > can
> > > > > > > easily write relational operations and enjoy optimization. At
> the
> > > > same
> > > > > > > time, it supports more functionality and ease of use. Looking
> > > forward
> > > > > to
> > > > > > > the detailed design/FLIP.
> > > > > > >
> > > > > > > Best,
> > > > > > > Hequn
> > > > > > >
> > > > > > > On Fri, Nov 2, 2018 at 9:48 AM Shaoxuan Wang <
> > wshaoxuan@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Aljoscha,
> > > > > > > > Glad that you like the proposal. We have completed the
> > prototype
> > > of
> > > > > > most
> > > > > > > > new proposed functionalities. Once collect the feedback from
> > > > > community,
> > > > > > > we
> > > > > > > > will come up with a concrete FLIP/design doc.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Shaoxuan
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Nov 1, 2018 at 8:12 PM Aljoscha Krettek <
> > > > aljoscha@apache.org
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Jincheng,
> > > > > > > > >
> > > > > > > > > these points sound very good! Are there any concrete
> > proposals
> > > > for
> > > > > > > > > changes? For example a FLIP/design document?
> > > > > > > > >
> > > > > > > > > See here for FLIPs:
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Aljoscha
> > > > > > > > >
> > > > > > > > > > On 1. Nov 2018, at 12:51, jincheng sun <
> > > > sunjincheng121@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > *--------I am sorry for the formatting of the email
> > content.
> > > I
> > > > > > > reformat
> > > > > > > > > > the **content** as follows-----------*
> > > > > > > > > >
> > > > > > > > > > *Hi ALL,*
> > > > > > > > > >
> > > > > > > > > > With the continuous efforts from the community, the Flink
> > > > system
> > > > > > has
> > > > > > > > been
> > > > > > > > > > continuously improved, which has attracted more and more
> > > users.
> > > > > > Flink
> > > > > > > > SQL
> > > > > > > > > > is a canonical, widely used relational query language.
> > > However,
> > > > > > there
> > > > > > > > are
> > > > > > > > > > still some scenarios where Flink SQL failed to meet user
> > > needs
> > > > in
> > > > > > > terms
> > > > > > > > > of
> > > > > > > > > > functionality and ease of use, such as:
> > > > > > > > > >
> > > > > > > > > > *1. In terms of functionality*
> > > > > > > > > >    Iteration, user-defined window, user-defined join,
> > > > > user-defined
> > > > > > > > > > GroupReduce, etc. Users cannot express them with SQL;
> > > > > > > > > >
> > > > > > > > > > *2. In terms of ease of use*
> > > > > > > > > >
> > > > > > > > > >   - Map - e.g. “dataStream.map(mapFun)”. Although
> > > > > > > “table.select(udf1(),
> > > > > > > > > >   udf2(), udf3()....)” can be used to accomplish the same
> > > > > > function.,
> > > > > > > > > with a
> > > > > > > > > >   map() function returning 100 columns, one has to define
> > or
> > > > call
> > > > > > 100
> > > > > > > > > UDFs
> > > > > > > > > >   when using SQL, which is quite involved.
> > > > > > > > > >   - FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”.
> > > Similarly,
> > > > > it
> > > > > > > can
> > > > > > > > be
> > > > > > > > > >   implemented with “table.join(udtf).select()”. However,
> it
> > > is
> > > > > > > obvious
> > > > > > > > > that
> > > > > > > > > >   dataStream is easier to use than SQL.
> > > > > > > > > >
> > > > > > > > > > Due to the above two reasons, some users have to use the
> > > > > DataStream
> > > > > > > API
> > > > > > > > > or
> > > > > > > > > > the DataSet API. But when they do that, they lose the
> > > > unification
> > > > > > of
> > > > > > > > > batch
> > > > > > > > > > and streaming. They will also lose the sophisticated
> > > > > optimizations
> > > > > > > such
> > > > > > > > > as
> > > > > > > > > > codegen, aggregate join transpose and multi-stage agg
> from
> > > > Flink
> > > > > > SQL.
> > > > > > > > > >
> > > > > > > > > > We believe that enhancing the functionality and
> > productivity
> > > is
> > > > > > vital
> > > > > > > > for
> > > > > > > > > > the successful adoption of Table API. To this end,  Table
> > API
> > > > > still
> > > > > > > > > > requires more efforts from every contributor in the
> > > community.
> > > > We
> > > > > > see
> > > > > > > > > great
> > > > > > > > > > opportunity in improving our user’s experience from this
> > > work.
> > > > > Any
> > > > > > > > > feedback
> > > > > > > > > > is welcome.
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > >
> > > > > > > > > > Jincheng
> > > > > > > > > >
> > > > > > > > > > jincheng sun <su...@gmail.com> 于2018年11月1日周四
> > > > 下午5:07写道:
> > > > > > > > > >
> > > > > > > > > >> Hi all,
> > > > > > > > > >>
> > > > > > > > > >> With the continuous efforts from the community, the
> Flink
> > > > system
> > > > > > has
> > > > > > > > > been
> > > > > > > > > >> continuously improved, which has attracted more and more
> > > > users.
> > > > > > > Flink
> > > > > > > > > SQL
> > > > > > > > > >> is a canonical, widely used relational query language.
> > > > However,
> > > > > > > there
> > > > > > > > > are
> > > > > > > > > >> still some scenarios where Flink SQL failed to meet user
> > > needs
> > > > > in
> > > > > > > > terms
> > > > > > > > > of
> > > > > > > > > >> functionality and ease of use, such as:
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >>   -
> > > > > > > > > >>
> > > > > > > > > >>   In terms of functionality
> > > > > > > > > >>
> > > > > > > > > >> Iteration, user-defined window, user-defined join,
> > > > user-defined
> > > > > > > > > >> GroupReduce, etc. Users cannot express them with SQL;
> > > > > > > > > >>
> > > > > > > > > >>   -
> > > > > > > > > >>
> > > > > > > > > >>   In terms of ease of use
> > > > > > > > > >>   -
> > > > > > > > > >>
> > > > > > > > > >>      Map - e.g. “dataStream.map(mapFun)”. Although
> > > > > > > > “table.select(udf1(),
> > > > > > > > > >>      udf2(), udf3()....)” can be used to accomplish the
> > same
> > > > > > > > function.,
> > > > > > > > > with a
> > > > > > > > > >>      map() function returning 100 columns, one has to
> > define
> > > > or
> > > > > > call
> > > > > > > > > 100 UDFs
> > > > > > > > > >>      when using SQL, which is quite involved.
> > > > > > > > > >>      -
> > > > > > > > > >>
> > > > > > > > > >>      FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”.
> > > > Similarly,
> > > > > > it
> > > > > > > > can
> > > > > > > > > >>      be implemented with “table.join(udtf).select()”.
> > > However,
> > > > > it
> > > > > > is
> > > > > > > > > obvious
> > > > > > > > > >>      that datastream is easier to use than SQL.
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >> Due to the above two reasons, some users have to use the
> > > > > > DataStream
> > > > > > > > API
> > > > > > > > > or
> > > > > > > > > >> the DataSet API. But when they do that, they lose the
> > > > > unification
> > > > > > of
> > > > > > > > > batch
> > > > > > > > > >> and streaming. They will also lose the sophisticated
> > > > > optimizations
> > > > > > > > such
> > > > > > > > > as
> > > > > > > > > >> codegen, aggregate join transpose  and multi-stage agg
> > from
> > > > > Flink
> > > > > > > SQL.
> > > > > > > > > >>
> > > > > > > > > >> We believe that enhancing the functionality and
> > productivity
> > > > is
> > > > > > > vital
> > > > > > > > > for
> > > > > > > > > >> the successful adoption of Table API. To this end,
> Table
> > > API
> > > > > > still
> > > > > > > > > >> requires more efforts from every contributor in the
> > > community.
> > > > > We
> > > > > > > see
> > > > > > > > > great
> > > > > > > > > >> opportunity in improving our user’s experience from this
> > > work.
> > > > > Any
> > > > > > > > > feedback
> > > > > > > > > >> is welcome.
> > > > > > > > > >>
> > > > > > > > > >> Regards,
> > > > > > > > > >>
> > > > > > > > > >> Jincheng
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Posted by jincheng sun <su...@gmail.com>.
Hi Rong Rong,

Sorry for the late reply, And thanks for your feedback!  We will continue
to add more convenience features to the TableAPI, such as map, flatmap,
agg, flatagg, iteration etc. And I am very happy that you are interested on
this proposal. Due to this is a long-term continuous work, we will push it
in stages.  Currently  Xiaowei has started a threading outlining which talk
about what we are proposing. Please see the detail in the mail thread:
 https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB
<https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB>
 .

The Table API Enhancement Outline as follows:
https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing

Please let we know if you have further thoughts or feedback!

Thanks,
Jincheng

Fabian Hueske <fh...@gmail.com> 于2018年11月5日周一 下午7:03写道:

> Hi Jincheng,
>
> Thanks for this interesting proposal.
> I like that we can push this effort forward in a very fine-grained manner,
> i.e., incrementally adding more APIs to the Table API.
>
> However, I also have a few questions / concerns.
> Today, the Table API is tightly integrated with the DataSet and DataStream
> APIs. It is very easy to convert a Table into a DataSet or DataStream and
> vice versa. This mean it is already easy to combine custom logic an
> relational operations. What I like is that several aspects are clearly
> separated like retraction and timestamp handling (see below) + all
> libraries on DataStream/DataSet can be easily combined with relational
> operations.
> I can see that adding more functionality to the Table API would remove the
> distinction between DataSet and DataStream. However, wouldn't we get a
> similar benefit by extending the DataStream API for proper support for
> bounded streams (as is the long-term goal of Flink)?
> I'm also a bit skeptical about the optimization opportunities we would
> gain. Map/FlatMap UDFs are black boxes that cannot be easily removed
> without additional information (I did some research on this a few years ago
> [1]).
>
> Moreover, I think there are a few tricky details that need to be resolved
> to enable a good integration.
>
> 1) How to deal with retraction messages? The DataStream API does not have a
> notion of retractions. How would a MapFunction or FlatMapFunction handle
> retraction? Do they need to be aware of the change flag? Custom windowing
> and aggregation logic would certainly need to have that information.
> 2) How to deal with timestamps? The DataStream API does not give access to
> timestamps. In the Table API / SQL these are exposed as regular attributes.
> How can we ensure that timestamp attributes remain valid (i.e. aligned with
> watermarks) if the output is produced by arbitrary code?
> There might be more issues of this kind.
>
> My main question would be how much would we gain with this proposal over a
> tight integration of Table API and DataStream API, assuming that batch
> functionality is moved to DataStream?
>
> Best, Fabian
>
> [1] http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf
>
>
> Am Mo., 5. Nov. 2018 um 02:49 Uhr schrieb Rong Rong <wa...@gmail.com>:
>
> > Hi Jincheng,
> >
> > Thank you for the proposal! I think being able to define a process /
> > co-process function in table API definitely opens up a whole new level of
> > applications using a unified API.
> >
> > In addition, as Tzu-Li and Hequn have mentioned, the benefit of
> > optimization layer of Table API will already bring in additional benefit
> > over directly programming on top of DataStream/DataSet API. I am very
> > interested an looking forward to seeing the support for more complex use
> > cases, especially iterations. It will enable table API to define much
> > broader, event-driven use cases such as real-time ML prediction/training.
> >
> > As Timo mentioned, This will make Table API diverge from the SQL API. But
> > as from my experience Table API was always giving me the impression to
> be a
> > more sophisticated, syntactic-aware way to express relational operations.
> > Looking forward to further discussion and collaborations on the FLIP doc.
> >
> > --
> > Rong
> >
> > On Sun, Nov 4, 2018 at 5:22 PM jincheng sun <su...@gmail.com>
> > wrote:
> >
> > > Hi tison,
> > >
> > > Thanks a lot for your feedback!
> > > I am very happy to see that community contributors agree to enhanced
> the
> > > TableAPI. This work is a long-term continuous work, we will push it in
> > > stages, we will soon complete  the enhanced list of the first phase, we
> > can
> > > go deep discussion  in google doc. thanks again for joining on the very
> > > important discussion of the Flink Table API.
> > >
> > > Thanks,
> > > Jincheng
> > >
> > > Tzu-Li Chen <wa...@gmail.com> 于2018年11月2日周五 下午1:49写道:
> > >
> > > > Hi jingchengm
> > > >
> > > > Thanks a lot for your proposal! I find it is a good start point for
> > > > internal optimization works and help Flink to be more
> > > > user-friendly.
> > > >
> > > > AFAIK, DataStream is the most popular API currently that Flink
> > > > users should describe their logic with detailed logic.
> > > > From a more internal view the conversion from DataStream to
> > > > JobGraph is quite mechanically and hard to be optimized. So when
> > > > users program with DataStream, they have to learn more internals
> > > > and spend a lot of time to tune for performance.
> > > > With your proposal, we provide enhanced functionality of Table API,
> > > > so that users can describe their job easily on Table aspect. This
> gives
> > > > an opportunity to Flink developers to introduce an optimize phase
> > > > while transforming user program(described by Table API) to internal
> > > > representation.
> > > >
> > > > Given a user who want to start using Flink with simple ETL,
> pipelining
> > > > or analytics, he would find it is most naturally described by
> SQL/Table
> > > > API. Further, as mentioned by @hequn,
> > > >
> > > > SQL is a widely used language. It follows standards, is a
> > > > > descriptive language, and is easy to use
> > > >
> > > >
> > > > thus we could expect with the enhancement of SQL/Table API, Flink
> > > > becomes more friendly to users.
> > > >
> > > > Looking forward to the design doc/FLIP!
> > > >
> > > > Best,
> > > > tison.
> > > >
> > > >
> > > > jincheng sun <su...@gmail.com> 于2018年11月2日周五 上午11:46写道:
> > > >
> > > > > Hi Hequn,
> > > > > Thanks for your feedback! And also thanks for our offline
> discussion!
> > > > > You are right, unification of batch and streaming is very important
> > for
> > > > > flink API.
> > > > > We will provide more detailed design later, Please let me know if
> you
> > > > have
> > > > > further thoughts or feedback.
> > > > >
> > > > > Thanks,
> > > > > Jincheng
> > > > >
> > > > > Hequn Cheng <ch...@gmail.com> 于2018年11月2日周五 上午10:02写道:
> > > > >
> > > > > > Hi Jincheng,
> > > > > >
> > > > > > Thanks a lot for your proposal. It is very encouraging!
> > > > > >
> > > > > > As we all know, SQL is a widely used language. It follows
> > standards,
> > > > is a
> > > > > > descriptive language, and is easy to use. A powerful feature of
> SQL
> > > is
> > > > > that
> > > > > > it supports optimization. Users only need to care about the logic
> > of
> > > > the
> > > > > > program. The underlying optimizer will help users optimize the
> > > > > performance
> > > > > > of the program. However, in terms of functionality and ease of
> use,
> > > in
> > > > > some
> > > > > > scenarios sql will be limited, as described in Jincheng's
> proposal.
> > > > > >
> > > > > > Correspondingly, the DataStream/DataSet api can provide powerful
> > > > > > functionalities. Users can write
> ProcessFunction/CoProcessFunction
> > > and
> > > > > get
> > > > > > the timer. Compared with SQL, it provides more functionalities
> and
> > > > > > flexibilities. However, it does not support optimization like
> SQL.
> > > > > > Meanwhile, DataStream/DataSet api has not been unified which
> means,
> > > for
> > > > > the
> > > > > > same logic, users need to write a job for each stream and batch.
> > > > > >
> > > > > > With TableApi, I think we can combine the advantages of both.
> Users
> > > can
> > > > > > easily write relational operations and enjoy optimization. At the
> > > same
> > > > > > time, it supports more functionality and ease of use. Looking
> > forward
> > > > to
> > > > > > the detailed design/FLIP.
> > > > > >
> > > > > > Best,
> > > > > > Hequn
> > > > > >
> > > > > > On Fri, Nov 2, 2018 at 9:48 AM Shaoxuan Wang <
> wshaoxuan@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Hi Aljoscha,
> > > > > > > Glad that you like the proposal. We have completed the
> prototype
> > of
> > > > > most
> > > > > > > new proposed functionalities. Once collect the feedback from
> > > > community,
> > > > > > we
> > > > > > > will come up with a concrete FLIP/design doc.
> > > > > > >
> > > > > > > Regards,
> > > > > > > Shaoxuan
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Nov 1, 2018 at 8:12 PM Aljoscha Krettek <
> > > aljoscha@apache.org
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Jincheng,
> > > > > > > >
> > > > > > > > these points sound very good! Are there any concrete
> proposals
> > > for
> > > > > > > > changes? For example a FLIP/design document?
> > > > > > > >
> > > > > > > > See here for FLIPs:
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Aljoscha
> > > > > > > >
> > > > > > > > > On 1. Nov 2018, at 12:51, jincheng sun <
> > > sunjincheng121@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > *--------I am sorry for the formatting of the email
> content.
> > I
> > > > > > reformat
> > > > > > > > > the **content** as follows-----------*
> > > > > > > > >
> > > > > > > > > *Hi ALL,*
> > > > > > > > >
> > > > > > > > > With the continuous efforts from the community, the Flink
> > > system
> > > > > has
> > > > > > > been
> > > > > > > > > continuously improved, which has attracted more and more
> > users.
> > > > > Flink
> > > > > > > SQL
> > > > > > > > > is a canonical, widely used relational query language.
> > However,
> > > > > there
> > > > > > > are
> > > > > > > > > still some scenarios where Flink SQL failed to meet user
> > needs
> > > in
> > > > > > terms
> > > > > > > > of
> > > > > > > > > functionality and ease of use, such as:
> > > > > > > > >
> > > > > > > > > *1. In terms of functionality*
> > > > > > > > >    Iteration, user-defined window, user-defined join,
> > > > user-defined
> > > > > > > > > GroupReduce, etc. Users cannot express them with SQL;
> > > > > > > > >
> > > > > > > > > *2. In terms of ease of use*
> > > > > > > > >
> > > > > > > > >   - Map - e.g. “dataStream.map(mapFun)”. Although
> > > > > > “table.select(udf1(),
> > > > > > > > >   udf2(), udf3()....)” can be used to accomplish the same
> > > > > function.,
> > > > > > > > with a
> > > > > > > > >   map() function returning 100 columns, one has to define
> or
> > > call
> > > > > 100
> > > > > > > > UDFs
> > > > > > > > >   when using SQL, which is quite involved.
> > > > > > > > >   - FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”.
> > Similarly,
> > > > it
> > > > > > can
> > > > > > > be
> > > > > > > > >   implemented with “table.join(udtf).select()”. However, it
> > is
> > > > > > obvious
> > > > > > > > that
> > > > > > > > >   dataStream is easier to use than SQL.
> > > > > > > > >
> > > > > > > > > Due to the above two reasons, some users have to use the
> > > > DataStream
> > > > > > API
> > > > > > > > or
> > > > > > > > > the DataSet API. But when they do that, they lose the
> > > unification
> > > > > of
> > > > > > > > batch
> > > > > > > > > and streaming. They will also lose the sophisticated
> > > > optimizations
> > > > > > such
> > > > > > > > as
> > > > > > > > > codegen, aggregate join transpose and multi-stage agg from
> > > Flink
> > > > > SQL.
> > > > > > > > >
> > > > > > > > > We believe that enhancing the functionality and
> productivity
> > is
> > > > > vital
> > > > > > > for
> > > > > > > > > the successful adoption of Table API. To this end,  Table
> API
> > > > still
> > > > > > > > > requires more efforts from every contributor in the
> > community.
> > > We
> > > > > see
> > > > > > > > great
> > > > > > > > > opportunity in improving our user’s experience from this
> > work.
> > > > Any
> > > > > > > > feedback
> > > > > > > > > is welcome.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > >
> > > > > > > > > Jincheng
> > > > > > > > >
> > > > > > > > > jincheng sun <su...@gmail.com> 于2018年11月1日周四
> > > 下午5:07写道:
> > > > > > > > >
> > > > > > > > >> Hi all,
> > > > > > > > >>
> > > > > > > > >> With the continuous efforts from the community, the Flink
> > > system
> > > > > has
> > > > > > > > been
> > > > > > > > >> continuously improved, which has attracted more and more
> > > users.
> > > > > > Flink
> > > > > > > > SQL
> > > > > > > > >> is a canonical, widely used relational query language.
> > > However,
> > > > > > there
> > > > > > > > are
> > > > > > > > >> still some scenarios where Flink SQL failed to meet user
> > needs
> > > > in
> > > > > > > terms
> > > > > > > > of
> > > > > > > > >> functionality and ease of use, such as:
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >>   -
> > > > > > > > >>
> > > > > > > > >>   In terms of functionality
> > > > > > > > >>
> > > > > > > > >> Iteration, user-defined window, user-defined join,
> > > user-defined
> > > > > > > > >> GroupReduce, etc. Users cannot express them with SQL;
> > > > > > > > >>
> > > > > > > > >>   -
> > > > > > > > >>
> > > > > > > > >>   In terms of ease of use
> > > > > > > > >>   -
> > > > > > > > >>
> > > > > > > > >>      Map - e.g. “dataStream.map(mapFun)”. Although
> > > > > > > “table.select(udf1(),
> > > > > > > > >>      udf2(), udf3()....)” can be used to accomplish the
> same
> > > > > > > function.,
> > > > > > > > with a
> > > > > > > > >>      map() function returning 100 columns, one has to
> define
> > > or
> > > > > call
> > > > > > > > 100 UDFs
> > > > > > > > >>      when using SQL, which is quite involved.
> > > > > > > > >>      -
> > > > > > > > >>
> > > > > > > > >>      FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”.
> > > Similarly,
> > > > > it
> > > > > > > can
> > > > > > > > >>      be implemented with “table.join(udtf).select()”.
> > However,
> > > > it
> > > > > is
> > > > > > > > obvious
> > > > > > > > >>      that datastream is easier to use than SQL.
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> Due to the above two reasons, some users have to use the
> > > > > DataStream
> > > > > > > API
> > > > > > > > or
> > > > > > > > >> the DataSet API. But when they do that, they lose the
> > > > unification
> > > > > of
> > > > > > > > batch
> > > > > > > > >> and streaming. They will also lose the sophisticated
> > > > optimizations
> > > > > > > such
> > > > > > > > as
> > > > > > > > >> codegen, aggregate join transpose  and multi-stage agg
> from
> > > > Flink
> > > > > > SQL.
> > > > > > > > >>
> > > > > > > > >> We believe that enhancing the functionality and
> productivity
> > > is
> > > > > > vital
> > > > > > > > for
> > > > > > > > >> the successful adoption of Table API. To this end,  Table
> > API
> > > > > still
> > > > > > > > >> requires more efforts from every contributor in the
> > community.
> > > > We
> > > > > > see
> > > > > > > > great
> > > > > > > > >> opportunity in improving our user’s experience from this
> > work.
> > > > Any
> > > > > > > > feedback
> > > > > > > > >> is welcome.
> > > > > > > > >>
> > > > > > > > >> Regards,
> > > > > > > > >>
> > > > > > > > >> Jincheng
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Posted by Becket Qin <be...@gmail.com>.
Hi Xinggang,

Thanks for the comments. Please see the responses inline below.

On Tue, Nov 6, 2018 at 11:28 AM SHI Xiaogang <sh...@gmail.com> wrote:

> Hi all,
>
> I think it's good to enhance the functionality and productivity of Table
> API, but still I think SQL + DataStream is a better choice from user
> experience
> 1. The unification of batch and stream processing is very attractive, and
> many our users are moving their batch-processing applications to Flink.
> These users are familiar with SQL and most of their applications are
> written in SQL-like languages. In Ad-hoc or interactive scenarios where
> users write new queries according to the results or previous queries, SQL
> is much more efficient than Table API.
>
As of now, both SQL and Table API support batch and stream at the same
time. However, the scopes of them are a little different. SQL is a query
language which perfectly fits the scenarios such as OLAP/BI. However, it
falls in short when it comes to cases like ML/Graph processing where a
programming interface is needed. As you may probably noticed, some
fundamental features such as iteration is missing from SQL. Besides that,
people rarely write algorithms in SQL. Due to the same argument you have in
(2), users want to see a programming interface instead of a query language.
As Xiaowei mentioned, we are extending Table API to wider range of use
cases. Hence, enhancing table API would be necessary to that goal.

> 2. Though TableAPI is also declarative, most users typically refuse to
> learn another language, not to mention a fast-evolving one.
> 3. There are many efforts to be done in the optimization of both Table API
> and SQL. Given that most data processing systems, including Databases,
> MapReduce-like systems and Continuous Query systems, have done a lot of
> work in the optimization of SQL queries, it seems more easier to adopt
> existing work in Flink with SQL than TableAPI.
>
WRT optimization, at the end of the day, SQL and Table API in fact share
the exact same QO path. Therefore, any optimization made to SQL will be
available to Table API. There is no additional overhead. The additional
optimization opportunity Xiaowei mentioned was specifically for UDFs which
are blackboxes today and may be optimized with symbolic expressions.

> 4. Data stream is still welcome as users can deal with complex logic with
> flexible interfaces. It is also widely adopted in critical missions where
> users can use a lot of "tricky" optimization to achieve optimal
> performance. Most of these tricks typically are not allowed in TableAPI or
> SQL.

Completely agree that DataStream is important to many advanced cases. We
should also invest in that to make it stronger. Enhancing Table API does
not mean getting rid of DataStreams. Table API is more of a high level API
with which users just need to focus on their business logic, while
DataStream API is more of a low level API for users who want to have a
strong control over all the execution details. Both of them are valuable to
different users.

>


> Regards,
> Xiaogang
>
> Xiaowei Jiang <xi...@gmail.com> 于2018年11月5日周一 下午8:34写道:
>
> > Hi Fabian, these are great questions! I have some quick thoughts on some
> of
> > these.
> >
> > Optimization opportunities: I think that you are right UDFs are more like
> > blackboxes today. However this can change if we let user develop UDFs
> > symbolically in the future (i.e., Flink will look inside the UDF code,
> > understand it and potentially do codegen to execute it). This will open
> the
> > door for potential optimizations.
> >
> > Moving batch functionality to DataStream: I actually think that moving
> > batch functionality to Table API is probably a better idea. Currently,
> > DataStream API is very deterministic and imperative. On the other hand,
> > DataSet API has an optimizer and can choose very different execution
> plans.
> > Another distinguishing feature of DataStream API is that users get direct
> > access to state/statebackend which we intensionally avoided in Table API
> so
> > far. The introduction of states is probably the biggest innovation by
> > Flink. At the same time, abusing states may also be the largest source
> for
> > reliability/performance issues. Shielding users away from dealing with
> > state directly is a key advantage in Table API. I think this is probably
> a
> > good boundary between DataStream and Table API. If users HAVE to
> manipulate
> > state explicitly, go with DataStream, otherwise, go with Table API.
> >
> > Because Flink is extending into more and more scenarios (e.g., batch,
> > streaming & micro-service), we may inevitably end up with multiple APIs.
> It
> > appears that data analytics (batch & streaming) related applications can
> be
> > well served with Table API which not only unifies batch and streaming,
> but
> > also relieves the users from dealing with states explicitly. On the other
> > hand, DataStream is very convenient and powerful for micro-service kind
> of
> > applications because explicitly state access may be necessary in such
> > cases.
> >
> > We will start a threading outlining what we are proposing in Table API.
> >
> > Regards,
> > Xiaowei
> >
> > On Mon, Nov 5, 2018 at 7:03 PM Fabian Hueske <fh...@gmail.com> wrote:
> >
> > > Hi Jincheng,
> > >
> > > Thanks for this interesting proposal.
> > > I like that we can push this effort forward in a very fine-grained
> > manner,
> > > i.e., incrementally adding more APIs to the Table API.
> > >
> > > However, I also have a few questions / concerns.
> > > Today, the Table API is tightly integrated with the DataSet and
> > DataStream
> > > APIs. It is very easy to convert a Table into a DataSet or DataStream
> and
> > > vice versa. This mean it is already easy to combine custom logic an
> > > relational operations. What I like is that several aspects are clearly
> > > separated like retraction and timestamp handling (see below) + all
> > > libraries on DataStream/DataSet can be easily combined with relational
> > > operations.
> > > I can see that adding more functionality to the Table API would remove
> > the
> > > distinction between DataSet and DataStream. However, wouldn't we get a
> > > similar benefit by extending the DataStream API for proper support for
> > > bounded streams (as is the long-term goal of Flink)?
> > > I'm also a bit skeptical about the optimization opportunities we would
> > > gain. Map/FlatMap UDFs are black boxes that cannot be easily removed
> > > without additional information (I did some research on this a few years
> > ago
> > > [1]).
> > >
> > > Moreover, I think there are a few tricky details that need to be
> resolved
> > > to enable a good integration.
> > >
> > > 1) How to deal with retraction messages? The DataStream API does not
> > have a
> > > notion of retractions. How would a MapFunction or FlatMapFunction
> handle
> > > retraction? Do they need to be aware of the change flag? Custom
> windowing
> > > and aggregation logic would certainly need to have that information.
> > > 2) How to deal with timestamps? The DataStream API does not give access
> > to
> > > timestamps. In the Table API / SQL these are exposed as regular
> > attributes.
> > > How can we ensure that timestamp attributes remain valid (i.e. aligned
> > with
> > > watermarks) if the output is produced by arbitrary code?
> > > There might be more issues of this kind.
> > >
> > > My main question would be how much would we gain with this proposal
> over
> > a
> > > tight integration of Table API and DataStream API, assuming that batch
> > > functionality is moved to DataStream?
> > >
> > > Best, Fabian
> > >
> > > [1] http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf
> > >
> > >
> > > Am Mo., 5. Nov. 2018 um 02:49 Uhr schrieb Rong Rong <
> walterddr@gmail.com
> > >:
> > >
> > > > Hi Jincheng,
> > > >
> > > > Thank you for the proposal! I think being able to define a process /
> > > > co-process function in table API definitely opens up a whole new
> level
> > of
> > > > applications using a unified API.
> > > >
> > > > In addition, as Tzu-Li and Hequn have mentioned, the benefit of
> > > > optimization layer of Table API will already bring in additional
> > benefit
> > > > over directly programming on top of DataStream/DataSet API. I am very
> > > > interested an looking forward to seeing the support for more complex
> > use
> > > > cases, especially iterations. It will enable table API to define much
> > > > broader, event-driven use cases such as real-time ML
> > prediction/training.
> > > >
> > > > As Timo mentioned, This will make Table API diverge from the SQL API.
> > But
> > > > as from my experience Table API was always giving me the impression
> to
> > > be a
> > > > more sophisticated, syntactic-aware way to express relational
> > operations.
> > > > Looking forward to further discussion and collaborations on the FLIP
> > doc.
> > > >
> > > > --
> > > > Rong
> > > >
> > > > On Sun, Nov 4, 2018 at 5:22 PM jincheng sun <
> sunjincheng121@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi tison,
> > > > >
> > > > > Thanks a lot for your feedback!
> > > > > I am very happy to see that community contributors agree to
> enhanced
> > > the
> > > > > TableAPI. This work is a long-term continuous work, we will push it
> > in
> > > > > stages, we will soon complete  the enhanced list of the first
> phase,
> > we
> > > > can
> > > > > go deep discussion  in google doc. thanks again for joining on the
> > very
> > > > > important discussion of the Flink Table API.
> > > > >
> > > > > Thanks,
> > > > > Jincheng
> > > > >
> > > > > Tzu-Li Chen <wa...@gmail.com> 于2018年11月2日周五 下午1:49写道:
> > > > >
> > > > > > Hi jingchengm
> > > > > >
> > > > > > Thanks a lot for your proposal! I find it is a good start point
> for
> > > > > > internal optimization works and help Flink to be more
> > > > > > user-friendly.
> > > > > >
> > > > > > AFAIK, DataStream is the most popular API currently that Flink
> > > > > > users should describe their logic with detailed logic.
> > > > > > From a more internal view the conversion from DataStream to
> > > > > > JobGraph is quite mechanically and hard to be optimized. So when
> > > > > > users program with DataStream, they have to learn more internals
> > > > > > and spend a lot of time to tune for performance.
> > > > > > With your proposal, we provide enhanced functionality of Table
> API,
> > > > > > so that users can describe their job easily on Table aspect. This
> > > gives
> > > > > > an opportunity to Flink developers to introduce an optimize phase
> > > > > > while transforming user program(described by Table API) to
> internal
> > > > > > representation.
> > > > > >
> > > > > > Given a user who want to start using Flink with simple ETL,
> > > pipelining
> > > > > > or analytics, he would find it is most naturally described by
> > > SQL/Table
> > > > > > API. Further, as mentioned by @hequn,
> > > > > >
> > > > > > SQL is a widely used language. It follows standards, is a
> > > > > > > descriptive language, and is easy to use
> > > > > >
> > > > > >
> > > > > > thus we could expect with the enhancement of SQL/Table API, Flink
> > > > > > becomes more friendly to users.
> > > > > >
> > > > > > Looking forward to the design doc/FLIP!
> > > > > >
> > > > > > Best,
> > > > > > tison.
> > > > > >
> > > > > >
> > > > > > jincheng sun <su...@gmail.com> 于2018年11月2日周五 上午11:46写道:
> > > > > >
> > > > > > > Hi Hequn,
> > > > > > > Thanks for your feedback! And also thanks for our offline
> > > discussion!
> > > > > > > You are right, unification of batch and streaming is very
> > important
> > > > for
> > > > > > > flink API.
> > > > > > > We will provide more detailed design later, Please let me know
> if
> > > you
> > > > > > have
> > > > > > > further thoughts or feedback.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Jincheng
> > > > > > >
> > > > > > > Hequn Cheng <ch...@gmail.com> 于2018年11月2日周五 上午10:02写道:
> > > > > > >
> > > > > > > > Hi Jincheng,
> > > > > > > >
> > > > > > > > Thanks a lot for your proposal. It is very encouraging!
> > > > > > > >
> > > > > > > > As we all know, SQL is a widely used language. It follows
> > > > standards,
> > > > > > is a
> > > > > > > > descriptive language, and is easy to use. A powerful feature
> of
> > > SQL
> > > > > is
> > > > > > > that
> > > > > > > > it supports optimization. Users only need to care about the
> > logic
> > > > of
> > > > > > the
> > > > > > > > program. The underlying optimizer will help users optimize
> the
> > > > > > > performance
> > > > > > > > of the program. However, in terms of functionality and ease
> of
> > > use,
> > > > > in
> > > > > > > some
> > > > > > > > scenarios sql will be limited, as described in Jincheng's
> > > proposal.
> > > > > > > >
> > > > > > > > Correspondingly, the DataStream/DataSet api can provide
> > powerful
> > > > > > > > functionalities. Users can write
> > > ProcessFunction/CoProcessFunction
> > > > > and
> > > > > > > get
> > > > > > > > the timer. Compared with SQL, it provides more
> functionalities
> > > and
> > > > > > > > flexibilities. However, it does not support optimization like
> > > SQL.
> > > > > > > > Meanwhile, DataStream/DataSet api has not been unified which
> > > means,
> > > > > for
> > > > > > > the
> > > > > > > > same logic, users need to write a job for each stream and
> > batch.
> > > > > > > >
> > > > > > > > With TableApi, I think we can combine the advantages of both.
> > > Users
> > > > > can
> > > > > > > > easily write relational operations and enjoy optimization. At
> > the
> > > > > same
> > > > > > > > time, it supports more functionality and ease of use. Looking
> > > > forward
> > > > > > to
> > > > > > > > the detailed design/FLIP.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Hequn
> > > > > > > >
> > > > > > > > On Fri, Nov 2, 2018 at 9:48 AM Shaoxuan Wang <
> > > wshaoxuan@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Aljoscha,
> > > > > > > > > Glad that you like the proposal. We have completed the
> > > prototype
> > > > of
> > > > > > > most
> > > > > > > > > new proposed functionalities. Once collect the feedback
> from
> > > > > > community,
> > > > > > > > we
> > > > > > > > > will come up with a concrete FLIP/design doc.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Shaoxuan
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Thu, Nov 1, 2018 at 8:12 PM Aljoscha Krettek <
> > > > > aljoscha@apache.org
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Jincheng,
> > > > > > > > > >
> > > > > > > > > > these points sound very good! Are there any concrete
> > > proposals
> > > > > for
> > > > > > > > > > changes? For example a FLIP/design document?
> > > > > > > > > >
> > > > > > > > > > See here for FLIPs:
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Aljoscha
> > > > > > > > > >
> > > > > > > > > > > On 1. Nov 2018, at 12:51, jincheng sun <
> > > > > sunjincheng121@gmail.com
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > *--------I am sorry for the formatting of the email
> > > content.
> > > > I
> > > > > > > > reformat
> > > > > > > > > > > the **content** as follows-----------*
> > > > > > > > > > >
> > > > > > > > > > > *Hi ALL,*
> > > > > > > > > > >
> > > > > > > > > > > With the continuous efforts from the community, the
> Flink
> > > > > system
> > > > > > > has
> > > > > > > > > been
> > > > > > > > > > > continuously improved, which has attracted more and
> more
> > > > users.
> > > > > > > Flink
> > > > > > > > > SQL
> > > > > > > > > > > is a canonical, widely used relational query language.
> > > > However,
> > > > > > > there
> > > > > > > > > are
> > > > > > > > > > > still some scenarios where Flink SQL failed to meet
> user
> > > > needs
> > > > > in
> > > > > > > > terms
> > > > > > > > > > of
> > > > > > > > > > > functionality and ease of use, such as:
> > > > > > > > > > >
> > > > > > > > > > > *1. In terms of functionality*
> > > > > > > > > > >    Iteration, user-defined window, user-defined join,
> > > > > > user-defined
> > > > > > > > > > > GroupReduce, etc. Users cannot express them with SQL;
> > > > > > > > > > >
> > > > > > > > > > > *2. In terms of ease of use*
> > > > > > > > > > >
> > > > > > > > > > >   - Map - e.g. “dataStream.map(mapFun)”. Although
> > > > > > > > “table.select(udf1(),
> > > > > > > > > > >   udf2(), udf3()....)” can be used to accomplish the
> same
> > > > > > > function.,
> > > > > > > > > > with a
> > > > > > > > > > >   map() function returning 100 columns, one has to
> define
> > > or
> > > > > call
> > > > > > > 100
> > > > > > > > > > UDFs
> > > > > > > > > > >   when using SQL, which is quite involved.
> > > > > > > > > > >   - FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”.
> > > > Similarly,
> > > > > > it
> > > > > > > > can
> > > > > > > > > be
> > > > > > > > > > >   implemented with “table.join(udtf).select()”.
> However,
> > it
> > > > is
> > > > > > > > obvious
> > > > > > > > > > that
> > > > > > > > > > >   dataStream is easier to use than SQL.
> > > > > > > > > > >
> > > > > > > > > > > Due to the above two reasons, some users have to use
> the
> > > > > > DataStream
> > > > > > > > API
> > > > > > > > > > or
> > > > > > > > > > > the DataSet API. But when they do that, they lose the
> > > > > unification
> > > > > > > of
> > > > > > > > > > batch
> > > > > > > > > > > and streaming. They will also lose the sophisticated
> > > > > > optimizations
> > > > > > > > such
> > > > > > > > > > as
> > > > > > > > > > > codegen, aggregate join transpose and multi-stage agg
> > from
> > > > > Flink
> > > > > > > SQL.
> > > > > > > > > > >
> > > > > > > > > > > We believe that enhancing the functionality and
> > > productivity
> > > > is
> > > > > > > vital
> > > > > > > > > for
> > > > > > > > > > > the successful adoption of Table API. To this end,
> Table
> > > API
> > > > > > still
> > > > > > > > > > > requires more efforts from every contributor in the
> > > > community.
> > > > > We
> > > > > > > see
> > > > > > > > > > great
> > > > > > > > > > > opportunity in improving our user’s experience from
> this
> > > > work.
> > > > > > Any
> > > > > > > > > > feedback
> > > > > > > > > > > is welcome.
> > > > > > > > > > >
> > > > > > > > > > > Regards,
> > > > > > > > > > >
> > > > > > > > > > > Jincheng
> > > > > > > > > > >
> > > > > > > > > > > jincheng sun <su...@gmail.com> 于2018年11月1日周四
> > > > > 下午5:07写道:
> > > > > > > > > > >
> > > > > > > > > > >> Hi all,
> > > > > > > > > > >>
> > > > > > > > > > >> With the continuous efforts from the community, the
> > Flink
> > > > > system
> > > > > > > has
> > > > > > > > > > been
> > > > > > > > > > >> continuously improved, which has attracted more and
> more
> > > > > users.
> > > > > > > > Flink
> > > > > > > > > > SQL
> > > > > > > > > > >> is a canonical, widely used relational query language.
> > > > > However,
> > > > > > > > there
> > > > > > > > > > are
> > > > > > > > > > >> still some scenarios where Flink SQL failed to meet
> user
> > > > needs
> > > > > > in
> > > > > > > > > terms
> > > > > > > > > > of
> > > > > > > > > > >> functionality and ease of use, such as:
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >>   -
> > > > > > > > > > >>
> > > > > > > > > > >>   In terms of functionality
> > > > > > > > > > >>
> > > > > > > > > > >> Iteration, user-defined window, user-defined join,
> > > > > user-defined
> > > > > > > > > > >> GroupReduce, etc. Users cannot express them with SQL;
> > > > > > > > > > >>
> > > > > > > > > > >>   -
> > > > > > > > > > >>
> > > > > > > > > > >>   In terms of ease of use
> > > > > > > > > > >>   -
> > > > > > > > > > >>
> > > > > > > > > > >>      Map - e.g. “dataStream.map(mapFun)”. Although
> > > > > > > > > “table.select(udf1(),
> > > > > > > > > > >>      udf2(), udf3()....)” can be used to accomplish
> the
> > > same
> > > > > > > > > function.,
> > > > > > > > > > with a
> > > > > > > > > > >>      map() function returning 100 columns, one has to
> > > define
> > > > > or
> > > > > > > call
> > > > > > > > > > 100 UDFs
> > > > > > > > > > >>      when using SQL, which is quite involved.
> > > > > > > > > > >>      -
> > > > > > > > > > >>
> > > > > > > > > > >>      FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”.
> > > > > Similarly,
> > > > > > > it
> > > > > > > > > can
> > > > > > > > > > >>      be implemented with “table.join(udtf).select()”.
> > > > However,
> > > > > > it
> > > > > > > is
> > > > > > > > > > obvious
> > > > > > > > > > >>      that datastream is easier to use than SQL.
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >> Due to the above two reasons, some users have to use
> the
> > > > > > > DataStream
> > > > > > > > > API
> > > > > > > > > > or
> > > > > > > > > > >> the DataSet API. But when they do that, they lose the
> > > > > > unification
> > > > > > > of
> > > > > > > > > > batch
> > > > > > > > > > >> and streaming. They will also lose the sophisticated
> > > > > > optimizations
> > > > > > > > > such
> > > > > > > > > > as
> > > > > > > > > > >> codegen, aggregate join transpose  and multi-stage agg
> > > from
> > > > > > Flink
> > > > > > > > SQL.
> > > > > > > > > > >>
> > > > > > > > > > >> We believe that enhancing the functionality and
> > > productivity
> > > > > is
> > > > > > > > vital
> > > > > > > > > > for
> > > > > > > > > > >> the successful adoption of Table API. To this end,
> > Table
> > > > API
> > > > > > > still
> > > > > > > > > > >> requires more efforts from every contributor in the
> > > > community.
> > > > > > We
> > > > > > > > see
> > > > > > > > > > great
> > > > > > > > > > >> opportunity in improving our user’s experience from
> this
> > > > work.
> > > > > > Any
> > > > > > > > > > feedback
> > > > > > > > > > >> is welcome.
> > > > > > > > > > >>
> > > > > > > > > > >> Regards,
> > > > > > > > > > >>
> > > > > > > > > > >> Jincheng
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Posted by jincheng sun <su...@gmail.com>.
Hi Xiaogang,

Thanks for your feedback, I will share my thoughts here:

First, enhancing TableAPI does not mean weakening SQL. We also need to
enhance the functionality of SQL, such as @Xuefu's ongoing integration of
the hive SQL ecosystem.

In addition,SQL and TableAPI are two different API forms of the Apache
Flink high-level APIs, and users can choose according to their own
preferences and habits.

About #3) SQL and TableAPI using the same optimization framework, So both
SQL and TableAPI can reused the  optimization rules from traditional
database.

Regarding #4) You are right, whether we want to enhancing the TableAPI or
not, we have to retain the DataStreamAPI, because DataStream provides users
with enough control.

But FLink SQL follow the ANSI-SQL standard is worthy of our insistence, and
ANSI-SQL is difficult to express the convenient operation of map/flatmap.
So we want to add the map/flatmap on TableAPI, to enhance the ease of use
of TableAPI and optimize the Flink user experience, what do you think?

Thanks,
Jincheng

SHI Xiaogang <sh...@gmail.com> 于2018年11月6日周二 上午11:28写道:

> Hi all,
>
> I think it's good to enhance the functionality and productivity of Table
> API, but still I think SQL + DataStream is a better choice from user
> experience
> 1. The unification of batch and stream processing is very attractive, and
> many our users are moving their batch-processing applications to Flink.
> These users are familiar with SQL and most of their applications are
> written in SQL-like languages. In Ad-hoc or interactive scenarios where
> users write new queries according to the results or previous queries, SQL
> is much more efficient than Table API.
> 2. Though TableAPI is also declarative, most users typically refuse to
> learn another language, not to mention a fast-evolving one.
> 3. There are many efforts to be done in the optimization of both Table API
> and SQL. Given that most data processing systems, including Databases,
> MapReduce-like systems and Continuous Query systems, have done a lot of
> work in the optimization of SQL queries, it seems more easier to adopt
> existing work in Flink with SQL than TableAPI.
> 4. Data stream is still welcome as users can deal with complex logic with
> flexible interfaces. It is also widely adopted in critical missions where
> users can use a lot of "tricky" optimization to achieve optimal
> performance. Most of these tricks typically are not allowed in TableAPI or
> SQL.
>
> Regards,
> Xiaogang
>
> Xiaowei Jiang <xi...@gmail.com> 于2018年11月5日周一 下午8:34写道:
>
> > Hi Fabian, these are great questions! I have some quick thoughts on some
> of
> > these.
> >
> > Optimization opportunities: I think that you are right UDFs are more like
> > blackboxes today. However this can change if we let user develop UDFs
> > symbolically in the future (i.e., Flink will look inside the UDF code,
> > understand it and potentially do codegen to execute it). This will open
> the
> > door for potential optimizations.
> >
> > Moving batch functionality to DataStream: I actually think that moving
> > batch functionality to Table API is probably a better idea. Currently,
> > DataStream API is very deterministic and imperative. On the other hand,
> > DataSet API has an optimizer and can choose very different execution
> plans.
> > Another distinguishing feature of DataStream API is that users get direct
> > access to state/statebackend which we intensionally avoided in Table API
> so
> > far. The introduction of states is probably the biggest innovation by
> > Flink. At the same time, abusing states may also be the largest source
> for
> > reliability/performance issues. Shielding users away from dealing with
> > state directly is a key advantage in Table API. I think this is probably
> a
> > good boundary between DataStream and Table API. If users HAVE to
> manipulate
> > state explicitly, go with DataStream, otherwise, go with Table API.
> >
> > Because Flink is extending into more and more scenarios (e.g., batch,
> > streaming & micro-service), we may inevitably end up with multiple APIs.
> It
> > appears that data analytics (batch & streaming) related applications can
> be
> > well served with Table API which not only unifies batch and streaming,
> but
> > also relieves the users from dealing with states explicitly. On the other
> > hand, DataStream is very convenient and powerful for micro-service kind
> of
> > applications because explicitly state access may be necessary in such
> > cases.
> >
> > We will start a threading outlining what we are proposing in Table API.
> >
> > Regards,
> > Xiaowei
> >
> > On Mon, Nov 5, 2018 at 7:03 PM Fabian Hueske <fh...@gmail.com> wrote:
> >
> > > Hi Jincheng,
> > >
> > > Thanks for this interesting proposal.
> > > I like that we can push this effort forward in a very fine-grained
> > manner,
> > > i.e., incrementally adding more APIs to the Table API.
> > >
> > > However, I also have a few questions / concerns.
> > > Today, the Table API is tightly integrated with the DataSet and
> > DataStream
> > > APIs. It is very easy to convert a Table into a DataSet or DataStream
> and
> > > vice versa. This mean it is already easy to combine custom logic an
> > > relational operations. What I like is that several aspects are clearly
> > > separated like retraction and timestamp handling (see below) + all
> > > libraries on DataStream/DataSet can be easily combined with relational
> > > operations.
> > > I can see that adding more functionality to the Table API would remove
> > the
> > > distinction between DataSet and DataStream. However, wouldn't we get a
> > > similar benefit by extending the DataStream API for proper support for
> > > bounded streams (as is the long-term goal of Flink)?
> > > I'm also a bit skeptical about the optimization opportunities we would
> > > gain. Map/FlatMap UDFs are black boxes that cannot be easily removed
> > > without additional information (I did some research on this a few years
> > ago
> > > [1]).
> > >
> > > Moreover, I think there are a few tricky details that need to be
> resolved
> > > to enable a good integration.
> > >
> > > 1) How to deal with retraction messages? The DataStream API does not
> > have a
> > > notion of retractions. How would a MapFunction or FlatMapFunction
> handle
> > > retraction? Do they need to be aware of the change flag? Custom
> windowing
> > > and aggregation logic would certainly need to have that information.
> > > 2) How to deal with timestamps? The DataStream API does not give access
> > to
> > > timestamps. In the Table API / SQL these are exposed as regular
> > attributes.
> > > How can we ensure that timestamp attributes remain valid (i.e. aligned
> > with
> > > watermarks) if the output is produced by arbitrary code?
> > > There might be more issues of this kind.
> > >
> > > My main question would be how much would we gain with this proposal
> over
> > a
> > > tight integration of Table API and DataStream API, assuming that batch
> > > functionality is moved to DataStream?
> > >
> > > Best, Fabian
> > >
> > > [1] http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf
> > >
> > >
> > > Am Mo., 5. Nov. 2018 um 02:49 Uhr schrieb Rong Rong <
> walterddr@gmail.com
> > >:
> > >
> > > > Hi Jincheng,
> > > >
> > > > Thank you for the proposal! I think being able to define a process /
> > > > co-process function in table API definitely opens up a whole new
> level
> > of
> > > > applications using a unified API.
> > > >
> > > > In addition, as Tzu-Li and Hequn have mentioned, the benefit of
> > > > optimization layer of Table API will already bring in additional
> > benefit
> > > > over directly programming on top of DataStream/DataSet API. I am very
> > > > interested an looking forward to seeing the support for more complex
> > use
> > > > cases, especially iterations. It will enable table API to define much
> > > > broader, event-driven use cases such as real-time ML
> > prediction/training.
> > > >
> > > > As Timo mentioned, This will make Table API diverge from the SQL API.
> > But
> > > > as from my experience Table API was always giving me the impression
> to
> > > be a
> > > > more sophisticated, syntactic-aware way to express relational
> > operations.
> > > > Looking forward to further discussion and collaborations on the FLIP
> > doc.
> > > >
> > > > --
> > > > Rong
> > > >
> > > > On Sun, Nov 4, 2018 at 5:22 PM jincheng sun <
> sunjincheng121@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi tison,
> > > > >
> > > > > Thanks a lot for your feedback!
> > > > > I am very happy to see that community contributors agree to
> enhanced
> > > the
> > > > > TableAPI. This work is a long-term continuous work, we will push it
> > in
> > > > > stages, we will soon complete  the enhanced list of the first
> phase,
> > we
> > > > can
> > > > > go deep discussion  in google doc. thanks again for joining on the
> > very
> > > > > important discussion of the Flink Table API.
> > > > >
> > > > > Thanks,
> > > > > Jincheng
> > > > >
> > > > > Tzu-Li Chen <wa...@gmail.com> 于2018年11月2日周五 下午1:49写道:
> > > > >
> > > > > > Hi jingchengm
> > > > > >
> > > > > > Thanks a lot for your proposal! I find it is a good start point
> for
> > > > > > internal optimization works and help Flink to be more
> > > > > > user-friendly.
> > > > > >
> > > > > > AFAIK, DataStream is the most popular API currently that Flink
> > > > > > users should describe their logic with detailed logic.
> > > > > > From a more internal view the conversion from DataStream to
> > > > > > JobGraph is quite mechanically and hard to be optimized. So when
> > > > > > users program with DataStream, they have to learn more internals
> > > > > > and spend a lot of time to tune for performance.
> > > > > > With your proposal, we provide enhanced functionality of Table
> API,
> > > > > > so that users can describe their job easily on Table aspect. This
> > > gives
> > > > > > an opportunity to Flink developers to introduce an optimize phase
> > > > > > while transforming user program(described by Table API) to
> internal
> > > > > > representation.
> > > > > >
> > > > > > Given a user who want to start using Flink with simple ETL,
> > > pipelining
> > > > > > or analytics, he would find it is most naturally described by
> > > SQL/Table
> > > > > > API. Further, as mentioned by @hequn,
> > > > > >
> > > > > > SQL is a widely used language. It follows standards, is a
> > > > > > > descriptive language, and is easy to use
> > > > > >
> > > > > >
> > > > > > thus we could expect with the enhancement of SQL/Table API, Flink
> > > > > > becomes more friendly to users.
> > > > > >
> > > > > > Looking forward to the design doc/FLIP!
> > > > > >
> > > > > > Best,
> > > > > > tison.
> > > > > >
> > > > > >
> > > > > > jincheng sun <su...@gmail.com> 于2018年11月2日周五 上午11:46写道:
> > > > > >
> > > > > > > Hi Hequn,
> > > > > > > Thanks for your feedback! And also thanks for our offline
> > > discussion!
> > > > > > > You are right, unification of batch and streaming is very
> > important
> > > > for
> > > > > > > flink API.
> > > > > > > We will provide more detailed design later, Please let me know
> if
> > > you
> > > > > > have
> > > > > > > further thoughts or feedback.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Jincheng
> > > > > > >
> > > > > > > Hequn Cheng <ch...@gmail.com> 于2018年11月2日周五 上午10:02写道:
> > > > > > >
> > > > > > > > Hi Jincheng,
> > > > > > > >
> > > > > > > > Thanks a lot for your proposal. It is very encouraging!
> > > > > > > >
> > > > > > > > As we all know, SQL is a widely used language. It follows
> > > > standards,
> > > > > > is a
> > > > > > > > descriptive language, and is easy to use. A powerful feature
> of
> > > SQL
> > > > > is
> > > > > > > that
> > > > > > > > it supports optimization. Users only need to care about the
> > logic
> > > > of
> > > > > > the
> > > > > > > > program. The underlying optimizer will help users optimize
> the
> > > > > > > performance
> > > > > > > > of the program. However, in terms of functionality and ease
> of
> > > use,
> > > > > in
> > > > > > > some
> > > > > > > > scenarios sql will be limited, as described in Jincheng's
> > > proposal.
> > > > > > > >
> > > > > > > > Correspondingly, the DataStream/DataSet api can provide
> > powerful
> > > > > > > > functionalities. Users can write
> > > ProcessFunction/CoProcessFunction
> > > > > and
> > > > > > > get
> > > > > > > > the timer. Compared with SQL, it provides more
> functionalities
> > > and
> > > > > > > > flexibilities. However, it does not support optimization like
> > > SQL.
> > > > > > > > Meanwhile, DataStream/DataSet api has not been unified which
> > > means,
> > > > > for
> > > > > > > the
> > > > > > > > same logic, users need to write a job for each stream and
> > batch.
> > > > > > > >
> > > > > > > > With TableApi, I think we can combine the advantages of both.
> > > Users
> > > > > can
> > > > > > > > easily write relational operations and enjoy optimization. At
> > the
> > > > > same
> > > > > > > > time, it supports more functionality and ease of use. Looking
> > > > forward
> > > > > > to
> > > > > > > > the detailed design/FLIP.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Hequn
> > > > > > > >
> > > > > > > > On Fri, Nov 2, 2018 at 9:48 AM Shaoxuan Wang <
> > > wshaoxuan@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Aljoscha,
> > > > > > > > > Glad that you like the proposal. We have completed the
> > > prototype
> > > > of
> > > > > > > most
> > > > > > > > > new proposed functionalities. Once collect the feedback
> from
> > > > > > community,
> > > > > > > > we
> > > > > > > > > will come up with a concrete FLIP/design doc.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Shaoxuan
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Thu, Nov 1, 2018 at 8:12 PM Aljoscha Krettek <
> > > > > aljoscha@apache.org
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Jincheng,
> > > > > > > > > >
> > > > > > > > > > these points sound very good! Are there any concrete
> > > proposals
> > > > > for
> > > > > > > > > > changes? For example a FLIP/design document?
> > > > > > > > > >
> > > > > > > > > > See here for FLIPs:
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Aljoscha
> > > > > > > > > >
> > > > > > > > > > > On 1. Nov 2018, at 12:51, jincheng sun <
> > > > > sunjincheng121@gmail.com
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > *--------I am sorry for the formatting of the email
> > > content.
> > > > I
> > > > > > > > reformat
> > > > > > > > > > > the **content** as follows-----------*
> > > > > > > > > > >
> > > > > > > > > > > *Hi ALL,*
> > > > > > > > > > >
> > > > > > > > > > > With the continuous efforts from the community, the
> Flink
> > > > > system
> > > > > > > has
> > > > > > > > > been
> > > > > > > > > > > continuously improved, which has attracted more and
> more
> > > > users.
> > > > > > > Flink
> > > > > > > > > SQL
> > > > > > > > > > > is a canonical, widely used relational query language.
> > > > However,
> > > > > > > there
> > > > > > > > > are
> > > > > > > > > > > still some scenarios where Flink SQL failed to meet
> user
> > > > needs
> > > > > in
> > > > > > > > terms
> > > > > > > > > > of
> > > > > > > > > > > functionality and ease of use, such as:
> > > > > > > > > > >
> > > > > > > > > > > *1. In terms of functionality*
> > > > > > > > > > >    Iteration, user-defined window, user-defined join,
> > > > > > user-defined
> > > > > > > > > > > GroupReduce, etc. Users cannot express them with SQL;
> > > > > > > > > > >
> > > > > > > > > > > *2. In terms of ease of use*
> > > > > > > > > > >
> > > > > > > > > > >   - Map - e.g. “dataStream.map(mapFun)”. Although
> > > > > > > > “table.select(udf1(),
> > > > > > > > > > >   udf2(), udf3()....)” can be used to accomplish the
> same
> > > > > > > function.,
> > > > > > > > > > with a
> > > > > > > > > > >   map() function returning 100 columns, one has to
> define
> > > or
> > > > > call
> > > > > > > 100
> > > > > > > > > > UDFs
> > > > > > > > > > >   when using SQL, which is quite involved.
> > > > > > > > > > >   - FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”.
> > > > Similarly,
> > > > > > it
> > > > > > > > can
> > > > > > > > > be
> > > > > > > > > > >   implemented with “table.join(udtf).select()”.
> However,
> > it
> > > > is
> > > > > > > > obvious
> > > > > > > > > > that
> > > > > > > > > > >   dataStream is easier to use than SQL.
> > > > > > > > > > >
> > > > > > > > > > > Due to the above two reasons, some users have to use
> the
> > > > > > DataStream
> > > > > > > > API
> > > > > > > > > > or
> > > > > > > > > > > the DataSet API. But when they do that, they lose the
> > > > > unification
> > > > > > > of
> > > > > > > > > > batch
> > > > > > > > > > > and streaming. They will also lose the sophisticated
> > > > > > optimizations
> > > > > > > > such
> > > > > > > > > > as
> > > > > > > > > > > codegen, aggregate join transpose and multi-stage agg
> > from
> > > > > Flink
> > > > > > > SQL.
> > > > > > > > > > >
> > > > > > > > > > > We believe that enhancing the functionality and
> > > productivity
> > > > is
> > > > > > > vital
> > > > > > > > > for
> > > > > > > > > > > the successful adoption of Table API. To this end,
> Table
> > > API
> > > > > > still
> > > > > > > > > > > requires more efforts from every contributor in the
> > > > community.
> > > > > We
> > > > > > > see
> > > > > > > > > > great
> > > > > > > > > > > opportunity in improving our user’s experience from
> this
> > > > work.
> > > > > > Any
> > > > > > > > > > feedback
> > > > > > > > > > > is welcome.
> > > > > > > > > > >
> > > > > > > > > > > Regards,
> > > > > > > > > > >
> > > > > > > > > > > Jincheng
> > > > > > > > > > >
> > > > > > > > > > > jincheng sun <su...@gmail.com> 于2018年11月1日周四
> > > > > 下午5:07写道:
> > > > > > > > > > >
> > > > > > > > > > >> Hi all,
> > > > > > > > > > >>
> > > > > > > > > > >> With the continuous efforts from the community, the
> > Flink
> > > > > system
> > > > > > > has
> > > > > > > > > > been
> > > > > > > > > > >> continuously improved, which has attracted more and
> more
> > > > > users.
> > > > > > > > Flink
> > > > > > > > > > SQL
> > > > > > > > > > >> is a canonical, widely used relational query language.
> > > > > However,
> > > > > > > > there
> > > > > > > > > > are
> > > > > > > > > > >> still some scenarios where Flink SQL failed to meet
> user
> > > > needs
> > > > > > in
> > > > > > > > > terms
> > > > > > > > > > of
> > > > > > > > > > >> functionality and ease of use, such as:
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >>   -
> > > > > > > > > > >>
> > > > > > > > > > >>   In terms of functionality
> > > > > > > > > > >>
> > > > > > > > > > >> Iteration, user-defined window, user-defined join,
> > > > > user-defined
> > > > > > > > > > >> GroupReduce, etc. Users cannot express them with SQL;
> > > > > > > > > > >>
> > > > > > > > > > >>   -
> > > > > > > > > > >>
> > > > > > > > > > >>   In terms of ease of use
> > > > > > > > > > >>   -
> > > > > > > > > > >>
> > > > > > > > > > >>      Map - e.g. “dataStream.map(mapFun)”. Although
> > > > > > > > > “table.select(udf1(),
> > > > > > > > > > >>      udf2(), udf3()....)” can be used to accomplish
> the
> > > same
> > > > > > > > > function.,
> > > > > > > > > > with a
> > > > > > > > > > >>      map() function returning 100 columns, one has to
> > > define
> > > > > or
> > > > > > > call
> > > > > > > > > > 100 UDFs
> > > > > > > > > > >>      when using SQL, which is quite involved.
> > > > > > > > > > >>      -
> > > > > > > > > > >>
> > > > > > > > > > >>      FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”.
> > > > > Similarly,
> > > > > > > it
> > > > > > > > > can
> > > > > > > > > > >>      be implemented with “table.join(udtf).select()”.
> > > > However,
> > > > > > it
> > > > > > > is
> > > > > > > > > > obvious
> > > > > > > > > > >>      that datastream is easier to use than SQL.
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >> Due to the above two reasons, some users have to use
> the
> > > > > > > DataStream
> > > > > > > > > API
> > > > > > > > > > or
> > > > > > > > > > >> the DataSet API. But when they do that, they lose the
> > > > > > unification
> > > > > > > of
> > > > > > > > > > batch
> > > > > > > > > > >> and streaming. They will also lose the sophisticated
> > > > > > optimizations
> > > > > > > > > such
> > > > > > > > > > as
> > > > > > > > > > >> codegen, aggregate join transpose  and multi-stage agg
> > > from
> > > > > > Flink
> > > > > > > > SQL.
> > > > > > > > > > >>
> > > > > > > > > > >> We believe that enhancing the functionality and
> > > productivity
> > > > > is
> > > > > > > > vital
> > > > > > > > > > for
> > > > > > > > > > >> the successful adoption of Table API. To this end,
> > Table
> > > > API
> > > > > > > still
> > > > > > > > > > >> requires more efforts from every contributor in the
> > > > community.
> > > > > > We
> > > > > > > > see
> > > > > > > > > > great
> > > > > > > > > > >> opportunity in improving our user’s experience from
> this
> > > > work.
> > > > > > Any
> > > > > > > > > > feedback
> > > > > > > > > > >> is welcome.
> > > > > > > > > > >>
> > > > > > > > > > >> Regards,
> > > > > > > > > > >>
> > > > > > > > > > >> Jincheng
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Posted by SHI Xiaogang <sh...@gmail.com>.
Hi all,

I think it's good to enhance the functionality and productivity of Table
API, but still I think SQL + DataStream is a better choice from user
experience
1. The unification of batch and stream processing is very attractive, and
many our users are moving their batch-processing applications to Flink.
These users are familiar with SQL and most of their applications are
written in SQL-like languages. In Ad-hoc or interactive scenarios where
users write new queries according to the results or previous queries, SQL
is much more efficient than Table API.
2. Though TableAPI is also declarative, most users typically refuse to
learn another language, not to mention a fast-evolving one.
3. There are many efforts to be done in the optimization of both Table API
and SQL. Given that most data processing systems, including Databases,
MapReduce-like systems and Continuous Query systems, have done a lot of
work in the optimization of SQL queries, it seems more easier to adopt
existing work in Flink with SQL than TableAPI.
4. Data stream is still welcome as users can deal with complex logic with
flexible interfaces. It is also widely adopted in critical missions where
users can use a lot of "tricky" optimization to achieve optimal
performance. Most of these tricks typically are not allowed in TableAPI or
SQL.

Regards,
Xiaogang

Xiaowei Jiang <xi...@gmail.com> 于2018年11月5日周一 下午8:34写道:

> Hi Fabian, these are great questions! I have some quick thoughts on some of
> these.
>
> Optimization opportunities: I think that you are right UDFs are more like
> blackboxes today. However this can change if we let user develop UDFs
> symbolically in the future (i.e., Flink will look inside the UDF code,
> understand it and potentially do codegen to execute it). This will open the
> door for potential optimizations.
>
> Moving batch functionality to DataStream: I actually think that moving
> batch functionality to Table API is probably a better idea. Currently,
> DataStream API is very deterministic and imperative. On the other hand,
> DataSet API has an optimizer and can choose very different execution plans.
> Another distinguishing feature of DataStream API is that users get direct
> access to state/statebackend which we intensionally avoided in Table API so
> far. The introduction of states is probably the biggest innovation by
> Flink. At the same time, abusing states may also be the largest source for
> reliability/performance issues. Shielding users away from dealing with
> state directly is a key advantage in Table API. I think this is probably a
> good boundary between DataStream and Table API. If users HAVE to manipulate
> state explicitly, go with DataStream, otherwise, go with Table API.
>
> Because Flink is extending into more and more scenarios (e.g., batch,
> streaming & micro-service), we may inevitably end up with multiple APIs. It
> appears that data analytics (batch & streaming) related applications can be
> well served with Table API which not only unifies batch and streaming, but
> also relieves the users from dealing with states explicitly. On the other
> hand, DataStream is very convenient and powerful for micro-service kind of
> applications because explicitly state access may be necessary in such
> cases.
>
> We will start a threading outlining what we are proposing in Table API.
>
> Regards,
> Xiaowei
>
> On Mon, Nov 5, 2018 at 7:03 PM Fabian Hueske <fh...@gmail.com> wrote:
>
> > Hi Jincheng,
> >
> > Thanks for this interesting proposal.
> > I like that we can push this effort forward in a very fine-grained
> manner,
> > i.e., incrementally adding more APIs to the Table API.
> >
> > However, I also have a few questions / concerns.
> > Today, the Table API is tightly integrated with the DataSet and
> DataStream
> > APIs. It is very easy to convert a Table into a DataSet or DataStream and
> > vice versa. This mean it is already easy to combine custom logic an
> > relational operations. What I like is that several aspects are clearly
> > separated like retraction and timestamp handling (see below) + all
> > libraries on DataStream/DataSet can be easily combined with relational
> > operations.
> > I can see that adding more functionality to the Table API would remove
> the
> > distinction between DataSet and DataStream. However, wouldn't we get a
> > similar benefit by extending the DataStream API for proper support for
> > bounded streams (as is the long-term goal of Flink)?
> > I'm also a bit skeptical about the optimization opportunities we would
> > gain. Map/FlatMap UDFs are black boxes that cannot be easily removed
> > without additional information (I did some research on this a few years
> ago
> > [1]).
> >
> > Moreover, I think there are a few tricky details that need to be resolved
> > to enable a good integration.
> >
> > 1) How to deal with retraction messages? The DataStream API does not
> have a
> > notion of retractions. How would a MapFunction or FlatMapFunction handle
> > retraction? Do they need to be aware of the change flag? Custom windowing
> > and aggregation logic would certainly need to have that information.
> > 2) How to deal with timestamps? The DataStream API does not give access
> to
> > timestamps. In the Table API / SQL these are exposed as regular
> attributes.
> > How can we ensure that timestamp attributes remain valid (i.e. aligned
> with
> > watermarks) if the output is produced by arbitrary code?
> > There might be more issues of this kind.
> >
> > My main question would be how much would we gain with this proposal over
> a
> > tight integration of Table API and DataStream API, assuming that batch
> > functionality is moved to DataStream?
> >
> > Best, Fabian
> >
> > [1] http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf
> >
> >
> > Am Mo., 5. Nov. 2018 um 02:49 Uhr schrieb Rong Rong <walterddr@gmail.com
> >:
> >
> > > Hi Jincheng,
> > >
> > > Thank you for the proposal! I think being able to define a process /
> > > co-process function in table API definitely opens up a whole new level
> of
> > > applications using a unified API.
> > >
> > > In addition, as Tzu-Li and Hequn have mentioned, the benefit of
> > > optimization layer of Table API will already bring in additional
> benefit
> > > over directly programming on top of DataStream/DataSet API. I am very
> > > interested an looking forward to seeing the support for more complex
> use
> > > cases, especially iterations. It will enable table API to define much
> > > broader, event-driven use cases such as real-time ML
> prediction/training.
> > >
> > > As Timo mentioned, This will make Table API diverge from the SQL API.
> But
> > > as from my experience Table API was always giving me the impression to
> > be a
> > > more sophisticated, syntactic-aware way to express relational
> operations.
> > > Looking forward to further discussion and collaborations on the FLIP
> doc.
> > >
> > > --
> > > Rong
> > >
> > > On Sun, Nov 4, 2018 at 5:22 PM jincheng sun <su...@gmail.com>
> > > wrote:
> > >
> > > > Hi tison,
> > > >
> > > > Thanks a lot for your feedback!
> > > > I am very happy to see that community contributors agree to enhanced
> > the
> > > > TableAPI. This work is a long-term continuous work, we will push it
> in
> > > > stages, we will soon complete  the enhanced list of the first phase,
> we
> > > can
> > > > go deep discussion  in google doc. thanks again for joining on the
> very
> > > > important discussion of the Flink Table API.
> > > >
> > > > Thanks,
> > > > Jincheng
> > > >
> > > > Tzu-Li Chen <wa...@gmail.com> 于2018年11月2日周五 下午1:49写道:
> > > >
> > > > > Hi jingchengm
> > > > >
> > > > > Thanks a lot for your proposal! I find it is a good start point for
> > > > > internal optimization works and help Flink to be more
> > > > > user-friendly.
> > > > >
> > > > > AFAIK, DataStream is the most popular API currently that Flink
> > > > > users should describe their logic with detailed logic.
> > > > > From a more internal view the conversion from DataStream to
> > > > > JobGraph is quite mechanically and hard to be optimized. So when
> > > > > users program with DataStream, they have to learn more internals
> > > > > and spend a lot of time to tune for performance.
> > > > > With your proposal, we provide enhanced functionality of Table API,
> > > > > so that users can describe their job easily on Table aspect. This
> > gives
> > > > > an opportunity to Flink developers to introduce an optimize phase
> > > > > while transforming user program(described by Table API) to internal
> > > > > representation.
> > > > >
> > > > > Given a user who want to start using Flink with simple ETL,
> > pipelining
> > > > > or analytics, he would find it is most naturally described by
> > SQL/Table
> > > > > API. Further, as mentioned by @hequn,
> > > > >
> > > > > SQL is a widely used language. It follows standards, is a
> > > > > > descriptive language, and is easy to use
> > > > >
> > > > >
> > > > > thus we could expect with the enhancement of SQL/Table API, Flink
> > > > > becomes more friendly to users.
> > > > >
> > > > > Looking forward to the design doc/FLIP!
> > > > >
> > > > > Best,
> > > > > tison.
> > > > >
> > > > >
> > > > > jincheng sun <su...@gmail.com> 于2018年11月2日周五 上午11:46写道:
> > > > >
> > > > > > Hi Hequn,
> > > > > > Thanks for your feedback! And also thanks for our offline
> > discussion!
> > > > > > You are right, unification of batch and streaming is very
> important
> > > for
> > > > > > flink API.
> > > > > > We will provide more detailed design later, Please let me know if
> > you
> > > > > have
> > > > > > further thoughts or feedback.
> > > > > >
> > > > > > Thanks,
> > > > > > Jincheng
> > > > > >
> > > > > > Hequn Cheng <ch...@gmail.com> 于2018年11月2日周五 上午10:02写道:
> > > > > >
> > > > > > > Hi Jincheng,
> > > > > > >
> > > > > > > Thanks a lot for your proposal. It is very encouraging!
> > > > > > >
> > > > > > > As we all know, SQL is a widely used language. It follows
> > > standards,
> > > > > is a
> > > > > > > descriptive language, and is easy to use. A powerful feature of
> > SQL
> > > > is
> > > > > > that
> > > > > > > it supports optimization. Users only need to care about the
> logic
> > > of
> > > > > the
> > > > > > > program. The underlying optimizer will help users optimize the
> > > > > > performance
> > > > > > > of the program. However, in terms of functionality and ease of
> > use,
> > > > in
> > > > > > some
> > > > > > > scenarios sql will be limited, as described in Jincheng's
> > proposal.
> > > > > > >
> > > > > > > Correspondingly, the DataStream/DataSet api can provide
> powerful
> > > > > > > functionalities. Users can write
> > ProcessFunction/CoProcessFunction
> > > > and
> > > > > > get
> > > > > > > the timer. Compared with SQL, it provides more functionalities
> > and
> > > > > > > flexibilities. However, it does not support optimization like
> > SQL.
> > > > > > > Meanwhile, DataStream/DataSet api has not been unified which
> > means,
> > > > for
> > > > > > the
> > > > > > > same logic, users need to write a job for each stream and
> batch.
> > > > > > >
> > > > > > > With TableApi, I think we can combine the advantages of both.
> > Users
> > > > can
> > > > > > > easily write relational operations and enjoy optimization. At
> the
> > > > same
> > > > > > > time, it supports more functionality and ease of use. Looking
> > > forward
> > > > > to
> > > > > > > the detailed design/FLIP.
> > > > > > >
> > > > > > > Best,
> > > > > > > Hequn
> > > > > > >
> > > > > > > On Fri, Nov 2, 2018 at 9:48 AM Shaoxuan Wang <
> > wshaoxuan@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Aljoscha,
> > > > > > > > Glad that you like the proposal. We have completed the
> > prototype
> > > of
> > > > > > most
> > > > > > > > new proposed functionalities. Once collect the feedback from
> > > > > community,
> > > > > > > we
> > > > > > > > will come up with a concrete FLIP/design doc.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Shaoxuan
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Nov 1, 2018 at 8:12 PM Aljoscha Krettek <
> > > > aljoscha@apache.org
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Jincheng,
> > > > > > > > >
> > > > > > > > > these points sound very good! Are there any concrete
> > proposals
> > > > for
> > > > > > > > > changes? For example a FLIP/design document?
> > > > > > > > >
> > > > > > > > > See here for FLIPs:
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Aljoscha
> > > > > > > > >
> > > > > > > > > > On 1. Nov 2018, at 12:51, jincheng sun <
> > > > sunjincheng121@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > *--------I am sorry for the formatting of the email
> > content.
> > > I
> > > > > > > reformat
> > > > > > > > > > the **content** as follows-----------*
> > > > > > > > > >
> > > > > > > > > > *Hi ALL,*
> > > > > > > > > >
> > > > > > > > > > With the continuous efforts from the community, the Flink
> > > > system
> > > > > > has
> > > > > > > > been
> > > > > > > > > > continuously improved, which has attracted more and more
> > > users.
> > > > > > Flink
> > > > > > > > SQL
> > > > > > > > > > is a canonical, widely used relational query language.
> > > However,
> > > > > > there
> > > > > > > > are
> > > > > > > > > > still some scenarios where Flink SQL failed to meet user
> > > needs
> > > > in
> > > > > > > terms
> > > > > > > > > of
> > > > > > > > > > functionality and ease of use, such as:
> > > > > > > > > >
> > > > > > > > > > *1. In terms of functionality*
> > > > > > > > > >    Iteration, user-defined window, user-defined join,
> > > > > user-defined
> > > > > > > > > > GroupReduce, etc. Users cannot express them with SQL;
> > > > > > > > > >
> > > > > > > > > > *2. In terms of ease of use*
> > > > > > > > > >
> > > > > > > > > >   - Map - e.g. “dataStream.map(mapFun)”. Although
> > > > > > > “table.select(udf1(),
> > > > > > > > > >   udf2(), udf3()....)” can be used to accomplish the same
> > > > > > function.,
> > > > > > > > > with a
> > > > > > > > > >   map() function returning 100 columns, one has to define
> > or
> > > > call
> > > > > > 100
> > > > > > > > > UDFs
> > > > > > > > > >   when using SQL, which is quite involved.
> > > > > > > > > >   - FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”.
> > > Similarly,
> > > > > it
> > > > > > > can
> > > > > > > > be
> > > > > > > > > >   implemented with “table.join(udtf).select()”. However,
> it
> > > is
> > > > > > > obvious
> > > > > > > > > that
> > > > > > > > > >   dataStream is easier to use than SQL.
> > > > > > > > > >
> > > > > > > > > > Due to the above two reasons, some users have to use the
> > > > > DataStream
> > > > > > > API
> > > > > > > > > or
> > > > > > > > > > the DataSet API. But when they do that, they lose the
> > > > unification
> > > > > > of
> > > > > > > > > batch
> > > > > > > > > > and streaming. They will also lose the sophisticated
> > > > > optimizations
> > > > > > > such
> > > > > > > > > as
> > > > > > > > > > codegen, aggregate join transpose and multi-stage agg
> from
> > > > Flink
> > > > > > SQL.
> > > > > > > > > >
> > > > > > > > > > We believe that enhancing the functionality and
> > productivity
> > > is
> > > > > > vital
> > > > > > > > for
> > > > > > > > > > the successful adoption of Table API. To this end,  Table
> > API
> > > > > still
> > > > > > > > > > requires more efforts from every contributor in the
> > > community.
> > > > We
> > > > > > see
> > > > > > > > > great
> > > > > > > > > > opportunity in improving our user’s experience from this
> > > work.
> > > > > Any
> > > > > > > > > feedback
> > > > > > > > > > is welcome.
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > >
> > > > > > > > > > Jincheng
> > > > > > > > > >
> > > > > > > > > > jincheng sun <su...@gmail.com> 于2018年11月1日周四
> > > > 下午5:07写道:
> > > > > > > > > >
> > > > > > > > > >> Hi all,
> > > > > > > > > >>
> > > > > > > > > >> With the continuous efforts from the community, the
> Flink
> > > > system
> > > > > > has
> > > > > > > > > been
> > > > > > > > > >> continuously improved, which has attracted more and more
> > > > users.
> > > > > > > Flink
> > > > > > > > > SQL
> > > > > > > > > >> is a canonical, widely used relational query language.
> > > > However,
> > > > > > > there
> > > > > > > > > are
> > > > > > > > > >> still some scenarios where Flink SQL failed to meet user
> > > needs
> > > > > in
> > > > > > > > terms
> > > > > > > > > of
> > > > > > > > > >> functionality and ease of use, such as:
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >>   -
> > > > > > > > > >>
> > > > > > > > > >>   In terms of functionality
> > > > > > > > > >>
> > > > > > > > > >> Iteration, user-defined window, user-defined join,
> > > > user-defined
> > > > > > > > > >> GroupReduce, etc. Users cannot express them with SQL;
> > > > > > > > > >>
> > > > > > > > > >>   -
> > > > > > > > > >>
> > > > > > > > > >>   In terms of ease of use
> > > > > > > > > >>   -
> > > > > > > > > >>
> > > > > > > > > >>      Map - e.g. “dataStream.map(mapFun)”. Although
> > > > > > > > “table.select(udf1(),
> > > > > > > > > >>      udf2(), udf3()....)” can be used to accomplish the
> > same
> > > > > > > > function.,
> > > > > > > > > with a
> > > > > > > > > >>      map() function returning 100 columns, one has to
> > define
> > > > or
> > > > > > call
> > > > > > > > > 100 UDFs
> > > > > > > > > >>      when using SQL, which is quite involved.
> > > > > > > > > >>      -
> > > > > > > > > >>
> > > > > > > > > >>      FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”.
> > > > Similarly,
> > > > > > it
> > > > > > > > can
> > > > > > > > > >>      be implemented with “table.join(udtf).select()”.
> > > However,
> > > > > it
> > > > > > is
> > > > > > > > > obvious
> > > > > > > > > >>      that datastream is easier to use than SQL.
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >> Due to the above two reasons, some users have to use the
> > > > > > DataStream
> > > > > > > > API
> > > > > > > > > or
> > > > > > > > > >> the DataSet API. But when they do that, they lose the
> > > > > unification
> > > > > > of
> > > > > > > > > batch
> > > > > > > > > >> and streaming. They will also lose the sophisticated
> > > > > optimizations
> > > > > > > > such
> > > > > > > > > as
> > > > > > > > > >> codegen, aggregate join transpose  and multi-stage agg
> > from
> > > > > Flink
> > > > > > > SQL.
> > > > > > > > > >>
> > > > > > > > > >> We believe that enhancing the functionality and
> > productivity
> > > > is
> > > > > > > vital
> > > > > > > > > for
> > > > > > > > > >> the successful adoption of Table API. To this end,
> Table
> > > API
> > > > > > still
> > > > > > > > > >> requires more efforts from every contributor in the
> > > community.
> > > > > We
> > > > > > > see
> > > > > > > > > great
> > > > > > > > > >> opportunity in improving our user’s experience from this
> > > work.
> > > > > Any
> > > > > > > > > feedback
> > > > > > > > > >> is welcome.
> > > > > > > > > >>
> > > > > > > > > >> Regards,
> > > > > > > > > >>
> > > > > > > > > >> Jincheng
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Posted by Xiaowei Jiang <xi...@gmail.com>.
Hi Fabian, these are great questions! I have some quick thoughts on some of
these.

Optimization opportunities: I think that you are right UDFs are more like
blackboxes today. However this can change if we let user develop UDFs
symbolically in the future (i.e., Flink will look inside the UDF code,
understand it and potentially do codegen to execute it). This will open the
door for potential optimizations.

Moving batch functionality to DataStream: I actually think that moving
batch functionality to Table API is probably a better idea. Currently,
DataStream API is very deterministic and imperative. On the other hand,
DataSet API has an optimizer and can choose very different execution plans.
Another distinguishing feature of DataStream API is that users get direct
access to state/statebackend which we intensionally avoided in Table API so
far. The introduction of states is probably the biggest innovation by
Flink. At the same time, abusing states may also be the largest source for
reliability/performance issues. Shielding users away from dealing with
state directly is a key advantage in Table API. I think this is probably a
good boundary between DataStream and Table API. If users HAVE to manipulate
state explicitly, go with DataStream, otherwise, go with Table API.

Because Flink is extending into more and more scenarios (e.g., batch,
streaming & micro-service), we may inevitably end up with multiple APIs. It
appears that data analytics (batch & streaming) related applications can be
well served with Table API which not only unifies batch and streaming, but
also relieves the users from dealing with states explicitly. On the other
hand, DataStream is very convenient and powerful for micro-service kind of
applications because explicitly state access may be necessary in such cases.

We will start a threading outlining what we are proposing in Table API.

Regards,
Xiaowei

On Mon, Nov 5, 2018 at 7:03 PM Fabian Hueske <fh...@gmail.com> wrote:

> Hi Jincheng,
>
> Thanks for this interesting proposal.
> I like that we can push this effort forward in a very fine-grained manner,
> i.e., incrementally adding more APIs to the Table API.
>
> However, I also have a few questions / concerns.
> Today, the Table API is tightly integrated with the DataSet and DataStream
> APIs. It is very easy to convert a Table into a DataSet or DataStream and
> vice versa. This mean it is already easy to combine custom logic an
> relational operations. What I like is that several aspects are clearly
> separated like retraction and timestamp handling (see below) + all
> libraries on DataStream/DataSet can be easily combined with relational
> operations.
> I can see that adding more functionality to the Table API would remove the
> distinction between DataSet and DataStream. However, wouldn't we get a
> similar benefit by extending the DataStream API for proper support for
> bounded streams (as is the long-term goal of Flink)?
> I'm also a bit skeptical about the optimization opportunities we would
> gain. Map/FlatMap UDFs are black boxes that cannot be easily removed
> without additional information (I did some research on this a few years ago
> [1]).
>
> Moreover, I think there are a few tricky details that need to be resolved
> to enable a good integration.
>
> 1) How to deal with retraction messages? The DataStream API does not have a
> notion of retractions. How would a MapFunction or FlatMapFunction handle
> retraction? Do they need to be aware of the change flag? Custom windowing
> and aggregation logic would certainly need to have that information.
> 2) How to deal with timestamps? The DataStream API does not give access to
> timestamps. In the Table API / SQL these are exposed as regular attributes.
> How can we ensure that timestamp attributes remain valid (i.e. aligned with
> watermarks) if the output is produced by arbitrary code?
> There might be more issues of this kind.
>
> My main question would be how much would we gain with this proposal over a
> tight integration of Table API and DataStream API, assuming that batch
> functionality is moved to DataStream?
>
> Best, Fabian
>
> [1] http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf
>
>
> Am Mo., 5. Nov. 2018 um 02:49 Uhr schrieb Rong Rong <wa...@gmail.com>:
>
> > Hi Jincheng,
> >
> > Thank you for the proposal! I think being able to define a process /
> > co-process function in table API definitely opens up a whole new level of
> > applications using a unified API.
> >
> > In addition, as Tzu-Li and Hequn have mentioned, the benefit of
> > optimization layer of Table API will already bring in additional benefit
> > over directly programming on top of DataStream/DataSet API. I am very
> > interested an looking forward to seeing the support for more complex use
> > cases, especially iterations. It will enable table API to define much
> > broader, event-driven use cases such as real-time ML prediction/training.
> >
> > As Timo mentioned, This will make Table API diverge from the SQL API. But
> > as from my experience Table API was always giving me the impression to
> be a
> > more sophisticated, syntactic-aware way to express relational operations.
> > Looking forward to further discussion and collaborations on the FLIP doc.
> >
> > --
> > Rong
> >
> > On Sun, Nov 4, 2018 at 5:22 PM jincheng sun <su...@gmail.com>
> > wrote:
> >
> > > Hi tison,
> > >
> > > Thanks a lot for your feedback!
> > > I am very happy to see that community contributors agree to enhanced
> the
> > > TableAPI. This work is a long-term continuous work, we will push it in
> > > stages, we will soon complete  the enhanced list of the first phase, we
> > can
> > > go deep discussion  in google doc. thanks again for joining on the very
> > > important discussion of the Flink Table API.
> > >
> > > Thanks,
> > > Jincheng
> > >
> > > Tzu-Li Chen <wa...@gmail.com> 于2018年11月2日周五 下午1:49写道:
> > >
> > > > Hi jingchengm
> > > >
> > > > Thanks a lot for your proposal! I find it is a good start point for
> > > > internal optimization works and help Flink to be more
> > > > user-friendly.
> > > >
> > > > AFAIK, DataStream is the most popular API currently that Flink
> > > > users should describe their logic with detailed logic.
> > > > From a more internal view the conversion from DataStream to
> > > > JobGraph is quite mechanically and hard to be optimized. So when
> > > > users program with DataStream, they have to learn more internals
> > > > and spend a lot of time to tune for performance.
> > > > With your proposal, we provide enhanced functionality of Table API,
> > > > so that users can describe their job easily on Table aspect. This
> gives
> > > > an opportunity to Flink developers to introduce an optimize phase
> > > > while transforming user program(described by Table API) to internal
> > > > representation.
> > > >
> > > > Given a user who want to start using Flink with simple ETL,
> pipelining
> > > > or analytics, he would find it is most naturally described by
> SQL/Table
> > > > API. Further, as mentioned by @hequn,
> > > >
> > > > SQL is a widely used language. It follows standards, is a
> > > > > descriptive language, and is easy to use
> > > >
> > > >
> > > > thus we could expect with the enhancement of SQL/Table API, Flink
> > > > becomes more friendly to users.
> > > >
> > > > Looking forward to the design doc/FLIP!
> > > >
> > > > Best,
> > > > tison.
> > > >
> > > >
> > > > jincheng sun <su...@gmail.com> 于2018年11月2日周五 上午11:46写道:
> > > >
> > > > > Hi Hequn,
> > > > > Thanks for your feedback! And also thanks for our offline
> discussion!
> > > > > You are right, unification of batch and streaming is very important
> > for
> > > > > flink API.
> > > > > We will provide more detailed design later, Please let me know if
> you
> > > > have
> > > > > further thoughts or feedback.
> > > > >
> > > > > Thanks,
> > > > > Jincheng
> > > > >
> > > > > Hequn Cheng <ch...@gmail.com> 于2018年11月2日周五 上午10:02写道:
> > > > >
> > > > > > Hi Jincheng,
> > > > > >
> > > > > > Thanks a lot for your proposal. It is very encouraging!
> > > > > >
> > > > > > As we all know, SQL is a widely used language. It follows
> > standards,
> > > > is a
> > > > > > descriptive language, and is easy to use. A powerful feature of
> SQL
> > > is
> > > > > that
> > > > > > it supports optimization. Users only need to care about the logic
> > of
> > > > the
> > > > > > program. The underlying optimizer will help users optimize the
> > > > > performance
> > > > > > of the program. However, in terms of functionality and ease of
> use,
> > > in
> > > > > some
> > > > > > scenarios sql will be limited, as described in Jincheng's
> proposal.
> > > > > >
> > > > > > Correspondingly, the DataStream/DataSet api can provide powerful
> > > > > > functionalities. Users can write
> ProcessFunction/CoProcessFunction
> > > and
> > > > > get
> > > > > > the timer. Compared with SQL, it provides more functionalities
> and
> > > > > > flexibilities. However, it does not support optimization like
> SQL.
> > > > > > Meanwhile, DataStream/DataSet api has not been unified which
> means,
> > > for
> > > > > the
> > > > > > same logic, users need to write a job for each stream and batch.
> > > > > >
> > > > > > With TableApi, I think we can combine the advantages of both.
> Users
> > > can
> > > > > > easily write relational operations and enjoy optimization. At the
> > > same
> > > > > > time, it supports more functionality and ease of use. Looking
> > forward
> > > > to
> > > > > > the detailed design/FLIP.
> > > > > >
> > > > > > Best,
> > > > > > Hequn
> > > > > >
> > > > > > On Fri, Nov 2, 2018 at 9:48 AM Shaoxuan Wang <
> wshaoxuan@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Hi Aljoscha,
> > > > > > > Glad that you like the proposal. We have completed the
> prototype
> > of
> > > > > most
> > > > > > > new proposed functionalities. Once collect the feedback from
> > > > community,
> > > > > > we
> > > > > > > will come up with a concrete FLIP/design doc.
> > > > > > >
> > > > > > > Regards,
> > > > > > > Shaoxuan
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Nov 1, 2018 at 8:12 PM Aljoscha Krettek <
> > > aljoscha@apache.org
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Jincheng,
> > > > > > > >
> > > > > > > > these points sound very good! Are there any concrete
> proposals
> > > for
> > > > > > > > changes? For example a FLIP/design document?
> > > > > > > >
> > > > > > > > See here for FLIPs:
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Aljoscha
> > > > > > > >
> > > > > > > > > On 1. Nov 2018, at 12:51, jincheng sun <
> > > sunjincheng121@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > *--------I am sorry for the formatting of the email
> content.
> > I
> > > > > > reformat
> > > > > > > > > the **content** as follows-----------*
> > > > > > > > >
> > > > > > > > > *Hi ALL,*
> > > > > > > > >
> > > > > > > > > With the continuous efforts from the community, the Flink
> > > system
> > > > > has
> > > > > > > been
> > > > > > > > > continuously improved, which has attracted more and more
> > users.
> > > > > Flink
> > > > > > > SQL
> > > > > > > > > is a canonical, widely used relational query language.
> > However,
> > > > > there
> > > > > > > are
> > > > > > > > > still some scenarios where Flink SQL failed to meet user
> > needs
> > > in
> > > > > > terms
> > > > > > > > of
> > > > > > > > > functionality and ease of use, such as:
> > > > > > > > >
> > > > > > > > > *1. In terms of functionality*
> > > > > > > > >    Iteration, user-defined window, user-defined join,
> > > > user-defined
> > > > > > > > > GroupReduce, etc. Users cannot express them with SQL;
> > > > > > > > >
> > > > > > > > > *2. In terms of ease of use*
> > > > > > > > >
> > > > > > > > >   - Map - e.g. “dataStream.map(mapFun)”. Although
> > > > > > “table.select(udf1(),
> > > > > > > > >   udf2(), udf3()....)” can be used to accomplish the same
> > > > > function.,
> > > > > > > > with a
> > > > > > > > >   map() function returning 100 columns, one has to define
> or
> > > call
> > > > > 100
> > > > > > > > UDFs
> > > > > > > > >   when using SQL, which is quite involved.
> > > > > > > > >   - FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”.
> > Similarly,
> > > > it
> > > > > > can
> > > > > > > be
> > > > > > > > >   implemented with “table.join(udtf).select()”. However, it
> > is
> > > > > > obvious
> > > > > > > > that
> > > > > > > > >   dataStream is easier to use than SQL.
> > > > > > > > >
> > > > > > > > > Due to the above two reasons, some users have to use the
> > > > DataStream
> > > > > > API
> > > > > > > > or
> > > > > > > > > the DataSet API. But when they do that, they lose the
> > > unification
> > > > > of
> > > > > > > > batch
> > > > > > > > > and streaming. They will also lose the sophisticated
> > > > optimizations
> > > > > > such
> > > > > > > > as
> > > > > > > > > codegen, aggregate join transpose and multi-stage agg from
> > > Flink
> > > > > SQL.
> > > > > > > > >
> > > > > > > > > We believe that enhancing the functionality and
> productivity
> > is
> > > > > vital
> > > > > > > for
> > > > > > > > > the successful adoption of Table API. To this end,  Table
> API
> > > > still
> > > > > > > > > requires more efforts from every contributor in the
> > community.
> > > We
> > > > > see
> > > > > > > > great
> > > > > > > > > opportunity in improving our user’s experience from this
> > work.
> > > > Any
> > > > > > > > feedback
> > > > > > > > > is welcome.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > >
> > > > > > > > > Jincheng
> > > > > > > > >
> > > > > > > > > jincheng sun <su...@gmail.com> 于2018年11月1日周四
> > > 下午5:07写道:
> > > > > > > > >
> > > > > > > > >> Hi all,
> > > > > > > > >>
> > > > > > > > >> With the continuous efforts from the community, the Flink
> > > system
> > > > > has
> > > > > > > > been
> > > > > > > > >> continuously improved, which has attracted more and more
> > > users.
> > > > > > Flink
> > > > > > > > SQL
> > > > > > > > >> is a canonical, widely used relational query language.
> > > However,
> > > > > > there
> > > > > > > > are
> > > > > > > > >> still some scenarios where Flink SQL failed to meet user
> > needs
> > > > in
> > > > > > > terms
> > > > > > > > of
> > > > > > > > >> functionality and ease of use, such as:
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >>   -
> > > > > > > > >>
> > > > > > > > >>   In terms of functionality
> > > > > > > > >>
> > > > > > > > >> Iteration, user-defined window, user-defined join,
> > > user-defined
> > > > > > > > >> GroupReduce, etc. Users cannot express them with SQL;
> > > > > > > > >>
> > > > > > > > >>   -
> > > > > > > > >>
> > > > > > > > >>   In terms of ease of use
> > > > > > > > >>   -
> > > > > > > > >>
> > > > > > > > >>      Map - e.g. “dataStream.map(mapFun)”. Although
> > > > > > > “table.select(udf1(),
> > > > > > > > >>      udf2(), udf3()....)” can be used to accomplish the
> same
> > > > > > > function.,
> > > > > > > > with a
> > > > > > > > >>      map() function returning 100 columns, one has to
> define
> > > or
> > > > > call
> > > > > > > > 100 UDFs
> > > > > > > > >>      when using SQL, which is quite involved.
> > > > > > > > >>      -
> > > > > > > > >>
> > > > > > > > >>      FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”.
> > > Similarly,
> > > > > it
> > > > > > > can
> > > > > > > > >>      be implemented with “table.join(udtf).select()”.
> > However,
> > > > it
> > > > > is
> > > > > > > > obvious
> > > > > > > > >>      that datastream is easier to use than SQL.
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> Due to the above two reasons, some users have to use the
> > > > > DataStream
> > > > > > > API
> > > > > > > > or
> > > > > > > > >> the DataSet API. But when they do that, they lose the
> > > > unification
> > > > > of
> > > > > > > > batch
> > > > > > > > >> and streaming. They will also lose the sophisticated
> > > > optimizations
> > > > > > > such
> > > > > > > > as
> > > > > > > > >> codegen, aggregate join transpose  and multi-stage agg
> from
> > > > Flink
> > > > > > SQL.
> > > > > > > > >>
> > > > > > > > >> We believe that enhancing the functionality and
> productivity
> > > is
> > > > > > vital
> > > > > > > > for
> > > > > > > > >> the successful adoption of Table API. To this end,  Table
> > API
> > > > > still
> > > > > > > > >> requires more efforts from every contributor in the
> > community.
> > > > We
> > > > > > see
> > > > > > > > great
> > > > > > > > >> opportunity in improving our user’s experience from this
> > work.
> > > > Any
> > > > > > > > feedback
> > > > > > > > >> is welcome.
> > > > > > > > >>
> > > > > > > > >> Regards,
> > > > > > > > >>
> > > > > > > > >> Jincheng
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Posted by Fabian Hueske <fh...@gmail.com>.
Hi Jincheng,

Thanks for this interesting proposal.
I like that we can push this effort forward in a very fine-grained manner,
i.e., incrementally adding more APIs to the Table API.

However, I also have a few questions / concerns.
Today, the Table API is tightly integrated with the DataSet and DataStream
APIs. It is very easy to convert a Table into a DataSet or DataStream and
vice versa. This mean it is already easy to combine custom logic an
relational operations. What I like is that several aspects are clearly
separated like retraction and timestamp handling (see below) + all
libraries on DataStream/DataSet can be easily combined with relational
operations.
I can see that adding more functionality to the Table API would remove the
distinction between DataSet and DataStream. However, wouldn't we get a
similar benefit by extending the DataStream API for proper support for
bounded streams (as is the long-term goal of Flink)?
I'm also a bit skeptical about the optimization opportunities we would
gain. Map/FlatMap UDFs are black boxes that cannot be easily removed
without additional information (I did some research on this a few years ago
[1]).

Moreover, I think there are a few tricky details that need to be resolved
to enable a good integration.

1) How to deal with retraction messages? The DataStream API does not have a
notion of retractions. How would a MapFunction or FlatMapFunction handle
retraction? Do they need to be aware of the change flag? Custom windowing
and aggregation logic would certainly need to have that information.
2) How to deal with timestamps? The DataStream API does not give access to
timestamps. In the Table API / SQL these are exposed as regular attributes.
How can we ensure that timestamp attributes remain valid (i.e. aligned with
watermarks) if the output is produced by arbitrary code?
There might be more issues of this kind.

My main question would be how much would we gain with this proposal over a
tight integration of Table API and DataStream API, assuming that batch
functionality is moved to DataStream?

Best, Fabian

[1] http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf


Am Mo., 5. Nov. 2018 um 02:49 Uhr schrieb Rong Rong <wa...@gmail.com>:

> Hi Jincheng,
>
> Thank you for the proposal! I think being able to define a process /
> co-process function in table API definitely opens up a whole new level of
> applications using a unified API.
>
> In addition, as Tzu-Li and Hequn have mentioned, the benefit of
> optimization layer of Table API will already bring in additional benefit
> over directly programming on top of DataStream/DataSet API. I am very
> interested an looking forward to seeing the support for more complex use
> cases, especially iterations. It will enable table API to define much
> broader, event-driven use cases such as real-time ML prediction/training.
>
> As Timo mentioned, This will make Table API diverge from the SQL API. But
> as from my experience Table API was always giving me the impression to be a
> more sophisticated, syntactic-aware way to express relational operations.
> Looking forward to further discussion and collaborations on the FLIP doc.
>
> --
> Rong
>
> On Sun, Nov 4, 2018 at 5:22 PM jincheng sun <su...@gmail.com>
> wrote:
>
> > Hi tison,
> >
> > Thanks a lot for your feedback!
> > I am very happy to see that community contributors agree to enhanced the
> > TableAPI. This work is a long-term continuous work, we will push it in
> > stages, we will soon complete  the enhanced list of the first phase, we
> can
> > go deep discussion  in google doc. thanks again for joining on the very
> > important discussion of the Flink Table API.
> >
> > Thanks,
> > Jincheng
> >
> > Tzu-Li Chen <wa...@gmail.com> 于2018年11月2日周五 下午1:49写道:
> >
> > > Hi jingchengm
> > >
> > > Thanks a lot for your proposal! I find it is a good start point for
> > > internal optimization works and help Flink to be more
> > > user-friendly.
> > >
> > > AFAIK, DataStream is the most popular API currently that Flink
> > > users should describe their logic with detailed logic.
> > > From a more internal view the conversion from DataStream to
> > > JobGraph is quite mechanically and hard to be optimized. So when
> > > users program with DataStream, they have to learn more internals
> > > and spend a lot of time to tune for performance.
> > > With your proposal, we provide enhanced functionality of Table API,
> > > so that users can describe their job easily on Table aspect. This gives
> > > an opportunity to Flink developers to introduce an optimize phase
> > > while transforming user program(described by Table API) to internal
> > > representation.
> > >
> > > Given a user who want to start using Flink with simple ETL, pipelining
> > > or analytics, he would find it is most naturally described by SQL/Table
> > > API. Further, as mentioned by @hequn,
> > >
> > > SQL is a widely used language. It follows standards, is a
> > > > descriptive language, and is easy to use
> > >
> > >
> > > thus we could expect with the enhancement of SQL/Table API, Flink
> > > becomes more friendly to users.
> > >
> > > Looking forward to the design doc/FLIP!
> > >
> > > Best,
> > > tison.
> > >
> > >
> > > jincheng sun <su...@gmail.com> 于2018年11月2日周五 上午11:46写道:
> > >
> > > > Hi Hequn,
> > > > Thanks for your feedback! And also thanks for our offline discussion!
> > > > You are right, unification of batch and streaming is very important
> for
> > > > flink API.
> > > > We will provide more detailed design later, Please let me know if you
> > > have
> > > > further thoughts or feedback.
> > > >
> > > > Thanks,
> > > > Jincheng
> > > >
> > > > Hequn Cheng <ch...@gmail.com> 于2018年11月2日周五 上午10:02写道:
> > > >
> > > > > Hi Jincheng,
> > > > >
> > > > > Thanks a lot for your proposal. It is very encouraging!
> > > > >
> > > > > As we all know, SQL is a widely used language. It follows
> standards,
> > > is a
> > > > > descriptive language, and is easy to use. A powerful feature of SQL
> > is
> > > > that
> > > > > it supports optimization. Users only need to care about the logic
> of
> > > the
> > > > > program. The underlying optimizer will help users optimize the
> > > > performance
> > > > > of the program. However, in terms of functionality and ease of use,
> > in
> > > > some
> > > > > scenarios sql will be limited, as described in Jincheng's proposal.
> > > > >
> > > > > Correspondingly, the DataStream/DataSet api can provide powerful
> > > > > functionalities. Users can write ProcessFunction/CoProcessFunction
> > and
> > > > get
> > > > > the timer. Compared with SQL, it provides more functionalities and
> > > > > flexibilities. However, it does not support optimization like SQL.
> > > > > Meanwhile, DataStream/DataSet api has not been unified which means,
> > for
> > > > the
> > > > > same logic, users need to write a job for each stream and batch.
> > > > >
> > > > > With TableApi, I think we can combine the advantages of both. Users
> > can
> > > > > easily write relational operations and enjoy optimization. At the
> > same
> > > > > time, it supports more functionality and ease of use. Looking
> forward
> > > to
> > > > > the detailed design/FLIP.
> > > > >
> > > > > Best,
> > > > > Hequn
> > > > >
> > > > > On Fri, Nov 2, 2018 at 9:48 AM Shaoxuan Wang <ws...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Hi Aljoscha,
> > > > > > Glad that you like the proposal. We have completed the prototype
> of
> > > > most
> > > > > > new proposed functionalities. Once collect the feedback from
> > > community,
> > > > > we
> > > > > > will come up with a concrete FLIP/design doc.
> > > > > >
> > > > > > Regards,
> > > > > > Shaoxuan
> > > > > >
> > > > > >
> > > > > > On Thu, Nov 1, 2018 at 8:12 PM Aljoscha Krettek <
> > aljoscha@apache.org
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Jincheng,
> > > > > > >
> > > > > > > these points sound very good! Are there any concrete proposals
> > for
> > > > > > > changes? For example a FLIP/design document?
> > > > > > >
> > > > > > > See here for FLIPs:
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> > > > > > >
> > > > > > > Best,
> > > > > > > Aljoscha
> > > > > > >
> > > > > > > > On 1. Nov 2018, at 12:51, jincheng sun <
> > sunjincheng121@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > *--------I am sorry for the formatting of the email content.
> I
> > > > > reformat
> > > > > > > > the **content** as follows-----------*
> > > > > > > >
> > > > > > > > *Hi ALL,*
> > > > > > > >
> > > > > > > > With the continuous efforts from the community, the Flink
> > system
> > > > has
> > > > > > been
> > > > > > > > continuously improved, which has attracted more and more
> users.
> > > > Flink
> > > > > > SQL
> > > > > > > > is a canonical, widely used relational query language.
> However,
> > > > there
> > > > > > are
> > > > > > > > still some scenarios where Flink SQL failed to meet user
> needs
> > in
> > > > > terms
> > > > > > > of
> > > > > > > > functionality and ease of use, such as:
> > > > > > > >
> > > > > > > > *1. In terms of functionality*
> > > > > > > >    Iteration, user-defined window, user-defined join,
> > > user-defined
> > > > > > > > GroupReduce, etc. Users cannot express them with SQL;
> > > > > > > >
> > > > > > > > *2. In terms of ease of use*
> > > > > > > >
> > > > > > > >   - Map - e.g. “dataStream.map(mapFun)”. Although
> > > > > “table.select(udf1(),
> > > > > > > >   udf2(), udf3()....)” can be used to accomplish the same
> > > > function.,
> > > > > > > with a
> > > > > > > >   map() function returning 100 columns, one has to define or
> > call
> > > > 100
> > > > > > > UDFs
> > > > > > > >   when using SQL, which is quite involved.
> > > > > > > >   - FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”.
> Similarly,
> > > it
> > > > > can
> > > > > > be
> > > > > > > >   implemented with “table.join(udtf).select()”. However, it
> is
> > > > > obvious
> > > > > > > that
> > > > > > > >   dataStream is easier to use than SQL.
> > > > > > > >
> > > > > > > > Due to the above two reasons, some users have to use the
> > > DataStream
> > > > > API
> > > > > > > or
> > > > > > > > the DataSet API. But when they do that, they lose the
> > unification
> > > > of
> > > > > > > batch
> > > > > > > > and streaming. They will also lose the sophisticated
> > > optimizations
> > > > > such
> > > > > > > as
> > > > > > > > codegen, aggregate join transpose and multi-stage agg from
> > Flink
> > > > SQL.
> > > > > > > >
> > > > > > > > We believe that enhancing the functionality and productivity
> is
> > > > vital
> > > > > > for
> > > > > > > > the successful adoption of Table API. To this end,  Table API
> > > still
> > > > > > > > requires more efforts from every contributor in the
> community.
> > We
> > > > see
> > > > > > > great
> > > > > > > > opportunity in improving our user’s experience from this
> work.
> > > Any
> > > > > > > feedback
> > > > > > > > is welcome.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > >
> > > > > > > > Jincheng
> > > > > > > >
> > > > > > > > jincheng sun <su...@gmail.com> 于2018年11月1日周四
> > 下午5:07写道:
> > > > > > > >
> > > > > > > >> Hi all,
> > > > > > > >>
> > > > > > > >> With the continuous efforts from the community, the Flink
> > system
> > > > has
> > > > > > > been
> > > > > > > >> continuously improved, which has attracted more and more
> > users.
> > > > > Flink
> > > > > > > SQL
> > > > > > > >> is a canonical, widely used relational query language.
> > However,
> > > > > there
> > > > > > > are
> > > > > > > >> still some scenarios where Flink SQL failed to meet user
> needs
> > > in
> > > > > > terms
> > > > > > > of
> > > > > > > >> functionality and ease of use, such as:
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>   -
> > > > > > > >>
> > > > > > > >>   In terms of functionality
> > > > > > > >>
> > > > > > > >> Iteration, user-defined window, user-defined join,
> > user-defined
> > > > > > > >> GroupReduce, etc. Users cannot express them with SQL;
> > > > > > > >>
> > > > > > > >>   -
> > > > > > > >>
> > > > > > > >>   In terms of ease of use
> > > > > > > >>   -
> > > > > > > >>
> > > > > > > >>      Map - e.g. “dataStream.map(mapFun)”. Although
> > > > > > “table.select(udf1(),
> > > > > > > >>      udf2(), udf3()....)” can be used to accomplish the same
> > > > > > function.,
> > > > > > > with a
> > > > > > > >>      map() function returning 100 columns, one has to define
> > or
> > > > call
> > > > > > > 100 UDFs
> > > > > > > >>      when using SQL, which is quite involved.
> > > > > > > >>      -
> > > > > > > >>
> > > > > > > >>      FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”.
> > Similarly,
> > > > it
> > > > > > can
> > > > > > > >>      be implemented with “table.join(udtf).select()”.
> However,
> > > it
> > > > is
> > > > > > > obvious
> > > > > > > >>      that datastream is easier to use than SQL.
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> Due to the above two reasons, some users have to use the
> > > > DataStream
> > > > > > API
> > > > > > > or
> > > > > > > >> the DataSet API. But when they do that, they lose the
> > > unification
> > > > of
> > > > > > > batch
> > > > > > > >> and streaming. They will also lose the sophisticated
> > > optimizations
> > > > > > such
> > > > > > > as
> > > > > > > >> codegen, aggregate join transpose  and multi-stage agg from
> > > Flink
> > > > > SQL.
> > > > > > > >>
> > > > > > > >> We believe that enhancing the functionality and productivity
> > is
> > > > > vital
> > > > > > > for
> > > > > > > >> the successful adoption of Table API. To this end,  Table
> API
> > > > still
> > > > > > > >> requires more efforts from every contributor in the
> community.
> > > We
> > > > > see
> > > > > > > great
> > > > > > > >> opportunity in improving our user’s experience from this
> work.
> > > Any
> > > > > > > feedback
> > > > > > > >> is welcome.
> > > > > > > >>
> > > > > > > >> Regards,
> > > > > > > >>
> > > > > > > >> Jincheng
> > > > > > > >>
> > > > > > > >>
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Posted by Rong Rong <wa...@gmail.com>.
Hi Jincheng,

Thank you for the proposal! I think being able to define a process /
co-process function in table API definitely opens up a whole new level of
applications using a unified API.

In addition, as Tzu-Li and Hequn have mentioned, the benefit of
optimization layer of Table API will already bring in additional benefit
over directly programming on top of DataStream/DataSet API. I am very
interested an looking forward to seeing the support for more complex use
cases, especially iterations. It will enable table API to define much
broader, event-driven use cases such as real-time ML prediction/training.

As Timo mentioned, This will make Table API diverge from the SQL API. But
as from my experience Table API was always giving me the impression to be a
more sophisticated, syntactic-aware way to express relational operations.
Looking forward to further discussion and collaborations on the FLIP doc.

--
Rong

On Sun, Nov 4, 2018 at 5:22 PM jincheng sun <su...@gmail.com>
wrote:

> Hi tison,
>
> Thanks a lot for your feedback!
> I am very happy to see that community contributors agree to enhanced the
> TableAPI. This work is a long-term continuous work, we will push it in
> stages, we will soon complete  the enhanced list of the first phase, we can
> go deep discussion  in google doc. thanks again for joining on the very
> important discussion of the Flink Table API.
>
> Thanks,
> Jincheng
>
> Tzu-Li Chen <wa...@gmail.com> 于2018年11月2日周五 下午1:49写道:
>
> > Hi jingchengm
> >
> > Thanks a lot for your proposal! I find it is a good start point for
> > internal optimization works and help Flink to be more
> > user-friendly.
> >
> > AFAIK, DataStream is the most popular API currently that Flink
> > users should describe their logic with detailed logic.
> > From a more internal view the conversion from DataStream to
> > JobGraph is quite mechanically and hard to be optimized. So when
> > users program with DataStream, they have to learn more internals
> > and spend a lot of time to tune for performance.
> > With your proposal, we provide enhanced functionality of Table API,
> > so that users can describe their job easily on Table aspect. This gives
> > an opportunity to Flink developers to introduce an optimize phase
> > while transforming user program(described by Table API) to internal
> > representation.
> >
> > Given a user who want to start using Flink with simple ETL, pipelining
> > or analytics, he would find it is most naturally described by SQL/Table
> > API. Further, as mentioned by @hequn,
> >
> > SQL is a widely used language. It follows standards, is a
> > > descriptive language, and is easy to use
> >
> >
> > thus we could expect with the enhancement of SQL/Table API, Flink
> > becomes more friendly to users.
> >
> > Looking forward to the design doc/FLIP!
> >
> > Best,
> > tison.
> >
> >
> > jincheng sun <su...@gmail.com> 于2018年11月2日周五 上午11:46写道:
> >
> > > Hi Hequn,
> > > Thanks for your feedback! And also thanks for our offline discussion!
> > > You are right, unification of batch and streaming is very important for
> > > flink API.
> > > We will provide more detailed design later, Please let me know if you
> > have
> > > further thoughts or feedback.
> > >
> > > Thanks,
> > > Jincheng
> > >
> > > Hequn Cheng <ch...@gmail.com> 于2018年11月2日周五 上午10:02写道:
> > >
> > > > Hi Jincheng,
> > > >
> > > > Thanks a lot for your proposal. It is very encouraging!
> > > >
> > > > As we all know, SQL is a widely used language. It follows standards,
> > is a
> > > > descriptive language, and is easy to use. A powerful feature of SQL
> is
> > > that
> > > > it supports optimization. Users only need to care about the logic of
> > the
> > > > program. The underlying optimizer will help users optimize the
> > > performance
> > > > of the program. However, in terms of functionality and ease of use,
> in
> > > some
> > > > scenarios sql will be limited, as described in Jincheng's proposal.
> > > >
> > > > Correspondingly, the DataStream/DataSet api can provide powerful
> > > > functionalities. Users can write ProcessFunction/CoProcessFunction
> and
> > > get
> > > > the timer. Compared with SQL, it provides more functionalities and
> > > > flexibilities. However, it does not support optimization like SQL.
> > > > Meanwhile, DataStream/DataSet api has not been unified which means,
> for
> > > the
> > > > same logic, users need to write a job for each stream and batch.
> > > >
> > > > With TableApi, I think we can combine the advantages of both. Users
> can
> > > > easily write relational operations and enjoy optimization. At the
> same
> > > > time, it supports more functionality and ease of use. Looking forward
> > to
> > > > the detailed design/FLIP.
> > > >
> > > > Best,
> > > > Hequn
> > > >
> > > > On Fri, Nov 2, 2018 at 9:48 AM Shaoxuan Wang <ws...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi Aljoscha,
> > > > > Glad that you like the proposal. We have completed the prototype of
> > > most
> > > > > new proposed functionalities. Once collect the feedback from
> > community,
> > > > we
> > > > > will come up with a concrete FLIP/design doc.
> > > > >
> > > > > Regards,
> > > > > Shaoxuan
> > > > >
> > > > >
> > > > > On Thu, Nov 1, 2018 at 8:12 PM Aljoscha Krettek <
> aljoscha@apache.org
> > >
> > > > > wrote:
> > > > >
> > > > > > Hi Jincheng,
> > > > > >
> > > > > > these points sound very good! Are there any concrete proposals
> for
> > > > > > changes? For example a FLIP/design document?
> > > > > >
> > > > > > See here for FLIPs:
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> > > > > >
> > > > > > Best,
> > > > > > Aljoscha
> > > > > >
> > > > > > > On 1. Nov 2018, at 12:51, jincheng sun <
> sunjincheng121@gmail.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > *--------I am sorry for the formatting of the email content. I
> > > > reformat
> > > > > > > the **content** as follows-----------*
> > > > > > >
> > > > > > > *Hi ALL,*
> > > > > > >
> > > > > > > With the continuous efforts from the community, the Flink
> system
> > > has
> > > > > been
> > > > > > > continuously improved, which has attracted more and more users.
> > > Flink
> > > > > SQL
> > > > > > > is a canonical, widely used relational query language. However,
> > > there
> > > > > are
> > > > > > > still some scenarios where Flink SQL failed to meet user needs
> in
> > > > terms
> > > > > > of
> > > > > > > functionality and ease of use, such as:
> > > > > > >
> > > > > > > *1. In terms of functionality*
> > > > > > >    Iteration, user-defined window, user-defined join,
> > user-defined
> > > > > > > GroupReduce, etc. Users cannot express them with SQL;
> > > > > > >
> > > > > > > *2. In terms of ease of use*
> > > > > > >
> > > > > > >   - Map - e.g. “dataStream.map(mapFun)”. Although
> > > > “table.select(udf1(),
> > > > > > >   udf2(), udf3()....)” can be used to accomplish the same
> > > function.,
> > > > > > with a
> > > > > > >   map() function returning 100 columns, one has to define or
> call
> > > 100
> > > > > > UDFs
> > > > > > >   when using SQL, which is quite involved.
> > > > > > >   - FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”. Similarly,
> > it
> > > > can
> > > > > be
> > > > > > >   implemented with “table.join(udtf).select()”. However, it is
> > > > obvious
> > > > > > that
> > > > > > >   dataStream is easier to use than SQL.
> > > > > > >
> > > > > > > Due to the above two reasons, some users have to use the
> > DataStream
> > > > API
> > > > > > or
> > > > > > > the DataSet API. But when they do that, they lose the
> unification
> > > of
> > > > > > batch
> > > > > > > and streaming. They will also lose the sophisticated
> > optimizations
> > > > such
> > > > > > as
> > > > > > > codegen, aggregate join transpose and multi-stage agg from
> Flink
> > > SQL.
> > > > > > >
> > > > > > > We believe that enhancing the functionality and productivity is
> > > vital
> > > > > for
> > > > > > > the successful adoption of Table API. To this end,  Table API
> > still
> > > > > > > requires more efforts from every contributor in the community.
> We
> > > see
> > > > > > great
> > > > > > > opportunity in improving our user’s experience from this work.
> > Any
> > > > > > feedback
> > > > > > > is welcome.
> > > > > > >
> > > > > > > Regards,
> > > > > > >
> > > > > > > Jincheng
> > > > > > >
> > > > > > > jincheng sun <su...@gmail.com> 于2018年11月1日周四
> 下午5:07写道:
> > > > > > >
> > > > > > >> Hi all,
> > > > > > >>
> > > > > > >> With the continuous efforts from the community, the Flink
> system
> > > has
> > > > > > been
> > > > > > >> continuously improved, which has attracted more and more
> users.
> > > > Flink
> > > > > > SQL
> > > > > > >> is a canonical, widely used relational query language.
> However,
> > > > there
> > > > > > are
> > > > > > >> still some scenarios where Flink SQL failed to meet user needs
> > in
> > > > > terms
> > > > > > of
> > > > > > >> functionality and ease of use, such as:
> > > > > > >>
> > > > > > >>
> > > > > > >>   -
> > > > > > >>
> > > > > > >>   In terms of functionality
> > > > > > >>
> > > > > > >> Iteration, user-defined window, user-defined join,
> user-defined
> > > > > > >> GroupReduce, etc. Users cannot express them with SQL;
> > > > > > >>
> > > > > > >>   -
> > > > > > >>
> > > > > > >>   In terms of ease of use
> > > > > > >>   -
> > > > > > >>
> > > > > > >>      Map - e.g. “dataStream.map(mapFun)”. Although
> > > > > “table.select(udf1(),
> > > > > > >>      udf2(), udf3()....)” can be used to accomplish the same
> > > > > function.,
> > > > > > with a
> > > > > > >>      map() function returning 100 columns, one has to define
> or
> > > call
> > > > > > 100 UDFs
> > > > > > >>      when using SQL, which is quite involved.
> > > > > > >>      -
> > > > > > >>
> > > > > > >>      FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”.
> Similarly,
> > > it
> > > > > can
> > > > > > >>      be implemented with “table.join(udtf).select()”. However,
> > it
> > > is
> > > > > > obvious
> > > > > > >>      that datastream is easier to use than SQL.
> > > > > > >>
> > > > > > >>
> > > > > > >> Due to the above two reasons, some users have to use the
> > > DataStream
> > > > > API
> > > > > > or
> > > > > > >> the DataSet API. But when they do that, they lose the
> > unification
> > > of
> > > > > > batch
> > > > > > >> and streaming. They will also lose the sophisticated
> > optimizations
> > > > > such
> > > > > > as
> > > > > > >> codegen, aggregate join transpose  and multi-stage agg from
> > Flink
> > > > SQL.
> > > > > > >>
> > > > > > >> We believe that enhancing the functionality and productivity
> is
> > > > vital
> > > > > > for
> > > > > > >> the successful adoption of Table API. To this end,  Table API
> > > still
> > > > > > >> requires more efforts from every contributor in the community.
> > We
> > > > see
> > > > > > great
> > > > > > >> opportunity in improving our user’s experience from this work.
> > Any
> > > > > > feedback
> > > > > > >> is welcome.
> > > > > > >>
> > > > > > >> Regards,
> > > > > > >>
> > > > > > >> Jincheng
> > > > > > >>
> > > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Posted by jincheng sun <su...@gmail.com>.
Hi tison,

Thanks a lot for your feedback!
I am very happy to see that community contributors agree to enhanced the
TableAPI. This work is a long-term continuous work, we will push it in
stages, we will soon complete  the enhanced list of the first phase, we can
go deep discussion  in google doc. thanks again for joining on the very
important discussion of the Flink Table API.

Thanks,
Jincheng

Tzu-Li Chen <wa...@gmail.com> 于2018年11月2日周五 下午1:49写道:

> Hi jingchengm
>
> Thanks a lot for your proposal! I find it is a good start point for
> internal optimization works and help Flink to be more
> user-friendly.
>
> AFAIK, DataStream is the most popular API currently that Flink
> users should describe their logic with detailed logic.
> From a more internal view the conversion from DataStream to
> JobGraph is quite mechanically and hard to be optimized. So when
> users program with DataStream, they have to learn more internals
> and spend a lot of time to tune for performance.
> With your proposal, we provide enhanced functionality of Table API,
> so that users can describe their job easily on Table aspect. This gives
> an opportunity to Flink developers to introduce an optimize phase
> while transforming user program(described by Table API) to internal
> representation.
>
> Given a user who want to start using Flink with simple ETL, pipelining
> or analytics, he would find it is most naturally described by SQL/Table
> API. Further, as mentioned by @hequn,
>
> SQL is a widely used language. It follows standards, is a
> > descriptive language, and is easy to use
>
>
> thus we could expect with the enhancement of SQL/Table API, Flink
> becomes more friendly to users.
>
> Looking forward to the design doc/FLIP!
>
> Best,
> tison.
>
>
> jincheng sun <su...@gmail.com> 于2018年11月2日周五 上午11:46写道:
>
> > Hi Hequn,
> > Thanks for your feedback! And also thanks for our offline discussion!
> > You are right, unification of batch and streaming is very important for
> > flink API.
> > We will provide more detailed design later, Please let me know if you
> have
> > further thoughts or feedback.
> >
> > Thanks,
> > Jincheng
> >
> > Hequn Cheng <ch...@gmail.com> 于2018年11月2日周五 上午10:02写道:
> >
> > > Hi Jincheng,
> > >
> > > Thanks a lot for your proposal. It is very encouraging!
> > >
> > > As we all know, SQL is a widely used language. It follows standards,
> is a
> > > descriptive language, and is easy to use. A powerful feature of SQL is
> > that
> > > it supports optimization. Users only need to care about the logic of
> the
> > > program. The underlying optimizer will help users optimize the
> > performance
> > > of the program. However, in terms of functionality and ease of use, in
> > some
> > > scenarios sql will be limited, as described in Jincheng's proposal.
> > >
> > > Correspondingly, the DataStream/DataSet api can provide powerful
> > > functionalities. Users can write ProcessFunction/CoProcessFunction and
> > get
> > > the timer. Compared with SQL, it provides more functionalities and
> > > flexibilities. However, it does not support optimization like SQL.
> > > Meanwhile, DataStream/DataSet api has not been unified which means, for
> > the
> > > same logic, users need to write a job for each stream and batch.
> > >
> > > With TableApi, I think we can combine the advantages of both. Users can
> > > easily write relational operations and enjoy optimization. At the same
> > > time, it supports more functionality and ease of use. Looking forward
> to
> > > the detailed design/FLIP.
> > >
> > > Best,
> > > Hequn
> > >
> > > On Fri, Nov 2, 2018 at 9:48 AM Shaoxuan Wang <ws...@gmail.com>
> > wrote:
> > >
> > > > Hi Aljoscha,
> > > > Glad that you like the proposal. We have completed the prototype of
> > most
> > > > new proposed functionalities. Once collect the feedback from
> community,
> > > we
> > > > will come up with a concrete FLIP/design doc.
> > > >
> > > > Regards,
> > > > Shaoxuan
> > > >
> > > >
> > > > On Thu, Nov 1, 2018 at 8:12 PM Aljoscha Krettek <aljoscha@apache.org
> >
> > > > wrote:
> > > >
> > > > > Hi Jincheng,
> > > > >
> > > > > these points sound very good! Are there any concrete proposals for
> > > > > changes? For example a FLIP/design document?
> > > > >
> > > > > See here for FLIPs:
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> > > > >
> > > > > Best,
> > > > > Aljoscha
> > > > >
> > > > > > On 1. Nov 2018, at 12:51, jincheng sun <sunjincheng121@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > > *--------I am sorry for the formatting of the email content. I
> > > reformat
> > > > > > the **content** as follows-----------*
> > > > > >
> > > > > > *Hi ALL,*
> > > > > >
> > > > > > With the continuous efforts from the community, the Flink system
> > has
> > > > been
> > > > > > continuously improved, which has attracted more and more users.
> > Flink
> > > > SQL
> > > > > > is a canonical, widely used relational query language. However,
> > there
> > > > are
> > > > > > still some scenarios where Flink SQL failed to meet user needs in
> > > terms
> > > > > of
> > > > > > functionality and ease of use, such as:
> > > > > >
> > > > > > *1. In terms of functionality*
> > > > > >    Iteration, user-defined window, user-defined join,
> user-defined
> > > > > > GroupReduce, etc. Users cannot express them with SQL;
> > > > > >
> > > > > > *2. In terms of ease of use*
> > > > > >
> > > > > >   - Map - e.g. “dataStream.map(mapFun)”. Although
> > > “table.select(udf1(),
> > > > > >   udf2(), udf3()....)” can be used to accomplish the same
> > function.,
> > > > > with a
> > > > > >   map() function returning 100 columns, one has to define or call
> > 100
> > > > > UDFs
> > > > > >   when using SQL, which is quite involved.
> > > > > >   - FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”. Similarly,
> it
> > > can
> > > > be
> > > > > >   implemented with “table.join(udtf).select()”. However, it is
> > > obvious
> > > > > that
> > > > > >   dataStream is easier to use than SQL.
> > > > > >
> > > > > > Due to the above two reasons, some users have to use the
> DataStream
> > > API
> > > > > or
> > > > > > the DataSet API. But when they do that, they lose the unification
> > of
> > > > > batch
> > > > > > and streaming. They will also lose the sophisticated
> optimizations
> > > such
> > > > > as
> > > > > > codegen, aggregate join transpose and multi-stage agg from Flink
> > SQL.
> > > > > >
> > > > > > We believe that enhancing the functionality and productivity is
> > vital
> > > > for
> > > > > > the successful adoption of Table API. To this end,  Table API
> still
> > > > > > requires more efforts from every contributor in the community. We
> > see
> > > > > great
> > > > > > opportunity in improving our user’s experience from this work.
> Any
> > > > > feedback
> > > > > > is welcome.
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > Jincheng
> > > > > >
> > > > > > jincheng sun <su...@gmail.com> 于2018年11月1日周四 下午5:07写道:
> > > > > >
> > > > > >> Hi all,
> > > > > >>
> > > > > >> With the continuous efforts from the community, the Flink system
> > has
> > > > > been
> > > > > >> continuously improved, which has attracted more and more users.
> > > Flink
> > > > > SQL
> > > > > >> is a canonical, widely used relational query language. However,
> > > there
> > > > > are
> > > > > >> still some scenarios where Flink SQL failed to meet user needs
> in
> > > > terms
> > > > > of
> > > > > >> functionality and ease of use, such as:
> > > > > >>
> > > > > >>
> > > > > >>   -
> > > > > >>
> > > > > >>   In terms of functionality
> > > > > >>
> > > > > >> Iteration, user-defined window, user-defined join, user-defined
> > > > > >> GroupReduce, etc. Users cannot express them with SQL;
> > > > > >>
> > > > > >>   -
> > > > > >>
> > > > > >>   In terms of ease of use
> > > > > >>   -
> > > > > >>
> > > > > >>      Map - e.g. “dataStream.map(mapFun)”. Although
> > > > “table.select(udf1(),
> > > > > >>      udf2(), udf3()....)” can be used to accomplish the same
> > > > function.,
> > > > > with a
> > > > > >>      map() function returning 100 columns, one has to define or
> > call
> > > > > 100 UDFs
> > > > > >>      when using SQL, which is quite involved.
> > > > > >>      -
> > > > > >>
> > > > > >>      FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”. Similarly,
> > it
> > > > can
> > > > > >>      be implemented with “table.join(udtf).select()”. However,
> it
> > is
> > > > > obvious
> > > > > >>      that datastream is easier to use than SQL.
> > > > > >>
> > > > > >>
> > > > > >> Due to the above two reasons, some users have to use the
> > DataStream
> > > > API
> > > > > or
> > > > > >> the DataSet API. But when they do that, they lose the
> unification
> > of
> > > > > batch
> > > > > >> and streaming. They will also lose the sophisticated
> optimizations
> > > > such
> > > > > as
> > > > > >> codegen, aggregate join transpose  and multi-stage agg from
> Flink
> > > SQL.
> > > > > >>
> > > > > >> We believe that enhancing the functionality and productivity is
> > > vital
> > > > > for
> > > > > >> the successful adoption of Table API. To this end,  Table API
> > still
> > > > > >> requires more efforts from every contributor in the community.
> We
> > > see
> > > > > great
> > > > > >> opportunity in improving our user’s experience from this work.
> Any
> > > > > feedback
> > > > > >> is welcome.
> > > > > >>
> > > > > >> Regards,
> > > > > >>
> > > > > >> Jincheng
> > > > > >>
> > > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Posted by Tzu-Li Chen <wa...@gmail.com>.
Hi jingchengm

Thanks a lot for your proposal! I find it is a good start point for
internal optimization works and help Flink to be more
user-friendly.

AFAIK, DataStream is the most popular API currently that Flink
users should describe their logic with detailed logic.
From a more internal view the conversion from DataStream to
JobGraph is quite mechanically and hard to be optimized. So when
users program with DataStream, they have to learn more internals
and spend a lot of time to tune for performance.
With your proposal, we provide enhanced functionality of Table API,
so that users can describe their job easily on Table aspect. This gives
an opportunity to Flink developers to introduce an optimize phase
while transforming user program(described by Table API) to internal
representation.

Given a user who want to start using Flink with simple ETL, pipelining
or analytics, he would find it is most naturally described by SQL/Table
API. Further, as mentioned by @hequn,

SQL is a widely used language. It follows standards, is a
> descriptive language, and is easy to use


thus we could expect with the enhancement of SQL/Table API, Flink
becomes more friendly to users.

Looking forward to the design doc/FLIP!

Best,
tison.


jincheng sun <su...@gmail.com> 于2018年11月2日周五 上午11:46写道:

> Hi Hequn,
> Thanks for your feedback! And also thanks for our offline discussion!
> You are right, unification of batch and streaming is very important for
> flink API.
> We will provide more detailed design later, Please let me know if you have
> further thoughts or feedback.
>
> Thanks,
> Jincheng
>
> Hequn Cheng <ch...@gmail.com> 于2018年11月2日周五 上午10:02写道:
>
> > Hi Jincheng,
> >
> > Thanks a lot for your proposal. It is very encouraging!
> >
> > As we all know, SQL is a widely used language. It follows standards, is a
> > descriptive language, and is easy to use. A powerful feature of SQL is
> that
> > it supports optimization. Users only need to care about the logic of the
> > program. The underlying optimizer will help users optimize the
> performance
> > of the program. However, in terms of functionality and ease of use, in
> some
> > scenarios sql will be limited, as described in Jincheng's proposal.
> >
> > Correspondingly, the DataStream/DataSet api can provide powerful
> > functionalities. Users can write ProcessFunction/CoProcessFunction and
> get
> > the timer. Compared with SQL, it provides more functionalities and
> > flexibilities. However, it does not support optimization like SQL.
> > Meanwhile, DataStream/DataSet api has not been unified which means, for
> the
> > same logic, users need to write a job for each stream and batch.
> >
> > With TableApi, I think we can combine the advantages of both. Users can
> > easily write relational operations and enjoy optimization. At the same
> > time, it supports more functionality and ease of use. Looking forward to
> > the detailed design/FLIP.
> >
> > Best,
> > Hequn
> >
> > On Fri, Nov 2, 2018 at 9:48 AM Shaoxuan Wang <ws...@gmail.com>
> wrote:
> >
> > > Hi Aljoscha,
> > > Glad that you like the proposal. We have completed the prototype of
> most
> > > new proposed functionalities. Once collect the feedback from community,
> > we
> > > will come up with a concrete FLIP/design doc.
> > >
> > > Regards,
> > > Shaoxuan
> > >
> > >
> > > On Thu, Nov 1, 2018 at 8:12 PM Aljoscha Krettek <al...@apache.org>
> > > wrote:
> > >
> > > > Hi Jincheng,
> > > >
> > > > these points sound very good! Are there any concrete proposals for
> > > > changes? For example a FLIP/design document?
> > > >
> > > > See here for FLIPs:
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> > > >
> > > > Best,
> > > > Aljoscha
> > > >
> > > > > On 1. Nov 2018, at 12:51, jincheng sun <su...@gmail.com>
> > > wrote:
> > > > >
> > > > > *--------I am sorry for the formatting of the email content. I
> > reformat
> > > > > the **content** as follows-----------*
> > > > >
> > > > > *Hi ALL,*
> > > > >
> > > > > With the continuous efforts from the community, the Flink system
> has
> > > been
> > > > > continuously improved, which has attracted more and more users.
> Flink
> > > SQL
> > > > > is a canonical, widely used relational query language. However,
> there
> > > are
> > > > > still some scenarios where Flink SQL failed to meet user needs in
> > terms
> > > > of
> > > > > functionality and ease of use, such as:
> > > > >
> > > > > *1. In terms of functionality*
> > > > >    Iteration, user-defined window, user-defined join, user-defined
> > > > > GroupReduce, etc. Users cannot express them with SQL;
> > > > >
> > > > > *2. In terms of ease of use*
> > > > >
> > > > >   - Map - e.g. “dataStream.map(mapFun)”. Although
> > “table.select(udf1(),
> > > > >   udf2(), udf3()....)” can be used to accomplish the same
> function.,
> > > > with a
> > > > >   map() function returning 100 columns, one has to define or call
> 100
> > > > UDFs
> > > > >   when using SQL, which is quite involved.
> > > > >   - FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”. Similarly, it
> > can
> > > be
> > > > >   implemented with “table.join(udtf).select()”. However, it is
> > obvious
> > > > that
> > > > >   dataStream is easier to use than SQL.
> > > > >
> > > > > Due to the above two reasons, some users have to use the DataStream
> > API
> > > > or
> > > > > the DataSet API. But when they do that, they lose the unification
> of
> > > > batch
> > > > > and streaming. They will also lose the sophisticated optimizations
> > such
> > > > as
> > > > > codegen, aggregate join transpose and multi-stage agg from Flink
> SQL.
> > > > >
> > > > > We believe that enhancing the functionality and productivity is
> vital
> > > for
> > > > > the successful adoption of Table API. To this end,  Table API still
> > > > > requires more efforts from every contributor in the community. We
> see
> > > > great
> > > > > opportunity in improving our user’s experience from this work. Any
> > > > feedback
> > > > > is welcome.
> > > > >
> > > > > Regards,
> > > > >
> > > > > Jincheng
> > > > >
> > > > > jincheng sun <su...@gmail.com> 于2018年11月1日周四 下午5:07写道:
> > > > >
> > > > >> Hi all,
> > > > >>
> > > > >> With the continuous efforts from the community, the Flink system
> has
> > > > been
> > > > >> continuously improved, which has attracted more and more users.
> > Flink
> > > > SQL
> > > > >> is a canonical, widely used relational query language. However,
> > there
> > > > are
> > > > >> still some scenarios where Flink SQL failed to meet user needs in
> > > terms
> > > > of
> > > > >> functionality and ease of use, such as:
> > > > >>
> > > > >>
> > > > >>   -
> > > > >>
> > > > >>   In terms of functionality
> > > > >>
> > > > >> Iteration, user-defined window, user-defined join, user-defined
> > > > >> GroupReduce, etc. Users cannot express them with SQL;
> > > > >>
> > > > >>   -
> > > > >>
> > > > >>   In terms of ease of use
> > > > >>   -
> > > > >>
> > > > >>      Map - e.g. “dataStream.map(mapFun)”. Although
> > > “table.select(udf1(),
> > > > >>      udf2(), udf3()....)” can be used to accomplish the same
> > > function.,
> > > > with a
> > > > >>      map() function returning 100 columns, one has to define or
> call
> > > > 100 UDFs
> > > > >>      when using SQL, which is quite involved.
> > > > >>      -
> > > > >>
> > > > >>      FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”. Similarly,
> it
> > > can
> > > > >>      be implemented with “table.join(udtf).select()”. However, it
> is
> > > > obvious
> > > > >>      that datastream is easier to use than SQL.
> > > > >>
> > > > >>
> > > > >> Due to the above two reasons, some users have to use the
> DataStream
> > > API
> > > > or
> > > > >> the DataSet API. But when they do that, they lose the unification
> of
> > > > batch
> > > > >> and streaming. They will also lose the sophisticated optimizations
> > > such
> > > > as
> > > > >> codegen, aggregate join transpose  and multi-stage agg from Flink
> > SQL.
> > > > >>
> > > > >> We believe that enhancing the functionality and productivity is
> > vital
> > > > for
> > > > >> the successful adoption of Table API. To this end,  Table API
> still
> > > > >> requires more efforts from every contributor in the community. We
> > see
> > > > great
> > > > >> opportunity in improving our user’s experience from this work. Any
> > > > feedback
> > > > >> is welcome.
> > > > >>
> > > > >> Regards,
> > > > >>
> > > > >> Jincheng
> > > > >>
> > > > >>
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Posted by jincheng sun <su...@gmail.com>.
Hi Hequn,
Thanks for your feedback! And also thanks for our offline discussion!
You are right, unification of batch and streaming is very important for
flink API.
We will provide more detailed design later, Please let me know if you have
further thoughts or feedback.

Thanks,
Jincheng

Hequn Cheng <ch...@gmail.com> 于2018年11月2日周五 上午10:02写道:

> Hi Jincheng,
>
> Thanks a lot for your proposal. It is very encouraging!
>
> As we all know, SQL is a widely used language. It follows standards, is a
> descriptive language, and is easy to use. A powerful feature of SQL is that
> it supports optimization. Users only need to care about the logic of the
> program. The underlying optimizer will help users optimize the performance
> of the program. However, in terms of functionality and ease of use, in some
> scenarios sql will be limited, as described in Jincheng's proposal.
>
> Correspondingly, the DataStream/DataSet api can provide powerful
> functionalities. Users can write ProcessFunction/CoProcessFunction and get
> the timer. Compared with SQL, it provides more functionalities and
> flexibilities. However, it does not support optimization like SQL.
> Meanwhile, DataStream/DataSet api has not been unified which means, for the
> same logic, users need to write a job for each stream and batch.
>
> With TableApi, I think we can combine the advantages of both. Users can
> easily write relational operations and enjoy optimization. At the same
> time, it supports more functionality and ease of use. Looking forward to
> the detailed design/FLIP.
>
> Best,
> Hequn
>
> On Fri, Nov 2, 2018 at 9:48 AM Shaoxuan Wang <ws...@gmail.com> wrote:
>
> > Hi Aljoscha,
> > Glad that you like the proposal. We have completed the prototype of most
> > new proposed functionalities. Once collect the feedback from community,
> we
> > will come up with a concrete FLIP/design doc.
> >
> > Regards,
> > Shaoxuan
> >
> >
> > On Thu, Nov 1, 2018 at 8:12 PM Aljoscha Krettek <al...@apache.org>
> > wrote:
> >
> > > Hi Jincheng,
> > >
> > > these points sound very good! Are there any concrete proposals for
> > > changes? For example a FLIP/design document?
> > >
> > > See here for FLIPs:
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> > >
> > > Best,
> > > Aljoscha
> > >
> > > > On 1. Nov 2018, at 12:51, jincheng sun <su...@gmail.com>
> > wrote:
> > > >
> > > > *--------I am sorry for the formatting of the email content. I
> reformat
> > > > the **content** as follows-----------*
> > > >
> > > > *Hi ALL,*
> > > >
> > > > With the continuous efforts from the community, the Flink system has
> > been
> > > > continuously improved, which has attracted more and more users. Flink
> > SQL
> > > > is a canonical, widely used relational query language. However, there
> > are
> > > > still some scenarios where Flink SQL failed to meet user needs in
> terms
> > > of
> > > > functionality and ease of use, such as:
> > > >
> > > > *1. In terms of functionality*
> > > >    Iteration, user-defined window, user-defined join, user-defined
> > > > GroupReduce, etc. Users cannot express them with SQL;
> > > >
> > > > *2. In terms of ease of use*
> > > >
> > > >   - Map - e.g. “dataStream.map(mapFun)”. Although
> “table.select(udf1(),
> > > >   udf2(), udf3()....)” can be used to accomplish the same function.,
> > > with a
> > > >   map() function returning 100 columns, one has to define or call 100
> > > UDFs
> > > >   when using SQL, which is quite involved.
> > > >   - FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”. Similarly, it
> can
> > be
> > > >   implemented with “table.join(udtf).select()”. However, it is
> obvious
> > > that
> > > >   dataStream is easier to use than SQL.
> > > >
> > > > Due to the above two reasons, some users have to use the DataStream
> API
> > > or
> > > > the DataSet API. But when they do that, they lose the unification of
> > > batch
> > > > and streaming. They will also lose the sophisticated optimizations
> such
> > > as
> > > > codegen, aggregate join transpose and multi-stage agg from Flink SQL.
> > > >
> > > > We believe that enhancing the functionality and productivity is vital
> > for
> > > > the successful adoption of Table API. To this end,  Table API still
> > > > requires more efforts from every contributor in the community. We see
> > > great
> > > > opportunity in improving our user’s experience from this work. Any
> > > feedback
> > > > is welcome.
> > > >
> > > > Regards,
> > > >
> > > > Jincheng
> > > >
> > > > jincheng sun <su...@gmail.com> 于2018年11月1日周四 下午5:07写道:
> > > >
> > > >> Hi all,
> > > >>
> > > >> With the continuous efforts from the community, the Flink system has
> > > been
> > > >> continuously improved, which has attracted more and more users.
> Flink
> > > SQL
> > > >> is a canonical, widely used relational query language. However,
> there
> > > are
> > > >> still some scenarios where Flink SQL failed to meet user needs in
> > terms
> > > of
> > > >> functionality and ease of use, such as:
> > > >>
> > > >>
> > > >>   -
> > > >>
> > > >>   In terms of functionality
> > > >>
> > > >> Iteration, user-defined window, user-defined join, user-defined
> > > >> GroupReduce, etc. Users cannot express them with SQL;
> > > >>
> > > >>   -
> > > >>
> > > >>   In terms of ease of use
> > > >>   -
> > > >>
> > > >>      Map - e.g. “dataStream.map(mapFun)”. Although
> > “table.select(udf1(),
> > > >>      udf2(), udf3()....)” can be used to accomplish the same
> > function.,
> > > with a
> > > >>      map() function returning 100 columns, one has to define or call
> > > 100 UDFs
> > > >>      when using SQL, which is quite involved.
> > > >>      -
> > > >>
> > > >>      FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”. Similarly, it
> > can
> > > >>      be implemented with “table.join(udtf).select()”. However, it is
> > > obvious
> > > >>      that datastream is easier to use than SQL.
> > > >>
> > > >>
> > > >> Due to the above two reasons, some users have to use the DataStream
> > API
> > > or
> > > >> the DataSet API. But when they do that, they lose the unification of
> > > batch
> > > >> and streaming. They will also lose the sophisticated optimizations
> > such
> > > as
> > > >> codegen, aggregate join transpose  and multi-stage agg from Flink
> SQL.
> > > >>
> > > >> We believe that enhancing the functionality and productivity is
> vital
> > > for
> > > >> the successful adoption of Table API. To this end,  Table API still
> > > >> requires more efforts from every contributor in the community. We
> see
> > > great
> > > >> opportunity in improving our user’s experience from this work. Any
> > > feedback
> > > >> is welcome.
> > > >>
> > > >> Regards,
> > > >>
> > > >> Jincheng
> > > >>
> > > >>
> > >
> > >
> >
>

Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Posted by Hequn Cheng <ch...@gmail.com>.
Hi Jincheng,

Thanks a lot for your proposal. It is very encouraging!

As we all know, SQL is a widely used language. It follows standards, is a
descriptive language, and is easy to use. A powerful feature of SQL is that
it supports optimization. Users only need to care about the logic of the
program. The underlying optimizer will help users optimize the performance
of the program. However, in terms of functionality and ease of use, in some
scenarios sql will be limited, as described in Jincheng's proposal.

Correspondingly, the DataStream/DataSet api can provide powerful
functionalities. Users can write ProcessFunction/CoProcessFunction and get
the timer. Compared with SQL, it provides more functionalities and
flexibilities. However, it does not support optimization like SQL.
Meanwhile, DataStream/DataSet api has not been unified which means, for the
same logic, users need to write a job for each stream and batch.

With TableApi, I think we can combine the advantages of both. Users can
easily write relational operations and enjoy optimization. At the same
time, it supports more functionality and ease of use. Looking forward to
the detailed design/FLIP.

Best,
Hequn

On Fri, Nov 2, 2018 at 9:48 AM Shaoxuan Wang <ws...@gmail.com> wrote:

> Hi Aljoscha,
> Glad that you like the proposal. We have completed the prototype of most
> new proposed functionalities. Once collect the feedback from community, we
> will come up with a concrete FLIP/design doc.
>
> Regards,
> Shaoxuan
>
>
> On Thu, Nov 1, 2018 at 8:12 PM Aljoscha Krettek <al...@apache.org>
> wrote:
>
> > Hi Jincheng,
> >
> > these points sound very good! Are there any concrete proposals for
> > changes? For example a FLIP/design document?
> >
> > See here for FLIPs:
> >
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> >
> > Best,
> > Aljoscha
> >
> > > On 1. Nov 2018, at 12:51, jincheng sun <su...@gmail.com>
> wrote:
> > >
> > > *--------I am sorry for the formatting of the email content. I reformat
> > > the **content** as follows-----------*
> > >
> > > *Hi ALL,*
> > >
> > > With the continuous efforts from the community, the Flink system has
> been
> > > continuously improved, which has attracted more and more users. Flink
> SQL
> > > is a canonical, widely used relational query language. However, there
> are
> > > still some scenarios where Flink SQL failed to meet user needs in terms
> > of
> > > functionality and ease of use, such as:
> > >
> > > *1. In terms of functionality*
> > >    Iteration, user-defined window, user-defined join, user-defined
> > > GroupReduce, etc. Users cannot express them with SQL;
> > >
> > > *2. In terms of ease of use*
> > >
> > >   - Map - e.g. “dataStream.map(mapFun)”. Although “table.select(udf1(),
> > >   udf2(), udf3()....)” can be used to accomplish the same function.,
> > with a
> > >   map() function returning 100 columns, one has to define or call 100
> > UDFs
> > >   when using SQL, which is quite involved.
> > >   - FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”. Similarly, it can
> be
> > >   implemented with “table.join(udtf).select()”. However, it is obvious
> > that
> > >   dataStream is easier to use than SQL.
> > >
> > > Due to the above two reasons, some users have to use the DataStream API
> > or
> > > the DataSet API. But when they do that, they lose the unification of
> > batch
> > > and streaming. They will also lose the sophisticated optimizations such
> > as
> > > codegen, aggregate join transpose and multi-stage agg from Flink SQL.
> > >
> > > We believe that enhancing the functionality and productivity is vital
> for
> > > the successful adoption of Table API. To this end,  Table API still
> > > requires more efforts from every contributor in the community. We see
> > great
> > > opportunity in improving our user’s experience from this work. Any
> > feedback
> > > is welcome.
> > >
> > > Regards,
> > >
> > > Jincheng
> > >
> > > jincheng sun <su...@gmail.com> 于2018年11月1日周四 下午5:07写道:
> > >
> > >> Hi all,
> > >>
> > >> With the continuous efforts from the community, the Flink system has
> > been
> > >> continuously improved, which has attracted more and more users. Flink
> > SQL
> > >> is a canonical, widely used relational query language. However, there
> > are
> > >> still some scenarios where Flink SQL failed to meet user needs in
> terms
> > of
> > >> functionality and ease of use, such as:
> > >>
> > >>
> > >>   -
> > >>
> > >>   In terms of functionality
> > >>
> > >> Iteration, user-defined window, user-defined join, user-defined
> > >> GroupReduce, etc. Users cannot express them with SQL;
> > >>
> > >>   -
> > >>
> > >>   In terms of ease of use
> > >>   -
> > >>
> > >>      Map - e.g. “dataStream.map(mapFun)”. Although
> “table.select(udf1(),
> > >>      udf2(), udf3()....)” can be used to accomplish the same
> function.,
> > with a
> > >>      map() function returning 100 columns, one has to define or call
> > 100 UDFs
> > >>      when using SQL, which is quite involved.
> > >>      -
> > >>
> > >>      FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”. Similarly, it
> can
> > >>      be implemented with “table.join(udtf).select()”. However, it is
> > obvious
> > >>      that datastream is easier to use than SQL.
> > >>
> > >>
> > >> Due to the above two reasons, some users have to use the DataStream
> API
> > or
> > >> the DataSet API. But when they do that, they lose the unification of
> > batch
> > >> and streaming. They will also lose the sophisticated optimizations
> such
> > as
> > >> codegen, aggregate join transpose  and multi-stage agg from Flink SQL.
> > >>
> > >> We believe that enhancing the functionality and productivity is vital
> > for
> > >> the successful adoption of Table API. To this end,  Table API still
> > >> requires more efforts from every contributor in the community. We see
> > great
> > >> opportunity in improving our user’s experience from this work. Any
> > feedback
> > >> is welcome.
> > >>
> > >> Regards,
> > >>
> > >> Jincheng
> > >>
> > >>
> >
> >
>

Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Posted by Shaoxuan Wang <ws...@gmail.com>.
Hi Aljoscha,
Glad that you like the proposal. We have completed the prototype of most
new proposed functionalities. Once collect the feedback from community, we
will come up with a concrete FLIP/design doc.

Regards,
Shaoxuan


On Thu, Nov 1, 2018 at 8:12 PM Aljoscha Krettek <al...@apache.org> wrote:

> Hi Jincheng,
>
> these points sound very good! Are there any concrete proposals for
> changes? For example a FLIP/design document?
>
> See here for FLIPs:
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
>
> Best,
> Aljoscha
>
> > On 1. Nov 2018, at 12:51, jincheng sun <su...@gmail.com> wrote:
> >
> > *--------I am sorry for the formatting of the email content. I reformat
> > the **content** as follows-----------*
> >
> > *Hi ALL,*
> >
> > With the continuous efforts from the community, the Flink system has been
> > continuously improved, which has attracted more and more users. Flink SQL
> > is a canonical, widely used relational query language. However, there are
> > still some scenarios where Flink SQL failed to meet user needs in terms
> of
> > functionality and ease of use, such as:
> >
> > *1. In terms of functionality*
> >    Iteration, user-defined window, user-defined join, user-defined
> > GroupReduce, etc. Users cannot express them with SQL;
> >
> > *2. In terms of ease of use*
> >
> >   - Map - e.g. “dataStream.map(mapFun)”. Although “table.select(udf1(),
> >   udf2(), udf3()....)” can be used to accomplish the same function.,
> with a
> >   map() function returning 100 columns, one has to define or call 100
> UDFs
> >   when using SQL, which is quite involved.
> >   - FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”. Similarly, it can be
> >   implemented with “table.join(udtf).select()”. However, it is obvious
> that
> >   dataStream is easier to use than SQL.
> >
> > Due to the above two reasons, some users have to use the DataStream API
> or
> > the DataSet API. But when they do that, they lose the unification of
> batch
> > and streaming. They will also lose the sophisticated optimizations such
> as
> > codegen, aggregate join transpose and multi-stage agg from Flink SQL.
> >
> > We believe that enhancing the functionality and productivity is vital for
> > the successful adoption of Table API. To this end,  Table API still
> > requires more efforts from every contributor in the community. We see
> great
> > opportunity in improving our user’s experience from this work. Any
> feedback
> > is welcome.
> >
> > Regards,
> >
> > Jincheng
> >
> > jincheng sun <su...@gmail.com> 于2018年11月1日周四 下午5:07写道:
> >
> >> Hi all,
> >>
> >> With the continuous efforts from the community, the Flink system has
> been
> >> continuously improved, which has attracted more and more users. Flink
> SQL
> >> is a canonical, widely used relational query language. However, there
> are
> >> still some scenarios where Flink SQL failed to meet user needs in terms
> of
> >> functionality and ease of use, such as:
> >>
> >>
> >>   -
> >>
> >>   In terms of functionality
> >>
> >> Iteration, user-defined window, user-defined join, user-defined
> >> GroupReduce, etc. Users cannot express them with SQL;
> >>
> >>   -
> >>
> >>   In terms of ease of use
> >>   -
> >>
> >>      Map - e.g. “dataStream.map(mapFun)”. Although “table.select(udf1(),
> >>      udf2(), udf3()....)” can be used to accomplish the same function.,
> with a
> >>      map() function returning 100 columns, one has to define or call
> 100 UDFs
> >>      when using SQL, which is quite involved.
> >>      -
> >>
> >>      FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”. Similarly, it can
> >>      be implemented with “table.join(udtf).select()”. However, it is
> obvious
> >>      that datastream is easier to use than SQL.
> >>
> >>
> >> Due to the above two reasons, some users have to use the DataStream API
> or
> >> the DataSet API. But when they do that, they lose the unification of
> batch
> >> and streaming. They will also lose the sophisticated optimizations such
> as
> >> codegen, aggregate join transpose  and multi-stage agg from Flink SQL.
> >>
> >> We believe that enhancing the functionality and productivity is vital
> for
> >> the successful adoption of Table API. To this end,  Table API still
> >> requires more efforts from every contributor in the community. We see
> great
> >> opportunity in improving our user’s experience from this work. Any
> feedback
> >> is welcome.
> >>
> >> Regards,
> >>
> >> Jincheng
> >>
> >>
>
>

Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Posted by Aljoscha Krettek <al...@apache.org>.
Yes, that makes sense!

> On 1. Nov 2018, at 15:51, jincheng sun <su...@gmail.com> wrote:
> 
> Hi, Aljoscha,
> 
> Thanks for your feedback and suggestions. I think your are right, the
> detailed design/FLIP is very necessary. Before the detailed design or open
> a FLIP, I would like to hear the community's views on Enhancing the
> functionality and productivity of Table API,  to ensure that it worth to
> effort. If most community members agree with my proposal, I will list the
> changes and discuss with all community members. Is that make sense to you?
> 
> Thanks,
> Jincheng
> 
> Aljoscha Krettek <al...@apache.org> 于2018年11月1日周四 下午8:12写道:
> 
>> Hi Jincheng,
>> 
>> these points sound very good! Are there any concrete proposals for
>> changes? For example a FLIP/design document?
>> 
>> See here for FLIPs:
>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
>> 
>> Best,
>> Aljoscha
>> 
>>> On 1. Nov 2018, at 12:51, jincheng sun <su...@gmail.com> wrote:
>>> 
>>> *--------I am sorry for the formatting of the email content. I reformat
>>> the **content** as follows-----------*
>>> 
>>> *Hi ALL,*
>>> 
>>> With the continuous efforts from the community, the Flink system has been
>>> continuously improved, which has attracted more and more users. Flink SQL
>>> is a canonical, widely used relational query language. However, there are
>>> still some scenarios where Flink SQL failed to meet user needs in terms
>> of
>>> functionality and ease of use, such as:
>>> 
>>> *1. In terms of functionality*
>>>   Iteration, user-defined window, user-defined join, user-defined
>>> GroupReduce, etc. Users cannot express them with SQL;
>>> 
>>> *2. In terms of ease of use*
>>> 
>>>  - Map - e.g. “dataStream.map(mapFun)”. Although “table.select(udf1(),
>>>  udf2(), udf3()....)” can be used to accomplish the same function.,
>> with a
>>>  map() function returning 100 columns, one has to define or call 100
>> UDFs
>>>  when using SQL, which is quite involved.
>>>  - FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”. Similarly, it can be
>>>  implemented with “table.join(udtf).select()”. However, it is obvious
>> that
>>>  dataStream is easier to use than SQL.
>>> 
>>> Due to the above two reasons, some users have to use the DataStream API
>> or
>>> the DataSet API. But when they do that, they lose the unification of
>> batch
>>> and streaming. They will also lose the sophisticated optimizations such
>> as
>>> codegen, aggregate join transpose and multi-stage agg from Flink SQL.
>>> 
>>> We believe that enhancing the functionality and productivity is vital for
>>> the successful adoption of Table API. To this end,  Table API still
>>> requires more efforts from every contributor in the community. We see
>> great
>>> opportunity in improving our user’s experience from this work. Any
>> feedback
>>> is welcome.
>>> 
>>> Regards,
>>> 
>>> Jincheng
>>> 
>>> jincheng sun <su...@gmail.com> 于2018年11月1日周四 下午5:07写道:
>>> 
>>>> Hi all,
>>>> 
>>>> With the continuous efforts from the community, the Flink system has
>> been
>>>> continuously improved, which has attracted more and more users. Flink
>> SQL
>>>> is a canonical, widely used relational query language. However, there
>> are
>>>> still some scenarios where Flink SQL failed to meet user needs in terms
>> of
>>>> functionality and ease of use, such as:
>>>> 
>>>> 
>>>>  -
>>>> 
>>>>  In terms of functionality
>>>> 
>>>> Iteration, user-defined window, user-defined join, user-defined
>>>> GroupReduce, etc. Users cannot express them with SQL;
>>>> 
>>>>  -
>>>> 
>>>>  In terms of ease of use
>>>>  -
>>>> 
>>>>     Map - e.g. “dataStream.map(mapFun)”. Although “table.select(udf1(),
>>>>     udf2(), udf3()....)” can be used to accomplish the same function.,
>> with a
>>>>     map() function returning 100 columns, one has to define or call
>> 100 UDFs
>>>>     when using SQL, which is quite involved.
>>>>     -
>>>> 
>>>>     FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”. Similarly, it can
>>>>     be implemented with “table.join(udtf).select()”. However, it is
>> obvious
>>>>     that datastream is easier to use than SQL.
>>>> 
>>>> 
>>>> Due to the above two reasons, some users have to use the DataStream API
>> or
>>>> the DataSet API. But when they do that, they lose the unification of
>> batch
>>>> and streaming. They will also lose the sophisticated optimizations such
>> as
>>>> codegen, aggregate join transpose  and multi-stage agg from Flink SQL.
>>>> 
>>>> We believe that enhancing the functionality and productivity is vital
>> for
>>>> the successful adoption of Table API. To this end,  Table API still
>>>> requires more efforts from every contributor in the community. We see
>> great
>>>> opportunity in improving our user’s experience from this work. Any
>> feedback
>>>> is welcome.
>>>> 
>>>> Regards,
>>>> 
>>>> Jincheng
>>>> 
>>>> 
>> 
>> 


Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Posted by jincheng sun <su...@gmail.com>.
Hi, Aljoscha,

Thanks for your feedback and suggestions. I think your are right, the
detailed design/FLIP is very necessary. Before the detailed design or open
a FLIP, I would like to hear the community's views on Enhancing the
functionality and productivity of Table API,  to ensure that it worth to
effort. If most community members agree with my proposal, I will list the
changes and discuss with all community members. Is that make sense to you?

Thanks,
Jincheng

Aljoscha Krettek <al...@apache.org> 于2018年11月1日周四 下午8:12写道:

> Hi Jincheng,
>
> these points sound very good! Are there any concrete proposals for
> changes? For example a FLIP/design document?
>
> See here for FLIPs:
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
>
> Best,
> Aljoscha
>
> > On 1. Nov 2018, at 12:51, jincheng sun <su...@gmail.com> wrote:
> >
> > *--------I am sorry for the formatting of the email content. I reformat
> > the **content** as follows-----------*
> >
> > *Hi ALL,*
> >
> > With the continuous efforts from the community, the Flink system has been
> > continuously improved, which has attracted more and more users. Flink SQL
> > is a canonical, widely used relational query language. However, there are
> > still some scenarios where Flink SQL failed to meet user needs in terms
> of
> > functionality and ease of use, such as:
> >
> > *1. In terms of functionality*
> >    Iteration, user-defined window, user-defined join, user-defined
> > GroupReduce, etc. Users cannot express them with SQL;
> >
> > *2. In terms of ease of use*
> >
> >   - Map - e.g. “dataStream.map(mapFun)”. Although “table.select(udf1(),
> >   udf2(), udf3()....)” can be used to accomplish the same function.,
> with a
> >   map() function returning 100 columns, one has to define or call 100
> UDFs
> >   when using SQL, which is quite involved.
> >   - FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”. Similarly, it can be
> >   implemented with “table.join(udtf).select()”. However, it is obvious
> that
> >   dataStream is easier to use than SQL.
> >
> > Due to the above two reasons, some users have to use the DataStream API
> or
> > the DataSet API. But when they do that, they lose the unification of
> batch
> > and streaming. They will also lose the sophisticated optimizations such
> as
> > codegen, aggregate join transpose and multi-stage agg from Flink SQL.
> >
> > We believe that enhancing the functionality and productivity is vital for
> > the successful adoption of Table API. To this end,  Table API still
> > requires more efforts from every contributor in the community. We see
> great
> > opportunity in improving our user’s experience from this work. Any
> feedback
> > is welcome.
> >
> > Regards,
> >
> > Jincheng
> >
> > jincheng sun <su...@gmail.com> 于2018年11月1日周四 下午5:07写道:
> >
> >> Hi all,
> >>
> >> With the continuous efforts from the community, the Flink system has
> been
> >> continuously improved, which has attracted more and more users. Flink
> SQL
> >> is a canonical, widely used relational query language. However, there
> are
> >> still some scenarios where Flink SQL failed to meet user needs in terms
> of
> >> functionality and ease of use, such as:
> >>
> >>
> >>   -
> >>
> >>   In terms of functionality
> >>
> >> Iteration, user-defined window, user-defined join, user-defined
> >> GroupReduce, etc. Users cannot express them with SQL;
> >>
> >>   -
> >>
> >>   In terms of ease of use
> >>   -
> >>
> >>      Map - e.g. “dataStream.map(mapFun)”. Although “table.select(udf1(),
> >>      udf2(), udf3()....)” can be used to accomplish the same function.,
> with a
> >>      map() function returning 100 columns, one has to define or call
> 100 UDFs
> >>      when using SQL, which is quite involved.
> >>      -
> >>
> >>      FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”. Similarly, it can
> >>      be implemented with “table.join(udtf).select()”. However, it is
> obvious
> >>      that datastream is easier to use than SQL.
> >>
> >>
> >> Due to the above two reasons, some users have to use the DataStream API
> or
> >> the DataSet API. But when they do that, they lose the unification of
> batch
> >> and streaming. They will also lose the sophisticated optimizations such
> as
> >> codegen, aggregate join transpose  and multi-stage agg from Flink SQL.
> >>
> >> We believe that enhancing the functionality and productivity is vital
> for
> >> the successful adoption of Table API. To this end,  Table API still
> >> requires more efforts from every contributor in the community. We see
> great
> >> opportunity in improving our user’s experience from this work. Any
> feedback
> >> is welcome.
> >>
> >> Regards,
> >>
> >> Jincheng
> >>
> >>
>
>

Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Posted by Aljoscha Krettek <al...@apache.org>.
Hi Jincheng,

these points sound very good! Are there any concrete proposals for changes? For example a FLIP/design document?

See here for FLIPs: https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals

Best,
Aljoscha 

> On 1. Nov 2018, at 12:51, jincheng sun <su...@gmail.com> wrote:
> 
> *--------I am sorry for the formatting of the email content. I reformat
> the **content** as follows-----------*
> 
> *Hi ALL,*
> 
> With the continuous efforts from the community, the Flink system has been
> continuously improved, which has attracted more and more users. Flink SQL
> is a canonical, widely used relational query language. However, there are
> still some scenarios where Flink SQL failed to meet user needs in terms of
> functionality and ease of use, such as:
> 
> *1. In terms of functionality*
>    Iteration, user-defined window, user-defined join, user-defined
> GroupReduce, etc. Users cannot express them with SQL;
> 
> *2. In terms of ease of use*
> 
>   - Map - e.g. “dataStream.map(mapFun)”. Although “table.select(udf1(),
>   udf2(), udf3()....)” can be used to accomplish the same function., with a
>   map() function returning 100 columns, one has to define or call 100 UDFs
>   when using SQL, which is quite involved.
>   - FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”. Similarly, it can be
>   implemented with “table.join(udtf).select()”. However, it is obvious that
>   dataStream is easier to use than SQL.
> 
> Due to the above two reasons, some users have to use the DataStream API or
> the DataSet API. But when they do that, they lose the unification of batch
> and streaming. They will also lose the sophisticated optimizations such as
> codegen, aggregate join transpose and multi-stage agg from Flink SQL.
> 
> We believe that enhancing the functionality and productivity is vital for
> the successful adoption of Table API. To this end,  Table API still
> requires more efforts from every contributor in the community. We see great
> opportunity in improving our user’s experience from this work. Any feedback
> is welcome.
> 
> Regards,
> 
> Jincheng
> 
> jincheng sun <su...@gmail.com> 于2018年11月1日周四 下午5:07写道:
> 
>> Hi all,
>> 
>> With the continuous efforts from the community, the Flink system has been
>> continuously improved, which has attracted more and more users. Flink SQL
>> is a canonical, widely used relational query language. However, there are
>> still some scenarios where Flink SQL failed to meet user needs in terms of
>> functionality and ease of use, such as:
>> 
>> 
>>   -
>> 
>>   In terms of functionality
>> 
>> Iteration, user-defined window, user-defined join, user-defined
>> GroupReduce, etc. Users cannot express them with SQL;
>> 
>>   -
>> 
>>   In terms of ease of use
>>   -
>> 
>>      Map - e.g. “dataStream.map(mapFun)”. Although “table.select(udf1(),
>>      udf2(), udf3()....)” can be used to accomplish the same function., with a
>>      map() function returning 100 columns, one has to define or call 100 UDFs
>>      when using SQL, which is quite involved.
>>      -
>> 
>>      FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”. Similarly, it can
>>      be implemented with “table.join(udtf).select()”. However, it is obvious
>>      that datastream is easier to use than SQL.
>> 
>> 
>> Due to the above two reasons, some users have to use the DataStream API or
>> the DataSet API. But when they do that, they lose the unification of batch
>> and streaming. They will also lose the sophisticated optimizations such as
>> codegen, aggregate join transpose  and multi-stage agg from Flink SQL.
>> 
>> We believe that enhancing the functionality and productivity is vital for
>> the successful adoption of Table API. To this end,  Table API still
>> requires more efforts from every contributor in the community. We see great
>> opportunity in improving our user’s experience from this work. Any feedback
>> is welcome.
>> 
>> Regards,
>> 
>> Jincheng
>> 
>> 


Re: [DISCUSS] Enhancing the functionality and productivity of Table API

Posted by jincheng sun <su...@gmail.com>.
*--------I am sorry for the formatting of the email content. I reformat
the **content** as follows-----------*

*Hi ALL,*

With the continuous efforts from the community, the Flink system has been
continuously improved, which has attracted more and more users. Flink SQL
is a canonical, widely used relational query language. However, there are
still some scenarios where Flink SQL failed to meet user needs in terms of
functionality and ease of use, such as:

*1. In terms of functionality*
    Iteration, user-defined window, user-defined join, user-defined
GroupReduce, etc. Users cannot express them with SQL;

*2. In terms of ease of use*

   - Map - e.g. “dataStream.map(mapFun)”. Although “table.select(udf1(),
   udf2(), udf3()....)” can be used to accomplish the same function., with a
   map() function returning 100 columns, one has to define or call 100 UDFs
   when using SQL, which is quite involved.
   - FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”. Similarly, it can be
   implemented with “table.join(udtf).select()”. However, it is obvious that
   dataStream is easier to use than SQL.

Due to the above two reasons, some users have to use the DataStream API or
the DataSet API. But when they do that, they lose the unification of batch
and streaming. They will also lose the sophisticated optimizations such as
codegen, aggregate join transpose and multi-stage agg from Flink SQL.

We believe that enhancing the functionality and productivity is vital for
the successful adoption of Table API. To this end,  Table API still
requires more efforts from every contributor in the community. We see great
opportunity in improving our user’s experience from this work. Any feedback
is welcome.

Regards,

Jincheng

jincheng sun <su...@gmail.com> 于2018年11月1日周四 下午5:07写道:

> Hi all,
>
> With the continuous efforts from the community, the Flink system has been
> continuously improved, which has attracted more and more users. Flink SQL
> is a canonical, widely used relational query language. However, there are
> still some scenarios where Flink SQL failed to meet user needs in terms of
> functionality and ease of use, such as:
>
>
>    -
>
>    In terms of functionality
>
> Iteration, user-defined window, user-defined join, user-defined
> GroupReduce, etc. Users cannot express them with SQL;
>
>    -
>
>    In terms of ease of use
>    -
>
>       Map - e.g. “dataStream.map(mapFun)”. Although “table.select(udf1(),
>       udf2(), udf3()....)” can be used to accomplish the same function., with a
>       map() function returning 100 columns, one has to define or call 100 UDFs
>       when using SQL, which is quite involved.
>       -
>
>       FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”. Similarly, it can
>       be implemented with “table.join(udtf).select()”. However, it is obvious
>       that datastream is easier to use than SQL.
>
>
> Due to the above two reasons, some users have to use the DataStream API or
> the DataSet API. But when they do that, they lose the unification of batch
> and streaming. They will also lose the sophisticated optimizations such as
> codegen, aggregate join transpose  and multi-stage agg from Flink SQL.
>
> We believe that enhancing the functionality and productivity is vital for
> the successful adoption of Table API. To this end,  Table API still
> requires more efforts from every contributor in the community. We see great
> opportunity in improving our user’s experience from this work. Any feedback
> is welcome.
>
> Regards,
>
> Jincheng
>
>