You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Fabian Hueske <fh...@gmail.com> on 2017/06/15 14:29:19 UTC

[DISCUSS] Table API / SQL features for Flink 1.4.0

Hi everybody,

I would like to start a discussion about the targeted feature set of the
Table API / SQL for Flink 1.4.0.
Flink 1.3.0 was released about 2 weeks ago and we have 2.5 months (~11
weeks, until begin of September) left until the feature freeze for Flink
1.4.0.

I think it makes sense to start with a collection of desired features. Once
we have a list of requested features, we might want to prioritize and maybe
also assign responsibilities.

When we prioritize, we should keep in mind that:
- we want to have a consistent API. Larger features should be developed in
a feature branch first.
- the next months are typical time for vacations
- we have been bottlenecked by committer resources in the last release.

I think the following features would be a nice addition to the current
state:

- Conversion of a stream into an upsert table (with retraction, updating to
the last row per key)
- Joins for streaming tables
  - Stream-Stream (time-range predicate) there is already a PR for
processing time joins
  - Table-Table (with retraction)
- Support for late arriving records in group window aggregations
- Exposing a keyed result table as queryable state

Which features are others looking for?

Cheers,
Fabian

Re: [DISCUSS] Table API / SQL features for Flink 1.4.0

Posted by Fabian Hueske <fh...@gmail.com>.

Hi everybody,

Shaoxuan, Timo, and I compiled a list of features from the replies to this
thread, features that didn't make it into 1.3, and some additional ones.
We also graded them by importance, tried to assess the effort, and added
links to JIRAs (some existed already others were created) and existing PRs.

Please have a look at the list and give feedback if you want to add a
feature to the list or do not agree with the importance or effort
assessment either by replying to this thread or commenting on the document.

-> https://docs.google.com/document/d/1I7YuF6lfxyuyPIve_
VtLDNMfSKcepZJbnZfVueqlQN4

Thanks,
Fabian




2017-06-21 23:55 GMT+02:00 Fabian Hueske <fh...@gmail.com>:

> Hi Haohui,
>
> thanks for your input!
>
> Can you describe the semantics of the join you'd like to see in Flink 1.4?
> I can think of three types of joins that match your description:
> 1) `table` is an external table stored in an external database (redis,
> cassandra, MySQL, etc) and we join with the current value that is in that
> table. This could be implemented with an async TableFunction (based on
> Flink's AsyncFunction).
> 2) `table` is static: In this case we need support for side-inputs to read
> the whole table before starting to process the other (streaming) side.
> There is a FLIP [1] for side inputs. I don't know what's the status of this
> feature though.
> 3) `table` changing and each record of the stream should be joined with
> the most recent update (but no future updates). In this case, the query is
> more complex to express and requires some time-bound logic which is quite
> cumbersome to express in SQL. I think this is a very important type of
> join, but IMO it is more challenging to implement than the other joins. We
> had also a discussion about this type of join on the dev ML a few months
> back [2].
>
> Which type of join are you looking for (external table, static table,
> dynamic table)?
>
> Regarding the bottleneck of committers, the situation should become a bit
> better in the near future as we have two committer more working on the
> relational APIs (Jark is spending more time here and Shaoxuan recently
> became a committer). However, we will of course continue to encourage and
> help contributors to earn the merits to become committers and grow the
> number of committers.
>
> Thank you very much,
> Fabian
>
> [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-
> 17+Side+Inputs+for+DataStream+API
> [2] http://apache-flink-mailing-list-archive.1008284.n3.
> nabble.com/STREAM-SQL-inner-queries-tp15585.html
>

Re: [DISCUSS] Table API / SQL features for Flink 1.4.0

Posted by Fabian Hueske <fh...@gmail.com>.

Hi Haohui,

thanks for your input!

Can you describe the semantics of the join you'd like to see in Flink 1.4?
I can think of three types of joins that match your description:
1) `table` is an external table stored in an external database (redis,
cassandra, MySQL, etc) and we join with the current value that is in that
table. This could be implemented with an async TableFunction (based on
Flink's AsyncFunction).
2) `table` is static: In this case we need support for side-inputs to read
the whole table before starting to process the other (streaming) side.
There is a FLIP [1] for side inputs. I don't know what's the status of this
feature though.
3) `table` changing and each record of the stream should be joined with the
most recent update (but no future updates). In this case, the query is more
complex to express and requires some time-bound logic which is quite
cumbersome to express in SQL. I think this is a very important type of
join, but IMO it is more challenging to implement than the other joins. We
had also a discussion about this type of join on the dev ML a few months
back [2].

Which type of join are you looking for (external table, static table,
dynamic table)?

Regarding the bottleneck of committers, the situation should become a bit
better in the near future as we have two committer more working on the
relational APIs (Jark is spending more time here and Shaoxuan recently
became a committer). However, we will of course continue to encourage and
help contributors to earn the merits to become committers and grow the
number of committers.

Thank you very much,
Fabian

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+Side+Inputs+for+DataStream+API
[2]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/STREAM-SQL-inner-queries-tp15585.html

Re: [DISCUSS] Table API / SQL features for Flink 1.4.0

Posted by Haohui Mai <ri...@gmail.com>.

Hi,

We are interested in building the simplest case of stream-table joins --
essentially calling stream.map(x => (x, table.get(x)). It solves the use
cases of augmenting the streams with the information of the database. The
operation itself can be batched for better performance.

We are happy to contribute to the the scalar functions as well as we
internally also share similar requirements.

Fabian mentioned that the development of Table / SQL API was bottlenecked
by committers, which shows that there are thriving developments happening
in the space. I think it is a good problem to have. :-)

I wonder, is it a good time to nominate new batches of committers and to
keep the momentum of developments?

Regards,
Haohui



On Fri, Jun 16, 2017 at 7:28 AM jincheng sun <su...@gmail.com>
wrote:

> Hi Fabian,
> Thanks for bring up this discuss.
> In order to enrich Flink's built-in scalar function, friendly user
> experience, I recommend adding as much scalar functions as possible in
> version 1.4 release. I have filed the JIRAs(
> https://issues.apache.org/jira/browse/FLINK-6810), and try my best to work
> on them.
>
> Of course, welcome anybody to add sub-tasks or take the JIRAs.
>
> Cheers,
> SunJincheng
>
> 2017-06-16 16:07 GMT+08:00 Fabian Hueske <fh...@gmail.com>:
>
> > Thanks for your response Shaoxuan,
> >
> > My "Table-table join with retraction" is probably the same as your
> > "unbounded stream-stream join with retraction".
> > Basically, a join between two dynamic tables with unique keys (either
> > because of an upsert stream->table conversion or an unbounded
> aggregation).
> >
> > Best, Fabian
> >
> > 2017-06-16 0:56 GMT+02:00 Shaoxuan Wang <ws...@gmail.com>:
> >
> > > Nice timing, Fabian!
> > >
> > > Your checklist aligns our plans very well. Here are the things we are
> > > working on & planning to contribute to release 1.4:
> > > 1. DDL (with property waterMark config for source-table, and emit
> config
> > on
> > > result-table)
> > > 2. unbounded stream-stream joins (with retraction supported)
> > > 3. backend state user interface for UDAGG
> > > 4. UDOP (as oppose to UDF(scalars to scalar)/UDTF(scalar to
> > > table)/UDAGG(table to scalar), this allows user to define a table to
> > table
> > > conversion business logic)
> > >
> > > Some of them already have PR/jira, while some are not. We will send out
> > the
> > > design doc for the missing ones very soon. Looking forward to the 1.4
> > > release.
> > >
> > > Btw, what is "Table-Table (with retraction)" you have mentioned in your
> > > plan?
> > >
> > > Regards,
> > > Shaoxuan
> > >
> > >
> > >
> > > On Thu, Jun 15, 2017 at 10:29 PM, Fabian Hueske <fh...@gmail.com>
> > wrote:
> > >
> > > > Hi everybody,
> > > >
> > > > I would like to start a discussion about the targeted feature set of
> > the
> > > > Table API / SQL for Flink 1.4.0.
> > > > Flink 1.3.0 was released about 2 weeks ago and we have 2.5 months
> (~11
> > > > weeks, until begin of September) left until the feature freeze for
> > Flink
> > > > 1.4.0.
> > > >
> > > > I think it makes sense to start with a collection of desired
> features.
> > > Once
> > > > we have a list of requested features, we might want to prioritize and
> > > maybe
> > > > also assign responsibilities.
> > > >
> > > > When we prioritize, we should keep in mind that:
> > > > - we want to have a consistent API. Larger features should be
> developed
> > > in
> > > > a feature branch first.
> > > > - the next months are typical time for vacations
> > > > - we have been bottlenecked by committer resources in the last
> release.
> > > >
> > > > I think the following features would be a nice addition to the
> current
> > > > state:
> > > >
> > > > - Conversion of a stream into an upsert table (with retraction,
> > updating
> > > to
> > > > the last row per key)
> > > > - Joins for streaming tables
> > > >   - Stream-Stream (time-range predicate) there is already a PR for
> > > > processing time joins
> > > >   - Table-Table (with retraction)
> > > > - Support for late arriving records in group window aggregations
> > > > - Exposing a keyed result table as queryable state
> > > >
> > > > Which features are others looking for?
> > > >
> > > > Cheers,
> > > > Fabian
> > > >
> > >
> >
>

Re: [DISCUSS] Table API / SQL features for Flink 1.4.0

Posted by jincheng sun <su...@gmail.com>.

Hi Fabian,
Thanks for bring up this discuss.
In order to enrich Flink's built-in scalar function, friendly user
experience, I recommend adding as much scalar functions as possible in
version 1.4 release. I have filed the JIRAs(
https://issues.apache.org/jira/browse/FLINK-6810), and try my best to work
on them.

Of course, welcome anybody to add sub-tasks or take the JIRAs.

Cheers,
SunJincheng

2017-06-16 16:07 GMT+08:00 Fabian Hueske <fh...@gmail.com>:

> Thanks for your response Shaoxuan,
>
> My "Table-table join with retraction" is probably the same as your
> "unbounded stream-stream join with retraction".
> Basically, a join between two dynamic tables with unique keys (either
> because of an upsert stream->table conversion or an unbounded aggregation).
>
> Best, Fabian
>
> 2017-06-16 0:56 GMT+02:00 Shaoxuan Wang <ws...@gmail.com>:
>
> > Nice timing, Fabian!
> >
> > Your checklist aligns our plans very well. Here are the things we are
> > working on & planning to contribute to release 1.4:
> > 1. DDL (with property waterMark config for source-table, and emit config
> on
> > result-table)
> > 2. unbounded stream-stream joins (with retraction supported)
> > 3. backend state user interface for UDAGG
> > 4. UDOP (as oppose to UDF(scalars to scalar)/UDTF(scalar to
> > table)/UDAGG(table to scalar), this allows user to define a table to
> table
> > conversion business logic)
> >
> > Some of them already have PR/jira, while some are not. We will send out
> the
> > design doc for the missing ones very soon. Looking forward to the 1.4
> > release.
> >
> > Btw, what is "Table-Table (with retraction)" you have mentioned in your
> > plan?
> >
> > Regards,
> > Shaoxuan
> >
> >
> >
> > On Thu, Jun 15, 2017 at 10:29 PM, Fabian Hueske <fh...@gmail.com>
> wrote:
> >
> > > Hi everybody,
> > >
> > > I would like to start a discussion about the targeted feature set of
> the
> > > Table API / SQL for Flink 1.4.0.
> > > Flink 1.3.0 was released about 2 weeks ago and we have 2.5 months (~11
> > > weeks, until begin of September) left until the feature freeze for
> Flink
> > > 1.4.0.
> > >
> > > I think it makes sense to start with a collection of desired features.
> > Once
> > > we have a list of requested features, we might want to prioritize and
> > maybe
> > > also assign responsibilities.
> > >
> > > When we prioritize, we should keep in mind that:
> > > - we want to have a consistent API. Larger features should be developed
> > in
> > > a feature branch first.
> > > - the next months are typical time for vacations
> > > - we have been bottlenecked by committer resources in the last release.
> > >
> > > I think the following features would be a nice addition to the current
> > > state:
> > >
> > > - Conversion of a stream into an upsert table (with retraction,
> updating
> > to
> > > the last row per key)
> > > - Joins for streaming tables
> > >   - Stream-Stream (time-range predicate) there is already a PR for
> > > processing time joins
> > >   - Table-Table (with retraction)
> > > - Support for late arriving records in group window aggregations
> > > - Exposing a keyed result table as queryable state
> > >
> > > Which features are others looking for?
> > >
> > > Cheers,
> > > Fabian
> > >
> >
>

Re: [DISCUSS] Table API / SQL features for Flink 1.4.0

Posted by Fabian Hueske <fh...@gmail.com>.

Thanks for your response Shaoxuan,

My "Table-table join with retraction" is probably the same as your
"unbounded stream-stream join with retraction".
Basically, a join between two dynamic tables with unique keys (either
because of an upsert stream->table conversion or an unbounded aggregation).

Best, Fabian

2017-06-16 0:56 GMT+02:00 Shaoxuan Wang <ws...@gmail.com>:

> Nice timing, Fabian!
>
> Your checklist aligns our plans very well. Here are the things we are
> working on & planning to contribute to release 1.4:
> 1. DDL (with property waterMark config for source-table, and emit config on
> result-table)
> 2. unbounded stream-stream joins (with retraction supported)
> 3. backend state user interface for UDAGG
> 4. UDOP (as oppose to UDF(scalars to scalar)/UDTF(scalar to
> table)/UDAGG(table to scalar), this allows user to define a table to table
> conversion business logic)
>
> Some of them already have PR/jira, while some are not. We will send out the
> design doc for the missing ones very soon. Looking forward to the 1.4
> release.
>
> Btw, what is "Table-Table (with retraction)" you have mentioned in your
> plan?
>
> Regards,
> Shaoxuan
>
>
>
> On Thu, Jun 15, 2017 at 10:29 PM, Fabian Hueske <fh...@gmail.com> wrote:
>
> > Hi everybody,
> >
> > I would like to start a discussion about the targeted feature set of the
> > Table API / SQL for Flink 1.4.0.
> > Flink 1.3.0 was released about 2 weeks ago and we have 2.5 months (~11
> > weeks, until begin of September) left until the feature freeze for Flink
> > 1.4.0.
> >
> > I think it makes sense to start with a collection of desired features.
> Once
> > we have a list of requested features, we might want to prioritize and
> maybe
> > also assign responsibilities.
> >
> > When we prioritize, we should keep in mind that:
> > - we want to have a consistent API. Larger features should be developed
> in
> > a feature branch first.
> > - the next months are typical time for vacations
> > - we have been bottlenecked by committer resources in the last release.
> >
> > I think the following features would be a nice addition to the current
> > state:
> >
> > - Conversion of a stream into an upsert table (with retraction, updating
> to
> > the last row per key)
> > - Joins for streaming tables
> >   - Stream-Stream (time-range predicate) there is already a PR for
> > processing time joins
> >   - Table-Table (with retraction)
> > - Support for late arriving records in group window aggregations
> > - Exposing a keyed result table as queryable state
> >
> > Which features are others looking for?
> >
> > Cheers,
> > Fabian
> >
>

Re: [DISCUSS] Table API / SQL features for Flink 1.4.0

Posted by Shaoxuan Wang <ws...@gmail.com>.

Nice timing, Fabian!

Your checklist aligns our plans very well. Here are the things we are
working on & planning to contribute to release 1.4:
1. DDL (with property waterMark config for source-table, and emit config on
result-table)
2. unbounded stream-stream joins (with retraction supported)
3. backend state user interface for UDAGG
4. UDOP (as oppose to UDF(scalars to scalar)/UDTF(scalar to
table)/UDAGG(table to scalar), this allows user to define a table to table
conversion business logic)

Some of them already have PR/jira, while some are not. We will send out the
design doc for the missing ones very soon. Looking forward to the 1.4
release.

Btw, what is "Table-Table (with retraction)" you have mentioned in your
plan?

Regards,
Shaoxuan

On Thu, Jun 15, 2017 at 10:29 PM, Fabian Hueske <fh...@gmail.com> wrote:

> Hi everybody,
>
> I would like to start a discussion about the targeted feature set of the
> Table API / SQL for Flink 1.4.0.
> Flink 1.3.0 was released about 2 weeks ago and we have 2.5 months (~11
> weeks, until begin of September) left until the feature freeze for Flink
> 1.4.0.
>
> I think it makes sense to start with a collection of desired features. Once
> we have a list of requested features, we might want to prioritize and maybe
> also assign responsibilities.
>
> When we prioritize, we should keep in mind that:
> - we want to have a consistent API. Larger features should be developed in
> a feature branch first.
> - the next months are typical time for vacations
> - we have been bottlenecked by committer resources in the last release.
>
> I think the following features would be a nice addition to the current
> state:
>
> - Conversion of a stream into an upsert table (with retraction, updating to
> the last row per key)
> - Joins for streaming tables
>   - Stream-Stream (time-range predicate) there is already a PR for
> processing time joins
>   - Table-Table (with retraction)
> - Support for late arriving records in group window aggregations
> - Exposing a keyed result table as queryable state
>
> Which features are others looking for?
>
> Cheers,
> Fabian
>